Tokenizing and Stemming Natural Language in Rust

Tokenizing Multilingual Input

There is a bit of a chicken and egg problem in Natural Language Processing when it comes to Language Detection vs Tokenization. Which one should come first? A multilingual tokenizer can be immensely useful for language detection but the same is also somewhat true for language detection being useful for tokenization. This problem is very recognizable in Asian Languages that do not use spacing between words. Examples would be Chinese or Japanese. For this application I chose to use a multilingual tokenizer and carve exceptions into it for the Asian Languages. I put the tokenizer into a convenience command in the so_many_words project. Using it looks like this:

cargo run --bin tokenize a_document.txt

Stemming Multilingual Input

If you already know what language you are looking at then there are some nice tools in Rust to stem many languages. There is a port of the Snowball Stemmers that supports 17 different languages. Wow! I moved this facility behind another command and using it looks like this:

cargo run --bin stem fra le_document.txt

Detecting Language without creating dependencies

There is a crate called whatlang that implements a very convenient language detection algorithm. The classifier searches for sub-word n-grams to identify the language. The downsides of this are maybe that the detection is not as robust as some other classification methods. However, the results are good enough to be used in real world applications, as evidenced by the crate's relative popularity. The benefits are that there is no need to preprocess any text and that the data requirement is very small. All data dependencies are declared as part of the crate and only amount to several kilobytes. There are over 100 languages supported for text classification with whatlang.

cargo run --bin detect le_document.txt