Now that we have a fledgling Machine Learning framework we should gather some data. Python has a good access to download corpus data like that found in the Gutenberg Corpus.
Then we write a little script to download the corpus into a text file
Currently Rust does not have as nice of libraries for such simple tasks. What we do expect from Rust is good performance, specifically when it comes to space efficiency. One of the largest hurdles in Natural Language Processing is having enough space to work with the data you gather. Rust is very good at that, and also nice enough to work with, so I chose this language for this task.
Processing Corpus Data in Rust
Now that we have some data lets start analyzing it. To process data we will need several new facilities. One we will need the ability to multiply tensors by constants and to add tensors on top of one another:
Now might be a good time to ask "why so many tensors but no TensorFlow?". My reason for not working with TensorFlow right now is that it is a humonguous dependency that I only need a small portion of. Tensors are a simple concept from a mathematical point of view, and if I won't be using any of the cool algorithms that they already have built, then the library just doesn't justify its cost. Agile processes dictate that one should choose the path of least overhead cost. That choice is currently to at least delay the integration of larger toolsets until it becomes clear which tools are best fit for the job.
Defining each Layer of our Neural Network
For our Tensor Network we will define each layer as a list of words associated with a tensor value indicating their word usage. We can reuse the Trie data structure to implement this concept of a layer.
Tokenization works better when hard coded. For example it would be annoying to try to learn all of the numerals from 0 to 9999999999999. Hell that number doesn't even pass the Rust lexer. By comparison, a simple regular expression can do this work with not much hassle. We also should tokenize punctuation while we are at it.