Continuing Natural Language Processing in Rust

Gathering Data for Machine Learning

Now that we have a fledgling Machine Learning framework we should gather some data. Python has a good access to download corpus data like that found in the Gutenberg Corpus.

pip3 install nltk
python3
>> import nltk
>> nltk.download("gutenberg")
>> nltk.download("punkt")

Then we write a little script to download the corpus into a text file

from nltk.corpus import gutenberg

out = open("gutenberg_sentences.txt","w")
for fileid in gutenberg.fileids():
   for sent in gutenberg.sents(fileid):
      out.write(" ".join(sent)+"\n")

Currently Rust does not have as nice of libraries for such simple tasks. What we do expect from Rust is good performance, specifically when it comes to space efficiency. One of the largest hurdles in Natural Language Processing is having enough space to work with the data you gather. Rust is very good at that, and also nice enough to work with, so I chose this language for this task.

Processing Corpus Data in Rust

Now that we have some data lets start analyzing it. To process data we will need several new facilities. One we will need the ability to multiply tensors by constants and to add tensors on top of one another:

impl std::ops::Add for WordUsageTensor {
   type Output = Self;

   fn add(self, other: Self) -> Self {
      let mut tensor: [f64; 32] = [0.0; 32];
      for i in 0..32 {
         tensor[i] += self.tensor[i];
         tensor[i] += other.tensor[i];
      }
      WordUsageTensor {
         tensor: tensor
      }
   }
}

impl WordUsageTensor {
   pub fn multiply(&mut self, c: f64) {
      for i in 0..32 {
         self.tensor[i] *= c;
      }
   }
}

Now might be a good time to ask "why so many tensors but no TensorFlow?". My reason for not working with TensorFlow right now is that it is a humonguous dependency that I only need a small portion of. Tensors are a simple concept from a mathematical point of view, and if I won't be using any of the cool algorithms that they already have built, then the library just doesn't justify its cost. Agile processes dictate that one should choose the path of least overhead cost. That choice is currently to at least delay the integration of larger toolsets until it becomes clear which tools are best fit for the job.

Defining each Layer of our Neural Network

For our Tensor Network we will define each layer as a list of words associated with a tensor value indicating their word usage. We can reuse the Trie data structure to implement this concept of a layer.

use radix_trie::Trie;

pub struct DictionLayer {
   diction: Trie<String,WordUsageTensor>
}
impl DictionLayer {
   pub fn new() -> DictionLayer {
      DictionLayer {
         diction: Trie::new()
      }
   }
   pub fn add(&mut self, key: String, usage: WordUsageTensor) {
      if let Some(record) = self.diction.get_mut(&key) {
         *record = *record + usage;
      } else {
         self.diction.insert(key, usage);
      }
   }
}

Tokenizing Input

Tokenization works better when hard coded. For example it would be annoying to try to learn all of the numerals from 0 to 9999999999999. Hell that number doesn't even pass the Rust lexer. By comparison, a simple regular expression can do this work with not much hassle. We also should tokenize punctuation while we are at it.

use regex::Regex;

pub fn tokenize(s: &str) -> Vec<String> {
   let mut s = s.to_string();
   let whitespace = Regex::new("^\\s+").unwrap();
   let numeral = Regex::new("^\\d+").unwrap();
   let word = Regex::new("^\\w+").unwrap();
   let mut ts = Vec::new();
   while s.len() > 0 {
      if let Some(m) = whitespace.find(&s) {
         s = s[m.end()..].to_string();
      } else if let Some(m) = numeral.find(&s) {
         ts.push(s[0..m.end()].to_string());
         s = s[m.end()..].to_string();
      } else if let Some(m) = word.find(&s) {
         ts.push(s[0..m.end()].to_string());
         s = s[m.end()..].to_string();
      } else {
         ts.push(s[0..1].to_string());
         s = s[1..].to_string();
      }
   }
   ts
}

pub fn is_word(s: &str) -> bool {
   let word = Regex::new("^\\w+").unwrap();
   if let Some(m) = word.find(s) {
      m.start()==0 && m.end()==s.len()
   } else { false }
}

pub fn is_numeral(s: &str) -> bool {
   let numeral = Regex::new("^[0123456789]+").unwrap();
   if let Some(m) = numeral.find(s) {
      m.start()==0 && m.end()==s.len()
   } else { false }
}

pub fn is_punctuation(s: &str) -> bool {
   let nonpunct = Regex::new("(\\s|[0123456789]|\\w)").unwrap();
   !nonpunct.is_match(s)
}

We continue this adventure in the next article Directed Graphs, Tensors, and Neural Networks in Rust.