Building Phonetic Auto Correction in Rust

Finding Possible Misspelling Suggestions

Levenshtein Distance, Edit Distance, or Manhattan Distance are metrics to measure how many corrections are needed to get from one word to another. Corrections are measured one for each deleted characters, one for each added character, and one for each substituted character. There is a very efficient crate in Rust called fst that can be used to very quickly generate a list of possible corrections within N edit distance of a given string.

Ranking Spelling Suggestions by Phonetic Similarity

What happens if you search for "funetic", instead of "phonetic". These two terms are phonetically nearly identical but the spelling is quite different. For these cases, we will use a Rust crate called eudex. So for our autocorrect we have two steps. One, find a list of possible misspellings. Two, rank the misspellings in order of phonetic similarity.

Implementing Dyslexic Search

First let's build a convenience function for building a phonetic index from a dictionary.

use std::env;
use so_many_words::util::read_lines;
use std::fs::File;
use fst::MapBuilder;
use eudex::Hash;

fn main() {
   let lang = "eng".to_string();
   let args: Vec<String> = env::args().collect();
   let wtr = std::io::BufWriter::new(File::create(&format!("data/phoneme_{}.fst",lang)).expect("Unable to create file"));
   let mut build = MapBuilder::new(wtr).expect("Unable to create fst::MapBuilder");

   let mut i = 0;
   for fp in args[1..].iter() {
      if let Ok(lines) = read_lines(fp) {
         for line in lines {
            if let Ok(sent) = line {
               if let Some(word) = sent.split_whitespace().next() {
                  i += 1;
                  if (i%1000)==0 { println!("pass build line: {}", i); }
                  let phone: u64 = Hash::new(word).into();
                  build.insert(word, phone).unwrap();
               }
            }
         }
      }
   }
   build.finish().expect("Unable to finish build of fst::MapBuilder");
}

Searching the Phonetic Index

Now to search the phonetic index we create a Levenshtein search over the fst data. After we find a list of similar words we then rank them by their phonetic similarity. The results are fairly good, particularly for short words.

use std::env;
use std::fs::File;
use fst::{Map,IntoStreamer,Streamer};
use fst::automaton::Levenshtein;
use memmap::Mmap;
use eudex::Hash;

fn main() {
   let lang = "eng";
   let load_file = format!("data/phoneme_{}.fst", lang);
   let mmap = unsafe {
      Mmap::map(
         &File::open(&load_file).expect(&format!("No file data file: {}", load_file))
      ).expect("unable to mmap file")
   };
   let map = Map::new(mmap).expect("Could not create fst::Map from file");

   let args: Vec<String> = env::args().collect();
   for term in args[1..].iter() {
      let th = Hash::new(term);
      let lev = Levenshtein::new(term, 3).unwrap();
      let mut stream = map.search(&lev).into_stream();

      println!("{}", term);
      let mut results = Vec::new();
      while let Some((key,_)) = stream.next() {
         let key = String::from_utf8_lossy(key).to_string();
         let kh = Hash::new(&key);
         let dh = (th - kh).dist();
         results.push((dh, key));
      }

      results.sort();
      for (_,term) in results[..10].iter() {
         println!("\t{}", term);
      }
   }
}