fst. There is a very efficient crate in Rust that can be used to very quickly generate a list of possible corrections within N edit distance of a"/>
Building Phonetic Auto Correction in Rust
Finding Possible Misspelling Suggestions
Levenshtein Distance, Edit Distance, or Manhattan Distance are metrics to measure how many corrections are needed to get from one word to another. Corrections are measured one for each deleted characters, one for each added character, and one for each substituted character. There is a very efficient crate in Rust called fst that can be used to very quickly generate a list of possible corrections within N edit distance of a given string.
Ranking Spelling Suggestions by Phonetic Similarity
What happens if you search for "funetic", instead of "phonetic". These two terms are phonetically nearly identical but the spelling is quite different. For these cases, we will use a Rust crate called eudex. So for our autocorrect we have two steps. One, find a list of possible misspellings. Two, rank the misspellings in order of phonetic similarity.
Implementing Dyslexic Search
First let's build a convenience function for building a phonetic index from a dictionary.
src/bin_build_phoneme.rs
usestd::env;useso_many_words::util::read_lines;usestd::fs::File;usefst::MapBuilder;useeudex::Hash;fnmain(){let lang ="eng".to_string();let args:Vec<String>=env::args().collect();let wtr =std::io::BufWriter::new(File::create(&format!("data/phoneme_{}.fst",lang)).expect("Unable to create file"));letmut build =MapBuilder::new(wtr).expect("Unable to create fst::MapBuilder");letmut i =0;for fp in args[1..].iter(){ifletOk(lines)=read_lines(fp){for line in lines {ifletOk(sent)= line {ifletSome(word)= sent.split_whitespace().next(){
i +=1;if(i%1000)==0{println!("pass build line: {}", i);}let phone:u64=Hash::new(word).into();
build.insert(word, phone).unwrap();}}}}}
build.finish().expect("Unable to finish build of fst::MapBuilder");}
Searching the Phonetic Index
Now to search the phonetic index we create a Levenshtein search over the fst data. After we find a list of similar words we then rank them by their phonetic similarity. The results are fairly good, particularly for short words.
bin_search_phoneme.rs
usestd::env;usestd::fs::File;usefst::{Map,IntoStreamer,Streamer};usefst::automaton::Levenshtein;usememmap::Mmap;useeudex::Hash;fnmain(){let lang ="eng";let load_file =format!("data/phoneme_{}.fst", lang);let mmap =unsafe{Mmap::map(&File::open(&load_file).expect(&format!("No file data file: {}", load_file))).expect("unable to mmap file")};let map =Map::new(mmap).expect("Could not create fst::Map from file");let args:Vec<String>=env::args().collect();for term in args[1..].iter(){let th =Hash::new(term);let lev =Levenshtein::new(term,3).unwrap();letmut stream = map.search(&lev).into_stream();println!("{}", term);letmut results =Vec::new();whileletSome((key,_))= stream.next(){let key =String::from_utf8_lossy(key).to_string();let kh =Hash::new(&key);let dh =(th - kh).dist();
results.push((dh, key));}
results.sort();for(_,term)in results[..10].iter(){println!("\t{}", term);}}}