Compiling Word Lists from Wiktionary Content

Building a crawler that doesn't crash or go berserk

Sometimes a link is broken. Sometimes Wiktionary declines your request. Sometimes it takes a long time to receive a response from Wiktionary. These are generic web crawler concerns, not unique to Wiktionary, so it helps to have a good crawling strategy. Wiktionary is also a work of good will, so hammering it with requests is also not a good strategy. A good web crawler should support:

  1. multiple open requests simultaneously
  2. an upper and lower bound on request frequency
  3. miscellaneous error handling
  4. exponentianal backoff in case of error for any reason
  5. extract high fidelity information from pages
  6. maintain a queue of new pages found while crawling that have yet to be crawled
  7. store extracted data somewhere
  8. resume crawling without losing state of crawl queue or extracted data

Downloading data from a URL

That is a lot of requirements, however that is all required for a responsible and efficient web crawler. To start let's try downloading data from a url. This is a bit ugly in Rust, but only because it jumps into advanced features with no simple way to do a blocking http request easily.

use error_chain::error_chain;
error_chain! {
     foreign_links {
         Io(std::io::Error);
         HttpRequest(reqwest::Error);
     }
}

#[tokio::main]
async fn main() -> Result<()> {
    let seed = "https://en.wiktionary.org/wiki/seed";
    let response = reqwest::get(seed).await?;
    let _content =  response.text().await?;

    Ok(())
}

Rate limiting and initializing our Web Crawler

We will set a maximum rate of 5 requests per second and a minimum rate of 1 per minute. If there is any kind of error from our http requests we will impose an exponential backoff towards the minimum request rate. We will also store partial data in a local file for in the case that the process is killed or somehow dies otherwise. While gathering data we will infrequently flush all of it to a file from which the crawl can be resumed later. This also benefits Wiktionary in the sense that we won't need to revisit urls unless we intend to.

use std::collections::HashSet;

pub const MAX_REQUEST_INTERVAL: f64 = 60.0;
pub const MIN_REQUEST_INTERVAL: f64 = 0.2;

pub struct Crawler {
   pub request_interval: f64,
   crawl_visited_urls: HashSet<String>,
   crawl_visited_words: HashSet<String>,
   crawl_queue: HashSet<String>
}
impl Crawler {
   pub fn new() -> std::io::Result<Crawler> {
      let mut cr = Crawler {
         request_interval: MIN_REQUEST_INTERVAL,
         crawl_visited_urls: HashSet::new(),
         crawl_visited_words: HashSet::new(),
         crawl_queue: HashSet::new(),
      };
      Ok(cr)
   }
}

Persisting crawl queue to file and resuming

One of our requirements stated was that we could stop and start the crawler without losing too much state. To accomplish this we will periodically flush the queue to a file and read it in once upon starting the crawler.

impl Crawler {
   pub fn persist_queue_to_file(&mut self, fp: &str) -> std::io::Result<()> {
      let mut file = std::fs::File::create(fp)?;
      for url in self.crawl_visited_urls.iter() {
         file.write_all(format!("{} true", url).as_bytes())?;
      }
      for url in self.crawl_queue.iter() {
         file.write_all(format!("{} false", url).as_bytes())?;
      }
      Ok(())
   }
   pub fn resume_queue_from_file(&mut self, fp: &str) -> std::io::Result<()> {
      if !std::path::Path::new(fp).exists() {
         std::fs::File::create(fp)?;
      }
      {
         let file = std::fs::File::open(fp)?;
         for line in std::io::BufReader::new(file).lines() {
            let line = line?;
            let ls = line.split_whitespace().collect::<Vec<&str>>();
            if ls.len() != 2 { continue; }
            let url = ls[0].to_string();
            let visited = ls[1]=="true";
            if visited {
               self.crawl_visited_urls.insert(url);
            } else {
               self.crawl_queue.insert(url);
            }
         }
      }
      Ok(())
   }
}