Building a crawler that doesn't crash or go berserk
Sometimes a link is broken. Sometimes Wiktionary declines your request. Sometimes it takes a long time to receive a response from Wiktionary. These are generic web crawler concerns, not unique to Wiktionary, so it helps to have a good crawling strategy. Wiktionary is also a work of good will, so hammering it with requests is also not a good strategy. A good web crawler should support:
multiple open requests simultaneously
an upper and lower bound on request frequency
miscellaneous error handling
exponentianal backoff in case of error for any reason
extract high fidelity information from pages
maintain a queue of new pages found while crawling that have yet to be crawled
store extracted data somewhere
resume crawling without losing state of crawl queue or extracted data
Downloading data from a URL
That is a lot of requirements, however that is all required for a responsible and efficient web crawler. To start let's try downloading data from a url. This is a bit ugly in Rust, but only because it jumps into advanced features with no simple way to do a blocking http request easily.
Rate limiting and initializing our Web Crawler
We will set a maximum rate of 5 requests per second and a minimum rate of 1 per minute. If there is any kind of error from our http requests we will impose an exponential backoff towards the minimum request rate. We will also store partial data in a local file for in the case that the process is killed or somehow dies otherwise. While gathering data we will infrequently flush all of it to a file from which the crawl can be resumed later. This also benefits Wiktionary in the sense that we won't need to revisit urls unless we intend to.
Persisting crawl queue to file and resuming
One of our requirements stated was that we could stop and start the crawler without losing too much state. To accomplish this we will periodically flush the queue to a file and read it in once upon starting the crawler.