Follow-up post to some Reflections on modifying the Nutch crawler.

The current implementation of Our Crawler (OC) has the following limitations:

  1. No support for distributed crawling.
    I neglected to mention that by building the fetchlist offline, the Nutch Crawler (NC) has an easier job splitting the crawling amongst different crawl servers. Furthermore, because the database of fetched URLs (WebDB) is also modified offline, its real easy to check if a URL has already been fetched.

    Since both fetchlist building and fetched URL DB is modified online in OC, the multiple crawl server scenario complicates things somewhat. My justification to omit this in the initial phase at least, is that for most constrained crawls, a single crawl server (with multiple threads) is the most probable use case, and also most likely sufficient.

  2. No load-balancing of URLs amongst threads.
    Currently URL distribution is excessively simple (url.getHost().hashCode() % numberOfThreads), and the only guarantee the Fetcher makes, is that every URL from the same host will go to the same thread (an important guarantee, since each fetcherthread maintains its own fetchlist and database of fetched URLs). This also means that its pointless using 20 threads to crawl 3 hosts (check out the previous post if you're not clear why)

    However, when some hosts are significantly larger than others, then its highly probable that the number of pages each thread has to fetch is uneven, resulting in sub-optimal fetch times. This calls for a more sophisticacted method of assigning URLs to threads, but still maintaining the thread-host contract.

  3. No bot-trap detection.
    With NC, since the depth of the crawl (determined by the number of iterations the fetcher is run) is limited, bot-traps (intentional or otherwise) aren't a major concern.

    When letting OC loose on a site, however, bot-traps can be a big problem because OC will continue to run as long as there are still items in the fetchlist.

  4. Disk writes in multiple threads may nullify SequenceFile performance gains
    Each thread maintains its own database of fetched URLs (via a MapFile). When using alot of threads, its quite likely that multiple threads will be writing to disk at once. To be honest, I don't quite know enough about this to know if its a problem, but depending on hardware and HD utilization, I think its possible that the disk heads may end up jumping around to satisfy the simultaneous writes? If so, SequenceFile's premise of fast sequential write-access is no longer valid.

These are what I can think off right now. Will update this list as I go along.