The code for my modifications to the Nutch-based fetcher is at alpha quality, meaning that it compiles, and I've tested it on smallish (<100 pages) sites. There are some unit tests written, but not as extensive as I'd like.

Some thoughts:

  1. Nutch crawler (NC) uses an offline fetchlist building strategy, whilst our crawler (OC) supports both online as well as offline fetchlist building. Fetchlist building is basically the process of determining which URL to crawl next. In Nutch, the fetchlist building basically goes like this:
    1. Inject seed urls into database of urls
    2. Generate fetchlist
    3. Fetch urls in fetchlist
    4. Parse fetched pages for links, adding them into database of urls selectively (depending on url filter rules)
    5. Rank fetched pages (based on PageRank)
    6. Generate fetchlist using top x pages
    7. Repeat steps 3 – 6

    Since step 5 is both time-consuming and computationally-intensive, NC fetchlist building has to be an offline step. (NB: Breadth-first crawling can possibly be just as effective as PageRank-based crawling).

    In crawls where the number of fetched pages is limited and the order in which the pages are fetched doesn't matter, then an offline fetchlist building strategy offers little or no advantage over an online one.

  2. Crawlers should be polite and not hammer a web server repeatedly with requests, so the crawler has to implement some kind of server politeness policy. The accepted norm is to wait a period of time between consecutive requests to the same server. To maximize crawl speed (and be polite), the crawler should try to download from as many different hosts at the same time as possible.

    Both of Nutch's current Http plugins (protocol-http and protocol-httpclient) are implemented in such a way that optimum crawl times can only be achieved when

    • the order in which the URLs are downloaded is sufficiently random
    • the set of hosts is sufficiently large
    • the set of hosts is well distributed

    In NC, this currently comprises of:
    1. Generating an MD5 hash of each URL, then sorting fetchlist entries by this hash. Since a single character change in the url will result in a wildly different hash, this approach does a good job of satisfying the requirement that URLs be randomized. The other 2 requirements (large and well-distributed set of hosts) are satisfied by the very definition of the WWW.
    2. Enforcing a fixed delay between consecutive requests to the same webserver (this is the fetcher.server.delay config variable).

    In the case of constrained crawling, when either the requirement of size or distribution of hosts is not satisifed, its not difficult to see how crawl time is severely hampered, i.e. if there are too few hosts, or if the distribution of URLs is skewed towards a couple of large hosts, the crawl time will always be grossly sub-optimal.

    OC's default fetchlist prioritizes hosts based on the number of unfetched pages they contain. This satisfies the pre-requisite for optimal crawl speed, and minimizes the chance of having "backed-up" urls from the large hosts which take a long time to fetch where the number of fetching threads gets reduced to the number of outstanding hosts there are. OC also waits a period of time between consecutive requests, and its also possible to vary this time depending on the time taken to download a URL (slow servers can mean overloaded server, or they can mean a slow network).

    Note that OC's fetchlist is an interface which means that Nutch's offline fetchlist building can be replicated by:
    1. Seed crawl from file (same as nutch)
    2. When crawl finishes, run Nutch fetchlist building steps
    3. Implement a simple fetchlist which reads entries from a Nutch fetchlist (.fl)
    4. Repeat steps 2 and 3.

    Actually, the current codebase already contains the necessary classes to do this.

  3. Java needs a library of external memory algorithms and data structures (I'm working on this now). Examples from Nutch are SequenceFile, MapFile and ArrayFile, as well as the external mergesort algorithm. Other examples are B+-tree, insertion sort for nearly sorted files, buffer tree, heap file, etc. The list goes on.