Alright. I know I've blogged about this before. Well, I'm revisiting it again.

My sense is that there's a real need for a simple crawler which is easy to use as an API and doesn't attempt to be everything to everyone.

Yes, Nutch is cool, but I'm so tired of fiddling around with configuration files, the proprietary fileformats, and the filesystem-dependence of plugins. Also, crawl progress reporting is poor unless you're intending to be parsing log files.

Here are some thoughts on what a simple crawler might look like:

Download all pages in a site


    SimpleCrawler c = new SimpleCrawler();
    c.addURL(url);
    c.setOutput(new SaveToDisk(downloaddir));
    c.setProgressListener(new StdOutProgressListener());
    c.setScope(new HostScope(url));
    c.start();

Download all urls from a file (depth 1 crawl)


    SimpleCrawler c = new SimpleCrawler();
    c.setMaxConnectionsPerHost(5);
    c.setIntervalBetweenConsecutiveRequests(1000);
    c.addURLs(new File(file));
    c.setLinkExtractor(null);
    c.setOutput(new DirectoryPerDomain(downloaddir));
    c.setProgressListener(new StdOutProgressListener());
    c.start();

Page through a search results page via regex


    SimpleCrawler c = new SimpleCrawler();
    c.addURL(url);
    c.setLinkExtractor(new RegexLinkExtractor(regex));
    c.setOutput(new SaveToDisk(downloaddir));
    c.setProgressListener(new StdOutProgressListener());
    c.start();

Save to nutch segment for compatibility


    SimpleCrawler c = new SimpleCrawler();
    c.addURL(url);
    c.setOutput(new NutchSegmentOutput(segmentdir));
    c.setProgressListener(new StdOutProgressListener());
    c.start();

I'm basically trying to find the sweet-spot between Commons HttpClient, and a full-blown crawler app like Nutch.

Thoughts?