Supermind Search Consulting Blog 
Solr - Elasticsearch - Big Data

Posts about crawling

Search and crawling internship

Posted by Kelvin on 31 Oct 2006 | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch, crawling

I'm looking for a competent Java programmer who wants to get into the world of search engines and crawlers. The internship will involve a mixture of (some) training and (mostly) hands-on projects with real-world clients.

In particular, my area of expertise is in vertical search (real estate, jobs, classifieds), so more than likely, that will be the bulk of the exposure.

You'll be working mostly with Lucene and Nutch, but more than just using them, you'll be delving into their innards and pushing the limits of these libraries.

Upon graduation, if you're interested, you'll be invited to stay on and work alongside me on contracting projects, some of which might involve on-site work, perhaps overseas.

The background to this is that I'm basically tired of turning down consulting work because of my limited time, and want to start building a search/crawling consulting practice.

So if you're interested, send your resume along to internship AT supermind.org, with a brief note on your background and interest in this field.

Nutch 0.8, Map & Reduce, here I come!

Posted by Kelvin on 09 Aug 2006 | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch, crawling

Finally taking the plunge to Nutch 0.8 after exclusively working with 0.7 for over a year (and something like 5 projects).

From initial experiences, it appears that using M&R does obfuscate the code somewhat for a developer who wants to build an app off the Nutch infrastructure instead of using it out-of-box. For example, trying to decipher what's going on in org.apache.nutch.indexer.Indexer is pretty difficult, compared to its 0.7 counterpart (IndexSegment).

Some serious documentation needs to be done around the implementation details of M&R. I'll keep posting about my learnings..

Lucene scoring for dummies

Posted by Kelvin on 08 Mar 2006 | Tagged as: Lucene / Solr / Elasticsearch / Nutch, crawling

The factors involved in Lucene's scoring algorithm are as follows:

1. tf = term frequency in document = measure of how often a term appears in the document
2. idf = inverse document frequency = measure of how often the term appears across the index
3. coord = number of terms in the query that were found in the document
4. lengthNorm = measure of the importance of a term according to the total number of terms in the field
5. queryNorm = normalization factor so that queries can be compared
6. boost (index) = boost of the field at index-time
7. boost (query) = boost of the field at query-time

The implementation, implication and rationales of factors 1,2, 3 and 4 in DefaultSimilarity.java, which is what you get if you don't explicitly specify a similarity, are:

note: the implication of these factors should be read as, "Everything else being equal, … "

1. tf
Implementation: sqrt(freq)
Implication: the more frequent a term occurs in a document, the greater its score
Rationale: documents which contains more of a term are generally more relevant

2. idf
Implementation: log(numDocs/(docFreq+1)) + 1
Implication: the greater the occurrence of a term in different documents, the lower its score
Rationale: common terms are less important than uncommon ones

3. coord
Implementation: overlap / maxOverlap
Implication: of the terms in the query, a document that contains more terms will have a higher score
Rationale: self-explanatory

4. lengthNorm
Implementation: 1/sqrt(numTerms)
Implication: a term matched in fields with less terms have a higher score
Rationale: a term in a field with less terms is more important than one with more

queryNorm is not related to the relevance of the document, but rather tries to make scores between different queries comparable. It is implemented as 1/sqrt(sumOfSquaredWeights)

So, roughly speaking (quoting Mark Harwood from the mailing list),
* Documents containing *all* the search terms are good
* Matches on rare words are better than for common words
* Long documents are not as good as short ones
* Documents which mention the search terms many times are good

The mathematical definition of the scoring can be found at http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html

Hint: look at NutchSimilarity in Nutch to see an example of how web pages can be scored for relevance

Customizing scoring

Its easy to customize the scoring algorithm. Just subclass DefaultSimilarity and override the method you want to customize.

For example, if you want to ignore how common a term appears across the index,


Similarity sim = new DefaultSimilarity() {
  public float idf(int i, int i1) {
    return 1;
  }
}

and if you think for the title field, more terms is better


Similarity sim = new DefaultSimilarity() {
  public float lengthNorm(String field, int numTerms) {
    if(field.equals("title")) return (float) (0.1 * Math.log(numTerms));
    else return super.lengthNorm(field, numTerms);
  }
}

OC and focused crawling

Posted by Kelvin on 26 Feb 2006 | Tagged as: Lucene / Solr / Elasticsearch / Nutch, crawling

I've had the good fortune to get paid to work on OC (Our Crawler). Features I've been developing have been for focused crawling purposes.

Specifically:

  1. Ranking content by relevance to a supplied query and crawling the most relevant links first, with the possibility of specifying a score threshold
  2. Checkpointing the crawl output (which is a Nutch segment) at time intervals, e.g. every 60 minutes. This is insurance against hung crawls, or if the crawler hit a bot-trap and couldn't exit.
  3. Time-limited "perpetual crawling" where the crawler would keep going until a time limit was reached, in which case it will stop all threads and exit gracefully.
  4. Introducing various fetchlist filters which reduce the chances of getting lost in bot-traps, such as don't go further than x levels deep within a host, and reject URLs which repeatedly increase the number of query parameters.
  5. MySQL and BDB-backed persistence.

In addition, some refactoring has also taken place that makes it easier to run crawls via API (as opposed to command-line or Spring). The role of Spring has also been relegated from obligatory to optional (but sweet to have, all the same).

We're still discussing the details of whether all of the code can be open-sourced, though. I'm keeping my fingers crossed.

Next on the plate is support for distributed crawling. Will OC use Nutch's Map-Reduce? That remains to be seen…

The next few months for OC

Posted by Kelvin on 28 Jan 2006 | Tagged as: Lucene / Solr / Elasticsearch / Nutch, crawling

I had a chat with Mike from Atlassian recently, and have arrived at the conclusion that the future of OC lies in being a crawler API, much like what Lucene does for searching. I suppose it will lie somewhere between Nutch (full-blown whole-web crawler) and Commons HTTPClient.

Some directions I will explore include:

  • Introducing checkpointing to recover from hung/crashed crawls
  • A GUI app (probably thinlet-based) for monitoring crawl status
  • Authentication databases for sites (username/password or cookies)
  • Alternatives to Nutch's SegmentFile

I expect to have some free time on my hands to resume work on OC in the coming months.

Update 270106:
Well, checkpointing at the segment level is done at least. So I can't yet recover from a failed crawl, but at least I don't lose everything. 🙂 Its a timer-based checkpointer, so it closes the segment writers every 60 minutes and opens a new one for writing.

Database mining would be very cool. With support for paging and stuff, though it feels like its somewhat peripheral to OC. If we include some kind of regex/parsing framework for screenscraping, we would already have like 85% of what we need to build a vertical search portal.

There is already an alternative to SegmentFile for online DB updates, and its BerkeleyDB. A simple BerkeleyDB persister has already been written.. but but but.. I don't like that its GPL (albeit a looser version of GPL). So, one day when I'm feeling particularly inspired, I'll hack up my own BTree and external HashMap implementation.

Now, a GUI for crawling would be totally sweet. In fact, that's probably what I'm going to work on next.

Crawling Basics

Posted by Kelvin on 27 Oct 2005 | Tagged as: work, Lucene / Solr / Elasticsearch / Nutch, crawling

Here's my attempt at providing an introduction to crawling, in the hope that it'll clarify some of my own thoughts on the topic.

The most basic tasks of a HTTP crawler are:

  1. Download urls via the HTTP protocol.
  2. Extract urls from downloaded pages to be added to the download queue.

That's actually very simple, and crawlers are very simple applications at heart. 2 factors complicate crawler development:

  • they deal with large datasets, and require care when designing the appopriate data structures and algorithms for acceptable performance.
  • they require either multi-threaded or non-blocking IO implementations, and both are non-trivial to do well.

To be abit more intelligent/useful, crawlers usually also provide some way of

  1. restricting the scope of the crawl
  2. prioritizing the download queue
  3. storing the downloaded pages for easy retrieval, preferably in compressed form
  4. logging and status reporting
  5. normalizing the urls
  6. being compliant with HTTP specifications, e.g. using robots.txt
  7. being polite to servers
  8. recrawling urls

Let's go through the interesting ones..

  • restricting the scope of the crawl: generally accomplished via 2 methods – duplicate url detection (removing urls that have already been downloaded or are already in the queue) and user-defined crawl scopes. The 1st is performed automatically by the crawler, and the 2nd by the user. Ways in which the user can limit the crawl scope include: limit urls to same hosts as seed urls, limit by depth from seed, regex, etc.

    A note about restricting scope of crawl by content for whole-web focused crawlers: Content-based sites tend to cluster (i.e. link to each other) and crawling an entire cluster is easy. However, clusters not linked from seed urls are difficult to get to, and moving up a cluster might be difficult, i.e. given 2 clusters A and B and 2 link hubs A1 and B1, A1 and B1 might share a link, but the common site may not link to both or either of them.

  • prioritizing the download queue: important for both whole-web and focused crawlers, but for different reasons.

    Whole-web crawlers
    The last thing a whole-web crawler wants to do is crawl pages indiscriminately. It generally wants to crawl pages that are both authoritative and have a high incoming and outgoing link count (hubs) as early as possible. It also needs to distribute requests to be polite to its servers.

    Focused crawlers
    Focused crawlers can be further broken down into

    • whole-web focused crawlers which are interested in a subset of the whole-web, usually partitioned by language or content. Queue prioritization would be similar to whole-web.
    • Site mirroring focused crawlers (SMFC) which are only interested in their seed urls (and possibly all pages within that domain/host). SMFCs can radically decrease overall crawl time by being smart about prioritzing their urls, especially when the hosts to crawl are small (<100). Btw, I suspect that the majority of developers who use Nutch actually belong to this category.

  • url normalization: ensures that urls that would be filtered out don't sneak through. Includes remembering redirects encountered, for example.

More advanced crawlers also

  • checkpoint at regular intervals so interrupted crawls can be recovered
  • detect and avoid bot traps
  • throttle bandwith usage according to bandwidth availability and server response time
  • allow crawling speed to be scaled (linearly) by adding more crawl boxes

Ignoring the advanced features, the high-level modules are:

  • HTTP request and response module
  • Link extractor (can be regex or html parser)
  • download queue (we'll call this the fetchlist from now on), and which urls are currently being downloaded
  • database of downloaded urls
  • database of downloaded pages
  • link graph/database
  • url normalizer

That's all for now. In a follow-up post, I'll walkthrough the typical application flow of a crawler, and the interactions between the modules/data structures.

Practical introduction to Nutch MapReduce

Posted by Kelvin on 28 Sep 2005 | Tagged as: work, Lucene / Solr / Elasticsearch / Nutch, crawling

Some terminology first:

Mapper
Performs the map() function. The name will make sense when you look at it as "mapping" a function/operation to elements in a list.
Reducer
Performs the reduce() function. Merges multiple input values with the same key to produce a single output value.
OutputFormat/InputFormat
Classes which tell Nutch how to process input files and what output format the MapReduce job is to produce.

Currently implemented classes: MapFile (output only), Text and SequenceFile. What this basically means is that out-of-box, Nutch support SequenceFiles and text files as input formats for MapReduce jobs.

Partitioner
Splits up input into n different partitions, where n is the number of map tasks.
Combiner
Javadoc says: Implements partial value reduction during mapping.. There is no special interface for a Combiner. Rather, it is a Reducer which is configured to be used during the mapping phase.

To complete a MapReduce job successfully, Nutch requires at least

  • a directory containing input files (in the format determined by JobConf.setInputFormat, which defaults to newline-terminated text files)
  • a mapper or reducer (even though technically the job will still complete even if none is provided, as the HelloWorld example shows)
  • output format

To map or to reduce?
Rule of thumb: when performing operation on one input key/value, use Mapper; when requiring multiple input, or merging files, use Reducer.

Technically, the implementation of MapReduce in Nutch allows for the map function to be performed by only the Reducer, although this would seem to be a rather inappropriate use of a Reducer.

Hello World for MapReduce

Posted by Kelvin on 28 Sep 2005 | Tagged as: work, Lucene / Solr / Elasticsearch / Nutch, crawling

Here's a Hello World tutorial as part of my attempts to grok MapReduce.

HelloWorld.java :

import org.apache.nutch.mapred.JobClient;
import org.apache.nutch.mapred.JobConf;
import org.apache.nutch.util.NutchConf;
import java.io.File;

public class HelloWorld {

  public static void main(String[] args) throws Exception {
    if (args.length < 1) {
      System.out.println("HelloWorld ");
      System.exit(-1);
    }

    NutchConf defaults = NutchConf.get();
    JobConf job = new JobConf(defaults);
    job.setInputDir(new File(args[0]));
    job.setOutputFormat(ConsoleOutputFormat.class);
    JobClient.runJob(job);
  }
}

and ConsoleOutputFormat.java (for printing to System.out):

import org.apache.nutch.fs.NutchFileSystem;
import org.apache.nutch.io.Writable;
import org.apache.nutch.io.WritableComparable;
import org.apache.nutch.mapred.JobConf;
import org.apache.nutch.mapred.OutputFormat;
import org.apache.nutch.mapred.RecordWriter;
import org.apache.nutch.mapred.Reporter;

import java.io.IOException;

public class ConsoleOutputFormat implements OutputFormat {
  public RecordWriter getRecordWriter(NutchFileSystem fs, JobConf job, String name) throws IOException {
    return new RecordWriter() {
      public void write(WritableComparable key, Writable value) {
        System.out.println(value);
      }

      public void close(Reporter reporter) {

      }
    };
  }
}

And now create a new directory someplace, and create a new file (say, foo.txt). Fire up your text editor, and in foo.txt, enter:

Hello World

Important: There MUST be a newline at the end of the file, following the words "Hello World". It is also important (for the purposes of this tutorial), that a new directory be created and there is only one file in this directory.

Now, run the HelloWorld application, providing the location of the directory where foo.txt resides, for example

java HelloWorld /tmp/mapreduce/

Note: Its best to run the application directly from your IDE, so you won't have to worry about adding the necessary libs to the classpath.

After some output from Nutch about parsing the config files, you should see the text Hello World.

Congratulations! You've run your first MapReduce program!

OC and Nutch MapReduce

Posted by Kelvin on 15 Sep 2005 | Tagged as: work, programming, Lucene / Solr / Elasticsearch / Nutch, crawling

(or what's next for OC)…

I've received a couple of emails about what the future of OC vis-a-vis incorporating into Nutch codebase, the upcoming MapReduce merge into trunk, etc.

My thoughts are:

  1. When MapReduce is merged into trunk, I'll make appropriate changes to OC to support MapReduce.
  2. This MapReduce-compatible OC will be offered to the Nutch codebase. As of today, I've removed most usages of JDK1.5 features, so the other thing that needs to be removed is Spring Framework dependencies.
  3. I _might_ keep a version of OC which is more experimental and uses Spring, and can operate on a more standalone basis. The advantages (more autonomy over what I can do and not, Spring support) will have to be balanced against the disadvantages (duplication).

27092005 edit
I'm browsing MapReduce sources now, and I've found the complexity of the Fetcher has rather significantly increased. I'm leaning towards maintaining a simple non-mapred fetcher for folks who don't need the MapReduce scalability.

Inside Our Crawler

Posted by Kelvin on 25 Aug 2005 | Tagged as: work, programming, Lucene / Solr / Elasticsearch / Nutch, crawling

Inspired by http://wiki.apache.org/nutch/DissectingTheNutchCrawler

I guess the contents of this post will eventually make it to javadocs. Note that Our Crawler (OC) is really not so different from Nutch Crawler (NC). This document highlights the main differences, as well as important classes.

CrawlTool
CrawlTool is the point of entry to OC. It doesn't do very much really, just calls Fetcher.run().

Fetcher
The Fetcher doesn't do very much either. It starts the FetcherThreads, and provides them SeedURLs from a CrawlSeedSource. Its 2 other main responsibilies are: distributing URLs amongst threads, and reporting FetcherStatus. In a minor design quirk, the PostFetchProcessor is also found in Fetcher and not in FetcherThread instances, thus imposing the requirement on PostFetchProcessors to be thread-safe. The rationale behind this, is to be able to write to a single Nutch segment, instead of requiring a x-way post-fetch segment merge where x is the number of fetcher threads (I haven't put much thought into this. If post-fetch processing is substantial, then it makes more sense to make this a per-thread thing)

FetcherThread
A FetcherThread has the following high-level responsibilities:

  1. Maintain FetchList and db of fetchedurls for a set of urls (as determined by Fetcher's url distribution strategy). FetcherThread is also responsible for avoiding duplicates between URLs (whether fetched, parsed or output)
  2. Delegate the actual URL downloading to Http and HttpResponse classes
  3. Process the outcome of downloading, which primarily has the following steps:
    1. Parse the HTML
    2. Extract links from parsed page and run each through FetchListScope, adding to relevant thread's link queue (for subsequent adding into fetchlist) if link is allowed
    3. Run fetch output through PostFetchScope, passing to PostFetchProcessor if allowed

FetcherThread periodically moves items from its link queue to its fetchlist. The link queue is the thread-safe holding area where other fetcher threads add URLs to be fetched.

A note about the PostFetchProcessor : I know it sounds weird :-).
Initially I called it something like PageOutputter or something, which leaves an equally bad taste in my mouth. I wanted to capture the idea of something that happens _after_ a url/page is downloaded, whether saving to Nutch segment, sending an email to someone every 100 downloaded pages or triggering a backup.

FetchList
Now on to the fetchlist system, one of the biggest differences between NC and OC. The main actors in are FetchList and HostQueue. Unlike Nutch where a FetchList is a sequence of URLs, our FetchList is a sequence of HostQueues, which are in turn a sequence of URLs with the same host. The FetchList also manages the server politeness policy (unlike Nutch where this is done in the Http class).

Different implementations of FetchList may choose different strategies of how to go about prioritizing certain hosts/URLs over others. This process can even be randomized (in which case, it somewhat simulates the Nutch fetchlist, although Nutch's fetchlist is deterministic and OC's DefaultFetchList is NOT).

LastModifiedDB and FetchedURLs
These are both new to OC, even though the WebDB serves as roughly the equivalent. Javadocs should be sufficient for understanding the roles of these 2 classes. One thing to note about FetchedURLs: to save space, the current implementation does _not_ save the actual URL, but rather a 64-bit checksum.

Scopes and Filters
A Scope consists of zero or more filters, where given an input, each filter replies ALLOW, REJECT or ABSTAIN. Self-explanatory. When all filters in a scope abstain, then the scope's allowByDefault value kicks in (also used when a scope has no filters).

The different scopes in use are: FetchListScope, ParseScope and PostFetchScope.

« Previous PageNext Page »