Thoughts on Lucene, Solr, Nutch and vertical search 

crawling

Archived Posts from this Category

java.net.URL synchronization bottleneck

Posted by Kelvin on 08 Dec 2009 | Tagged as: crawling, programming

This is interesting because I haven’t found anything on google about it.

There’s a static Hashtable in java.net.URL (urlStreamHandlers) which gets invoked with every constructor call. Well, turns out when you’re running a crawler with, say 50 threads, that turns out to be a major bottleneck.

Of 70 threads, I had running, 48 were blocked on the java.net.URL ctor. I was using the URL class for resolving relative URLs to absolute ones.

Since I had previously written a URL parser to parse out the parts of a URL, I went ahead and implemented my own URL resolution function.

Went from

Status: 12.407448 pages/s, 207.06316 kb/s, 2136.143 bytes/page

to

Status: 43.9947 pages/s, 557.29156 kb/s, 1621.4071 bytes/page

after increasing the number of threads to 100 (which would not have made much difference in the java.net.URL implementation).

Cool stuff.

Average length of a URL

Posted by Kelvin on 06 Nov 2009 | Tagged as: Lucene / Solr / Nutch, crawling, programming

Aug 16 update: I ran a more comprehensive analysis with a more complete dataset. Find out the new figures for the average length of a URL

I’ve always been curious what the average length of a URL is, mostly when approximating memory requirements of storing URLs in RAM.

Well, I did a dump of the DMOZ URLs, sorted and uniq-ed the list of URLs.

Ended up with 4074300 unique URLs weighing in at 139406406 bytes, which approximates to 34 characters per URL.

07/04/09 | Kelvin Tan | Lucene Solr Nutch Consultant