Kelvin Tan - Solr/Elasticsearch Consultant - java.net.URL synchronization bottleneck

Posted by Kelvin on 08 Dec 2009 at 02:40 pm | Tagged as: programming, crawling

This is interesting because I haven't found anything on google about it.

There's a static Hashtable in java.net.URL (urlStreamHandlers) which gets invoked with every constructor call. Well, turns out when you're running a crawler with, say 50 threads, that turns out to be a major bottleneck.

Of 70 threads, I had running, 48 were blocked on the java.net.URL ctor. I was using the URL class for resolving relative URLs to absolute ones.

Since I had previously written a URL parser to parse out the parts of a URL, I went ahead and implemented my own URL resolution function.

Went from

Status: 12.407448 pages/s, 207.06316 kb/s, 2136.143 bytes/page

Status: 43.9947 pages/s, 557.29156 kb/s, 1621.4071 bytes/page

after increasing the number of threads to 100 (which would not have made much difference in the java.net.URL implementation).

Cool stuff.

No Comments »

Supermind Search Consulting Blog Solr - Elasticsearch - Big Data

java.net.URL synchronization bottleneck

Supermind Search Consulting Blog
Solr - Elasticsearch - Big Data