Average length of a URL
Posted by Kelvin on 06 Nov 2009 at 06:48 pm | Tagged as: Lucene / Solr / Nutch, crawling, programming
Aug 16 update: I ran a more comprehensive analysis with a more complete dataset. Find out the new figures for the average length of a URL
I’ve always been curious what the average length of a URL is, mostly when approximating memory requirements of storing URLs in RAM.
Well, I did a dump of the DMOZ URLs, sorted and uniq-ed the list of URLs.
Ended up with 4074300 unique URLs weighing in at 139406406 bytes, which approximates to 34 characters per URL.

Interesting, but I would be interested to see more statistical analysis of the URLs. i.e. smallest, largest, and 95%/98%/99.5% confidence for length of URL. I see that the raw mean helps with your requirement, but it doesn’t answer questions like “How long should I make a database field to store a URL?”.
If we knew that (as an example), 98% of URLs will fit into a VARCHAR(100) field, and 99.5% will fit into VARCHAR(200) then it would make it easier to make these choices. Perhaps you would you like to contribute such knowledge to the benefit of all?
Regards,
Brodie
But these are mostly top level pages which usually have shorter URLs.
@Luying - you’re absolutely right and when I have abit more time, I’m planning to perform a more comprehensive survey based on blog URLs, news feeds and the lot.
[...] 16 Aug 2010 at 02:49 pm | Tagged as: programming Here’s a follow-up on my previous attempt at calculating the average length of a URL, which was naive and totally [...]