Average length of a URL
Posted by Kelvin on 06 Nov 2009 at 06:48 pm | Tagged as: Lucene / Solr / Nutch, crawling, programming
I’ve always been curious what the average length of a URL is, mostly when approximating memory requirements of storing URLs in RAM.
Well, I did a dump of the DMOZ URLs, sorted and uniq-ed the list of URLs.
Ended up with 4074300 unique URLs weighing in at 139406406 bytes, which approximates to 34 characters per URL.

Interesting, but I would be interested to see more statistical analysis of the URLs. i.e. smallest, largest, and 95%/98%/99.5% confidence for length of URL. I see that the raw mean helps with your requirement, but it doesn’t answer questions like “How long should I make a database field to store a URL?”.
If we knew that (as an example), 98% of URLs will fit into a VARCHAR(100) field, and 99.5% will fit into VARCHAR(200) then it would make it easier to make these choices. Perhaps you would you like to contribute such knowledge to the benefit of all?
Regards,
Brodie