TagSoup does this annoying thing of adding namespaces to the html it cleans.

This annoyance becomes a major hindrance when formulating XPath queries for tagsoup-cleaned html.

Instead of using

//body/a/@href


we have to do

//html:body/html:a/@href

I spent a couple hours trying to figure out how to disable namespace prefixes in TagSoup.

This does not work:

parser.setFeature(org.ccil.cowan.tagsoup.Parser.namespacesFeature, false);


This doesn’t work either:

parser.setFeature(org.ccil.cowan.tagsoup.Parser.namespacePrefixesFeature, false);

Finally stumbled on a crude bruteforce solution at http://www.mail-archive.com/dom4j-user%40lists.sourceforge.net/msg02511.html

Grrrrrrr….. but at least we can say goodbye to prefixes in xpath queries.