Dom4j + XPath + TagSoup - Namespaces = sweet!
Posted by Kelvin on 20 Jan 2010 at 04:02 pm | Tagged as: programming
TagSoup does this annoying thing of adding namespaces to the html it cleans.
This annoyance becomes a major hindrance when formulating XPath queries for tagsoup-cleaned html.
Instead of using
//body/a/@href
we have to do
//html:body/html:a/@href
I spent a couple hours trying to figure out how to disable namespace prefixes in TagSoup.
This does not work:
parser.setFeature(org.ccil.cowan.tagsoup.Parser.namespacesFeature, false);
This doesn’t work either:
parser.setFeature(org.ccil.cowan.tagsoup.Parser.namespacePrefixesFeature, false);
Finally stumbled on a crude bruteforce solution at http://www.mail-archive.com/dom4j-user%40lists.sourceforge.net/msg02511.html
Grrrrrrr….. but at least we can say goodbye to prefixes in xpath queries.
