Dom4j + XPath + TagSoup - Namespaces = sweet!
Posted by Kelvin on 20 Jan 2010 at 04:02 pm | Tagged as: programming
TagSoup does this annoying thing of adding namespaces to the html it cleans.
This annoyance becomes a major hindrance when formulating XPath queries for tagsoup-cleaned html.
Instead of using
we have to do
I spent a couple hours trying to figure out how to disable namespace prefixes in TagSoup.
This does not work:
This doesn’t work either:
Finally stumbled on a crude bruteforce solution at http://www.mail-archive.com/dom4j-user%40lists.sourceforge.net/msg02511.html
*Removes namespaces if removeNamespaces is true
*/
public static void fixNamespaces(Document doc){
Element root = doc.getRootElement();
if(removeNamespaces && root.getNamespace() !=
Namespace.NO_NAMESPACE) removeNamespaces( root.content() );
}
/**
*Puts the namespaces back to the original root if removeNamespaces
is true
*/
public static void unfixNamespaces(Document doc, Namespace original){
Element root = doc.getRootElement();
if(removeNamespaces && original != null)
setNamespaces(root.content(), original);
}
/**
*Sets the namespace of the element to the given namespace
*/
public static void setNamespace(Element elem, Namespace ns){
elem.setQName( QName.get( elem.getName(), ns,
elem.getQualifiedName() ) );
}
/**
*Recursively removes the namespace of the element and all its
children: sets to Namespace.NO_NAMESPACE
*/
public static void removeNamespaces(Element elem){
setNamespaces(elem, Namespace.NO_NAMESPACE);
}
/**
*Recursively removes the namespace of the list and all its
children: sets to Namespace.NO_NAMESPACE
*/
public static void removeNamespaces(List l){
setNamespaces(l, Namespace.NO_NAMESPACE);
}
/**
*Recursively sets the namespace of the element and all its children.
*/
public static void setNamespaces(Element elem, Namespace ns){
setNamespace(elem, ns);
setNamespaces(elem.content(), ns);
}
/**
*Recursively sets the namespace of the List and all children if the
current namespace is match
*/
public static void setNamespaces(List l, Namespace ns){
Node n = null;
for(int i=0; i<l.size(); i++){
n = (Node)l.get(i);
if(n.getNodeType() == Node.ATTRIBUTE_NODE) ( (Attribute)n
).setNamespace(ns);
if(n.getNodeType() == Node.ELEMENT_NODE) setNamespaces(
(Element)n, ns );
}
}
Grrrrrrr….. but at least we can say goodbye to prefixes in xpath queries.

thanks for those parser’s parameters. worked for me very well. you post saved my life!
[...] Using XPath on real-world HTML documents seems to work well except the following namespace problem: Dom4j + XPath + TagSoup – Namespaces = sweet! :: Kelvin Tan - Lucene Solr Nutch Consultant It seems other parsers are available: Open Source HTML Parsers in Java some of which support [...]
[...] entrada para eliminar el prefijo html de las [...]
Hi Kelvin,
I’ve solved this same issue two different ways in the past.
Easiest (brute force) is by calling SAXReader.setXMLFilter() with a filter, where that filter strips off namespaces.
The other approach is to use a utility routine that re-writes the XPath path with the required namespace identifier. Though I wound up having to also set the namespace context (XPath.setNamespaceContext(new SimpleNamespaceContext(map))) with a map from the identifier to the full xmlns://.
-- Ken
@Ken - that’s a great idea with SAXReader.setXMLFilter(). Probably much cleaner than the method I posted about.