Dom4j + XPath + TagSoup - Namespaces = sweet!
Posted by Kelvin on 20 Jan 2010 at 04:02 pm | Tagged as: programming
TagSoup does this annoying thing of adding namespaces to the html it cleans.
This annoyance becomes a major hindrance when formulating XPath queries for tagsoup-cleaned html.
Instead of using
we have to do
I spent a couple hours trying to figure out how to disable namespace prefixes in TagSoup.
This does not work:
This doesn't work either:
Finally stumbled on a crude bruteforce solution at http://www.mail-archive.com/dom4j-user%40lists.sourceforge.net/msg02511.html
*Removes namespaces if removeNamespaces is true
*/
public static void fixNamespaces(Document doc){
Element root = doc.getRootElement();
if(removeNamespaces && root.getNamespace() !=
Namespace.NO_NAMESPACE) removeNamespaces( root.content() );
}
/**
*Puts the namespaces back to the original root if removeNamespaces
is true
*/
public static void unfixNamespaces(Document doc, Namespace original){
Element root = doc.getRootElement();
if(removeNamespaces && original != null)
setNamespaces(root.content(), original);
}
/**
*Sets the namespace of the element to the given namespace
*/
public static void setNamespace(Element elem, Namespace ns){
elem.setQName( QName.get( elem.getName(), ns,
elem.getQualifiedName() ) );
}
/**
*Recursively removes the namespace of the element and all its
children: sets to Namespace.NO_NAMESPACE
*/
public static void removeNamespaces(Element elem){
setNamespaces(elem, Namespace.NO_NAMESPACE);
}
/**
*Recursively removes the namespace of the list and all its
children: sets to Namespace.NO_NAMESPACE
*/
public static void removeNamespaces(List l){
setNamespaces(l, Namespace.NO_NAMESPACE);
}
/**
*Recursively sets the namespace of the element and all its children.
*/
public static void setNamespaces(Element elem, Namespace ns){
setNamespace(elem, ns);
setNamespaces(elem.content(), ns);
}
/**
*Recursively sets the namespace of the List and all children if the
current namespace is match
*/
public static void setNamespaces(List l, Namespace ns){
Node n = null;
for(int i=0; i<l.size(); i++){
n = (Node)l.get(i);
if(n.getNodeType() == Node.ATTRIBUTE_NODE) ( (Attribute)n
).setNamespace(ns);
if(n.getNodeType() == Node.ELEMENT_NODE) setNamespaces(
(Element)n, ns );
}
}
Grrrrrrr..... but at least we can say goodbye to prefixes in xpath queries.
-
dis
-
http://www.java-forums.org/xml/28760-parsing-real-world-html-xpath-support.html#post119711 Parsing Real World HTML with XPath support - Java Forums
-
http://unbeagleyyo.wordpress.com/2010/05/16/extraer-contenido-html-mediante-consultas-xpath-con-dom4j-y-tagsoup/ Extraer contenido HTML mediante consultas xpath con dom4j y Tagsoup « Un Beagle y Yo
-
http://bixolabs.com Ken Krugler
-
Kelvin
-
http://www.tecnologiadigerida.com Carlos Rafael Ramirez
-
Ciel
