Kelvin Tan - Solr/Elasticsearch Consultant - Dom4j + XPath + TagSoup

TagSoup does this annoying thing of adding namespaces to the html it cleans.

This annoyance becomes a major hindrance when formulating XPath queries for tagsoup-cleaned html.

Instead of using

//body/a/@href

we have to do

//html:body/html:a/@href

I spent a couple hours trying to figure out how to disable namespace prefixes in TagSoup.

This does not work:

parser.setFeature(org.ccil.cowan.tagsoup.Parser.namespacesFeature, false);

This doesn't work either:

parser.setFeature(org.ccil.cowan.tagsoup.Parser.namespacePrefixesFeature, false);

Finally stumbled on a crude bruteforce solution at http://www.mail-archive.com/dom4j-user%40lists.sourceforge.net/msg02511.html

/**
     *Removes namespaces if removeNamespaces is true
     */   
    public static void fixNamespaces(Document doc){
        Element root = doc.getRootElement();       
        if(removeNamespaces && root.getNamespace() != 
Namespace.NO_NAMESPACE) removeNamespaces( root.content() );               
    }
 
    /**
     *Puts the namespaces back to the original root if removeNamespaces 
is true
     */   
    public static void unfixNamespaces(Document doc, Namespace original){
        Element root = doc.getRootElement();
        if(removeNamespaces && original != null) 
setNamespaces(root.content(), original);
    }
 
    /**
     *Sets the namespace of the element to the given namespace
     */
    public static void setNamespace(Element elem, Namespace ns){
        elem.setQName( QName.get( elem.getName(), ns, 
elem.getQualifiedName() ) );
    }
 
    /**
     *Recursively removes the namespace of the element and all its 
children: sets to Namespace.NO_NAMESPACE
     */
    public static void removeNamespaces(Element elem){
        setNamespaces(elem, Namespace.NO_NAMESPACE);
    }
 
    /**
     *Recursively removes the namespace of the list and all its 
children: sets to Namespace.NO_NAMESPACE
     */
    public static void removeNamespaces(List l){
        setNamespaces(l, Namespace.NO_NAMESPACE);
    }
 
    /**
     *Recursively sets the namespace of the element and all its children.
     */
    public static void setNamespaces(Element elem, Namespace ns){
        setNamespace(elem, ns);
        setNamespaces(elem.content(), ns);
    }
 
    /**
     *Recursively sets the namespace of the List and all children if the 
current namespace is match
     */
    public static void setNamespaces(List l, Namespace ns){
        Node n = null;
        for(int i=0; i<l.size(); i++){
            n = (Node)l.get(i);
            if(n.getNodeType() == Node.ATTRIBUTE_NODE) ( (Attribute)n 
).setNamespace(ns);
            if(n.getNodeType() == Node.ELEMENT_NODE) setNamespaces( 
(Element)n, ns );
        }
    }

Grrrrrrr….. but at least we can say goodbye to prefixes in xpath queries.

7 Comments »

Supermind Search Consulting Blog
Solr - Elasticsearch - Big Data

Dom4j + XPath + TagSoup – Namespaces = sweet!

Supermind Search Consulting Blog Solr - Elasticsearch - Big Data

Dom4j + XPath + TagSoup – Namespaces = sweet!

Supermind Search Consulting Blog
Solr - Elasticsearch - Big Data