Kelvin Tan - Solr/Elasticsearch Consultant - Tokenizing second-level and top-level domain for a URL in Lucene and Solr

In my previous post, I described how to extract second- and top-level domains from a URL in Java.

Now, I'll build a Lucene Tokenizer out of it, and a Solr TokenizerFactory class.

DomainTokenizer doesn't do anything really fancy. It first returns the hostname as the first token, then the 2nd-level domain as the second token, and the top-level domain as the last token.

e.g. given the URL http://www.supermind.org, it'll return

www.supermind.org
.supermind.org
.org

Doing so allows you to quickly return all documents in the Lucene or Solr index matching the second-level domain or the TLD.

package org.supermind.solr.analysis;
 
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
 
import java.io.IOException;
import java.io.Reader;
import java.net.URL;
 
public class DomainTokenizer extends Tokenizer {
  private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
 
  public static final int STATE_UNINITIALIZED = -1;
  public static final int STATE_INITIALIZED = 0;
  public static final int STATE_2LD = 1;
  public static final int STATE_TLD = 2;
  public static final int STATE_DONE = 3;
 
  private int state = STATE_UNINITIALIZED;
 
  private URL url = null;
  private SecondLDExtractor extractor;
  private boolean index2LD;
  private boolean indexTLD;
 
  public DomainTokenizer(Reader input, SecondLDExtractor extractor, boolean index2LD, boolean indexTLD) {
    super(input);
    this.extractor = extractor;
    this.index2LD = index2LD;
    this.indexTLD = indexTLD;
  }
 
  @Override
  public boolean incrementToken() throws IOException {
    if (state == STATE_DONE) return false;
 
    clearAttributes();
    if (this.url == null) {
      state = STATE_INITIALIZED;
 
      StringBuilder sb = new StringBuilder();
      int upto = 0;
      char[] buffer = new char[512];
      while (true) {
        final int length = input.read(buffer, upto, buffer.length - upto);
        if (length == -1) break;
        upto += length;
        sb.append(buffer);
      }
      this.url = new URL(sb.toString());
      if (!index2LD && !indexTLD) state = STATE_DONE;
      termAtt.append(url.getHost());
      return true;
    } else if (index2LD && state < STATE_2LD) {
      state = STATE_2LD;
      String twold = extractor.extract2LD(url.getHost());
      termAtt.append("."+twold);
      return true;
    } else if (indexTLD && state < STATE_TLD) {
      state = STATE_TLD;
      String tld = extractor.extractTLD(url.getHost());
      termAtt.append(tld);
      return true;
    }
    state = STATE_DONE;
    return false;
  }
}

and here's the corresponding Solr TokenizerFactory.

package org.supermind.solr.analysis;
 
import org.apache.lucene.analysis.Tokenizer;
import org.apache.solr.analysis.BaseTokenizerFactory;
 
import java.io.Reader;
import java.util.Map;
 
public class DomainTokenizerFactory extends BaseTokenizerFactory {
  private SecondLDExtractor extractor;
  private boolean index2LD;
  private boolean indexTLD;
 
  @Override
  public void init(Map<String, String> args) {
    super.init(args);
    assureMatchVersion();
    index2LD = getBoolean("index2LD", true);
    indexTLD = getBoolean("indexTLD", true);
    if (index2LD || indexTLD) {
      initTLDExtractor();
    }
  }
 
  private void initTLDExtractor() {
    extractor = new SecondLDExtractor();
    extractor.init();
  }
 
  public Tokenizer create(Reader input) {
    DomainTokenizer tokenizer = new DomainTokenizer(input, extractor, index2LD, indexTLD);
    return tokenizer;
  }
}

Here's a sample fieldType definition.

<fieldType name="domain" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="org.supermind.solr.analysis.DomainTokenizerFactory"/>
      </analyzer>
</fieldType>

1 Comment »

Supermind Search Consulting Blog Solr - Elasticsearch - Big Data

Tokenizing second-level and top-level domain for a URL in Lucene and Solr

Supermind Search Consulting Blog
Solr - Elasticsearch - Big Data