Kelvin Tan - Solr/Elasticsearch Consultant - Writing custom Solr Analysis Token Filters

Introduction to Solr analysis

Solr filters are declared in Solr's schema as part of analyzer chains.

In Lucene/Solr/Elasticsearch, analyzers are used to process text, breaking text up into terms or tokens, which are then indexed.

An analyzer chain in Solr/Elasticsearch comprises of
1) a tokenizer
2) an optional character filter
3) one or more token filters

Token filters operate on tokens/terms, as opposed to character filters which operate on individual characters.

Examples of token filters are the LowercaseFilter which removes stopwords and the TrimFilter which trims whitespace from terms.

Basic implementation of a TokenFilter

Without further ado, here is the simplest possible TokenFilter and TokenFilterFactory implementation. (I included some javadoc from TokenFilter.java to provide a bit more background context)

public class MyTokenFilterFactory extends TokenFilterFactory {

    public MyTokenFilterFactory(Map<string, string=""> args) {

```
        super(args);
```
```
    }
```
```
 
```
```
    @Override
```

    public TokenStream create(TokenStream input) {

        return new MyTokenFilter(input);

```
    }
```
```
}
```
```
 
```

final class MyTokenFilter extends TokenFilter {

    public MyTokenFilter(TokenStream input) {

```
        super(input);
```
```
    }
```
```
 
```
```
      /**
```

   * Consumers (i.e., {@link IndexWriter}) use this method to advance the stream to

   * the next token. Implementing classes must implement this method and update

   * the appropriate {@link AttributeImpl}s with the attributes of the next

```
   * token.
```
```
   * 
```

   * The producer must make no assumptions about the attributes after the method

   * has been returned: the caller may arbitrarily change it. If the producer

   * needs to preserve the state for subsequent calls, it can use

   * {@link #captureState} to create a copy of the current attribute state.

```
   * 
```
```
 
```

   * This method is called for every token of a document, so an efficient

   * implementation is crucial for good performance. To avoid calls to

   * {@link #addAttribute(Class)} and {@link #getAttribute(Class)},

   * references to all {@link AttributeImpl}s that this stream uses should be

```
   * retrieved during instantiation.
```
```
   * 
```
```
 
```

   * To ensure that filters and consumers know which attributes are available,

   * the attributes must be added during instantiation. Filters and consumers

   * are not required to check for availability of attributes in

```
   * {@link #incrementToken()}.
```
```
   * 
```

   * @return false for end of stream; true otherwise

```
   */
```
```
    @Override
```

    public boolean incrementToken() throws IOException {

```
        if (!input.incrementToken()) {
```
```
            return false;
```
```
        }
```
```
        return true;
```
```
    }
```
```
}
```
```
 
```
```
</string,>
```

Some notes:
1. MyTokenFilterFactory exists solely to create MyTokenFilter. Configuration arguments can be obtained via it's constructor.
2. MyTokenFilter must be final.
3. MyTokenFilter implements a single method called incrementToken().

Now, this tokenfilter is a no-op implementation and is not very useful. Here is a (slightly) modified version which replaces all terms with "Hello".

final class MyTokenFilter extends TokenFilter {

    private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);

    public MyTokenFilter(TokenStream input) {

```
        super(input);
```
```
    }
```
```
 
```
```
    @Override
```

    public boolean incrementToken() throws IOException {

```
        if (!input.incrementToken()) {
```
```
            return false;
```
```
        }
```

        termAtt.setEmpty().append("Hello");

```
        return true;
```
```
    }
```
```
}
```

Some notes:
1. Line 2 shows the TokenFilter pattern of adding attributes at constructor time. See AttributeImpl for the list of available attributes.
2. In this filter, we are using CharTermAttribute. This CharTermAttribute will be populated with each incrementToken() invocation with the respective term.
3. To modify the term that will be returned, call termAtt.setEmpty().append(some-string).

Deploying to Solr

Compile these classes into a .jar file, place it in your core's lib folder. You can then use this filter in your field types like so:

<analyzer>
      <tokenizer class="solr.WhitespaceTokenizerFactory"></tokenizer>
      <filter class="com.mycompany.solr.MyTokenFilterFactory"></filter>
</analyzer>

1 Comment »

Supermind Search Consulting Blog Solr - Elasticsearch - Big Data

Writing custom Solr Analysis Token Filters

Introduction to Solr analysis

Basic implementation of a TokenFilter

Deploying to Solr

Supermind Search Consulting Blog
Solr - Elasticsearch - Big Data