Introduction to Solr analysis

Solr filters are declared in Solr's schema as part of analyzer chains.

In Lucene/Solr/Elasticsearch, analyzers are used to process text, breaking text up into terms or tokens, which are then indexed.

An analyzer chain in Solr/Elasticsearch comprises of
1) a tokenizer
2) an optional character filter
3) one or more token filters

Token filters operate on tokens/terms, as opposed to character filters which operate on individual characters.

Examples of token filters are the LowercaseFilter which removes stopwords and the TrimFilter which trims whitespace from terms.

Basic implementation of a TokenFilter

Without further ado, here is the simplest possible TokenFilter and TokenFilterFactory implementation. (I included some javadoc from TokenFilter.java to provide a bit more background context)

  1. public class MyTokenFilterFactory extends TokenFilterFactory {
  2.     public MyTokenFilterFactory(Map<string, string=""> args) {
  3.         super(args);
  4.     }
  5.  
  6.     @Override
  7.     public TokenStream create(TokenStream input) {
  8.         return new MyTokenFilter(input);
  9.     }
  10. }
  11.  
  12. final class MyTokenFilter extends TokenFilter {
  13.     public MyTokenFilter(TokenStream input) {
  14.         super(input);
  15.     }
  16.  
  17.       /**
  18.    * Consumers (i.e., {@link IndexWriter}) use this method to advance the stream to
  19.    * the next token. Implementing classes must implement this method and update
  20.    * the appropriate {@link AttributeImpl}s with the attributes of the next
  21.    * token.
  22.    * 
  23.    * The producer must make no assumptions about the attributes after the method
  24.    * has been returned: the caller may arbitrarily change it. If the producer
  25.    * needs to preserve the state for subsequent calls, it can use
  26.    * {@link #captureState} to create a copy of the current attribute state.
  27.    * 
  28.  
  29.    * This method is called for every token of a document, so an efficient
  30.    * implementation is crucial for good performance. To avoid calls to
  31.    * {@link #addAttribute(Class)} and {@link #getAttribute(Class)},
  32.    * references to all {@link AttributeImpl}s that this stream uses should be
  33.    * retrieved during instantiation.
  34.    * 
  35.  
  36.    * To ensure that filters and consumers know which attributes are available,
  37.    * the attributes must be added during instantiation. Filters and consumers
  38.    * are not required to check for availability of attributes in
  39.    * {@link #incrementToken()}.
  40.    * 
  41.    * @return false for end of stream; true otherwise
  42.    */
  43.     @Override
  44.     public boolean incrementToken() throws IOException {
  45.         if (!input.incrementToken()) {
  46.             return false;
  47.         }
  48.         return true;
  49.     }
  50. }
  51.  
  52. </string,>

Some notes:
1. MyTokenFilterFactory exists solely to create MyTokenFilter. Configuration arguments can be obtained via it's constructor.
2. MyTokenFilter must be final.
3. MyTokenFilter implements a single method called incrementToken().

Now, this tokenfilter is a no-op implementation and is not very useful. Here is a (slightly) modified version which replaces all terms with "Hello".

  1. final class MyTokenFilter extends TokenFilter {
  2.     private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
  3.     public MyTokenFilter(TokenStream input) {
  4.         super(input);
  5.     }
  6.  
  7.     @Override
  8.     public boolean incrementToken() throws IOException {
  9.         if (!input.incrementToken()) {
  10.             return false;
  11.         }
  12.         termAtt.setEmpty().append("Hello");
  13.         return true;
  14.     }
  15. }

Some notes:
1. Line 2 shows the TokenFilter pattern of adding attributes at constructor time. See AttributeImpl for the list of available attributes.
2. In this filter, we are using CharTermAttribute. This CharTermAttribute will be populated with each incrementToken() invocation with the respective term.
3. To modify the term that will be returned, call termAtt.setEmpty().append(some-string).

Deploying to Solr

Compile these classes into a .jar file, place it in your core's lib folder. You can then use this filter in your field types like so:

<analyzer>
      <tokenizer class="solr.WhitespaceTokenizerFactory"></tokenizer>
      <filter class="com.mycompany.solr.MyTokenFilterFactory"></filter>
</analyzer>