Posted by Kelvin on 16 Sep 2013 at 03:45 pm | Tagged as: Lucene / Solr / Elasticsearch / Nutch
Solr has a number of Autocomplete implementations which are great for most purposes. However, a client of mine recently had some fairly specific requirements for autocomplete:
1. phrase-based substring matching
2. out-of-order matches ('foo bar' should match 'the bar is foo')
3. fallback matching to a secondary field when substring matches on the primary field fails, e.g. 'windstopper jac' doesn't match anything on the 'title' field, but matches on the 'category' field
The most direct way to model this would probably have been to create a separate Solr core and use ngram + shingles indexing and Solr queries to obtain results. However, because the index was fairly small, I decided to go with an in-memory approach.
The general strategy was:
1. For each entry in the primary field, create ngram tokens, adding entries to a Guava Table, where key was ngram, column was string, and value was a distance score.
2. For each entry in the secondary field, create ngram tokens and add entries to a Guava Multimap, where key was ngram, and value was term.
3. When a autocomplete query is received, split it by space, then do lookups against the primary Table.
4. If no matches were found, lookup against the secondary Multimap
5. Return results.
The scoring for the primary Table was a simple one based on length of word and distance of token from the start of the string.