Supermind Consulting Blog 
Solr - ElasticSearch - Big Data

Posts about Lucene / Solr / Elastic Search / Nutch

Monier-Williams Sanskrit-English-IAST search engine

Posted by Kelvin on 17 Sep 2015 | Tagged as: Lucene / Solr / Elastic Search / Nutch, programming, Python

I just launched a search application for the Monier-Williams dictionary, which is the definitive Sanskrit-English dictionary.

See it in action here:

The app is built in Python and uses the Whoosh search engine. I chose Whoosh instead of Solr or ElasticSearch because I wanted to try building a search app which didn't depend on Java.

Features include:
– full-text search in Devanagari, English, IAST, ascii and HK
– results link to page scans
– more frequently occurring word senses are boosted higher in search results
– visually displays the MW level or depth of a word with list indentation

A HTML5 ElasticSearch Query DSL Builder

Posted by Kelvin on 16 Sep 2015 | Tagged as: Lucene / Solr / Elastic Search / Nutch, programming

Tl;DR : I parsed ElasticSearch source and generated a HTML app that allows you to build ElasticSearch queries using its JSON Query DSL. You can see it in action here:

I really like ElasticSearch's JSON-based Query DSL – it lets you create fairly complex search queries in a relatively painless fashion.

I do not, however, fancy the query DSL documentation. I've often found it inadequate, inconsistent with the source, and at times downright confusing.

Browsing the source, I realised that ES parses JSON queries in a fairly regular fashion, which would lend itself well to regex-based parsing of the Java source in order to generate documention of the JSON 'schema'.

The parsing I did in Java, and the actual query builder UI is in HTML and Javascript. The Java phase outputs a JSON data model of the query DSL, which the HTML app then uses to dynamically build the HTML forms etc.

Because of the consistent naming conventions of the objects, I was also able to embed links to documentation and github source within the page itself. Very useful!

You can see the result in action here:

PS: I first did this for ES version 1.2.1, and then subsequently for 1.4.3 and now 1.7.2. The approach seems to work consistently across versions, with minor changes required in the Java backend parsing between version bumps. Hopefully this remains the case when we go to ES 2.x.

Phrase-based Out-of-order Solr Autocomplete Suggester

Posted by Kelvin on 16 Sep 2013 | Tagged as: Lucene / Solr / Elastic Search / Nutch

Solr has a number of Autocomplete implementations which are great for most purposes. However, a client of mine recently had some fairly specific requirements for autocomplete:

1. phrase-based substring matching
2. out-of-order matches ('foo bar' should match 'the bar is foo')
3. fallback matching to a secondary field when substring matches on the primary field fails, e.g. 'windstopper jac' doesn't match anything on the 'title' field, but matches on the 'category' field

The most direct way to model this would probably have been to create a separate Solr core and use ngram + shingles indexing and Solr queries to obtain results. However, because the index was fairly small, I decided to go with an in-memory approach.

The general strategy was:

1. For each entry in the primary field, create ngram tokens, adding entries to a Guava Table, where key was ngram, column was string, and value was a distance score.
2. For each entry in the secondary field, create ngram tokens and add entries to a Guava Multimap, where key was ngram, and value was term.
3. When a autocomplete query is received, split it by space, then do lookups against the primary Table.
4. If no matches were found, lookup against the secondary Multimap
5. Return results.

The scoring for the primary Table was a simple one based on length of word and distance of token from the start of the string.

Custom Solr QueryParsers for fun and profit

Posted by Kelvin on 09 Sep 2013 | Tagged as: Lucene / Solr / Elastic Search / Nutch

In this post, I'll show you what you need to do to implement a custom Solr QueryParser.

Step 1

Extend QParserPlugin.

public class TestQueryParserPlugin extends QParserPlugin {
  public void init(NamedList namedList) {

  @Override public QParser createParser(String s, SolrParams localParams, SolrParams params, SolrQueryRequest req) {
    return new TestQParser(s, localParams, params, req);

This is the class you'll define in solrconfig.xml, informing Solr of your queryparser. Define it like so:

<queryParser name="myfunparser" class="org.supermind.solr.queryparser.TestQParserPlugin"/>

Step 2

Extend QParser.

  public class TestQParser extends QParser {
    public TestQParser(String qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req) {
      super(qstr, localParams, params, req);

    @Override public Query parse() throws SyntaxError {
      return null;

Step 3

Actually implement the parsing in the parse() method.

Suppose we wanted to make a really simple parser for term queries, which are space-delimited. Here's how I'd do it:

@Override public Query parse() throws SyntaxError {
      String defaultField = req.getSchema().getDefaultSearchFieldName();
      QueryParser.Operator defaultOperator = QueryParser.Operator.valueOf(req.getSchema().getQueryParserDefaultOperator());
      BooleanClause.Occur op = (defaultOperator == QueryParser.Operator.AND) ? BooleanClause.Occur.MUST : BooleanClause.Occur.SHOULD;
      String[] arr = qstr.split(" ");
      BooleanQuery bq = new BooleanQuery(true);
      for(String s: arr) {
        if(s.trim().length() == 0) continue;
        bq.add(new TermQuery(new Term(defaultField, s)), op);
      return bq;

Step 4

In your query, use the nested query syntax to call your queryparser, e.g.


Maybe in a follow-up post, I'll post the full code with jars and all.

High-level overview of Latent Semantic Analysis / LSA

Posted by Kelvin on 09 Sep 2013 | Tagged as: Lucene / Solr / Elastic Search / Nutch, programming

I've just spent the last couple days wrapping my head around implementing Latent Semantic Analysis, and after wading through a number of research papers and quite a bit of linear algebra, I've finally emerged on the other end, and thought I'd write something about it to lock the knowledge in. I'll do my best to keep it non-technical, yet accurate.

Step One – Build the term-document matrix

Input : documents
Output : term-document matrix

Latent Semantic Analysis has the same starting point as most Information Retrieval algorithms : the term-document matrix. Specifically, columns are documents, and rows are terms. If a document contains a term, then the value of that row-column is 1, otherwise 0.

If you start with a corpus of documents, or a database table or something, then you'll need to index this corpus into this matrix. Meaning, lowercasing, removing stopwords, maybe stemming etc. The typical Lucene/Solr analyzer chain, basically.

Step Two – Decompose the matrix

Input : term-document matrix
Output : 3 matrices, U, S and V

Apply Singular Value Decomposition (SVD) to the matrix. This is the computationally expensive step of the whole operation.

SVD is a fairly technical concept and quite an involved process (if you doing it by hand). If you do a bit of googling, you're going to find all kinds of mathematical terms related to this, like matrix decomposition, eigenvalues, eigenvectors, PCA (principal component analysis), random projection etc.

The 5 second explanation of this step is that the original term-document matrix gets broken down into 3 simpler matrices: a term-term matrix (also known as U, or the left matrix), a matrix comprising of the singular values (also known as S), and a document-document matrix (also known as V, or the right matrix).

Something which usually also happens in the SVD step for LSA, and which is important, is rank reduction. In this context, rank reduction means that the original term-document matrix gets somehow "factorized" into its constituent factors, and the k most significant factors or features are retained, where k is some number greater than zero and less than the original size of the term-document matrix. For example, a rank 3 reduction means that the 3 most significant factors are retained. This is important for you to know because most LSA/LSI applications will ask you to specify the value of k, meaning the application wants to know how many features you want to retain.

So what's actually happening in this SVD rank reduction, is basically an approximation of the original term-document matrix, allowing you to compare features in a fast and efficient manner. Smaller k values generally run faster and use less memory, but are less accurate. Larger k values are more "true" to the original matrix, but require longer to compute. Note: this statement may not be true of the stochastic SVD implementations (involving random projection or some other method), where an increase in k doesn't lead to a linear increase in running time, but more like a log(n) increase in running time.

Step Three – Build query vector

Input : query string
Output : query vector

From here, we're on our downhill stretch. The query string needs to be expressed in terms that allow for searching.

Step Four – Compute cosine distance

Input : query vector, document matrix
Output : document scores

To obtain how similar each document is to the query, aka the doc score, we have to go through each document vector in the matrix and calculate its cosine distance to the query vector.


Naive Solr Did You Mean re-searcher SearchComponent

Posted by Kelvin on 05 Sep 2013 | Tagged as: Lucene / Solr / Elastic Search / Nutch

Solr makes Spellcheck easy. Super-easy in fact. All you need to do is to change some stuff in solrconfig.xml, and voila, spellcheck suggestions!

However, that's not how google does spellchecking. What Google does is determine if the query has a mis-spelling, and if so, transparently correct the misspelled term for you and perform the search, but also giving you the option of searching for the original term via a link.

Now, whilst it'd be uber-cool to have an exact equivalent in Solr, you'd need some statistical data to be able to perform this efficiently. A naive version is to use spellcheck corrections to transparently perform a new query when the original query returned less than x hits, where x is some arbitrarily small number.

Here's a simple SearchComponent that does just that:

import org.apache.solr.common.util.NamedList;
import org.apache.solr.handler.component.QueryComponent;
import org.apache.solr.handler.component.ResponseBuilder;


public class AutoSpellcheckResearcher extends QueryComponent {
  // if less than *threshold* hits are returned, a re-search is triggered
  private int threshold = 0;

  @Override public void init(NamedList args) {
    this.threshold = (Integer) args.get("threshold");

  @Override public void prepare(ResponseBuilder rb) throws IOException {

  @Override public void process(ResponseBuilder rb) throws IOException {
    long hits = rb.getNumberDocumentsFound();
    if (hits <= threshold) {
      final NamedList responseValues = rb.rsp.getValues();
      NamedList spellcheckresults = (NamedList) responseValues.get("spellcheck");
      if (spellcheckresults != null) {
        NamedList suggestions = (NamedList) spellcheckresults.get("suggestions");
        if (suggestions != null) {
          final NamedList collation = (NamedList) suggestions.get("collation");
          if (collation != null) {
            String collationQuery = (String) collation.get("collationQuery");
            if (responseValues != null) {
              responseValues.add("researched.original", rb.getQueryString());
              responseValues.add("researched.replaced", collationQuery);

  @Override public String getDescription() {
    return "AutoSpellcheckResearcher";

  @Override public String getSource() {
    return "1.0";

Reading ElasticSearch server book…

Posted by Kelvin on 23 May 2013 | Tagged as: Lucene / Solr / Elastic Search / Nutch

Just got on my hands on a review copy of PacktPub's ElasticSearch Server book, which I believe is the first ES book on the market.

Review to follow shortly..

New file formats in Lucene 4.1+ index

Posted by Kelvin on 30 Apr 2013 | Tagged as: Lucene / Solr / Elastic Search / Nutch

Lucene 4.1 introduces new files in the index.

Here's a link to the documentation:

The different types of files are:

.tim: Term Dictionary
.tip: Term Index
.doc: Frequencies and Skip Data
.pos: Positions
.pay: Payloads and Offsets

Permission filtering in Solr using an ACL permissions string

Posted by Kelvin on 03 Apr 2013 | Tagged as: Lucene / Solr / Elastic Search / Nutch

For an app I'm working on, permissions ACL is stored in a string, in the form:


Both users and documents have an ACL string.

The number represents the access level for that category. Bigger numbers mean higher access.

In the previous Lucene-based iteration, to perform permission filtering, I just loaded the entire field into memory and did quick in-memory lookups. In this current iteration, I'm trying something different.

I'm creating a one field per category level, and populating the field values accordingly. Then when searching, I need to search for all the possible categories using range queries, including specifying empty fields where applicable. Works pretty well. The main drawback (and its a severe one), is that I need to know a priori all the categories. This is not a problem for this app, but might be for other folks.

Here's an example of how it looks.

Document A: user=300|moderator=100
maps to

User A: moderator=300

Filter Query to determine if User A can access Document A:
-acl_user:[* TO *] acl_moderator:[0 T0 300]

Solr DateField java dateformat

Posted by Kelvin on 03 Apr 2013 | Tagged as: Lucene / Solr / Elastic Search / Nutch

Grrrr… keep forgetting the Solr DateField dateformat, so here it is for posterity.

new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss'Z'");

Next Page »