Supermind Search Consulting Blog 
Solr - ElasticSearch - Big Data

Posts about Lucene / Solr / Elastic Search / Nutch

Embed custom Javascript and HTML in a Kibana 4.x visualization

Posted by Kelvin on 11 Jan 2016 | Tagged as: Lucene / Solr / Elastic Search / Nutch

The embarrassingly simple answer to embedding ANY Javascript and HTML into a Kibana vis is to hack the markdown_vis plugin to not use markdown at all, but just display the HTML as-is.

Modify src/plugins/markdown_vis/public/markdown_vis_controller.js, and comment out

$scope.html = $sce.trustAsHtml(marked(html));

and replace it with

$scope.html = $sce.trustAsHtml(html);

You'll need to recreate the bundles (just install or remove/reinstall sense for example) and restart Kibana for this to take effect. It's pretty awesome, because now the sky's the limit!

Lucene 5 NRT Example

Posted by Kelvin on 16 Dec 2015 | Tagged as: Lucene / Solr / Elastic Search / Nutch

I just added an NRT search example for Lucene 5.x to

Check it out here:

Pain-free Solr replication

Posted by Kelvin on 02 Dec 2015 | Tagged as: Lucene / Solr / Elastic Search / Nutch

Here's a setup I use for totally pain-free Solr replication, and allowing you to switch masters/slaves quickly without messing with config files.

Add this to solrconfig.xml

<requestHandler name="/replication" class="solr.ReplicationHandler" >
  <str name="maxNumberOfBackups">1</str>
  <lst name="master">
        <str name="enable">${enable.master:false}</str>
        <str name="replicateAfter">startup</str>
        <str name="replicateAfter">commit</str>
        <str name="confFiles">solrconfig.xml,schema.xml,stopwords.txt,elevate.xml</str>
        <str name="commitReserveDuration">00:00:10</str>
    <lst name="slave">
        <str name="enable">${enable.slave:false}</str>
        <str name="masterUrl">http://${replication.master}:8983/solr/corename</str>
        <str name="pollInterval">00:00:20</str>
        <str name="compression">internal</str>
        <str name="httpConnTimeout">5000</str>
        <str name="httpReadTimeout">10000</str>

Substitute "corename" with your actual core name.

Now create your master core with this command (substitute MASTER_IP_ADDRESS as appropriate):

curl "http://localhost:8983/solr/admin/cores?action=CREATE&name=corename&instanceDir=corename&schema=schema.xml&config=solrconfig.xml&dataDir=data&property.enable.master=true&property.enable.slave=false&property.replication.master=MASTER_IP_ADDRESS"

And this for your slaves:

curl "http://localhost:8983/solr/admin/cores?action=CREATE&name=corename&instanceDir=corename&schema=schema.xml&config=solrconfig.xml&dataDir=data&property.enable.master=false&property.enable.slave=true&property.replication.master=MASTER_IP_ADDRESS"

Now when you need to promote a slave to a master, just do this on the new master:

curl "http://localhost:8983/solr/admin/cores?action=UNLOAD&core=corename" && curl "http://localhost:8983/solr/admin/cores?action=CREATE&name=corename&instanceDir=corename&schema=schema.xml&config=solrconfig.xml&dataDir=data&property.enable.master=true&property.enable.slave=false&property.replication.master=NEW_MASTER_IP_ADDRESS"

And this on all slaves:

curl "http://localhost:8983/solr/admin/cores?action=UNLOAD&core=corename" && curl "http://localhost:8983/solr/admin/cores?action=CREATE&name=corename&instanceDir=corename&schema=schema.xml&config=solrconfig.xml&dataDir=data&property.enable.master=false&property.enable.slave=true&property.replication.master=NEW_MASTER_IP_ADDRESS"

Copy these commands and do a search and replace on "corename" for your actual core.

If you have a cssh cluster setup, you can update all slaves in one fell swoop.

Monier-Williams Sanskrit-English-IAST search engine

Posted by Kelvin on 17 Sep 2015 | Tagged as: Lucene / Solr / Elastic Search / Nutch, programming, Python

I just launched a search application for the Monier-Williams dictionary, which is the definitive Sanskrit-English dictionary.

See it in action here:

The app is built in Python and uses the Whoosh search engine. I chose Whoosh instead of Solr or ElasticSearch because I wanted to try building a search app which didn't depend on Java.

Features include:
– full-text search in Devanagari, English, IAST, ascii and HK
– results link to page scans
– more frequently occurring word senses are boosted higher in search results
– visually displays the MW level or depth of a word with list indentation

A HTML5 ElasticSearch Query DSL Builder

Posted by Kelvin on 16 Sep 2015 | Tagged as: Lucene / Solr / Elastic Search / Nutch, programming

Tl;DR : I parsed ElasticSearch source and generated a HTML app that allows you to build ElasticSearch queries using its JSON Query DSL. You can see it in action here:

I really like ElasticSearch's JSON-based Query DSL – it lets you create fairly complex search queries in a relatively painless fashion.

I do not, however, fancy the query DSL documentation. I've often found it inadequate, inconsistent with the source, and at times downright confusing.

Browsing the source, I realised that ES parses JSON queries in a fairly regular fashion, which would lend itself well to regex-based parsing of the Java source in order to generate documention of the JSON 'schema'.

The parsing I did in Java, and the actual query builder UI is in HTML and Javascript. The Java phase outputs a JSON data model of the query DSL, which the HTML app then uses to dynamically build the HTML forms etc.

Because of the consistent naming conventions of the objects, I was also able to embed links to documentation and github source within the page itself. Very useful!

You can see the result in action here:

PS: I first did this for ES version 1.2.1, and then subsequently for 1.4.3 and now 1.7.2. The approach seems to work consistently across versions, with minor changes required in the Java backend parsing between version bumps. Hopefully this remains the case when we go to ES 2.x.

Phrase-based Out-of-order Solr Autocomplete Suggester

Posted by Kelvin on 16 Sep 2013 | Tagged as: Lucene / Solr / Elastic Search / Nutch

Solr has a number of Autocomplete implementations which are great for most purposes. However, a client of mine recently had some fairly specific requirements for autocomplete:

1. phrase-based substring matching
2. out-of-order matches ('foo bar' should match 'the bar is foo')
3. fallback matching to a secondary field when substring matches on the primary field fails, e.g. 'windstopper jac' doesn't match anything on the 'title' field, but matches on the 'category' field

The most direct way to model this would probably have been to create a separate Solr core and use ngram + shingles indexing and Solr queries to obtain results. However, because the index was fairly small, I decided to go with an in-memory approach.

The general strategy was:

1. For each entry in the primary field, create ngram tokens, adding entries to a Guava Table, where key was ngram, column was string, and value was a distance score.
2. For each entry in the secondary field, create ngram tokens and add entries to a Guava Multimap, where key was ngram, and value was term.
3. When a autocomplete query is received, split it by space, then do lookups against the primary Table.
4. If no matches were found, lookup against the secondary Multimap
5. Return results.

The scoring for the primary Table was a simple one based on length of word and distance of token from the start of the string.

Custom Solr QueryParsers for fun and profit

Posted by Kelvin on 09 Sep 2013 | Tagged as: Lucene / Solr / Elastic Search / Nutch

In this post, I'll show you what you need to do to implement a custom Solr QueryParser.

Step 1

Extend QParserPlugin.

public class TestQueryParserPlugin extends QParserPlugin {
  public void init(NamedList namedList) {

  @Override public QParser createParser(String s, SolrParams localParams, SolrParams params, SolrQueryRequest req) {
    return new TestQParser(s, localParams, params, req);

This is the class you'll define in solrconfig.xml, informing Solr of your queryparser. Define it like so:

<queryParser name="myfunparser" class="org.supermind.solr.queryparser.TestQParserPlugin"/>

Step 2

Extend QParser.

  public class TestQParser extends QParser {
    public TestQParser(String qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req) {
      super(qstr, localParams, params, req);

    @Override public Query parse() throws SyntaxError {
      return null;

Step 3

Actually implement the parsing in the parse() method.

Suppose we wanted to make a really simple parser for term queries, which are space-delimited. Here's how I'd do it:

@Override public Query parse() throws SyntaxError {
      String defaultField = req.getSchema().getDefaultSearchFieldName();
      QueryParser.Operator defaultOperator = QueryParser.Operator.valueOf(req.getSchema().getQueryParserDefaultOperator());
      BooleanClause.Occur op = (defaultOperator == QueryParser.Operator.AND) ? BooleanClause.Occur.MUST : BooleanClause.Occur.SHOULD;
      String[] arr = qstr.split(" ");
      BooleanQuery bq = new BooleanQuery(true);
      for(String s: arr) {
        if(s.trim().length() == 0) continue;
        bq.add(new TermQuery(new Term(defaultField, s)), op);
      return bq;

Step 4

In your query, use the nested query syntax to call your queryparser, e.g.


Maybe in a follow-up post, I'll post the full code with jars and all.

High-level overview of Latent Semantic Analysis / LSA

Posted by Kelvin on 09 Sep 2013 | Tagged as: Lucene / Solr / Elastic Search / Nutch, programming

I've just spent the last couple days wrapping my head around implementing Latent Semantic Analysis, and after wading through a number of research papers and quite a bit of linear algebra, I've finally emerged on the other end, and thought I'd write something about it to lock the knowledge in. I'll do my best to keep it non-technical, yet accurate.

Step One – Build the term-document matrix

Input : documents
Output : term-document matrix

Latent Semantic Analysis has the same starting point as most Information Retrieval algorithms : the term-document matrix. Specifically, columns are documents, and rows are terms. If a document contains a term, then the value of that row-column is 1, otherwise 0.

If you start with a corpus of documents, or a database table or something, then you'll need to index this corpus into this matrix. Meaning, lowercasing, removing stopwords, maybe stemming etc. The typical Lucene/Solr analyzer chain, basically.

Step Two – Decompose the matrix

Input : term-document matrix
Output : 3 matrices, U, S and V

Apply Singular Value Decomposition (SVD) to the matrix. This is the computationally expensive step of the whole operation.

SVD is a fairly technical concept and quite an involved process (if you doing it by hand). If you do a bit of googling, you're going to find all kinds of mathematical terms related to this, like matrix decomposition, eigenvalues, eigenvectors, PCA (principal component analysis), random projection etc.

The 5 second explanation of this step is that the original term-document matrix gets broken down into 3 simpler matrices: a term-term matrix (also known as U, or the left matrix), a matrix comprising of the singular values (also known as S), and a document-document matrix (also known as V, or the right matrix).

Something which usually also happens in the SVD step for LSA, and which is important, is rank reduction. In this context, rank reduction means that the original term-document matrix gets somehow "factorized" into its constituent factors, and the k most significant factors or features are retained, where k is some number greater than zero and less than the original size of the term-document matrix. For example, a rank 3 reduction means that the 3 most significant factors are retained. This is important for you to know because most LSA/LSI applications will ask you to specify the value of k, meaning the application wants to know how many features you want to retain.

So what's actually happening in this SVD rank reduction, is basically an approximation of the original term-document matrix, allowing you to compare features in a fast and efficient manner. Smaller k values generally run faster and use less memory, but are less accurate. Larger k values are more "true" to the original matrix, but require longer to compute. Note: this statement may not be true of the stochastic SVD implementations (involving random projection or some other method), where an increase in k doesn't lead to a linear increase in running time, but more like a log(n) increase in running time.

Step Three – Build query vector

Input : query string
Output : query vector

From here, we're on our downhill stretch. The query string needs to be expressed in terms that allow for searching.

Step Four – Compute cosine distance

Input : query vector, document matrix
Output : document scores

To obtain how similar each document is to the query, aka the doc score, we have to go through each document vector in the matrix and calculate its cosine distance to the query vector.


Naive Solr Did You Mean re-searcher SearchComponent

Posted by Kelvin on 05 Sep 2013 | Tagged as: Lucene / Solr / Elastic Search / Nutch

Solr makes Spellcheck easy. Super-easy in fact. All you need to do is to change some stuff in solrconfig.xml, and voila, spellcheck suggestions!

However, that's not how google does spellchecking. What Google does is determine if the query has a mis-spelling, and if so, transparently correct the misspelled term for you and perform the search, but also giving you the option of searching for the original term via a link.

Now, whilst it'd be uber-cool to have an exact equivalent in Solr, you'd need some statistical data to be able to perform this efficiently. A naive version is to use spellcheck corrections to transparently perform a new query when the original query returned less than x hits, where x is some arbitrarily small number.

Here's a simple SearchComponent that does just that:

import org.apache.solr.common.util.NamedList;
import org.apache.solr.handler.component.QueryComponent;
import org.apache.solr.handler.component.ResponseBuilder;


public class AutoSpellcheckResearcher extends QueryComponent {
  // if less than *threshold* hits are returned, a re-search is triggered
  private int threshold = 0;

  @Override public void init(NamedList args) {
    this.threshold = (Integer) args.get("threshold");

  @Override public void prepare(ResponseBuilder rb) throws IOException {

  @Override public void process(ResponseBuilder rb) throws IOException {
    long hits = rb.getNumberDocumentsFound();
    if (hits <= threshold) {
      final NamedList responseValues = rb.rsp.getValues();
      NamedList spellcheckresults = (NamedList) responseValues.get("spellcheck");
      if (spellcheckresults != null) {
        NamedList suggestions = (NamedList) spellcheckresults.get("suggestions");
        if (suggestions != null) {
          final NamedList collation = (NamedList) suggestions.get("collation");
          if (collation != null) {
            String collationQuery = (String) collation.get("collationQuery");
            if (responseValues != null) {
              responseValues.add("researched.original", rb.getQueryString());
              responseValues.add("researched.replaced", collationQuery);

  @Override public String getDescription() {
    return "AutoSpellcheckResearcher";

  @Override public String getSource() {
    return "1.0";

Reading ElasticSearch server book…

Posted by Kelvin on 23 May 2013 | Tagged as: Lucene / Solr / Elastic Search / Nutch

Just got on my hands on a review copy of PacktPub's ElasticSearch Server book, which I believe is the first ES book on the market.

Review to follow shortly..

Next Page »