Supermind Search Consulting Blog 
Solr - Elasticsearch - Big Data

Posts about Lucene / Solr / Elasticsearch / Nutch

SolrCloud search request lifecycle and writing distributed Solr SearchComponents

Posted by Kelvin on 02 Jul 2018 | Tagged as: Lucene / Solr / Elasticsearch / Nutch

This entry is part 4 of 4 in the Developing Custom Solr SearchComponents series

Recap

To recap, in a previous article, we saw that a SearchComponent comprises of a prepare() and process() method:

public class TestSearchComponent extends SearchComponent {
    @Override
    public void prepare(ResponseBuilder rb) throws IOException {
 
    }
 
    @Override
    public void process(ResponseBuilder rb) throws IOException {
 
    }
}

and that the standalone SolrCloud application flow of a search request looks more or less like:

for( SearchComponent c : components ) {
  c.prepare(rb);
}
for( SearchComponent c : components ) {
  c.process(rb);
}

SolrCloud high-level search request lifecycle

When a search request is received by a SolrCloud collection. the following sequence of events takes place:

1. A node receives the search request.
2. On this node, each SearchComponent's prepare() method is called.
3. A loop is entered where the request is handled in stages. The 'standard' stages, as defined in ResponseBuilder, are:

  public static int STAGE_START = 0;
  public static int STAGE_PARSE_QUERY = 1000;
  public static int STAGE_TOP_GROUPS = 1500;
  public static int STAGE_EXECUTE_QUERY = 2000;
  public static int STAGE_GET_FIELDS = 3000;
  public static int STAGE_DONE = Integer.MAX_VALUE;

In each stage, each SearchComponent has the ability to 'vote' on the next stage. The lowest stage from each SearchComponent is what determines the actual next stage.

The request may not go through every stage. It all depends on what SearchComponents are handling this request.

For illustrative purposes, this is what QueryComponent's stage handling code looks like:

  protected int regularDistributedProcess(ResponseBuilder rb) {
    if (rb.stage < ResponseBuilder.STAGE_PARSE_QUERY)
      return ResponseBuilder.STAGE_PARSE_QUERY;
    if (rb.stage == ResponseBuilder.STAGE_PARSE_QUERY) {
      createDistributedStats(rb);
      return ResponseBuilder.STAGE_EXECUTE_QUERY;
    }
    if (rb.stage < ResponseBuilder.STAGE_EXECUTE_QUERY) return ResponseBuilder.STAGE_EXECUTE_QUERY;
    if (rb.stage == ResponseBuilder.STAGE_EXECUTE_QUERY) {
      createMainQuery(rb);
      return ResponseBuilder.STAGE_GET_FIELDS;
    }
    if (rb.stage < ResponseBuilder.STAGE_GET_FIELDS) return ResponseBuilder.STAGE_GET_FIELDS;
    if (rb.stage == ResponseBuilder.STAGE_GET_FIELDS && !rb.onePassDistributedQuery) {
      createRetrieveDocs(rb);
      return ResponseBuilder.STAGE_DONE;
    }
    return ResponseBuilder.STAGE_DONE;
  }

It is beyond the scope of this article to fully go into each stage, but it is sufficient for you to know that search requests are handled in stages.

4. At each stage, SearchComponents may send out requests to the other shards, requesting data in some form. This is known as a ShardRequest. Every ShardRequest has an integer field called purpose. The ShardRequest purpose should not be confused with the ResponseBuilder stage. They are both integer fields, but that's where the similarity ends.

For example, in QueryComponent's STAGE_EXECUTE_QUERY, createMainQuery() is called, in which a ShardRequest is sent out to the other shards, requesting them to execute the query and return the top matching docs. This ShardRequest has the purpose of ShardRequest.PURPOSE_GET_TOP_IDS.

5. In a SearchComponent, the handling of the ShardRequest and its specific purpose occurs in 2 separate methods, both of which are contained in the SearchComponent: process() and handleResponses().

process() is executed on each shard which receives the ShardRequest.

handleResponses is executed on the shard that initiated the ShardRequest, where usually some process of merging and collation takes place. handleResponses() would be much better named handleShardResponses(). The signature of handleResponses() is

public void handleResponses(ResponseBuilder rb, ShardRequest sreq)

The responses of each shard to this ShardRequest is available from sreq.responses.

This step is potentially the most confusing conceptual bit of the entire lifecycle, so let's stop here to recap.

The key realization is that even though the ShardRequest creation, processing and ShardResponse handling takes place in a single SearchComponent, they are executed on different shards.

For example, given 2 shards, ShardA and ShardB. Suppose the request arrives at ShardB and we are at the STAGE_EXECUTE_QUERY in QueryComponent.

a. In ShardB's QueryComponent.createMainQuery(), a ShardRequest to all other shards, with ShardRequest.PURPOSE_GET_TOP_IDS
b. ShardA and ShardB's QueryComponents receive the ShardRequest and respond to it in process()
c. ShardB's QueryComponent.handleResponses() is called with the ShardRequest containing the 2 ShardResponses and collates the results.

5. After each stage is complete, finishStage() is then called. The signature of finishStage() is

public void finishStage(ResponseBuilder rb)

6. The above steps are repeated until we arrive at the ResponseBuilder stage of STAGE_DONE.

This concludes the core lifecycle of a search request in SolrCloud.

Summary

In summary, a SolrCloud collection is made of a number of shards. For a search request to be handled, each shard needs to provide data to the search request.

A distributed search request is handled in (ResponseBuilder) stages.

In any of the ResponseBuilder stages, a SearchComponent may initiate ShardRequests to the shards in a collection.

That same SearchComponent (though potentially residing on a different machine also responds to that ShardRequest (in process()), as well as collates the ShardResponses from the various shards to that ShardRequest (in handleResponses() or finishStage()).

In this fashion, the necessary computation required to build the response to the search request is executed.

Notes

1. For simplicity's sake, I opted to use the terminology of shards as opposed to replicas, since this is more of a conceptual overview. I am aware that it is actually replicas that initiate and respond to ShardRequests.

2. Similarly, I use the terminology of machines as opposed to nodes. Nodes is the 'proper' term used in Solr documentation and code.

Introducing SolrCloud

Posted by Kelvin on 02 Jul 2018 | Tagged as: Lucene / Solr / Elasticsearch / Nutch

This entry is part 3 of 4 in the Developing Custom Solr SearchComponents series

In this post, we will examine SolrCloud and examine how it is different from standalone Solr..

Introducing SolrCloud

In a standalone Solr installation, the data resides on a single machine and all requests are served from this machine.

SolrCloud is the operation mode in Solr where the data resides on multiple machines (known as a cluster), and requests are served from this cluster of machines.

SolrCloud terminology

Where a Solr 'database' in standalone Solr is known as a Solr core, a Solr 'database' in SolrCloud is known as a Solr collection.

Further, a Solr collection is split into a number of partitions known as shards. A shard is a logical partition of a collection, comprising a subset of all the documents in the collection. It is, however, a conceptual partitioning, in the sense that a shard actually contains of one or more replicas. A replica is a full 'instance' or 'copy' of a shard. Replicas provide fault-tolerance to the Solr collection, so that when machines go down, there can still be a replicas of the machine's shards on other machines.

Developing a Solr SearchComponent for standalone Solr

Posted by Kelvin on 02 Jul 2018 | Tagged as: Lucene / Solr / Elasticsearch / Nutch

This entry is part 2 of 4 in the Developing Custom Solr SearchComponents series

In this article, I discuss SearchComponents briefly, and show how straightforward it is (in non-SolrCloud mode) to develop a custom Solr SearchComponent. In the next articles in the series, we examine the flow of logic in SolrCloud mode, and how you might be able to develop a SolrCloud-compatible SearchComponent.

Standalone Solr/non-SolrCloud

In standalone Solr mode, there are 3 methods to implement when developing a custom SearchComponent. Here is the simplest no-op SearchComponent.

public class TestSearchComponent extends SearchComponent {
    @Override
    public void prepare(ResponseBuilder rb) throws IOException {
 
    }
 
    @Override
    public void process(ResponseBuilder rb) throws IOException {
 
    }
 
    @Override
    public String getDescription() {
        return "My custom search component";
    }
}

The Java class that encapsules Solr search workflow is SearchHandler.

For non-SolrCloud mode, the general application flow is:

for( SearchComponent c : components ) {
  c.prepare(rb);
}
for( SearchComponent c : components ) {
  c.process(rb);
}

If you look at some of Solr's standard SearchComponents, you'll see that it is fairly arbitrary what goes into prepare() vs process(). It is entirely possible to do stuff entirely in prepare() or process(). If you'd like to modify how a certain component works, or to modify its output, you could either subclass it and override the method altogether, or add a component before or after it and modify its side-effects (in the request params or the response).

Deploying your custom SearchComponent

To deploy your new SearchComponent, bundle it as a jar and place it in the lib folder of your Solr core. If this is your first SearchComponent, this lib folder *does not exist* and must be created.

The path to the lib directory will be something like:

/{solr.home}/{core}/lib

e.g.

/var/solr/data/mycorename/lib

The lib directory is at the same level as the conf and data directories for your Solr core.

Then, in solrconfig.xml, add

<searchComponent name="test" class="com.company.solr.TestSearchComponent"/>
<requestHandler name="/test" class="org.apache.solr.handler.component.SearchHandler">
    <lst name="defaults">
    </lst>
    <arr name="components">
      <str>test</str>
      <str>query</str>
    </arr>
  </requestHandler>

As the Solr guide shows, you can also use first-components and last-components in place of components when declaring your component.

Conclusion

This concludes the first article in the series. In the next article, we will examine how search requests are served in SolrCloud and how the application flow is more involved and complex.

Introducing this series on developing Solr SearchComponents

Posted by Kelvin on 01 Jul 2018 | Tagged as: Lucene / Solr / Elasticsearch / Nutch

This entry is part 1 of 4 in the Developing Custom Solr SearchComponents series

Solr and Elasticsearch more or less have feature-parity. Elsewhere, I have examined in detail the similarities and differences between Elasticsearch and Solr. One of the major differences between the 2 open-source search products though, is how easy it is to plug custom logic into the search workflow in Solr. In Solr, you do this by implementing a SearchComponent class.

Embed custom Javascript and HTML in a Kibana 4.x visualization

Posted by Kelvin on 11 Jan 2016 | Tagged as: Lucene / Solr / Elasticsearch / Nutch

The embarrassingly simple answer to embedding ANY Javascript and HTML into a Kibana vis is to hack the markdown_vis plugin to not use markdown at all, but just display the HTML as-is.

Modify src/plugins/markdown_vis/public/markdown_vis_controller.js, and comment out

$scope.html = $sce.trustAsHtml(marked(html));

and replace it with

$scope.html = $sce.trustAsHtml(html);

You'll need to recreate the bundles (just install or remove/reinstall sense for example) and restart Kibana for this to take effect. It's pretty awesome, because now the sky's the limit!

Lucene 5 NRT Example

Posted by Kelvin on 16 Dec 2015 | Tagged as: Lucene / Solr / Elasticsearch / Nutch

I just added an NRT search example for Lucene 5.x to lucenetutorial.com.

Check it out here: http://www.lucenetutorial.com/lucene-nrt-hello-world.html

Pain-free Solr replication

Posted by Kelvin on 02 Dec 2015 | Tagged as: Lucene / Solr / Elasticsearch / Nutch

Here's a setup I use for totally pain-free Solr replication, and allowing you to switch masters/slaves quickly without messing with config files.

Add this to solrconfig.xml

<requestHandler name="/replication" class="solr.ReplicationHandler" >
  <str name="maxNumberOfBackups">1</str>
  <lst name="master">
        <str name="enable">${enable.master:false}</str>
        <str name="replicateAfter">startup</str>
        <str name="replicateAfter">commit</str>
        <str name="confFiles">solrconfig.xml,schema.xml,stopwords.txt,elevate.xml</str>
        <str name="commitReserveDuration">00:00:10</str>
    </lst>
    <lst name="slave">
        <str name="enable">${enable.slave:false}</str>
        <str name="masterUrl">http://${replication.master}:8983/solr/corename</str>
        <str name="pollInterval">00:00:20</str>
        <str name="compression">internal</str>
        <str name="httpConnTimeout">5000</str>
        <str name="httpReadTimeout">10000</str>
     </lst>
</requestHandler>

Substitute "corename" with your actual core name.

Now create your master core with this command (substitute MASTER_IP_ADDRESS as appropriate):

curl "http://localhost:8983/solr/admin/cores?action=CREATE&name=corename&instanceDir=corename&schema=schema.xml&config=solrconfig.xml&dataDir=data&property.enable.master=true&property.enable.slave=false&property.replication.master=MASTER_IP_ADDRESS"

And this for your slaves:

curl "http://localhost:8983/solr/admin/cores?action=CREATE&name=corename&instanceDir=corename&schema=schema.xml&config=solrconfig.xml&dataDir=data&property.enable.master=false&property.enable.slave=true&property.replication.master=MASTER_IP_ADDRESS"

Now when you need to promote a slave to a master, just do this on the new master:

curl "http://localhost:8983/solr/admin/cores?action=UNLOAD&core=corename" && curl "http://localhost:8983/solr/admin/cores?action=CREATE&name=corename&instanceDir=corename&schema=schema.xml&config=solrconfig.xml&dataDir=data&property.enable.master=true&property.enable.slave=false&property.replication.master=NEW_MASTER_IP_ADDRESS"

And this on all slaves:

curl "http://localhost:8983/solr/admin/cores?action=UNLOAD&core=corename" && curl "http://localhost:8983/solr/admin/cores?action=CREATE&name=corename&instanceDir=corename&schema=schema.xml&config=solrconfig.xml&dataDir=data&property.enable.master=false&property.enable.slave=true&property.replication.master=NEW_MASTER_IP_ADDRESS"

Copy these commands and do a search and replace on "corename" for your actual core.

If you have a cssh cluster setup, you can update all slaves in one fell swoop.

Monier-Williams Sanskrit-English-IAST search engine

Posted by Kelvin on 17 Sep 2015 | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch, Python

I just launched a search application for the Monier-Williams dictionary, which is the definitive Sanskrit-English dictionary.

See it in action here: http://sanskrit.supermind.org

The app is built in Python and uses the Whoosh search engine. I chose Whoosh instead of Solr or ElasticSearch because I wanted to try building a search app which didn't depend on Java.

Features include:
– full-text search in Devanagari, English, IAST, ascii and HK
– results link to page scans
– more frequently occurring word senses are boosted higher in search results
– visually displays the MW level or depth of a word with list indentation

A HTML5 ElasticSearch Query DSL Builder

Posted by Kelvin on 16 Sep 2015 | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch


Tl;DR : I parsed ElasticSearch source and generated a HTML app that allows you to build ElasticSearch queries using its JSON Query DSL. You can see it in action here: http://supermind.org/elasticsearch/query-dsl-builder.html


I really like ElasticSearch's JSON-based Query DSL – it lets you create fairly complex search queries in a relatively painless fashion.

I do not, however, fancy the query DSL documentation. I've often found it inadequate, inconsistent with the source, and at times downright confusing.

Browsing the source, I realised that ES parses JSON queries in a fairly regular fashion, which would lend itself well to regex-based parsing of the Java source in order to generate documention of the JSON 'schema'.

The parsing I did in Java, and the actual query builder UI is in HTML and Javascript. The Java phase outputs a JSON data model of the query DSL, which the HTML app then uses to dynamically build the HTML forms etc.

Because of the consistent naming conventions of the objects, I was also able to embed links to documentation and github source within the page itself. Very useful!

You can see the result in action here: http://supermind.org/elasticsearch/query-dsl-builder.html

PS: I first did this for ES version 1.2.1, and then subsequently for 1.4.3 and now 1.7.2. The approach seems to work consistently across versions, with minor changes required in the Java backend parsing between version bumps. Hopefully this remains the case when we go to ES 2.x.

Phrase-based Out-of-order Solr Autocomplete Suggester

Posted by Kelvin on 16 Sep 2013 | Tagged as: Lucene / Solr / Elasticsearch / Nutch

Solr has a number of Autocomplete implementations which are great for most purposes. However, a client of mine recently had some fairly specific requirements for autocomplete:

1. phrase-based substring matching
2. out-of-order matches ('foo bar' should match 'the bar is foo')
3. fallback matching to a secondary field when substring matches on the primary field fails, e.g. 'windstopper jac' doesn't match anything on the 'title' field, but matches on the 'category' field

The most direct way to model this would probably have been to create a separate Solr core and use ngram + shingles indexing and Solr queries to obtain results. However, because the index was fairly small, I decided to go with an in-memory approach.

The general strategy was:

1. For each entry in the primary field, create ngram tokens, adding entries to a Guava Table, where key was ngram, column was string, and value was a distance score.
2. For each entry in the secondary field, create ngram tokens and add entries to a Guava Multimap, where key was ngram, and value was term.
3. When a autocomplete query is received, split it by space, then do lookups against the primary Table.
4. If no matches were found, lookup against the secondary Multimap
5. Return results.

The scoring for the primary Table was a simple one based on length of word and distance of token from the start of the string.

Next Page »