Supermind Search Consulting Blog 
Solr - Elasticsearch - Big Data

Posts about Lucene / Solr / Elasticsearch / Nutch

Tokenizing second-level and top-level domain for a URL in Lucene and Solr

Posted by Kelvin on 12 Nov 2012 | Tagged as: Lucene / Solr / Elasticsearch / Nutch

In my previous post, I described how to extract second- and top-level domains from a URL in Java.

Now, I'll build a Lucene Tokenizer out of it, and a Solr TokenizerFactory class.

DomainTokenizer doesn't do anything really fancy. It first returns the hostname as the first token, then the 2nd-level domain as the second token, and the top-level domain as the last token.

e.g. given the URL http://www.supermind.org, it'll return

www.supermind.org
.supermind.org
.org

Doing so allows you to quickly return all documents in the Lucene or Solr index matching the second-level domain or the TLD.

package org.supermind.solr.analysis;
 
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
 
import java.io.IOException;
import java.io.Reader;
import java.net.URL;
 
public class DomainTokenizer extends Tokenizer {
  private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
 
  public static final int STATE_UNINITIALIZED = -1;
  public static final int STATE_INITIALIZED = 0;
  public static final int STATE_2LD = 1;
  public static final int STATE_TLD = 2;
  public static final int STATE_DONE = 3;
 
  private int state = STATE_UNINITIALIZED;
 
  private URL url = null;
  private SecondLDExtractor extractor;
  private boolean index2LD;
  private boolean indexTLD;
 
  public DomainTokenizer(Reader input, SecondLDExtractor extractor, boolean index2LD, boolean indexTLD) {
    super(input);
    this.extractor = extractor;
    this.index2LD = index2LD;
    this.indexTLD = indexTLD;
  }
 
  @Override
  public boolean incrementToken() throws IOException {
    if (state == STATE_DONE) return false;
 
    clearAttributes();
    if (this.url == null) {
      state = STATE_INITIALIZED;
 
      StringBuilder sb = new StringBuilder();
      int upto = 0;
      char[] buffer = new char[512];
      while (true) {
        final int length = input.read(buffer, upto, buffer.length - upto);
        if (length == -1) break;
        upto += length;
        sb.append(buffer);
      }
      this.url = new URL(sb.toString());
      if (!index2LD && !indexTLD) state = STATE_DONE;
      termAtt.append(url.getHost());
      return true;
    } else if (index2LD && state < STATE_2LD) {
      state = STATE_2LD;
      String twold = extractor.extract2LD(url.getHost());
      termAtt.append("."+twold);
      return true;
    } else if (indexTLD && state < STATE_TLD) {
      state = STATE_TLD;
      String tld = extractor.extractTLD(url.getHost());
      termAtt.append(tld);
      return true;
    }
    state = STATE_DONE;
    return false;
  }
}

and here's the corresponding Solr TokenizerFactory.

package org.supermind.solr.analysis;
 
import org.apache.lucene.analysis.Tokenizer;
import org.apache.solr.analysis.BaseTokenizerFactory;
 
import java.io.Reader;
import java.util.Map;
 
public class DomainTokenizerFactory extends BaseTokenizerFactory {
  private SecondLDExtractor extractor;
  private boolean index2LD;
  private boolean indexTLD;
 
  @Override
  public void init(Map<String, String> args) {
    super.init(args);
    assureMatchVersion();
    index2LD = getBoolean("index2LD", true);
    indexTLD = getBoolean("indexTLD", true);
    if (index2LD || indexTLD) {
      initTLDExtractor();
    }
  }
 
  private void initTLDExtractor() {
    extractor = new SecondLDExtractor();
    extractor.init();
  }
 
  public Tokenizer create(Reader input) {
    DomainTokenizer tokenizer = new DomainTokenizer(input, extractor, index2LD, indexTLD);
    return tokenizer;
  }
}

Here's a sample fieldType definition.

<fieldType name="domain" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="org.supermind.solr.analysis.DomainTokenizerFactory"/>
      </analyzer>
</fieldType>

Extracting second-level domains and top-level domains (TLD) from a URL in Java

Posted by Kelvin on 12 Nov 2012 | Tagged as: Lucene / Solr / Elasticsearch / Nutch

It turns out that extracting second- and top-level domains is not a simple task, the primary difficulty being that in addition to the usual suspects (.com .org .net etc), there are the country suffixes (.uk .it .de etc) which need to be accounted for.

Regex alone has no way of handling this. http://publicsuffix.org/list/ contains a somewhat authoritative list of TLD and ccTLD that we can use.

Here follows a Java class which parses this list, builds a regex from it, and extracts out the the TLD and second-level domain from a hostname. You'll need to download the effective_tld_names.dat from http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1 and place it in the same directory as the Java class.

In my next post, I'll build a Lucene Tokenizer out of this, so it can be used in Lucene and Solr.

package org.supermind.solr.analysis;
 
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
 
public class SecondLDExtractor {
  private StringBuilder sb = new StringBuilder();
  private Pattern pattern;
 
  public void init() {
    try {
      ArrayList<String> terms = new ArrayList<String>();
 
      BufferedReader br = new BufferedReader(new InputStreamReader(getClass().getResourceAsStream("effective_tld_names.dat")));
      String s = null;
      while ((s = br.readLine()) != null) {
        s = s.trim();
        if (s.length() == 0 || s.startsWith("//") || s.startsWith("!")) continue;
        terms.add(s);
      }
      Collections.sort(terms, new StringLengthComparator());
      for(String t: terms) add(t);
      compile();
      br.close();
    } catch (IOException e) {
      throw new IllegalStateException(e);
    }
  }
 
  protected void add(String s) {
    s = s.replace(".", "\\.");
    s = "\\." + s;
    if (s.startsWith("*")) {
      s = s.replace("*", ".+");
      sb.append(s).append("|");
    } else {
      sb.append(s).append("|");
    }
  }
 
  public void compile() {
    if (sb.length() > 0) sb.deleteCharAt(sb.length() - 1);
    sb.insert(0, "[^.]+?(");
    sb.append(")$");
    pattern = Pattern.compile(sb.toString());
    sb = null;
  }
 
  public String extract2LD(String host) {
    Matcher m = pattern.matcher(host);
    if (m.find()) {
      return m.group(0);
    }
    return null;
  }
 
  public String extractTLD(String host) {
    Matcher m = pattern.matcher(host);
    if (m.find()) {
      return m.group(1);
    }
    return null;
  }
 
  public static class StringLengthComparator implements Comparator<String> {
    public int compare(String s1, String s2) {
      if (s1.length() > s2.length()) return -1;
      if (s1.length() < s2.length()) return 1;
      return 0;
    }
  }
}

Book review of Apache Solr 3.1 Cookbook

Posted by Kelvin on 30 Jun 2012 | Tagged as: Lucene / Solr / Elasticsearch / Nutch

I recently got a chance to review Apache Solr 3.1 Cookbook by Rafal Kuc, published by PacktPub.

Now, to give a bit of context: I help folks implementing and customizing Solr professionally, so I know a fair bit of how Solr works, and am also quite familiar with the code internals. I was, therefore, pleasantly surprised when leafing through the table of contents, that there were at least a couple entries which had me wondering: Now how would I do that?

Here's the high-level TOC:

Chapter 1: Apache Solr Configuration
Chapter 2: Indexing your Data
Chapter 3: Analyzing your Text Data
Chapter 4: Solr Administration
Chapter 5: Querying Solr
Chapter 6: Using Faceting Mechanism
Chapter 7: Improving Solr Performance
Chapter 8: Creating Applications that use Solr and Developing your Own Solr Modules
Chapter 9: Using Additional Solr Functionalities
Chapter 10: Dealing with Problems

And here's a list of the recipes in Chapter 5, to give you a feel of the recipes:

Chapter 5: Querying Solr
Introduction
Asking for a particular field value
Sorting results by a field value
Choosing a different query parser
How to search for a phrase, not a single word
Boosting phrases over words
Positioning some documents over others on a query
Positioning documents with words closer to each other first
Sorting results by a distance from a point
Getting documents with only a partial match
Affecting scoring with function
Nesting queries

You can view the full table of contents from the PacktPub website.

Now, first of all, this book is like one of those cookbook-type books with lots of snippets of how to do stuff in Solr. If you know next to nothing about Solr, this book is not for you. As the PacktPub site says:

This book is part of Packt's Cookbook series… The recipes deal with common problems of working with Solr by using easy-to-understand, real-life examples. The book is not in any way a complete Apache Solr reference…

If, however, you're just past beginner level and wanting to dig a little deeper into Solr and find the FAQs, tutorials, Solr Wiki etc too confusing/verbose/unorganized, then I think Apache Solr 3.1 Cookbook is probably exactly what you need.

The examples are concise, stand-alone, and can be readily implemented in 5 minutes or less. They're a non-threatening way to get past the beginner level, and also offer a glimpse at some of Solr's more advanced functionality.

Oddly enough, the reviews on the web (amazon.com, goodreads and google books) all rate this book mediocrely, with an average of 3+ stars. In my opinion, this book easily deserves at least 4, if not 4.5 stars, assuming you're not a complete Solr n00b.

OK, I admit the writing is a little repetitive at times (the author's Polish), and some of the recipes are really, really basic, but nonetheless, for a cookbook-style guide aimed at the beginner-intermediate crowd, I think it's great!

Get it from amazon here: http://amzn.to/LNHQxo.
More details at PacktPub

Simplistic noun-phrase chunking with POS tags in Java

Posted by Kelvin on 16 Jun 2012 | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch

I needed to extract Noun-Phrases from text. The way this is generally done is using Part-of-Speech (POS) tags. OpenNLP has a both a POS-tagger as well as a Noun-Phrase chunker. However, it's really really really slow!

I decided to look into alternatives, and chanced upon QTag.

QTag is a "freely available, language independent POS-Tagger. It is implemented in Java, and has been successfully tested on Mac OS X, Linux, and Windows."

It's waaay faster than OpenNLP for POS-tagging, though I haven't done any benchmarks as to a accuracy.

Here's my really simplistic but adequate implementation of noun-phrase chunking using QTag.

  private Qtag qt;
  public static List<String> chunkQtag(String str) throws IOException {
    List<String> result = new ArrayList<String>();
    if (qt == null) {
      qt = new Qtag("lib/english");
      qt.setOutputFormat(2);
    }
 
    String[] split = str.split("\n");
    for (String line : split) {
      String s = qt.tagLine(line, true);
      String lastTag = null;
      String lastToken = null;
      StringBuilder accum = new StringBuilder();
      for (String token : s.split("\n")) {
        String[] s2 = token.split("\t");
        if (s2.length < 2) continue;
        String tag = s2[1];
 
        if (tag.equals("JJ")
            || tag.startsWith("NN")
            || tag.startsWith("??")
            || (lastTag != null && lastTag.startsWith("NN") && s2[0].equalsIgnoreCase("of"))
            || (lastToken != null && lastToken.equalsIgnoreCase("of") && s2[0].equalsIgnoreCase("the"))
            ) {
          accum.append(s2[0]).append("-");
        } else {
          if (accum.length() > 0) {
            accum.deleteCharAt(accum.length() - 1);
            result.add(accum.toString());
            accum = new StringBuilder();
          }
        }
        lastTag = tag;
        lastToken = s2[0];
      }
      if (accum.length() > 0) {
        accum.deleteCharAt(accum.length() - 1);
        result.add(accum.toString());
      }
    }
    return result;
  }

The method returns a list of noun phrases.

Separating relevance signals from document content in Solr or Lucene

Posted by Kelvin on 16 Jun 2012 | Tagged as: Lucene / Solr / Elasticsearch / Nutch

Full-text search has traditionally been about the indexing and ranking of a corpus of unstructured text content.

The vector space model (VSM) and its cousins, in addition to structural ranking algorithms such as PageRank, have been the authoritative ways of ranking documents.

However, with the recent proliferation of personalization, analytics, social networks and the like, there are increasing ways of determining document relevance, both globally and on a per-user basis. Some call these relevance signals.

Global relevance signals are simple to incorporate into Solr, either as a separate field + query-time boost, or an index-time document boost.

However, there has not traditionally been a satisfactory way of incorporating per-user relevance signals in Lucene/Solr's search process. We'll therefore be focusing on user-specific relevance signals for the rest of this document…

Before going further, here are some examples of user-specific relevance signals:

  • clickstream data
  • search logs
  • user preferences
  • likes, +1, etc
  • purchase history
  • blog, twitter, tumblr feed
  • social graph

I'm going to describe a system of incorporating user-specific relevance signals into your Solr searches in a scalable fashion.

Index

In your Lucene/Solr index, store the documents you want searched. This can be products, companies, jobs etc. It can be multiple data-types, and each doc needs a unique id.

Relevance signals

In a separate sql/nosql database, store your relevance signals. They should be structured in a way which doesn't require complex joins, and be keyed by user-id. i.e. with a single get() query, you should be able to retrieve all necessary relevance data for that user.

One way of doing this is storing the relevance data as json, with individual fields as object ids.

You should also preferably pre-process the relevance data so there is a float/integer which provides the "score" or "value" of that signal.

For example:

{"SPN1002":10,"SPN399":89,"SPN19":1}

In this JSON example, the SPNxxx are product ids, and the integer value is the score.

Integrate

Now implement a custom FunctionQuery in Solr which accepts the userid as a parameter. Usage will look something like this: influence(201)^0.5 where influence is the name of the functionquery and 201 is the user id, 0.5 being the weight boost.

In the FunctionQuery, issue the DB request and obtain the relevance signal json, e.g. the example above.

Now within the ValueSource itself, load the data ids via FieldCache, and reference the json. The code looks something like:

@Override public DocValues getValues(Map context, IndexReader reader) throws IOException {
    final String[] lookup = FieldCache.DEFAULT.getStrings(reader, idField);
    return new DocValues() {
      @Override public float floatVal(int doc) {
        final String id = lookup[doc];
        if (obj == null) return 0;
        Object v = jsonObj.get(id);
        if (v == null) return 0;
        if (v instanceof Float) {
          return ((Float) v);
        }
}

See what's happening here is the id field is retrieved from the document id. With our JSON example above, the id value could be something like "SPN332".

This is then used to check against the JSON object. If it exists, the integer/float value is returned as the functionquery score of that doc. Else, 0 is returned.

ElasticSearch 0.19 extension points

Posted by Kelvin on 14 Jun 2012 | Tagged as: Lucene / Solr / Elasticsearch / Nutch

A list of the extension points exposed by ElasticSearch (as of 0.19.4)

  • Analysis plugins – use different kinds of analyzers
  • River plugins – A river is an external datasource which ES indexes
  • Transport plugins – Different means of exposing ES API, e.g. Thrift, memcached
  • Site plugins – for running various ES-related webapps, like the ES head admin webapp
  • Custom REST endpoint – lets you define a REST action by extending BaseRestHandler.
  • Scripting plugins – providing support for using different scripting languages as search scripts
  • NativeScripts – loosely equivalent to Solr's FunctionQuery. Allows you to return "script fields", custom scores or perform search filtering.

As far as I can tell (from the source), there's no equivalent of Solr's SearchComponent, which allows you to modify the search request processing pipeline in an extremely flexible manner.

Connecting Redis to ElasticSearch for custom scoring with nativescripts

Posted by Kelvin on 14 Jun 2012 | Tagged as: Lucene / Solr / Elasticsearch / Nutch

After connecting Redis and MongoDB to Solr, I figured it'd be interesting to do the same with ElasticSearch. Here's the result of my experiments:

We'll be implementing this using AbstractSearchScript, which is roughly ElasticSearch's version of Solr's FunctionQuery.

ES' NativeScriptFactory corresponds loosely to Solr's ValueSourceParser, and AbstractSearchScript to ValueSource.

public class RedisNativeScriptFactory implements NativeScriptFactory {
  @Override public ExecutableScript newScript(@Nullable Map<String, Object> params) {
    return new RedisScript(params);
  }
}
public class RedisScript extends AbstractFloatSearchScript {
  private String idField;
  private String redisKey;
  private String redisValue;
  private final Jedis jedis;
  private JSONObject obj;
 
  public RedisScript(Map<String, Object> params) {
    this.idField = (String) params.get("idField");
    this.redisKey = (String) params.get("redisKey");
    this.redisValue = (String) params.get("redisValue");
    jedis = new Jedis("localhost");
    String v = jedis.hget(redisKey, redisValue);
    if (v != null) {
      obj = (JSONObject) JSONValue.parse(v);
    } else {
      obj = new JSONObject();
    }
  }
 
  @Override public float runAsFloat() {
    String id = doc().field(idField).stringValue();
    Object v = obj.get(id);
    if (v != null) {
      try {
        return Float.parseFloat(v.toString());
      } catch (NumberFormatException e) {
        return 0;
      }
    }
    return 0;
  }
}

Now in config/elasticsearch.yml, add this:

script.native:
  redis.type: org.supermind.es.redis.RedisNativeScriptFactory

Change redis to whatever you want the script name to be, and change the class name accordingly too.

Now, to use this:

curl -XGET 'http://localhost:9200/electronics/product/_search' -d '{
  "query" :{
     "custom_score": {
       "query" : { "match_all": {}},
       "script" : "redis",
       "params" :{
          "idField": "id",
          "redisKey": "bar",
          "redisValue" : "500"
       },
       "lang": "native"
     }
  }
}'

PS: My implementation of RedisScript assumes a Redis hash has been populated with a json object corresponding to an idField. Here's a class populating the redis hash. JSON objects are created with the json-smart package, but you can plugin your favourite json lib:

public static void main(String[] args) {
    Jedis jedis = new Jedis("localhost");
    int num = 100000;
    Random r = new Random();
    for(int i=0;i< num;++i) {
      JSONObject o = new JSONObject();
      int numberOfEntries = r.nextInt(100);
      for(int j=0;j< numberOfEntries;++j) {
        o.put("es" + j, r.nextInt(100));
      }
      String json = o.toJSONString(JSONStyle.MAX_COMPRESS);
      jedis.hset("bar", Integer.toString(i), json);
    }
  }

Using MongoDB from within Solr for boosting documents

Posted by Kelvin on 09 Jun 2012 | Tagged as: Lucene / Solr / Elasticsearch / Nutch

Previously, I blogged about connecting Redis to Solr for relevance boosting via a custom FunctionQuery. Now, I'll talk about doing the same with MongoDB.

In solrconfig.xml, declare your ValueSourceParser.

<valueSourceParser name="mongo" class="org.supermind.solr.mongodb.MongoDBValueSourceParser">
  <str name="host">localhost</str>
  <str name="dbName">solr</str>
  <str name="collectionName">electronics</str>
  <str name="key">userId</str>
  <str name="idField">id</str>
</valueSourceParser>

The host, dbName and collectionName parameters are self-explanatory.

The key parameter is used to specify how to match for a MongoDB doc. The idField parameter declares the Solr field used for matching.

Here's the ValueSourceParser.

public class MongoDBValueSourceParser extends ValueSourceParser {
 
  private String idField;
  private String dbName;
  private String collectionName;
  private String key;
  private String host;
  private DBCollection collection;
 
  @Override public void init(NamedList args) {
    host = (String) args.get("host");
    idField = (String) args.get("idField");
    dbName = (String) args.get("dbName");
    collectionName = (String) args.get("collectionName");
    key = (String) args.get("key");
    try {
      Mongo mongo = new Mongo(host);
      collection = mongo.getDB(dbName).getCollection(collectionName);
    } catch (UnknownHostException e) {
      throw new IllegalArgumentException(e);
    }
  }
 
  @Override public ValueSource parse(FunctionQParser fp) throws ParseException {
    String value = fp.parseArg();
 
    final DBObject obj = collection.findOne(new BasicDBObject(key, value));
    return new MongoDBValueSource(idField, obj, value);
  }
}

Here's the interesting method in MongoDBValueSource.

  @Override public DocValues getValues(Map context, IndexReader reader) throws IOException {
    final String[] lookup = FieldCache.DEFAULT.getStrings(reader, idField);
    return new DocValues() {
      @Override public byte byteVal(int doc) {
        return (byte) intVal(doc);
      }
 
      @Override public short shortVal(int doc) {
        return (short) intVal(doc);
      }
 
      @Override public float floatVal(int doc) {
        final String id = lookup[doc];
        if (obj == null) return 0;
        Object v = obj.get(id);
        if (v == null) return 0;
        if (v instanceof Float) {
          return ((Float) v);
        } else if (v instanceof Integer) {
          return ((Integer) v);
        } else if (v instanceof String) {
          try {
            return Float.parseFloat((String) v);
          } catch (NumberFormatException e) {
            return 0;
          }
        }
        return 0;
      }
 
      @Override public int intVal(int doc) {
        final String id = lookup[doc];
        if (obj == null) return 0;
        Object v = obj.get(id);
        if (v == null) return 0;
        if (v instanceof Integer) {
          return (Integer) v;
        } else if (v instanceof String) {
          try {
            return Integer.parseInt((String) v);
          } catch (NumberFormatException e) {
            return 0;
          }
        }
        return 0;
      }
 
      @Override public long longVal(int doc) {
        return intVal(doc);
      }
 
      @Override public double doubleVal(int doc) {
        return floatVal(doc);
      }
 
      @Override public String strVal(int doc) {
        final String id = lookup[doc];
        if (obj == null) return null;
        Object v = obj.get(id);
        return v != null ? v.toString() : null;
      }
 
      @Override public String toString(int doc) {
        return strVal(doc);
      }
    };
  }

You can now use the FunctionQuery mongo in your search requests. For example:
http://localhost:8983/solr/select?defType=edismax&q=cat:electronics&bf=mongo(1377)

Connecting Redis to Solr for boosting documents

Posted by Kelvin on 07 Jun 2012 | Tagged as: Lucene / Solr / Elasticsearch / Nutch

There are a number of instances in Solr where it's desirable to retrieve data from an external datastore for boosting purposes instead of trying to contort Solr with multiple queries, joins etc.

Here's a trivial example:

Jobs are stored as documents in Solr. Users of the application can rank a job from 1-10. We need to boost each job with the user's rank if it exists.

Now, to try to attempt to model this fully in Solr would be fairly inefficient, especially for large # of jobs and/or users, since each time a user ranks a job, the searcher has to reload in order for that data to be available for searching.

A much more efficient method of implementing this, is by storing the rank data in a nosql store like Redis, and retrieving the rank at query-time, using it to boost the documents accordingly.

This can be accomplished using a custom FunctionQuery. I've blogged about how to create custom function queries in Solr before, so this is simply an application of the subject.

Here's the code:

public class RedisValueSourceParser extends ValueSourceParser {
  @Override public ValueSource parse(FunctionQParser fp) throws ParseException {
    String idField = fp.parseArg();
    String redisKey = fp.parseArg();
    String redisValue = fp.parseArg();
    return new RedisValueSource(idField, redisKey, redisValue);
  }
}

This FunctionQuery accepts 3 arguments:
1. redisKey
2. redisValue
3. the field to use as an id field

Here's what the salient part of RedisValueSource looks like:

  @Override public DocValues getValues(Map context, IndexReader reader) throws IOException {
    final String[] lookup = FieldCache.DEFAULT.getStrings(reader, idField);
    final Jedis jedis = new Jedis("localhost");
    String v = jedis.hget(redisKey, redisValue);
    final JSONObject obj;
    if (v != null) {
      obj = (JSONObject) JSONValue.parse(v);
    } else {
      obj = new JSONObject();
    }
    jedis.disconnect();
    return new DocValues() {
      @Override public float floatVal(int doc) {
        final String id = lookup[doc];
        Object v = obj.get(id);
        if(v != null) {
          try {
            return Float.parseFloat(v.toString());
          } catch (NumberFormatException e) {
            return 0;
          }
        } return 0;
      }
 
      @Override public int intVal(int doc) {
        final String id = lookup[doc];
        Object v = obj.get(id);
        if(v != null) {
          try {
            return Integer.parseInt(v.toString());
          } catch (NumberFormatException e) {
            return 0;
          }
        } return 0;
      }
 
      @Override public String strVal(int doc) {
        final String id = lookup[doc];
        Object v = obj.get(id);
        return v != null ? v.toString() : null;
      }
 
      @Override public String toString(int doc) {
        return strVal(doc);
      }
    };
  }

From here, you can use the following Solr query to perform boosting based on the Redis value:
http://localhost:8983/solr/select?defType=edismax&q=cat:electronics&bf=redis(id,influence,1001)&debugQuery=on

The explain output looks like this:

3.4664698 = (MATCH) sum of:
  1.070082 = (MATCH) weight(cat:electronics in 2), product of:
    0.80067647 = queryWeight(cat:electronics), product of:
      1.3364723 = idf(docFreq=14, maxDocs=21)
      0.59909695 = queryNorm
    1.3364723 = (MATCH) fieldWeight(cat:electronics in 2), product of:
      1.0 = tf(termFreq(cat:electronics)=1)
      1.3364723 = idf(docFreq=14, maxDocs=21)
      1.0 = fieldNorm(field=cat, doc=2)
  2.3963878 = (MATCH) FunctionQuery(redis(id,influence,1001)), product of:
    4.0 = 4.0
    1.0 = boost
    0.59909695 = queryNorm

Lucene multi-point spatial search

Posted by Kelvin on 14 Apr 2012 | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch

This post describes a method of augmenting the lucene-spatial contrib package to support multi-point searches. It is quite similar to the method described http://www.supermind.org/blog/548/multiple-latitudelongitude-pairs-for-a-single-solrlucene-doc with some minor modifications.

The problem is as follows:

A company (mapped as a Lucene doc) has an address associated with it. It also has a list of store locations, which each have an address. Given a lat/long point, return a list of companies which have either a store location or an address within x miles from that point. There should be the ability to search on just company addresses, store locations, or both. EDIT: There is also the need to sort by distance and return distance from the point, not just filter by distance.

This problem requires that you index a "primary" lat/long pair, and multiple "secondary" lat/long pairs, and be able to search only primary lat/long, only secondary lat/long or both.

This excludes the possibility of using SOLR-2155 or LUCENE-3795 as-is. I'm sure it would have been possible to patch either to do so

Also, SOLR-2155 depended on Solr, and I needed a pure Lucene 3.5 solution. And MultiValueSource, which SOLR-2155 uses, does not appear to be supported in Lucene 3.5.

The SOLR-2155 implementation is also pretty inefficient: it creates a List object
for every single doc in the index in order to support multi-point search.

The general outline of the method is:

1. Search store locations index and collect company IDs and distances
2. Augment DistanceFilter with store location distances
3. Add a BooleanQuery with company IDs. This is to include companies in the final result-set whose address does not match, but have one or more store locations which do
4. Search company index
5. Return results

The algorithm in detail:

1. Index the company address with the company document, i.e the document containing company fields such as name etc

2. In a separate index (or in the same index but in a different document "type"), index the store locations, adding the company ID as a field.

3. Given a lat/long point to search, first search the store locations index. Collect a unique list of company doc-ids:distance in a LinkedHashMap, checking for duplicates. Note that this is the lucene doc-id of the store location's corresponding company, NOT the company ID field value. This will be used to augment the distancefilter in the next stage.

Hint: you'll need to use TermDocs to get this, like so:

for(int i=0;i<locationHits.docs.totalHits;++i) {
      int locationDocId = locationHits.docs.scoreDocs[i].doc;
      int companyId = companyIds[locationDocId];
      double distance = locationHits.distanceFilter.getDistance(locationDocId);
      if(companyDistances.containsKey(companyId)) continue;
      Term t = new Term("id", Integer.toString(companyId));
      TermDocs td = companyReader.termDocs(t);
      if (td.next()) {
        int companyDocId = td.doc();
        companyDistances.put(companyDocId, distance);
      }
      td.close();
    }

Since the search returns results sorted by distance (using lucene-spatial's DistanceFilter), you're assured to have a list of company doc ids in ascending order of distance.

In this same pass, also collect a list of company IDs. This will be used to build the BooleanQuery used in the company search.

4. Set company DistanceFilter's distances. Note: in Lucene 3.5, I added a one-line patch to DistanceFilter so that setDistances() calls putAll() instead of replacing the map.

final DistanceQueryBuilder dq = new DistanceQueryBuilder(centerLat, centerLng, milesF, "lat", "lng", CartesianTierPlotter.DEFALT_FIELD_PREFIX, true, 0, 20);
dq.getDistanceFilter().setDistances(companyDistances);

5. Build BooleanQuery including company IDs

    BooleanQuery bq = new BooleanQuery();
    for(Integer id: companyIds) bq.add(new TermQuery(new Term("id", Integer.toString(id))), BooleanClause.Occur.SHOULD);
    bq.add(distanceQuery, BooleanClause.Occur.SHOULD);

6. Search and return results

« Previous PageNext Page »