This post describes a method of augmenting the lucene-spatial contrib package to support multi-point searches. It is quite similar to the method described with some minor modifications.

The problem is as follows:

A company (mapped as a Lucene doc) has an address associated with it. It also has a list of store locations, which each have an address. Given a lat/long point, return a list of companies which have either a store location or an address within x miles from that point. There should be the ability to search on just company addresses, store locations, or both. EDIT: There is also the need to sort by distance and return distance from the point, not just filter by distance.

This problem requires that you index a "primary" lat/long pair, and multiple "secondary" lat/long pairs, and be able to search only primary lat/long, only secondary lat/long or both.

This excludes the possibility of using SOLR-2155 or LUCENE-3795 as-is. I'm sure it would have been possible to patch either to do so

Also, SOLR-2155 depended on Solr, and I needed a pure Lucene 3.5 solution. And MultiValueSource, which SOLR-2155 uses, does not appear to be supported in Lucene 3.5.

The SOLR-2155 implementation is also pretty inefficient: it creates a List object
for every single doc in the index in order to support multi-point search.

The general outline of the method is:

1. Search store locations index and collect company IDs and distances
2. Augment DistanceFilter with store location distances
3. Add a BooleanQuery with company IDs. This is to include companies in the final result-set whose address does not match, but have one or more store locations which do
4. Search company index
5. Return results

The algorithm in detail:

1. Index the company address with the company document, i.e the document containing company fields such as name etc

2. In a separate index (or in the same index but in a different document "type"), index the store locations, adding the company ID as a field.

3. Given a lat/long point to search, first search the store locations index. Collect a unique list of company doc-ids:distance in a LinkedHashMap, checking for duplicates. Note that this is the lucene doc-id of the store location's corresponding company, NOT the company ID field value. This will be used to augment the distancefilter in the next stage.

Hint: you'll need to use TermDocs to get this, like so:

span class="st0">"id"

Since the search returns results sorted by distance (using lucene-spatial's DistanceFilter), you're assured to have a list of company doc ids in ascending order of distance.

In this same pass, also collect a list of company IDs. This will be used to build the BooleanQuery used in the company search.

4. Set company DistanceFilter's distances. Note: in Lucene 3.5, I added a one-line patch to DistanceFilter so that setDistances() calls putAll() instead of replacing the map.

span class="st0">"lat", "lng"

5. Build BooleanQuery including company IDs

span class="st0">"id"

6. Search and return results

  • David Smiley

    Hello Kevin.
    I'm the author of SOLR-2155 which you refer to. If you have questions/concerns about it, feel free to add a comment to the JIRA issue, or even email me directly. I'd like to clear up a few misunderstands you have about SOLR-2155 here:
    1. At the core of SOLR-2155 is a Lucene Filter. I noticed your use-case doesn't care about sorting, and so this filter is all you need from it; you can ignore any of the Solr parts.
    2. You say that SOLR2155 is inefficient due to the List per document. That is *only* required for _sorting_ by distance, not for _filtering_ by it. But yes this data structure could certainly be improved and I already know how I will do it when it becomes a priority.
    3. SOLR-2155 *can* meet the requirements of the use-case you present here. You would simply have a SOLR-2155 field for the business address (single-valued) and another for the store locations (multi-valued). If a query needs to be over both fields, then you would search both and combine both filters with BooleanFilter with both filters being added as "SHOULD" clauses (in effect creating an OR between both queries). Another simpler and faster to search approach is to index a 3rd field with the fields combined.

    I see that you are using the old Lucene spatial contrib module; I recommend you not waste any time with it.

    By the way, the overall approach you have here of using a side-index for multi-value geospatial search was repeated here:

    ~ David

    p.s. I discovered your post via dzone, which had no way for me to comment:

  • Kelvin Tan

    Hi David, thanks for your comment.

    1. You're right that the use-case didn't mention sorting by distance. I did however actually need distance sorting. Post amended accordingly.

    2. As mentioned in my post, as far as I can tell, SOLR-2155 requires MultiValueSource, which does not appear to be supported in Lucene 3.5. This makes it a non-starter for a pure Lucene 3.5 implementation.

    3. Thanks for linking to the Lucid blog on multi-valued geospatial search! I hadn't known about that.

  • David Smiley

    Thanks for the clarifications Kelvin. I want to make sure that Lucene 4's new spatial module (which was derived from SOLR-2155) addresses as many use-cases as possible. Some people may want a memory resident data structure but I can see that for some use-cases where a geo query will only match a fairly small-ish number of points, then putting the field in memory is not worth it.