Using MongoDB from within Solr for boosting documents
Posted by Kelvin on 09 Jun 2012 | Tagged as: Lucene / Solr / Elastic Search / Nutch
Previously, I blogged about connecting Redis to Solr for relevance boosting via a custom FunctionQuery. Now, I'll talk about doing the same with MongoDB.
In solrconfig.xml, declare your ValueSourceParser.
<str name="host">localhost</str>
<str name="dbName">solr</str>
<str name="collectionName">electronics</str>
<str name="key">userId</str>
<str name="idField">id</str>
</valueSourceParser>
The host, dbName and collectionName parameters are self-explanatory.
The key parameter is used to specify how to match for a MongoDB doc. The idField parameter declares the Solr field used for matching.
Here's the ValueSourceParser.
private String idField;
private String dbName;
private String collectionName;
private String key;
private String host;
private DBCollection collection;
@Override public void init(NamedList args) {
host = (String) args.get("host");
idField = (String) args.get("idField");
dbName = (String) args.get("dbName");
collectionName = (String) args.get("collectionName");
key = (String) args.get("key");
try {
Mongo mongo = new Mongo(host);
collection = mongo.getDB(dbName).getCollection(collectionName);
} catch (UnknownHostException e) {
throw new IllegalArgumentException(e);
}
}
@Override public ValueSource parse(FunctionQParser fp) throws ParseException {
String value = fp.parseArg();
final DBObject obj = collection.findOne(new BasicDBObject(key, value));
return new MongoDBValueSource(idField, obj, value);
}
}
Here's the interesting method in MongoDBValueSource.
final String[] lookup = FieldCache.DEFAULT.getStrings(reader, idField);
return new DocValues() {
@Override public byte byteVal(int doc) {
return (byte) intVal(doc);
}
@Override public short shortVal(int doc) {
return (short) intVal(doc);
}
@Override public float floatVal(int doc) {
final String id = lookup[doc];
if (obj == null) return 0;
Object v = obj.get(id);
if (v == null) return 0;
if (v instanceof Float) {
return ((Float) v);
} else if (v instanceof Integer) {
return ((Integer) v);
} else if (v instanceof String) {
try {
return Float.parseFloat((String) v);
} catch (NumberFormatException e) {
return 0;
}
}
return 0;
}
@Override public int intVal(int doc) {
final String id = lookup[doc];
if (obj == null) return 0;
Object v = obj.get(id);
if (v == null) return 0;
if (v instanceof Integer) {
return (Integer) v;
} else if (v instanceof String) {
try {
return Integer.parseInt((String) v);
} catch (NumberFormatException e) {
return 0;
}
}
return 0;
}
@Override public long longVal(int doc) {
return intVal(doc);
}
@Override public double doubleVal(int doc) {
return floatVal(doc);
}
@Override public String strVal(int doc) {
final String id = lookup[doc];
if (obj == null) return null;
Object v = obj.get(id);
return v != null ? v.toString() : null;
}
@Override public String toString(int doc) {
return strVal(doc);
}
};
}
You can now use the FunctionQuery mongo in your search requests. For example:
http://localhost:8983/solr/select?defType=edismax&q=cat:electronics&bf=mongo(1377)
Connecting Redis to Solr for boosting documents
Posted by Kelvin on 07 Jun 2012 | Tagged as: Lucene / Solr / Elastic Search / Nutch
There are a number of instances in Solr where it's desirable to retrieve data from an external datastore for boosting purposes instead of trying to contort Solr with multiple queries, joins etc.
Here's a trivial example:
Jobs are stored as documents in Solr. Users of the application can rank a job from 1-10. We need to boost each job with the user's rank if it exists.
Now, to try to attempt to model this fully in Solr would be fairly inefficient, especially for large # of jobs and/or users, since each time a user ranks a job, the searcher has to reload in order for that data to be available for searching.
A much more efficient method of implementing this, is by storing the rank data in a nosql store like Redis, and retrieving the rank at query-time, using it to boost the documents accordingly.
This can be accomplished using a custom FunctionQuery. I've blogged about how to create custom function queries in Solr before, so this is simply an application of the subject.
Here's the code:
@Override public ValueSource parse(FunctionQParser fp) throws ParseException {
String idField = fp.parseArg();
String redisKey = fp.parseArg();
String redisValue = fp.parseArg();
return new RedisValueSource(idField, redisKey, redisValue);
}
}
This FunctionQuery accepts 3 arguments:
1. redisKey
2. redisValue
3. the field to use as an id field
Here's what the salient part of RedisValueSource looks like:
final String[] lookup = FieldCache.DEFAULT.getStrings(reader, idField);
final Jedis jedis = new Jedis("localhost");
String v = jedis.hget(redisKey, redisValue);
final JSONObject obj;
if (v != null) {
obj = (JSONObject) JSONValue.parse(v);
} else {
obj = new JSONObject();
}
jedis.disconnect();
return new DocValues() {
@Override public float floatVal(int doc) {
final String id = lookup[doc];
Object v = obj.get(id);
if(v != null) {
try {
return Float.parseFloat(v.toString());
} catch (NumberFormatException e) {
return 0;
}
} return 0;
}
@Override public int intVal(int doc) {
final String id = lookup[doc];
Object v = obj.get(id);
if(v != null) {
try {
return Integer.parseInt(v.toString());
} catch (NumberFormatException e) {
return 0;
}
} return 0;
}
@Override public String strVal(int doc) {
final String id = lookup[doc];
Object v = obj.get(id);
return v != null ? v.toString() : null;
}
@Override public String toString(int doc) {
return strVal(doc);
}
};
}
From here, you can use the following Solr query to perform boosting based on the Redis value:
http://localhost:8983/solr/select?defType=edismax&q=cat:electronics&bf=redis(id,influence,1001)&debugQuery=on
The explain output looks like this:
3.4664698 = (MATCH) sum of:
1.070082 = (MATCH) weight(cat:electronics in 2), product of:
0.80067647 = queryWeight(cat:electronics), product of:
1.3364723 = idf(docFreq=14, maxDocs=21)
0.59909695 = queryNorm
1.3364723 = (MATCH) fieldWeight(cat:electronics in 2), product of:
1.0 = tf(termFreq(cat:electronics)=1)
1.3364723 = idf(docFreq=14, maxDocs=21)
1.0 = fieldNorm(field=cat, doc=2)
2.3963878 = (MATCH) FunctionQuery(redis(id,influence,1001)), product of:
4.0 = 4.0
1.0 = boost
0.59909695 = queryNorm
Lucene multi-point spatial search
Posted by Kelvin on 14 Apr 2012 | Tagged as: Lucene / Solr / Elastic Search / Nutch, programming
This post describes a method of augmenting the lucene-spatial contrib package to support multi-point searches. It is quite similar to the method described http://www.supermind.org/blog/548/multiple-latitudelongitude-pairs-for-a-single-solrlucene-doc with some minor modifications.
The problem is as follows:
A company (mapped as a Lucene doc) has an address associated with it. It also has a list of store locations, which each have an address. Given a lat/long point, return a list of companies which have either a store location or an address within x miles from that point. There should be the ability to search on just company addresses, store locations, or both. EDIT: There is also the need to sort by distance and return distance from the point, not just filter by distance.
This problem requires that you index a "primary" lat/long pair, and multiple "secondary" lat/long pairs, and be able to search only primary lat/long, only secondary lat/long or both.
This excludes the possibility of using SOLR-2155 or LUCENE-3795 as-is. I'm sure it would have been possible to patch either to do so
Also, SOLR-2155 depended on Solr, and I needed a pure Lucene 3.5 solution. And MultiValueSource, which SOLR-2155 uses, does not appear to be supported in Lucene 3.5.
The SOLR-2155 implementation is also pretty inefficient: it creates a List object
for every single doc in the index in order to support multi-point search.
The general outline of the method is:
1. Search store locations index and collect company IDs and distances
2. Augment DistanceFilter with store location distances
3. Add a BooleanQuery with company IDs. This is to include companies in the final result-set whose address does not match, but have one or more store locations which do
4. Search company index
5. Return results
The algorithm in detail:
1. Index the company address with the company document, i.e the document containing company fields such as name etc
2. In a separate index (or in the same index but in a different document "type"), index the store locations, adding the company ID as a field.
3. Given a lat/long point to search, first search the store locations index. Collect a unique list of company doc-ids:distance in a LinkedHashMap, checking for duplicates. Note that this is the lucene doc-id of the store location's corresponding company, NOT the company ID field value. This will be used to augment the distancefilter in the next stage.
Hint: you'll need to use TermDocs to get this, like so:
int locationDocId = locationHits.docs.scoreDocs[i].doc;
int companyId = companyIds[locationDocId];
double distance = locationHits.distanceFilter.getDistance(locationDocId);
if(companyDistances.containsKey(companyId)) continue;
Term t = new Term("id", Integer.toString(companyId));
TermDocs td = companyReader.termDocs(t);
if (td.next()) {
int companyDocId = td.doc();
companyDistances.put(companyDocId, distance);
}
td.close();
}
Since the search returns results sorted by distance (using lucene-spatial's DistanceFilter), you're assured to have a list of company doc ids in ascending order of distance.
In this same pass, also collect a list of company IDs. This will be used to build the BooleanQuery used in the company search.
4. Set company DistanceFilter's distances. Note: in Lucene 3.5, I added a one-line patch to DistanceFilter so that setDistances() calls putAll() instead of replacing the map.
dq.getDistanceFilter().setDistances(companyDistances);
5. Build BooleanQuery including company IDs
for(Integer id: companyIds) bq.add(new TermQuery(new Term("id", Integer.toString(id))), BooleanClause.Occur.SHOULD);
bq.add(distanceQuery, BooleanClause.Occur.SHOULD);
6. Search and return results
Using contextual hints to improve Solr's autocomplete suggester
Posted by Kelvin on 03 Mar 2012 | Tagged as: Lucene / Solr / Elastic Search / Nutch
Context-less multi-term autocomplete is difficult.
Given the term "di", we can look at our index and rank terms starting with "di" by frequency and return the n most frequent terms. Solr's TSTLookup and FSTLookup do this very well.
However, given the term "walt di", we can no longer do what we did above for each term and not look silly, especially if the corpus in question is a list of US companies (hint: think mickey mouse". There's little excuse to suggesting "walt discovery" or "walt diners" when our corpus does not contain any documents with that combination of terms.
In the absence of a large number of historical user queries to augment the autocomplete, context is king when it comes to multi-term queries.
The simplest way I can think of doing this, if it is feasible to do so memory-wise, is to store a list of terms and the term that immediately follows it. For example, given the field value "international business machines", mappings would be created for
international=>business
business=>machines
Out-of-order queries wouldn't be supported with this system, nor would term skips (e.g. international machines).
Here's a method fragment that does just this:
for (int i = 0; i < reader.numDocs(); ++i) {
Fieldable fieldable = reader.document(i).getFieldable(field);
if(fieldable == null) continue;
String fieldVal = fieldable.stringValue();
if(fieldVal == null) continue;
TokenStream ts = a.tokenStream(field, new StringReader(fieldVal));
String prev = null;
while (ts.incrementToken()) {
CharTermAttribute attr = ts.getAttribute(CharTermAttribute.class);
String v = new String(attr.buffer(), 0, attr.length()).intern();
if (prev != null) {
map.get(prev).add(v);
}
prev = v;
}
}
Guava's Multimap is perfect for this, and Solr already has a Guava dependency, so we might as well make full use of it.
Solr autocomplete with document suggestions
Posted by Kelvin on 03 Mar 2012 | Tagged as: Lucene / Solr / Elastic Search / Nutch
Solr 3.5 comes with a nice autocomplete/typeahead component that is based on the SolrSpellCheckComponent.
You provide it a query and a field, and the Suggester returns a list of suggestions based on the query. For example:
<response>
<lst name="spellcheck">
<lst name="suggestions">
<lst name="ac">
<int name="numFound">2</int>
<int name="startOffset">0</int>
<int name="endOffset">2</int>
<arr name="suggestion">
<str>acquire</str>
<str>accommodate</str>
</arr>
</lst>
<str name="collation">acquire</str>
</lst>
</lst>
</response>
Nice.
Now what if, as part of the autocomplete request, you needed a list of documents that contain the suggested terms for the given field? That's what I'm about to cover here.
TermDocs is your friend
The basic idea here is to call reader.termDocs() for each term, collect the document ids, and use that as the basis of a docslice. Here are relevant bits of code.
AND the doc ids for the various suggestions into a single docset.
NamedList suggestions = (NamedList) spellcheck.get("suggestions");
final SolrIndexReader reader = rb.req.getSearcher().getReader();
OpenBitSet docset = null;
for (int i = 0; i < suggestions.size(); ++i) {
String name = suggestions.getName(i);
if ("collation".equals(name)) continue;
NamedList query = (NamedList) suggestions.getVal(i);
Set<String> suggestion = (Set<String>) query.get("suggestion");
OpenBitSet docs = collectDocs(field, reader, result);
if (docset == null) docset = docs;
else {
docset.and(docs);
}
}
collectDocs is implemented here:
OpenBitSet docset = new OpenBitSet();
TermDocs te = reader.termDocs();
for (String s : terms) {
Term t = new Term(field, s);
te.seek(t);
while (te.next()) {
docset.set(te.doc());
}
}
te.close();
return docset;
}
Now with the OpenBitSet of document ids matching the suggested terms, you can return a list of documents.
One problem is that you don't have document scores since no search was actually performed. Ideally, you'd want to return the documents in sorted by some field, and use the field value as the score.
Book review of Apache Solr 3 Enterprise Search Server
Posted by Kelvin on 28 Feb 2012 | Tagged as: Lucene / Solr / Elastic Search / Nutch, programming
Apache Solr 3 Enterprise Search Server published by Packt Publishing is the only Solr book available at the moment.
It's a fairly comprehensive book, and discusses many new Solr 3 features. Considering the breakneck pace of Solr development and the rate at which new features get introduced, you have to hand it to the authors to have released a book which isn't outdated by the time it hits bookshelves.
Nonetheless, it does have shortcomings. I'll cover some of these shortly.
Firstly, the table of contents:
Chapter 1: Quick Starting Solr
Chapter 2: Schema and Text Analysis
Chapter 3: Indexing Data
Chapter 4: Searching
Chapter 5: Search Relevancy
Chapter 6: Faceting
Chapter 7: Search Components
Chapter 8: Deployment
Chapter 9: Integrating Solr
Chapter 10: Scaling Solr
Appendix: Search Quick Reference
A complete TOC with chapter sections is available here: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
The good points
The book does an overall excellent job of covering Solr basics such as the Lucene query syntax, scoring, schema.xml, DIH (dataimport handler), faceting and the various searchcomponents.
There are chapters dedicated to deploying, integrating and scaling Solr, which is nice. i found the Scaling Solr chapter in particular filled with common performance enhancement tips.
The DisMax query parser is covered in great detail, which is good because I've often found it to be a stumbling block for new solr users.
The bad points
Not many, but here are a few gripes.
The 2 most important files a new Solr user needs to understand are schema.xml and solrconfig.xml. There should have been more emphasis placed on them early on. I don't even see solrconfig.xml anywhere in the TOC.
No mention of the Solr admin interface which is an absolute gem for a number of tasks, such as understanding tokenizers. In the text analysis section of Chapter 2, there really should be a walkthrough of Solr Admin's analyzer interface.
I think there could have been at least an attempt at describing the underlying data structure in which documents are stored (inverted index), as well as a basic introduction to the tf.idf scoring model. No mention of this at all in Chapter 5 Search Relevancy. One could argue that this is out of the scope of the book, but if a reader is to arrive at a deep understanding of what Lucene really is, understanding inverted indices and tf.idf is clearly a must.
Summary
All in all, Apache Solr 3 Enterprise Search Server is a book I'd heartily recommend to new or even moderately experienced users of Apache Solr.
It brings together information which is spread throughout the Lucene and Solr wiki and javadocs, making it a handy desk reference.
Apache Solr book review coming soon..
Posted by Kelvin on 27 Feb 2012 | Tagged as: Lucene / Solr / Elastic Search / Nutch
Just received my review copy of the only Apache Solr book on the market..
http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
My book review to follow shortly..
What's new in Solr 3.4.0
Posted by Kelvin on 06 Oct 2011 | Tagged as: Lucene / Solr / Elastic Search / Nutch
If you are already using Apache Solr 3.1, 3.2 or 3.3, it's strongly recommended you upgrade to 3.4.0 because of the index corruption bug on OS or computer crash or power loss (LUCENE-3418), now fixed in 3.4.0.
Solr 3.4.0 release highlights include
- Bug fixes and improvements from Apache Lucene 3.4.0, including a
major bug (LUCENE-3418) whereby a Lucene index could
easily become corrupted if the OS or computer crashed or lost
power. - SolrJ client can now parse grouped and range facets results
(SOLR-2523). - A new XsltUpdateRequestHandler allows posting XML that's
transformed by a provided XSLT into a valid Solr document
(SOLR-2630). - Post-group faceting option (group.truncate) can now compute
facet counts for only the highest ranking documents per-group.
(SOLR-2665). - Add commitWithin update request parameter to all update handlers
that were previously missing it. This tells Solr to commit the
change within the specified amount of time (SOLR-2540). - You can now specify NIOFSDirectory (SOLR-2670).
- New parameter hl.phraseLimit speeds up FastVectorHighlighter
(LUCENE-3234). - The query cache and filter cache can now be disabled per request.
See this wiki page
(SOLR-2429). - Improved memory usage, build time, and performance of
SynonymFilterFactory (LUCENE-3233). - Added omitPositions to the schema, so you can omit position
information while still indexing term frequencies (LUCENE-2048). - Various fixes for multi-threaded DataImportHandler.
See the release notes for a more complete list of all the new features, improvements, and bugfixes.
As usual, the download is available here: http://www.apache.org/dyn/closer.cgi/lucene/solr/
Introducing SolrTutorial.com
Posted by Kelvin on 02 Oct 2011 | Tagged as: Lucene / Solr / Elastic Search / Nutch
Just launched a Solr tutorial website, a site styled after my LuceneTutorial.com but tailored towards Solr users.
It also includes high-level overviews to Solr for non-programmers, such as Solr for Managers and Solr for SysAdmins.
HOWTO: Collect WebDriver HTTP Request and Response Headers
Posted by Kelvin on 22 Jun 2011 | Tagged as: crawling, Lucene / Solr / Elastic Search / Nutch, programming
WebDriver, is a fantastic Java API for web application testing. It has recently been merged into the Selenium project to provide a friendlier API for programmatic simulation of web browser actions. Its unique property is that of executing web pages on web browsers such as Firefox, Chrome, IE etc, and the subsequent programmatic access of the DOM model.
The problem with WebDriver, though, as reported here, is that because the underlying browser implementation does the actual fetching, as opposed to, Commons HttpClient, for example, its currently not possible to obtain the HTTP request and response headers, which is kind of a PITA.
I present here a method of obtaining HTTP request and response headers via an embedded proxy, derived from the Proxoid project.
ProxyLight from Proxoid
ProxyLight is the lightweight standalone proxy from the Proxoid project. It's released under the Apache Public License.
The original code only provided request filtering, and performed no response filtering, forwarding data directly from the web server to the requesting client.
I made some modifications to intercept and parse HTTP response headers.
Get my version here (released under APL): http://downloads.supermind.org/proxylight-20110622.zip
Using ProxyLight from WebDriver
The modified ProxyLight allows you to process both request and response.
This has the added benefit allowing you to write a RequestFilter which ignores images, or URLs from certain domains. Sweet!
What your WebDriver code has to do then, is:
- Ensure the ProxyLight server is started
- Add Request and Response Filters to the ProxyLight server
- Maintain a cache of request and response filters which you can then retrieve
- Ensure the native browser uses our ProxyLight server
Here's a sample class to get you started
import com.mba.proxylight.ProxyLight;
import com.mba.proxylight.Response;
import com.mba.proxylight.ResponseFilter;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.openqa.selenium.firefox.FirefoxProfile;
import java.util.LinkedHashMap;
import java.util.Map;
public class SampleWebDriver {
protected int localProxyPort = 5368;
protected ProxyLight proxy;
// LRU response table. Note: this is not thread-safe.
// Use ConcurrentLinkedHashMap instead: http://code.google.com/p/concurrentlinkedhashmap/
private LinkedHashMap<String, Response> responseTable = new LinkedHashMap<String, Response>() {
protected boolean removeEldestEntry(Map.Entry eldest) {
return size() > 100;
}
};
public Response fetch(String url) {
if (proxy == null) {
initProxy();
}
FirefoxProfile profile = new FirefoxProfile();
/**
* Get the native browser to use our proxy
*/
profile.setPreference("network.proxy.type", 1);
profile.setPreference("network.proxy.http", "localhost");
profile.setPreference("network.proxy.http_port", localProxyPort);
FirefoxDriver driver = new FirefoxDriver(profile);
// Now fetch the URL
driver.get(url);
Response proxyResponse = responseTable.remove(driver.getCurrentUrl());
return proxyResponse;
}
private void initProxy() {
proxy = new ProxyLight();
this.proxy.setPort(localProxyPort);
// this response filter adds the intercepted response to the cache
this.proxy.getResponseFilters().add(new ResponseFilter() {
public void filter(Response response) {
responseTable.put(response.getRequest().getUrl(), response);
}
});
// add request filters here if needed
// now start the proxy
try {
this.proxy.start();
} catch (Exception e) {
e.printStackTrace();
}
}
public static void main(String[] args) {
SampleWebDriver driver = new SampleWebDriver();
Response res = driver.fetch("http://www.lucenetutorial.com");
System.out.println(res.getHeaders());
}
}
