Supermind Search Consulting Blog 
Solr - Elasticsearch - Big Data

Book review of Apache Solr 3.1 Cookbook

Posted by Kelvin on 30 Jun 2012 | Tagged as: Lucene / Solr / Elasticsearch / Nutch

I recently got a chance to review Apache Solr 3.1 Cookbook by Rafal Kuc, published by PacktPub. Now, to give a bit of context: I help folks implementing and customizing Solr professionally, so I know a fair bit of how Solr works, and am also quite familiar with the code internals. I was, therefore, pleasantly […]

Simplistic noun-phrase chunking with POS tags in Java

Posted by Kelvin on 16 Jun 2012 | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch

I needed to extract Noun-Phrases from text. The way this is generally done is using Part-of-Speech (POS) tags. OpenNLP has a both a POS-tagger as well as a Noun-Phrase chunker. However, it's really really really slow! I decided to look into alternatives, and chanced upon QTag. QTag is a "freely available, language independent POS-Tagger. It […]

Separating relevance signals from document content in Solr or Lucene

Posted by Kelvin on 16 Jun 2012 | Tagged as: Lucene / Solr / Elasticsearch / Nutch

Full-text search has traditionally been about the indexing and ranking of a corpus of unstructured text content. The vector space model (VSM) and its cousins, in addition to structural ranking algorithms such as PageRank, have been the authoritative ways of ranking documents. However, with the recent proliferation of personalization, analytics, social networks and the like, […]

ElasticSearch 0.19 extension points

Posted by Kelvin on 14 Jun 2012 | Tagged as: Lucene / Solr / Elasticsearch / Nutch

A list of the extension points exposed by ElasticSearch (as of 0.19.4) Analysis plugins – use different kinds of analyzers River plugins – A river is an external datasource which ES indexes Transport plugins – Different means of exposing ES API, e.g. Thrift, memcached Site plugins – for running various ES-related webapps, like the ES […]

Connecting Redis to ElasticSearch for custom scoring with nativescripts

Posted by Kelvin on 14 Jun 2012 | Tagged as: Lucene / Solr / Elasticsearch / Nutch

After connecting Redis and MongoDB to Solr, I figured it'd be interesting to do the same with ElasticSearch. Here's the result of my experiments: We'll be implementing this using AbstractSearchScript, which is roughly ElasticSearch's version of Solr's FunctionQuery. ES' NativeScriptFactory corresponds loosely to Solr's ValueSourceParser, and AbstractSearchScript to ValueSource. public class RedisNativeScriptFactory implements NativeScriptFactory { […]

Getting around protected read-only MS Word/LibreOffice odt documents

Posted by Kelvin on 13 Jun 2012 | Tagged as: Ubuntu

I recently received an agreement in MS Word format which I wanted to fill out, but couldn't because it was protected/read-only. Opening the document in LibreOffice writer didn't significantly improve the situation.. whole sections of the document was still marked read-only. Here's what I did to get around it: 1. Open .doc in LibreOffice Writer, […]

Using MongoDB from within Solr for boosting documents

Posted by Kelvin on 09 Jun 2012 | Tagged as: Lucene / Solr / Elasticsearch / Nutch

Previously, I blogged about connecting Redis to Solr for relevance boosting via a custom FunctionQuery. Now, I'll talk about doing the same with MongoDB. In solrconfig.xml, declare your ValueSourceParser. <valueSourceParser name="mongo" class="org.supermind.solr.mongodb.MongoDBValueSourceParser"> <str name="host">localhost</str> <str name="dbName">solr</str> <str name="collectionName">electronics</str> <str name="key">userId</str> <str name="idField">id</str> </valueSourceParser> The host, dbName and collectionName parameters are self-explanatory. The key parameter is […]

Connecting Redis to Solr for boosting documents

Posted by Kelvin on 07 Jun 2012 | Tagged as: Lucene / Solr / Elasticsearch / Nutch

There are a number of instances in Solr where it's desirable to retrieve data from an external datastore for boosting purposes instead of trying to contort Solr with multiple queries, joins etc. Here's a trivial example: Jobs are stored as documents in Solr. Users of the application can rank a job from 1-10. We need […]

Split wav/flac/ape files with cue

Posted by Kelvin on 07 May 2012 | Tagged as: Ubuntu

If you ever need to split a disc image which has been burned as a single wav/flac/ape file with a corresponding cue file, this will help you out. Split2flac does all the tedium of splitting, renaming (according to a renaming pattern of your choosing), converting to FLAC/M4A/MP3/OGG_VORBIS/WAV, as well as adding ID3 tags. cd /usr/local/bin […]

Lucene multi-point spatial search

Posted by Kelvin on 14 Apr 2012 | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch

This post describes a method of augmenting the lucene-spatial contrib package to support multi-point searches. It is quite similar to the method described http://www.supermind.org/blog/548/multiple-latitudelongitude-pairs-for-a-single-solrlucene-doc with some minor modifications. The problem is as follows: A company (mapped as a Lucene doc) has an address associated with it. It also has a list of store locations, which […]

« Previous PageNext Page »