Supermind Search Consulting Blog 
Solr - Elasticsearch - Big Data

Installing mosh on Dreamhost

Posted by Kelvin on 26 Mar 2013 | Tagged as: programming

Here's a gist which helps you install mosh on Dreamhost: https://gist.github.com/andrewgiessel/4486779

Generating HMAC MD5/SHA1/SHA256 etc in Java

Posted by Kelvin on 26 Nov 2012 | Tagged as: programming

There are a number of examples online which show how to generate HMAC MD5 digests in Java. Unfortunately, most of them don't generate digests which match the digest examples provided on the HMAC wikipedia page. HMAC_MD5("key", "The quick brown fox jumps over the lazy dog") = 0x80070713463e7749b90c2dc24911e275 HMAC_SHA1("key", "The quick brown fox jumps over the […]

Interesting PHP and apache/nginx links

Posted by Kelvin on 25 Nov 2012 | Tagged as: programming, PHP

http://code.google.com/p/rolling-curl/ A more efficient implementation of curl_multi() https://github.com/krakjoe/pthreads http://docs.php.net/manual/en/book.pthreads.php Posix threads in PHP. Whoa! http://www.underhanded.org/blog/2010/05/05 Installing Apache Worker over prefork. http://www.wikivs.com/wiki/Apache_vs_nginx I stumbled on this page when researching the pros/cons of Apache + mod_php vs nginx + php5-fpm http://barry.wordpress.com/2008/04/28/load-balancer-update/ Nice posting about wordpress.com's use of nginx for load balancing.

Java port of Quicksilver-style Live Search

Posted by Kelvin on 19 Nov 2012 | Tagged as: Lucene / Solr / Elasticsearch / Nutch, programming

Here's a straight Java port of the quicksilver algo, found here: http://orderedlist.com/blog/articles/live-search-with-quicksilver-style-for-jquery/ quicksilver.js contains the actual algorithm in javascript. It uses the same input strings as the demo page at http://static.railstips.org/orderedlist/demos/quicksilverjs/jquery.html import java.io.IOException; import java.util.TreeSet;   public class Quicksilver { public static void main(String[] args) throws IOException { for (ScoreDoc doc : getScores("DGHTD")) System.out.println(doc); System.out.println("============================================"); […]

The easiest way of converting a MySQL DB from latin1 to UTF8

Posted by Kelvin on 16 Nov 2012 | Tagged as: programming

There are *numerous* pages online describing how to fix those awful junk characters in a latin1 column caused by unicode characters. After spending over 2 hours trying out different methods, I found one that's dead simple and actually works: Export: mysqldump -u $user -p –opt –quote-names –skip-set-charset \ –default-character-set=latin1 $dbname > dump.sql Import: mysql -u […]

A lightweight jquery tooltip plugin that looks good

Posted by Kelvin on 14 Nov 2012 | Tagged as: programming

I checked out a whole bunch of jquery tooltip plugins for a new website I just created, and just wanted to say that the best, IMHO, was Tipsy. qTip and qTip2 is obviously very full-featured and beautiful, but overkill for my needs – the qTip 1.0.0-rc3 download weighed in at 38KB minified, and 83KB uncompressed. […]

Apache Solr vs ElasticSearch – the website

Posted by Kelvin on 14 Nov 2012 | Tagged as: Lucene / Solr / Elasticsearch / Nutch

Just spent the day hacking together a website that does a blow-by-blow examination of Solr vs ElasticSearch. Hopefully it'll address any questions people might have about whether to use Solr or ES.. Let me know what you think!

The anatomy of a Lucene Tokenizer

Posted by Kelvin on 12 Nov 2012 | Tagged as: Lucene / Solr / Elasticsearch / Nutch

A term is the unit of search in Lucene. A Lucene document comprises of a set of terms. Tokenization means splitting up a string into tokens, or terms. A Lucene Tokenizer is what both Lucene (and correspondingly, Solr) uses to tokenize text. To implement a custom Tokenizer, you extend org.apache.lucene.analysis.Tokenizer. The only method you need […]

Tokenizing second-level and top-level domain for a URL in Lucene and Solr

Posted by Kelvin on 12 Nov 2012 | Tagged as: Lucene / Solr / Elasticsearch / Nutch

In my previous post, I described how to extract second- and top-level domains from a URL in Java. Now, I'll build a Lucene Tokenizer out of it, and a Solr TokenizerFactory class. DomainTokenizer doesn't do anything really fancy. It first returns the hostname as the first token, then the 2nd-level domain as the second token, […]

Extracting second-level domains and top-level domains (TLD) from a URL in Java

Posted by Kelvin on 12 Nov 2012 | Tagged as: Lucene / Solr / Elasticsearch / Nutch

It turns out that extracting second- and top-level domains is not a simple task, the primary difficulty being that in addition to the usual suspects (.com .org .net etc), there are the country suffixes (.uk .it .de etc) which need to be accounted for. Regex alone has no way of handling this. http://publicsuffix.org/list/ contains a […]

« Previous PageNext Page »