Supermind Search Consulting Blog 
Solr - Elasticsearch - Big Data

Posts about programming

My favorite open-source license of all…

Posted by Kelvin on 30 Mar 2013 | Tagged as: programming

WTFPL – Do What the Fuck You Want to Public License

http://www.wtfpl.net/

Installing mosh on Dreamhost

Posted by Kelvin on 26 Mar 2013 | Tagged as: programming

Here's a gist which helps you install mosh on Dreamhost: https://gist.github.com/andrewgiessel/4486779

Generating HMAC MD5/SHA1/SHA256 etc in Java

Posted by Kelvin on 26 Nov 2012 | Tagged as: programming

There are a number of examples online which show how to generate HMAC MD5 digests in Java.

Unfortunately, most of them don't generate digests which match the digest examples provided on the HMAC wikipedia page.

HMAC_MD5("key", "The quick brown fox jumps over the lazy dog") = 0x80070713463e7749b90c2dc24911e275
HMAC_SHA1("key", "The quick brown fox jumps over the lazy dog") = 0xde7c9b85b8b78aa6bc8a7a36f70a90701c9db4d9
HMAC_SHA256("key", "The quick brown fox jumps over the lazy dog") = 0xf7bc83f430538424b13298e6aa6fb143ef4d59a14946175997479dbc2d1a3cd8

Here's a Java class which does. The trick is to do getBytes("ASCII") instead of UTF-8 (the default). Code courtesy of StackOverflow: http://stackoverflow.com/a/8396600

import javax.crypto.Mac;
import javax.crypto.spec.SecretKeySpec;
import java.io.UnsupportedEncodingException;
import java.security.InvalidKeyException;
import java.security.NoSuchAlgorithmException;
 
public class HMAC {
 
  public static void main(String[] args) throws Exception {
    System.out.println(hmacDigest("The quick brown fox jumps over the lazy dog", "key", "HmacSHA1"));
  }
 
 
  public static String hmacDigest(String msg, String keyString, String algo) {
    String digest = null;
    try {
      SecretKeySpec key = new SecretKeySpec((keyString).getBytes("UTF-8"), algo);
      Mac mac = Mac.getInstance(algo);
      mac.init(key);
 
      byte[] bytes = mac.doFinal(msg.getBytes("ASCII"));
 
      StringBuffer hash = new StringBuffer();
      for (int i = 0; i < bytes.length; i++) {
        String hex = Integer.toHexString(0xFF & bytes[i]);
        if (hex.length() == 1) {
          hash.append('0');
        }
        hash.append(hex);
      }
      digest = hash.toString();
    } catch (UnsupportedEncodingException e) {
    } catch (InvalidKeyException e) {
    } catch (NoSuchAlgorithmException e) {
    }
    return digest;
  }
}

Interesting PHP and apache/nginx links

Posted by Kelvin on 25 Nov 2012 | Tagged as: programming, PHP

http://code.google.com/p/rolling-curl/
A more efficient implementation of curl_multi()

https://github.com/krakjoe/pthreads
http://docs.php.net/manual/en/book.pthreads.php
Posix threads in PHP. Whoa!

http://www.underhanded.org/blog/2010/05/05
Installing Apache Worker over prefork.

http://www.wikivs.com/wiki/Apache_vs_nginx
I stumbled on this page when researching the pros/cons of Apache + mod_php vs nginx + php5-fpm

http://barry.wordpress.com/2008/04/28/load-balancer-update/
Nice posting about wordpress.com's use of nginx for load balancing.

Java port of Quicksilver-style Live Search

Posted by Kelvin on 19 Nov 2012 | Tagged as: Lucene / Solr / Elasticsearch / Nutch, programming

Here's a straight Java port of the quicksilver algo, found here: http://orderedlist.com/blog/articles/live-search-with-quicksilver-style-for-jquery/

quicksilver.js contains the actual algorithm in javascript.

It uses the same input strings as the demo page at http://static.railstips.org/orderedlist/demos/quicksilverjs/jquery.html

import java.io.IOException;
import java.util.TreeSet;
 
public class Quicksilver {
  public static void main(String[] args) throws IOException {
    for (ScoreDoc doc : getScores("DGHTD")) System.out.println(doc);
    System.out.println("============================================");
    for (ScoreDoc doc : getScores("Web")) System.out.println(doc);
    System.out.println("============================================");
    for (ScoreDoc doc : getScores("jhn nnmkr")) System.out.println(doc);
    System.out.println("============================================");
    for (ScoreDoc doc : getScores("wp")) System.out.println(doc);
  }
 
  public static TreeSet<ScoreDoc> getScores(String term) {
    term = term.toLowerCase();
    TreeSet<ScoreDoc> scores = new TreeSet<ScoreDoc>();
    for (int i = 0; i < cache.length; i++) {
      float score = score(cache[i], term, 0);
      if (score > 0) {
        scores.add(new ScoreDoc(score, i));
      }
    }
    return scores;
  }
 
  public static float score(String str, String abbreviation, int offset) {
//    int offset ? offset : 0 // TODO: I think this is unused... remove
 
    if (abbreviation.length() == 0) return 0.9f;
    if (abbreviation.length() > str.length()) return 0.0f;
 
    for (int i = abbreviation.length(); i > 0; i--) {
      String sub_abbreviation = abbreviation.substring(0, i);
      int index = str.indexOf(sub_abbreviation);
 
 
      if (index < 0) continue;
      if (index + abbreviation.length() > str.length() + offset) continue;
 
      String next_string = str.substring(index + sub_abbreviation.length());
      String next_abbreviation = null;
 
      if (i >= abbreviation.length())
        next_abbreviation = "";
      else
        next_abbreviation = abbreviation.substring(i);
 
      float remaining_score = score(next_string, next_abbreviation, offset + index);
 
      if (remaining_score > 0) {
        float score = str.length() - next_string.length();
 
        if (index != 0) {
          int j = 0;
 
          char c = str.charAt(index - 1);
          if (c == 32 || c == 9) {
            for (j = (index - 2); j >= 0; j--) {
              c = str.charAt(j);
              score -= ((c == 32 || c == 9) ? 1 : 0.15);
            }
 
            // XXX maybe not port str heuristic
            //
            //          } else if ([[NSCharacterSet uppercaseLetterCharacterSet] characterIsMember:[self characterAtIndex:matchedRange.location]]) {
            //            for (j = matchedRange.location-1; j >= (int) searchRange.location; j--) {
            //              if ([[NSCharacterSet uppercaseLetterCharacterSet] characterIsMember:[self characterAtIndex:j]])
            //                score--;
            //              else
            //                score -= 0.15;
            //            }
          } else {
            score -= index;
          }
        }
 
        score += remaining_score * next_string.length();
        score /= str.length();
        return score;
      }
    }
    return 0.0f;
  }
 
  public static class ScoreDoc implements Comparable<ScoreDoc> {
 
    public float score;
    public int doc;
    public String term;
    public ScoreDoc(float score, int doc) {
      this.score = score;
      this.doc = doc;
      this.term = cache[doc];
    }
 
    public int compareTo(ScoreDoc o) {
      if (o.score < score) return -1;
      if (o.score > score) return 1;
      return 0;
    }
 
    @Override
    public boolean equals(Object o) {
      if (this == o) return true;
      if (o == null || getClass() != o.getClass()) return false;
 
      ScoreDoc scoreDoc = (ScoreDoc) o;
 
      if (doc != scoreDoc.doc) return false;
      if (Float.compare(scoreDoc.score, score) != 0) return false;
 
      return true;
    }
 
    @Override
    public int hashCode() {
      int result = (score != +0.0f ? Float.floatToIntBits(score) : 0);
      result = 31 * result + doc;
      return result;
    }
 
    @Override public String toString() {
      final StringBuilder sb = new StringBuilder();
      sb.append("ScoreDoc");
      sb.append("{score=").append(score);
      sb.append(", doc=").append(doc);
      sb.append(", term='").append(term).append('\'');
      sb.append('}');
      return sb.toString();
    }
  }
 
  public static String[] cache = new String[]{
      "The Well-Designed Web",
      "Welcome John Nunemaker",
      "Sidebar Creative: The Next Steps",
      "The Web/Desktop Divide",
      "2007 in Review",
      "Don't Complicate the Solution",
      "Blog to Business",
      "Single Line CSS",
      "Comments Work Again",
      "The iPhone Effect",
      "Greek Blogger Camp",
      "FeedBurner FeedSmith",
      "Download Counter Update 1.3",
      "Branding Reworked",
      "Productivity and Fascination",
      "Passing the Torch",
      "Goodbye Austin",
      "Ordered Shirts",
      "Sidebar Creative",
      "Building the Modern Web",
      "Open for Business",
      "The Art and Science of CSS",
      "WP Tiger Administration v3.0",
      "Cleaning House",
      "Tiger Admin 3.0 Beta Testing",
      "Rails and MVC",
      "Updates and More",
      "FeedBurner Plugin v2.1 Released",
      "The Global Health Crisis",
      "WP FeedBurner v2.1 Beta",
      "Web Development and Information Technology",
      "On Becoming a Dad",
      "Tiger Admin and Shuttle",
      "Staying Small in a Big Place: Part 1",
      "WaSP eduTF Interview",
      "Planned Parenthood",
      "IE7 and Clearing Floats",
      "SXSWi 2006: Dan Gilbert - How To Do Exactly the Right Thing at All Possible Times",
      "SXSWi 2006: Traditional Design and New Technology",
      "SXSWi 2006: Almost There",
      "HOWTO: Animated Live Search",
      "Leaving Solo",
      "Tagged for Four Things",
      "Automotive Interface",
      "Another FeedBurner Plugin Update",
      "WP Tiger Admin 2.0",
      "WordPress FeedBurner Plugin for 2.0",
      "SXSWi 2006",
      "Statistical AJAX",
      "Semantics and Design",
      "Download Counter Update",
      "Best Buy, Worst Experience",
      "A Realign, or Whatever",
      "Stop with the Jargon",
      "10K+ for Tiger Plugin",
      "Flock and Integration",
      "Only the Beginning",
      "A Tip of the Hat",
      "3 Years",
      "Pepper: Download Counter",
      "Launch: Notre Dame College of Arts and Letters",
      "Innovation, Progress, and Imagination",
      "This Thing Here",
      "Ode",
      "Web Developer Opening",
      "WordPress Administration Design: Tiger",
      "SAJAX ColdFusion POST Request Method",
      "An Underscore Solution",
      "Google and the Underscore",
      "The Hand Off",
      "WordPress Upgrade and RSS",
      "WordPress FeedBurner Plugin",
      "Documentation Process",
      "WordPress Underscore Plugin",
      "CMS Release",
      "Two Suggestions for iTunes",
      "Call for Good Music",
      "A Change of Platform",
      "Point/Counterpoint: The Wrapper Div",
      "IE7 List, As Requested",
      "I'm a Switcher",
      "Breadcrumb Trails",
      "Output Code",
      "Bending the Matrix",
      "Children's Resource Group",
      "Do You Freelance?",
      "Project Management Software",
      "I Can't Stand It!",
      "Shiver Me Timbers!",
      "NDWG V1.0",
      "Dealing with IE5/Mac",
      "To All",
      "A Natural Progression",
      "Finishing the Basement",
      "Where Do You Live?",
      "The Recursion Project",
      "Clearing Floats: The FnE Method",
      "Ordered Zen",
      "Comment RSS",
      "Wordpress Code",
      "Writing Lean CSS",
      "v3.0 CMYK",
      "A Clean Slate",
      "Working for the Irish",
      "Excuse the Mess",
      "A Little Help",
      "Design Revisions",
      "Aloha",
      "FTOS Round 2",
      "I Love Storms",
      "One Gig?",
      "AD:TECH 2004 Chicago",
      "Thanks and Response",
      "OrderedList.com v2.0",
      "Skuzzy Standards",
      "Simple List",
      "Anger Management",
      "A Practical Start to Web Standards",
      "Irony and Progress",
      "The Familiar Chirping of Crickets",
      "Results of FTOS Round 1",
      "Figure This Out, Steve",
      "Increasing Developer Productivity",
      "One Down",
      "Content Management Your Way",
      "We Have Liftoff",
      "The Great Divide",
      "What's in a Name?",
      "Just How Important is Validation?"};
 
  static{
    for (int i = 0, n = cache.length; i < n; i++) {
      cache[i] = cache[i].toLowerCase();
    }
  }
}

The easiest way of converting a MySQL DB from latin1 to UTF8

Posted by Kelvin on 16 Nov 2012 | Tagged as: programming

There are *numerous* pages online describing how to fix those awful junk characters in a latin1 column caused by unicode characters.

After spending over 2 hours trying out different methods, I found one that's dead simple and actually works:

Export:

mysqldump -u $user -p --opt --quote-names --skip-set-charset \
--default-character-set=latin1 $dbname > dump.sql

Import:

mysql -u $user -p --default-character-set=utf8 $dbname < dump.sql

Thank you Gareth!

http://www.garethsprice.com/blog/2011/fix-mysql-latin1-utf-character-encoding/

A lightweight jquery tooltip plugin that looks good

Posted by Kelvin on 14 Nov 2012 | Tagged as: programming

I checked out a whole bunch of jquery tooltip plugins for a new website I just created, and just wanted to say that the best, IMHO, was Tipsy.

qTip and qTip2 is obviously very full-featured and beautiful, but overkill for my needs – the qTip 1.0.0-rc3 download weighed in at 38KB minified, and 83KB uncompressed. Holy smokes!! All that for a tooltip jquery lib?!

Tipsy is lightweight, and elegant, and the default css doesn't look half bad, unlike some of the other lightweight tooltip plugins I considered.

Simplistic noun-phrase chunking with POS tags in Java

Posted by Kelvin on 16 Jun 2012 | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch

I needed to extract Noun-Phrases from text. The way this is generally done is using Part-of-Speech (POS) tags. OpenNLP has a both a POS-tagger as well as a Noun-Phrase chunker. However, it's really really really slow!

I decided to look into alternatives, and chanced upon QTag.

QTag is a "freely available, language independent POS-Tagger. It is implemented in Java, and has been successfully tested on Mac OS X, Linux, and Windows."

It's waaay faster than OpenNLP for POS-tagging, though I haven't done any benchmarks as to a accuracy.

Here's my really simplistic but adequate implementation of noun-phrase chunking using QTag.

  private Qtag qt;
  public static List<String> chunkQtag(String str) throws IOException {
    List<String> result = new ArrayList<String>();
    if (qt == null) {
      qt = new Qtag("lib/english");
      qt.setOutputFormat(2);
    }
 
    String[] split = str.split("\n");
    for (String line : split) {
      String s = qt.tagLine(line, true);
      String lastTag = null;
      String lastToken = null;
      StringBuilder accum = new StringBuilder();
      for (String token : s.split("\n")) {
        String[] s2 = token.split("\t");
        if (s2.length < 2) continue;
        String tag = s2[1];
 
        if (tag.equals("JJ")
            || tag.startsWith("NN")
            || tag.startsWith("??")
            || (lastTag != null && lastTag.startsWith("NN") && s2[0].equalsIgnoreCase("of"))
            || (lastToken != null && lastToken.equalsIgnoreCase("of") && s2[0].equalsIgnoreCase("the"))
            ) {
          accum.append(s2[0]).append("-");
        } else {
          if (accum.length() > 0) {
            accum.deleteCharAt(accum.length() - 1);
            result.add(accum.toString());
            accum = new StringBuilder();
          }
        }
        lastTag = tag;
        lastToken = s2[0];
      }
      if (accum.length() > 0) {
        accum.deleteCharAt(accum.length() - 1);
        result.add(accum.toString());
      }
    }
    return result;
  }

The method returns a list of noun phrases.

Lucene multi-point spatial search

Posted by Kelvin on 14 Apr 2012 | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch

This post describes a method of augmenting the lucene-spatial contrib package to support multi-point searches. It is quite similar to the method described http://www.supermind.org/blog/548/multiple-latitudelongitude-pairs-for-a-single-solrlucene-doc with some minor modifications.

The problem is as follows:

A company (mapped as a Lucene doc) has an address associated with it. It also has a list of store locations, which each have an address. Given a lat/long point, return a list of companies which have either a store location or an address within x miles from that point. There should be the ability to search on just company addresses, store locations, or both. EDIT: There is also the need to sort by distance and return distance from the point, not just filter by distance.

This problem requires that you index a "primary" lat/long pair, and multiple "secondary" lat/long pairs, and be able to search only primary lat/long, only secondary lat/long or both.

This excludes the possibility of using SOLR-2155 or LUCENE-3795 as-is. I'm sure it would have been possible to patch either to do so

Also, SOLR-2155 depended on Solr, and I needed a pure Lucene 3.5 solution. And MultiValueSource, which SOLR-2155 uses, does not appear to be supported in Lucene 3.5.

The SOLR-2155 implementation is also pretty inefficient: it creates a List object
for every single doc in the index in order to support multi-point search.

The general outline of the method is:

1. Search store locations index and collect company IDs and distances
2. Augment DistanceFilter with store location distances
3. Add a BooleanQuery with company IDs. This is to include companies in the final result-set whose address does not match, but have one or more store locations which do
4. Search company index
5. Return results

The algorithm in detail:

1. Index the company address with the company document, i.e the document containing company fields such as name etc

2. In a separate index (or in the same index but in a different document "type"), index the store locations, adding the company ID as a field.

3. Given a lat/long point to search, first search the store locations index. Collect a unique list of company doc-ids:distance in a LinkedHashMap, checking for duplicates. Note that this is the lucene doc-id of the store location's corresponding company, NOT the company ID field value. This will be used to augment the distancefilter in the next stage.

Hint: you'll need to use TermDocs to get this, like so:

for(int i=0;i<locationHits.docs.totalHits;++i) {
      int locationDocId = locationHits.docs.scoreDocs[i].doc;
      int companyId = companyIds[locationDocId];
      double distance = locationHits.distanceFilter.getDistance(locationDocId);
      if(companyDistances.containsKey(companyId)) continue;
      Term t = new Term("id", Integer.toString(companyId));
      TermDocs td = companyReader.termDocs(t);
      if (td.next()) {
        int companyDocId = td.doc();
        companyDistances.put(companyDocId, distance);
      }
      td.close();
    }

Since the search returns results sorted by distance (using lucene-spatial's DistanceFilter), you're assured to have a list of company doc ids in ascending order of distance.

In this same pass, also collect a list of company IDs. This will be used to build the BooleanQuery used in the company search.

4. Set company DistanceFilter's distances. Note: in Lucene 3.5, I added a one-line patch to DistanceFilter so that setDistances() calls putAll() instead of replacing the map.

final DistanceQueryBuilder dq = new DistanceQueryBuilder(centerLat, centerLng, milesF, "lat", "lng", CartesianTierPlotter.DEFALT_FIELD_PREFIX, true, 0, 20);
dq.getDistanceFilter().setDistances(companyDistances);

5. Build BooleanQuery including company IDs

    BooleanQuery bq = new BooleanQuery();
    for(Integer id: companyIds) bq.add(new TermQuery(new Term("id", Integer.toString(id))), BooleanClause.Occur.SHOULD);
    bq.add(distanceQuery, BooleanClause.Occur.SHOULD);

6. Search and return results

Non-blocking/NIO HTTP requests in Java with Jetty's HttpClient

Posted by Kelvin on 05 Mar 2012 | Tagged as: programming, crawling

Jetty 6/7 contain a HttpClient class that make it uber-easy to issue non-blocking HTTP requests in Java. Here is a code snippet to get you started.

Initialize the HttpClient object.

    HttpClient client = new HttpClient();
    client.setConnectorType(HttpClient.CONNECTOR_SELECT_CHANNEL);
    client.setMaxConnectionsPerAddress(200); // max 200 concurrent connections to every address
    client.setTimeout(30000); // 30 seconds timeout; if no server reply, the request expires
    client.start();

Create a ContentExchange object which encapsulates the HTTP request/response interaction.

    ContentExchange exchange = new ContentExchange() {
      @Override protected void onResponseComplete() throws IOException {
        System.out.println(getResponseContent());
      }
    };
 
    exchange.setAddress(new Address("supermind.org", 80));
    exchange.setURL("http://www.supermind.org/index.html");
    client.send(exchange);

We override the onResponseComplete() method to print the response body to console.

By default, an asynchronous request is performed. To run the request synchronously, all you need to do is add the following line:

exchange.waitForDone();

« Previous PageNext Page »