Thoughts on Lucene, Solr and ElasticSearch 

Posts about programming

Simplistic noun-phrase chunking with POS tags in Java

Posted by Kelvin on 16 Jun 2012 | Tagged as: Lucene / Solr / Elastic Search / Nutch, programming

I needed to extract Noun-Phrases from text. The way this is generally done is using Part-of-Speech (POS) tags. OpenNLP has a both a POS-tagger as well as a Noun-Phrase chunker. However, it's really really really slow!

I decided to look into alternatives, and chanced upon QTag.

QTag is a "freely available, language independent POS-Tagger. It is implemented in Java, and has been successfully tested on Mac OS X, Linux, and Windows."

It's waaay faster than OpenNLP for POS-tagging, though I haven't done any benchmarks as to a accuracy.

Here's my really simplistic but adequate implementation of noun-phrase chunking using QTag.

  private Qtag qt;
  public static List<String> chunkQtag(String str) throws IOException {
    List<String> result = new ArrayList<String>();
    if (qt == null) {
      qt = new Qtag("lib/english");
      qt.setOutputFormat(2);
    }

    String[] split = str.split("\n");
    for (String line : split) {
      String s = qt.tagLine(line, true);
      String lastTag = null;
      String lastToken = null;
      StringBuilder accum = new StringBuilder();
      for (String token : s.split("\n")) {
        String[] s2 = token.split("\t");
        if (s2.length < 2) continue;
        String tag = s2[1];

        if (tag.equals("JJ")
            || tag.startsWith("NN")
            || tag.startsWith("??")
            || (lastTag != null && lastTag.startsWith("NN") && s2[0].equalsIgnoreCase("of"))
            || (lastToken != null && lastToken.equalsIgnoreCase("of") && s2[0].equalsIgnoreCase("the"))
            ) {
          accum.append(s2[0]).append("-");
        } else {
          if (accum.length() > 0) {
            accum.deleteCharAt(accum.length() - 1);
            result.add(accum.toString());
            accum = new StringBuilder();
          }
        }
        lastTag = tag;
        lastToken = s2[0];
      }
      if (accum.length() > 0) {
        accum.deleteCharAt(accum.length() - 1);
        result.add(accum.toString());
      }
    }
    return result;
  }
 

The method returns a list of noun phrases.

Lucene multi-point spatial search

Posted by Kelvin on 14 Apr 2012 | Tagged as: Lucene / Solr / Elastic Search / Nutch, programming

This post describes a method of augmenting the lucene-spatial contrib package to support multi-point searches. It is quite similar to the method described http://www.supermind.org/blog/548/multiple-latitudelongitude-pairs-for-a-single-solrlucene-doc with some minor modifications.

The problem is as follows:

A company (mapped as a Lucene doc) has an address associated with it. It also has a list of store locations, which each have an address. Given a lat/long point, return a list of companies which have either a store location or an address within x miles from that point. There should be the ability to search on just company addresses, store locations, or both. EDIT: There is also the need to sort by distance and return distance from the point, not just filter by distance.

This problem requires that you index a "primary" lat/long pair, and multiple "secondary" lat/long pairs, and be able to search only primary lat/long, only secondary lat/long or both.

This excludes the possibility of using SOLR-2155 or LUCENE-3795 as-is. I'm sure it would have been possible to patch either to do so

Also, SOLR-2155 depended on Solr, and I needed a pure Lucene 3.5 solution. And MultiValueSource, which SOLR-2155 uses, does not appear to be supported in Lucene 3.5.

The SOLR-2155 implementation is also pretty inefficient: it creates a List object
for every single doc in the index in order to support multi-point search.

The general outline of the method is:

1. Search store locations index and collect company IDs and distances
2. Augment DistanceFilter with store location distances
3. Add a BooleanQuery with company IDs. This is to include companies in the final result-set whose address does not match, but have one or more store locations which do
4. Search company index
5. Return results

The algorithm in detail:

1. Index the company address with the company document, i.e the document containing company fields such as name etc

2. In a separate index (or in the same index but in a different document "type"), index the store locations, adding the company ID as a field.

3. Given a lat/long point to search, first search the store locations index. Collect a unique list of company doc-ids:distance in a LinkedHashMap, checking for duplicates. Note that this is the lucene doc-id of the store location's corresponding company, NOT the company ID field value. This will be used to augment the distancefilter in the next stage.

Hint: you'll need to use TermDocs to get this, like so:

for(int i=0;i<locationHits.docs.totalHits;++i) {
      int locationDocId = locationHits.docs.scoreDocs[i].doc;
      int companyId = companyIds[locationDocId];
      double distance = locationHits.distanceFilter.getDistance(locationDocId);
      if(companyDistances.containsKey(companyId)) continue;
      Term t = new Term("id", Integer.toString(companyId));
      TermDocs td = companyReader.termDocs(t);
      if (td.next()) {
        int companyDocId = td.doc();
        companyDistances.put(companyDocId, distance);
      }
      td.close();
    }
 

Since the search returns results sorted by distance (using lucene-spatial's DistanceFilter), you're assured to have a list of company doc ids in ascending order of distance.

In this same pass, also collect a list of company IDs. This will be used to build the BooleanQuery used in the company search.

4. Set company DistanceFilter's distances. Note: in Lucene 3.5, I added a one-line patch to DistanceFilter so that setDistances() calls putAll() instead of replacing the map.

final DistanceQueryBuilder dq = new DistanceQueryBuilder(centerLat, centerLng, milesF, "lat", "lng", CartesianTierPlotter.DEFALT_FIELD_PREFIX, true, 0, 20);
dq.getDistanceFilter().setDistances(companyDistances);
 

5. Build BooleanQuery including company IDs

    BooleanQuery bq = new BooleanQuery();
    for(Integer id: companyIds) bq.add(new TermQuery(new Term("id", Integer.toString(id))), BooleanClause.Occur.SHOULD);
    bq.add(distanceQuery, BooleanClause.Occur.SHOULD);
 

6. Search and return results

Non-blocking/NIO HTTP requests in Java with Jetty's HttpClient

Posted by Kelvin on 05 Mar 2012 | Tagged as: crawling, programming

Jetty 6/7 contain a HttpClient class that make it uber-easy to issue non-blocking HTTP requests in Java. Here is a code snippet to get you started.

Initialize the HttpClient object.

    HttpClient client = new HttpClient();
    client.setConnectorType(HttpClient.CONNECTOR_SELECT_CHANNEL);
    client.setMaxConnectionsPerAddress(200); // max 200 concurrent connections to every address
    client.setTimeout(30000); // 30 seconds timeout; if no server reply, the request expires
    client.start();
 

Create a ContentExchange object which encapsulates the HTTP request/response interaction.

    ContentExchange exchange = new ContentExchange() {
      @Override protected void onResponseComplete() throws IOException {
        System.out.println(getResponseContent());
      }
    };

    exchange.setAddress(new Address("supermind.org", 80));
    exchange.setURL("http://www.supermind.org/index.html");
    client.send(exchange);
 

We override the onResponseComplete() method to print the response body to console.

By default, an asynchronous request is performed. To run the request synchronously, all you need to do is add the following line:

exchange.waitForDone();
 

Book review of Apache Solr 3 Enterprise Search Server

Posted by Kelvin on 28 Feb 2012 | Tagged as: Lucene / Solr / Elastic Search / Nutch, programming

Apache Solr 3 Enterprise Search Server published by Packt Publishing is the only Solr book available at the moment.

It's a fairly comprehensive book, and discusses many new Solr 3 features. Considering the breakneck pace of Solr development and the rate at which new features get introduced, you have to hand it to the authors to have released a book which isn't outdated by the time it hits bookshelves.

Nonetheless, it does have shortcomings. I'll cover some of these shortly.

Firstly, the table of contents:

Chapter 1: Quick Starting Solr
Chapter 2: Schema and Text Analysis
Chapter 3: Indexing Data
Chapter 4: Searching
Chapter 5: Search Relevancy
Chapter 6: Faceting
Chapter 7: Search Components
Chapter 8: Deployment
Chapter 9: Integrating Solr
Chapter 10: Scaling Solr
Appendix: Search Quick Reference

A complete TOC with chapter sections is available here: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book

The good points

The book does an overall excellent job of covering Solr basics such as the Lucene query syntax, scoring, schema.xml, DIH (dataimport handler), faceting and the various searchcomponents.

There are chapters dedicated to deploying, integrating and scaling Solr, which is nice. i found the Scaling Solr chapter in particular filled with common performance enhancement tips.

The DisMax query parser is covered in great detail, which is good because I've often found it to be a stumbling block for new solr users.

The bad points

Not many, but here are a few gripes.

The 2 most important files a new Solr user needs to understand are schema.xml and solrconfig.xml. There should have been more emphasis placed on them early on. I don't even see solrconfig.xml anywhere in the TOC.

No mention of the Solr admin interface which is an absolute gem for a number of tasks, such as understanding tokenizers. In the text analysis section of Chapter 2, there really should be a walkthrough of Solr Admin's analyzer interface.

I think there could have been at least an attempt at describing the underlying data structure in which documents are stored (inverted index), as well as a basic introduction to the tf.idf scoring model. No mention of this at all in Chapter 5 Search Relevancy. One could argue that this is out of the scope of the book, but if a reader is to arrive at a deep understanding of what Lucene really is, understanding inverted indices and tf.idf is clearly a must.

Summary

All in all, Apache Solr 3 Enterprise Search Server is a book I'd heartily recommend to new or even moderately experienced users of Apache Solr.

It brings together information which is spread throughout the Lucene and Solr wiki and javadocs, making it a handy desk reference.

Download KhanAcademy videos with a PHP crawler

Posted by Kelvin on 08 Oct 2011 | Tagged as: PHP, programming

At the moment (October 2011), there's no simple way to download all videos from a playlist from KhanAcademy.org.

This simple PHP crawler script changes that. :-)

What it does is downloads the videos (from archive.org) to a subfolder, numbering and naming the videos with the respective titles (not the gibberish titles that archive.org has assigned them). Additionally, through the use of wget --continue, the crawler has auto-resume support, so even if your computer crashes in the middle of a crawl, you don't need to start all over again.

Usage

Usage is like this, assuming the script is named downkhan.php:

php downkhan.php {folder} {urls.txt}
php downkhan.php history history.txt
 

where folder is the subdirectory to save the videos in, and urls.txt is a list of urls obtained by running a regex on http://www.khanacademy.org/#browse.

Regex

The regex used was

href="(.*?)".*?><span.*?>(.*?)</span>
 

urls

Here is a few lines of a urls.txt file:

http://www.khanacademy.org/video/scale-of-earth-and--sun?playlist=Cosmology+and+Astronomy|Scale of Earth and  Sun
http://www.khanacademy.org/video/scale-of-solar-system?playlist=Cosmology+and+Astronomy|Scale of Solar System
http://www.khanacademy.org/video/scale-of-distance-to-closest-stars?playlist=Cosmology+and+Astronomy|Scale of Distance to Closest Stars
 

Here's a list of what I've created so far:

http://www.supermind.org/code/history.txt
http://www.supermind.org/code/biology.txt
http://www.supermind.org/code/finance.txt
http://www.supermind.org/code/cosmology.txt
http://www.supermind.org/code/healthcare.txt
http://www.supermind.org/code/linearalgebra.txt
http://www.supermind.org/code/statistics.txt

script code

And here's the script:

<?php
$args = $_SERVER['argv'];
$folder = $args[1];
$file = $args[2];

$arr = explode("\n", trim(file_get_contents(getcwd()."/".$file)));
$urls = array();
foreach($arr as $k) {
  $split = explode("|", $k);
  $urls[$split[0]] = $split[1];
}

mkdir($folder);
chdir($folder);
$counter = 0;

foreach($urls as $url=>$title) {
  $counter++;

  echo "Fetching $url\n";
  $html = '';
  while(!$html) $html = fetch_url($url);
  $vid = get_match("/<a href=\"(http:\/\/www.archive.org.*?)\"/", $html);
  $outfile = "$counter. $title.mp4";

  `wget --continue $vid -O "$outfile"`;  
}

function get_match($pattern, $s) {
  preg_match($pattern, $s, $matches);
  if($matches) {
    return $matches[1];
  } else return NULL;
}

function fetch_url($url)
{
    $curl_handle = curl_init(); // initialize curl handle
    curl_setopt($curl_handle, CURLOPT_URL, $url); // set url to post to
    curl_setopt($curl_handle, CURLOPT_FAILONERROR, 1);
    curl_setopt($curl_handle, CURLOPT_CONNECTTIMEOUT, 5);
    curl_setopt($curl_handle, CURLINFO_TOTAL_TIME, 20);
    curl_setopt($curl_handle, CURLOPT_FOLLOWLOCATION, 1); // allow redirects
    curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, 1); // return into a variable
    curl_setopt($curl_handle, CURLOPT_HTTPHEADER, array('Accept: */*', 'User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows)'));
    $result = curl_exec($curl_handle); // run the whole process
    if (curl_exec($curl_handle) === false) {
        echo 'Curl error: ' . curl_error($curl_handle);
    }
    curl_close($curl_handle);
    return $result;
}

function rel2abs($rel, $base)
{
    /* return if already absolute URL */
    if (parse_url($rel, PHP_URL_SCHEME) != '') return $rel;

    /* queries and anchors */
    if ($rel[0] == '#' || $rel[0] == '?') return $base . $rel;

    /* parse base URL and convert to local variables:
 $scheme, $host, $path */

    extract(parse_url($base));

    /* remove non-directory element from path */
    $path = preg_replace('#/[^/]*$#', '', $path);

    /* destroy path if relative url points to root */
    if ($rel[0] == '/') $path = '';

    /* dirty absolute URL */
    $abs = "$host$path/$rel";

    /* replace '//' or '/./' or '/foo/../' with '/' */
    $re = array('#(/\.?/)#', '#/(?!\.\.)[^/]+/\.\./#');
    for ($n = 1; $n > 0; $abs = preg_replace($re, '/', $abs, -1, $n)) {
    }

    /* absolute URL is ready! */
    return $scheme . '://' . $abs;
}
 

Painless CRUD in PHP via AjaxCrud

Posted by Kelvin on 08 Oct 2011 | Tagged as: PHP, programming

I recently discovered an Ajax CRUD library which makes CRUD operations positively painless: AjaxCRUD

Its features include:

- displaying list in an inline-editable table
- generates a create form
- all operations (add,edit,delete) handled via ajax
- supports 1:many relations
- only 1 class to include!!

I highly recommend you try it out!

Here is the example code:

# the code for the class
include ('ajaxCRUD.class.php');

# this one line of code is how you implement the class
$tblCustomer = new ajaxCRUD("Customer",
                             "tblCustomer", "pkCustomerID");

# don't show the primary key in the table
$tblCustomer->omitPrimaryKey();

# my db fields all have prefixes;
# display headers as reasonable titles
$tblCustomer->displayAs("fldFName", "First");
$tblCustomer->displayAs("fldLName", "Last");
$tblCustomer->displayAs("fldPaysBy", "Pays By");
$tblCustomer->displayAs("fldDescription", "Customer Info");

# set the height for my textarea
$tblCustomer->setTextareaHeight('fldDescription', 100);

# define allowable fields for my dropdown fields
# (this can also be done for a pk/fk relationship)
$values = array("Cash", "Credit Card", "Paypal");
$tblCustomer->defineAllowableValues("fldPaysBy", $values);

# add the filter box (above the table)
$tblCustomer->addAjaxFilterBox("fldFName");

# actually show to the table
$tblCustomer->showTable();
 

HOWTO: Collect WebDriver HTTP Request and Response Headers

Posted by Kelvin on 22 Jun 2011 | Tagged as: crawling, Lucene / Solr / Elastic Search / Nutch, programming

WebDriver, is a fantastic Java API for web application testing. It has recently been merged into the Selenium project to provide a friendlier API for programmatic simulation of web browser actions. Its unique property is that of executing web pages on web browsers such as Firefox, Chrome, IE etc, and the subsequent programmatic access of the DOM model.

The problem with WebDriver, though, as reported here, is that because the underlying browser implementation does the actual fetching, as opposed to, Commons HttpClient, for example, its currently not possible to obtain the HTTP request and response headers, which is kind of a PITA.

I present here a method of obtaining HTTP request and response headers via an embedded proxy, derived from the Proxoid project.

ProxyLight from Proxoid

ProxyLight is the lightweight standalone proxy from the Proxoid project. It's released under the Apache Public License.

The original code only provided request filtering, and performed no response filtering, forwarding data directly from the web server to the requesting client.

I made some modifications to intercept and parse HTTP response headers.

Get my version here (released under APL): http://downloads.supermind.org/proxylight-20110622.zip

Using ProxyLight from WebDriver

The modified ProxyLight allows you to process both request and response.

This has the added benefit allowing you to write a RequestFilter which ignores images, or URLs from certain domains. Sweet!

What your WebDriver code has to do then, is:

  1. Ensure the ProxyLight server is started
  2. Add Request and Response Filters to the ProxyLight server
  3. Maintain a cache of request and response filters which you can then retrieve
  4. Ensure the native browser uses our ProxyLight server

Here's a sample class to get you started

package org.supermind.webdriver;

import com.mba.proxylight.ProxyLight;
import com.mba.proxylight.Response;
import com.mba.proxylight.ResponseFilter;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.openqa.selenium.firefox.FirefoxProfile;

import java.util.LinkedHashMap;
import java.util.Map;

public class SampleWebDriver {
  protected int localProxyPort = 5368;
  protected ProxyLight proxy;

  // LRU response table. Note: this is not thread-safe.
  // Use ConcurrentLinkedHashMap instead: http://code.google.com/p/concurrentlinkedhashmap/
  private LinkedHashMap<String, Response> responseTable = new LinkedHashMap<String, Response>() {
    protected boolean removeEldestEntry(Map.Entry eldest) {
      return size() > 100;
    }
  };

  public Response fetch(String url) {
    if (proxy == null) {
      initProxy();
    }
     FirefoxProfile profile = new FirefoxProfile();

    /**
     * Get the native browser to use our proxy
     */

    profile.setPreference("network.proxy.type", 1);
    profile.setPreference("network.proxy.http", "localhost");
    profile.setPreference("network.proxy.http_port", localProxyPort);

    FirefoxDriver driver = new FirefoxDriver(profile);

    // Now fetch the URL
    driver.get(url);

    Response proxyResponse = responseTable.remove(driver.getCurrentUrl());

    return proxyResponse;
  }

  private void initProxy() {
    proxy = new ProxyLight();

    this.proxy.setPort(localProxyPort);

    // this response filter adds the intercepted response to the cache
    this.proxy.getResponseFilters().add(new ResponseFilter() {
      public void filter(Response response) {
        responseTable.put(response.getRequest().getUrl(), response);
      }
    });

    // add request filters here if needed

    // now start the proxy
    try {
      this.proxy.start();
    } catch (Exception e) {
      e.printStackTrace();
    }
  }

  public static void main(String[] args) {
    SampleWebDriver driver = new SampleWebDriver();
    Response res = driver.fetch("http://www.lucenetutorial.com");
    System.out.println(res.getHeaders());
  }
}

 

Solr 3.2 released!

Posted by Kelvin on 22 Jun 2011 | Tagged as: crawling, Lucene / Solr / Elastic Search / Nutch, programming

I'm a little slow off the block here, but I just wanted to mention that Solr 3.2 had been released!

Get your download here: http://www.apache.org/dyn/closer.cgi/lucene/solr

Solr 3.2 release highlights include

  • Ability to specify overwrite and commitWithin as request parameters when using the JSON update format
  • TermQParserPlugin, useful when generating filter queries from terms returned from field faceting or the terms component.
  • DebugComponent now supports using a NamedList to model Explanation objects in it's responses instead of Explanation.toString
  • Improvements to the UIMA and Carrot2 integrations

I had personally been looking forward to the overwrite request param addition to JSON update format, so I'm delighted about this release.

Great work guys!

Classical learning curves for some editors

Posted by Kelvin on 20 Jun 2011 | Tagged as: programming

PHP function to send an email with file attachment

Posted by Kelvin on 11 Jun 2011 | Tagged as: PHP, programming

Courtesy of http://www.finalwebsites.com/forums/topic/php-e-mail-attachment-script

function mail_attachment($filename, $path, $mailto, $from_mail, $from_name, $replyto, $subject, $message) {
    $file = $path.$filename;
    $file_size = filesize($file);
    $handle = fopen($file, "r");
    $content = fread($handle, $file_size);
    fclose($handle);
    $content = chunk_split(base64_encode($content));
    $uid = md5(uniqid(time()));
    $name = basename($file);
    $header = "From: ".$from_name." <".$from_mail.">\r\n";
    $header .= "Reply-To: ".$replyto."\r\n";
    $header .= "MIME-Version: 1.0\r\n";
    $header .= "Content-Type: multipart/mixed; boundary=\"".$uid."\"\r\n\r\n";
    $header .= "This is a multi-part message in MIME format.\r\n";
    $header .= "--".$uid."\r\n";
    $header .= "Content-type:text/plain; charset=iso-8859-1\r\n";
    $header .= "Content-Transfer-Encoding: 7bit\r\n\r\n";
    $header .= $message."\r\n\r\n";
    $header .= "--".$uid."\r\n";
    $header .= "Content-Type: application/octet-stream; name=\"".$filename."\"\r\n"; // use different content types here
    $header .= "Content-Transfer-Encoding: base64\r\n";
    $header .= "Content-Disposition: attachment; filename=\"".$filename."\"\r\n\r\n";
    $header .= $content."\r\n\r\n";
    $header .= "--".$uid."--";
    if (mail($mailto, $subject, "", $header)) {
        echo "mail send ... OK"; // or use booleans here
    } else {
        echo "mail send ... ERROR!";
    }
}
 

« Previous PageNext Page »

10/30/2014 | Kelvin Tan