Blog 

Ehud on forces driving commercial programming language creation

Posted by Kelvin on 01 Jan 2007 | Tagged as: programming

Just discovered this gem on LtU here:

Ehud Lamm - Re: Growing a Language
5/25/2004; 3:02:35 PM (reads: 118, responses: 0)
(in response to a previous post: I’ll now turn the tables back: is it possible that the “commercial” languages (VB, Java, C#) gained popularity because they were created to provide programmers what they wanted rather than what [...]

Spam, moderation and life..

Posted by Kelvin on 01 Jan 2007 | Tagged as: programming

Just spent the last 2 hours cleaning up my spam-filled moderation queue, and have enabled Spam Karma (http://unknowngenius.com/blog/wordpress/spam-karma/).
Apologies to anyone whose comments never made it public!

PHP + Lucene integration

Posted by Kelvin on 01 Jan 2007 | Tagged as: Lucene / Nutch, programming

I’ve had very positive experiences integrating PHP front-end with a Lucene back-end, not with any fancy Java-in-PHP wizardry or Zend Lucene, but plain-old JSON-over-REST.
I’ve done some simple load tests, and it clearly outperforms Zend Lucene (though I don’t can’t offer you any numbers to back this claim up). Zend Lucene seems to suck bad at [...]

A simple API-friendly crawler

Posted by Kelvin on 01 Dec 2006 | Tagged as: Lucene / Nutch, programming

Alright. I know I’ve blogged about this before. Well, I’m revisiting it again.
My sense is that there’s a real need for a simple crawler which is easy to use as an API and doesn’t attempt to be everything to everyone.
Yes, Nutch is cool, but I’m so tired of fiddling around with configuration files, [...]

Search and crawling internship

Posted by Kelvin on 31 Oct 2006 | Tagged as: Lucene / Nutch, programming

I’m looking for a competent Java programmer who wants to get into the world of search engines and crawlers. The internship will involve a mixture of (some) training and (mostly) hands-on projects with real-world clients.
In particular, my area of expertise is in vertical search (real estate, jobs, classifieds), so more than likely, that will be [...]

Normalized Google Distance

Posted by Kelvin on 30 Oct 2006 | Tagged as: programming

http://blog.outer-court.com/archive/2005-01-27-n48.html has an interesting article on Normalized Google Distance.
In short, using google page counts to determine the semantic distance/similarity between 2 words.
I unknowingly used this in a recent project where we were attempting to detect the sentiment of blogs. For example, is a blog post positively or negatively slanted towards a movie.
The general idea was [...]

Nutch 0.8, Map & Reduce, here I come!

Posted by Kelvin on 09 Aug 2006 | Tagged as: Lucene / Nutch, programming

Finally taking the plunge to Nutch 0.8 after exclusively working with 0.7 for over a year (and something like 5 projects).
From initial experiences, it appears that using M&R does obfuscate the code somewhat for a developer who wants to build an app off the Nutch infrastructure instead of using it out-of-box. For example, trying [...]

Lucene scoring for dummies

Posted by Kelvin on 08 Mar 2006 | Tagged as: Lucene / Nutch

The factors involved in Lucene’s scoring algorithm are as follows:
1. tf = term frequency in document = measure of how often a term appears in the document
2. idf = inverse document frequency = measure of how often the term appears across the index
3. coord = number of terms in the query that were found in [...]

OC and focused crawling

Posted by Kelvin on 26 Feb 2006 | Tagged as: Lucene / Nutch

I’ve had the good fortune to get paid to work on OC (Our Crawler). Features I’ve been developing have been for focused crawling purposes.
Specifically:

Ranking content by relevance to a supplied query and crawling the most relevant links first, with the possibility of specifying a score threshold
Checkpointing the crawl output (which is a Nutch segment) [...]

The next few months for OC

Posted by Kelvin on 28 Jan 2006 | Tagged as: Lucene / Nutch

I had a chat with Mike from Atlassian recently, and have arrived at the conclusion that the future of OC lies in being a crawler API, much like what Lucene does for searching. I suppose it will lie somewhere between Nutch (full-blown whole-web crawler) and Commons HTTPClient.
Some directions I will explore include:

Introducing checkpointing to recover [...]

« Previous PageNext Page »

07/04/08 | Kelvin Tan | Lucene Vertical Search Consultant