Blog 

Finally, FireFox is complete!

Posted by Kelvin on 23 Jan 2006 | Tagged as: work

I stumbled upon a FireFox extension (http://nextplease.mozdev.org/) that does what I’ve always missed from Opera - the ability to go to next page page in Google using keyboard shortcuts alone!
YES!!
The idea is, you use the space bar to scroll down a search result page, and when you hit the bottom, just hit Ctrl+Space (I [...]

Getting a US work visa

Posted by Kelvin on 22 Jan 2006 | Tagged as: work

I’ve recently obtained my work visa to stay in the US, so I thought I’d write abit about my experiences.
The major routes to getting a green card/work visa in the US are:

Green card lottery
Investor’s green card
Work visa
Marriage visa
Extraordinary Alien visa

I will only talk about getting a work visa, but I have done a fair amount [...]

Crawling Basics

Posted by Kelvin on 27 Oct 2005 | Tagged as: Lucene / Nutch, work

Here’s my attempt at providing an introduction to crawling, in the hope that it’ll clarify some of my own thoughts on the topic.
The most basic tasks of a HTTP crawler are:

Download urls via the HTTP protocol.
Extract urls from downloaded pages to be added to the download queue.

That’s actually very simple, and crawlers are very simple [...]

Practical introduction to Nutch MapReduce

Posted by Kelvin on 28 Sep 2005 | Tagged as: Lucene / Nutch, work

Some terminology first:

Mapper
Performs the map() function. The name will make sense when you look at it as “mapping” a function/operation to elements in a list.

Reducer
Performs the reduce() function. Merges multiple input values with the same key to produce a single output value.

OutputFormat/InputFormat
Classes which tell Nutch how to process input files and what [...]

Hello World for MapReduce

Posted by Kelvin on 28 Sep 2005 | Tagged as: Lucene / Nutch, work

Here’s a Hello World tutorial as part of my attempts to grok MapReduce.
HelloWorld.java :
import org.apache.nutch.mapred.JobClient;
import org.apache.nutch.mapred.JobConf;
import org.apache.nutch.util.NutchConf;
import java.io.File;

public class HelloWorld {

public static void main(String[] args) throws Exception {
if (args.length < 1) {
System.out.println(”HelloWorld
“);
System.exit(-1);
}

[...]

OC and Nutch MapReduce

Posted by Kelvin on 15 Sep 2005 | Tagged as: Lucene / Nutch, programming, work

(or what’s next for OC)…
I’ve received a couple of emails about what the future of OC vis-a-vis incorporating into Nutch codebase, the upcoming MapReduce merge into trunk, etc.
My thoughts are:

When MapReduce is merged into trunk, I’ll make appropriate changes to OC to support MapReduce.
This MapReduce-compatible OC will be offered to the Nutch codebase. As of [...]

Inside Our Crawler

Posted by Kelvin on 25 Aug 2005 | Tagged as: Lucene / Nutch, programming, work

Inspired by http://wiki.apache.org/nutch/DissectingTheNutchCrawler
I guess the contents of this post will eventually make it to javadocs. Note that Our Crawler (OC) is really not so different from Nutch Crawler (NC). This document highlights the main differences, as well as important classes.
CrawlTool
CrawlTool is the point of entry to OC. It doesn’t do very much really, just calls [...]

Our Crawler Todo List

Posted by Kelvin on 25 Aug 2005 | Tagged as: Lucene / Nutch, programming

In the order in which these are jumping off my brain (read: no order whatsoever):

If-modified-since
OC already has the basic infrastructure in place to implement conditional downloading of pages based on the If-modified-since HTTP header. This just needs to be implemented in the Http and HttpResponse classes.

Re-use socket connections even for redirects
Right now, redirects are always [...]

Limitations of OC

Posted by Kelvin on 19 Aug 2005 | Tagged as: Lucene / Nutch, programming, work

Follow-up post to some Reflections on modifying the Nutch crawler.
The current implementation of Our Crawler (OC) has the following limitations:

No support for distributed crawling.
I neglected to mention that by building the fetchlist offline, the Nutch Crawler (NC) has an easier job splitting the crawling amongst different crawl servers. Furthermore, because the database of fetched [...]

Reflections on modifying the Nutch crawler

Posted by Kelvin on 16 Aug 2005 | Tagged as: Lucene / Nutch, programming, work

The code for my modifications to the Nutch-based fetcher is at alpha quality, meaning that it compiles, and I’ve tested it on smallish (<100 pages) sites. There are some unit tests written, but not as extensive as I’d like.
Some thoughts:

Nutch crawler (NC) uses an offline fetchlist building strategy, whilst our crawler (OC) supports both online [...]

« Previous PageNext Page »

07/04/08 | Kelvin Tan | Lucene Vertical Search Consultant