Supermind Search Consulting Blog 
Solr - Elasticsearch - Big Data

Posts about work

TokyoCabinet HDB slowdown

Posted by Kelvin on 10 Oct 2009 | Tagged as: work, programming

http://www.dmo.ca/blog/benchmarking-hash-databases-on-large-data/ reported that with a large number of records, puts become increasingly slower.

I experienced a similar phenomenon, and just stumbled upon http://parand.com/say/index.php/2009/04/09/tokyo-cabinet-observations/ , where I realized my problem was with bnum being too small (default of 128k).

According to docs, bnum is

number of elements of the bucket array. If it is not more than 0, the default value is specified. The default value is 131071 (128K). Suggested size of the bucket array is about from 0.5 to 4 times of the number of all records to be stored.

So, when you're dealing with a large number of records with the Tokyo Cabinet HDB, don't forget to increase the size of bnum accordingly.

Finally, FireFox is complete!

Posted by Kelvin on 23 Jan 2006 | Tagged as: work

I stumbled upon a FireFox extension (http://nextplease.mozdev.org/) that does what I've always missed from Opera – the ability to go to next page page in Google using keyboard shortcuts alone!

YES!!

The idea is, you use the space bar to scroll down a search result page, and when you hit the bottom, just hit Ctrl+Space (I mapped this key extension using NextPlease) and voila! The next page loads. Uber cool.

Getting a US work visa

Posted by Kelvin on 22 Jan 2006 | Tagged as: work

I've recently obtained my work visa to stay in the US, so I thought I'd write abit about my experiences.

The major routes to getting a green card/work visa in the US are:

  • Green card lottery
  • Investor's green card
  • Work visa
  • Marriage visa
  • Extraordinary Alien visa

I will only talk about getting a work visa, but I have done a fair amount of research into obtaining the others, so feel free to contact me if you want some input on them (except extraordinary alien.. but if you're that special, you probably don't need any help).

Work visa

Assuming you're applying for a new job (i.e. not a transfer), the visa you want is the H1B. If you're Singaporean, Chiliean or Australian, you're in luck. There are separate programmes (and correspondingly, quotas)for nationals of these countries. (Singaporeans: 5400, Australian: 10,500). The Singaporean one is called the H1-B1, and the Australian one is called the E-3. The advantage is not only dramatically lower cost and not having to compete with other foreigners, but also faster processing time. In my case (Singaporean), my application was approved on the day itself, and I collected it 3 business days later.

If you're working for a multi-national company, company transfers are a really easy way to get visas. Also, there's the J-class visas for students (I think). That's worth exploring too.

Obtaining a work visa is a 2-step process:

  1. Getting a job offer (the difficult part)
  2. Applying for the visa

When you actually apply for the visa makes a difference. The visa quotas get renewed on 1st October every year. In other words, its kinda silly to try to apply for a work visa in August or September. Much better to wait a couple months. However, in the case of Singapore at least, I know the quota was NOT reached last year at all. So, better to check with your US embassy regarding quotas.

The city where you look for a job makes a big difference too. From my experience, big cities like New York and San Francisco are very open to helping their potential employees apply for work visas (i.e. it is generally not a disadvantage). In contrast, I had a couple of job possibilities turned down in Boston even though I was a good fit for the job just coz of the visa thing.

The visa interview process should be a fairly simple one. Be prepared to explain why the company needs to bring in foreign talent, i.e. how are you/your skills special that justifies why they need to hire a foreigner.

In general, if you're looking to move to NYC to find work and are in the open-source community, make contact with me and I might be able to do some legwork for you (no promises!)

Crawling Basics

Posted by Kelvin on 27 Oct 2005 | Tagged as: work, Lucene / Solr / Elasticsearch / Nutch, crawling

Here's my attempt at providing an introduction to crawling, in the hope that it'll clarify some of my own thoughts on the topic.

The most basic tasks of a HTTP crawler are:

  1. Download urls via the HTTP protocol.
  2. Extract urls from downloaded pages to be added to the download queue.

That's actually very simple, and crawlers are very simple applications at heart. 2 factors complicate crawler development:

  • they deal with large datasets, and require care when designing the appopriate data structures and algorithms for acceptable performance.
  • they require either multi-threaded or non-blocking IO implementations, and both are non-trivial to do well.

To be abit more intelligent/useful, crawlers usually also provide some way of

  1. restricting the scope of the crawl
  2. prioritizing the download queue
  3. storing the downloaded pages for easy retrieval, preferably in compressed form
  4. logging and status reporting
  5. normalizing the urls
  6. being compliant with HTTP specifications, e.g. using robots.txt
  7. being polite to servers
  8. recrawling urls

Let's go through the interesting ones..

  • restricting the scope of the crawl: generally accomplished via 2 methods – duplicate url detection (removing urls that have already been downloaded or are already in the queue) and user-defined crawl scopes. The 1st is performed automatically by the crawler, and the 2nd by the user. Ways in which the user can limit the crawl scope include: limit urls to same hosts as seed urls, limit by depth from seed, regex, etc.

    A note about restricting scope of crawl by content for whole-web focused crawlers: Content-based sites tend to cluster (i.e. link to each other) and crawling an entire cluster is easy. However, clusters not linked from seed urls are difficult to get to, and moving up a cluster might be difficult, i.e. given 2 clusters A and B and 2 link hubs A1 and B1, A1 and B1 might share a link, but the common site may not link to both or either of them.

  • prioritizing the download queue: important for both whole-web and focused crawlers, but for different reasons.

    Whole-web crawlers
    The last thing a whole-web crawler wants to do is crawl pages indiscriminately. It generally wants to crawl pages that are both authoritative and have a high incoming and outgoing link count (hubs) as early as possible. It also needs to distribute requests to be polite to its servers.

    Focused crawlers
    Focused crawlers can be further broken down into

    • whole-web focused crawlers which are interested in a subset of the whole-web, usually partitioned by language or content. Queue prioritization would be similar to whole-web.
    • Site mirroring focused crawlers (SMFC) which are only interested in their seed urls (and possibly all pages within that domain/host). SMFCs can radically decrease overall crawl time by being smart about prioritzing their urls, especially when the hosts to crawl are small (<100). Btw, I suspect that the majority of developers who use Nutch actually belong to this category.

  • url normalization: ensures that urls that would be filtered out don't sneak through. Includes remembering redirects encountered, for example.

More advanced crawlers also

  • checkpoint at regular intervals so interrupted crawls can be recovered
  • detect and avoid bot traps
  • throttle bandwith usage according to bandwidth availability and server response time
  • allow crawling speed to be scaled (linearly) by adding more crawl boxes

Ignoring the advanced features, the high-level modules are:

  • HTTP request and response module
  • Link extractor (can be regex or html parser)
  • download queue (we'll call this the fetchlist from now on), and which urls are currently being downloaded
  • database of downloaded urls
  • database of downloaded pages
  • link graph/database
  • url normalizer

That's all for now. In a follow-up post, I'll walkthrough the typical application flow of a crawler, and the interactions between the modules/data structures.

Practical introduction to Nutch MapReduce

Posted by Kelvin on 28 Sep 2005 | Tagged as: work, Lucene / Solr / Elasticsearch / Nutch, crawling

Some terminology first:

Mapper
Performs the map() function. The name will make sense when you look at it as "mapping" a function/operation to elements in a list.
Reducer
Performs the reduce() function. Merges multiple input values with the same key to produce a single output value.
OutputFormat/InputFormat
Classes which tell Nutch how to process input files and what output format the MapReduce job is to produce.

Currently implemented classes: MapFile (output only), Text and SequenceFile. What this basically means is that out-of-box, Nutch support SequenceFiles and text files as input formats for MapReduce jobs.

Partitioner
Splits up input into n different partitions, where n is the number of map tasks.
Combiner
Javadoc says: Implements partial value reduction during mapping.. There is no special interface for a Combiner. Rather, it is a Reducer which is configured to be used during the mapping phase.

To complete a MapReduce job successfully, Nutch requires at least

  • a directory containing input files (in the format determined by JobConf.setInputFormat, which defaults to newline-terminated text files)
  • a mapper or reducer (even though technically the job will still complete even if none is provided, as the HelloWorld example shows)
  • output format

To map or to reduce?
Rule of thumb: when performing operation on one input key/value, use Mapper; when requiring multiple input, or merging files, use Reducer.

Technically, the implementation of MapReduce in Nutch allows for the map function to be performed by only the Reducer, although this would seem to be a rather inappropriate use of a Reducer.

Hello World for MapReduce

Posted by Kelvin on 28 Sep 2005 | Tagged as: work, Lucene / Solr / Elasticsearch / Nutch, crawling

Here's a Hello World tutorial as part of my attempts to grok MapReduce.

HelloWorld.java :

import org.apache.nutch.mapred.JobClient;
import org.apache.nutch.mapred.JobConf;
import org.apache.nutch.util.NutchConf;
import java.io.File;

public class HelloWorld {

  public static void main(String[] args) throws Exception {
    if (args.length < 1) {
      System.out.println("HelloWorld ");
      System.exit(-1);
    }

    NutchConf defaults = NutchConf.get();
    JobConf job = new JobConf(defaults);
    job.setInputDir(new File(args[0]));
    job.setOutputFormat(ConsoleOutputFormat.class);
    JobClient.runJob(job);
  }
}

and ConsoleOutputFormat.java (for printing to System.out):

import org.apache.nutch.fs.NutchFileSystem;
import org.apache.nutch.io.Writable;
import org.apache.nutch.io.WritableComparable;
import org.apache.nutch.mapred.JobConf;
import org.apache.nutch.mapred.OutputFormat;
import org.apache.nutch.mapred.RecordWriter;
import org.apache.nutch.mapred.Reporter;

import java.io.IOException;

public class ConsoleOutputFormat implements OutputFormat {
  public RecordWriter getRecordWriter(NutchFileSystem fs, JobConf job, String name) throws IOException {
    return new RecordWriter() {
      public void write(WritableComparable key, Writable value) {
        System.out.println(value);
      }

      public void close(Reporter reporter) {

      }
    };
  }
}

And now create a new directory someplace, and create a new file (say, foo.txt). Fire up your text editor, and in foo.txt, enter:

Hello World

Important: There MUST be a newline at the end of the file, following the words "Hello World". It is also important (for the purposes of this tutorial), that a new directory be created and there is only one file in this directory.

Now, run the HelloWorld application, providing the location of the directory where foo.txt resides, for example

java HelloWorld /tmp/mapreduce/

Note: Its best to run the application directly from your IDE, so you won't have to worry about adding the necessary libs to the classpath.

After some output from Nutch about parsing the config files, you should see the text Hello World.

Congratulations! You've run your first MapReduce program!

OC and Nutch MapReduce

Posted by Kelvin on 15 Sep 2005 | Tagged as: work, programming, Lucene / Solr / Elasticsearch / Nutch, crawling

(or what's next for OC)…

I've received a couple of emails about what the future of OC vis-a-vis incorporating into Nutch codebase, the upcoming MapReduce merge into trunk, etc.

My thoughts are:

  1. When MapReduce is merged into trunk, I'll make appropriate changes to OC to support MapReduce.
  2. This MapReduce-compatible OC will be offered to the Nutch codebase. As of today, I've removed most usages of JDK1.5 features, so the other thing that needs to be removed is Spring Framework dependencies.
  3. I _might_ keep a version of OC which is more experimental and uses Spring, and can operate on a more standalone basis. The advantages (more autonomy over what I can do and not, Spring support) will have to be balanced against the disadvantages (duplication).

27092005 edit
I'm browsing MapReduce sources now, and I've found the complexity of the Fetcher has rather significantly increased. I'm leaning towards maintaining a simple non-mapred fetcher for folks who don't need the MapReduce scalability.

Inside Our Crawler

Posted by Kelvin on 25 Aug 2005 | Tagged as: work, programming, Lucene / Solr / Elasticsearch / Nutch, crawling

Inspired by http://wiki.apache.org/nutch/DissectingTheNutchCrawler

I guess the contents of this post will eventually make it to javadocs. Note that Our Crawler (OC) is really not so different from Nutch Crawler (NC). This document highlights the main differences, as well as important classes.

CrawlTool
CrawlTool is the point of entry to OC. It doesn't do very much really, just calls Fetcher.run().

Fetcher
The Fetcher doesn't do very much either. It starts the FetcherThreads, and provides them SeedURLs from a CrawlSeedSource. Its 2 other main responsibilies are: distributing URLs amongst threads, and reporting FetcherStatus. In a minor design quirk, the PostFetchProcessor is also found in Fetcher and not in FetcherThread instances, thus imposing the requirement on PostFetchProcessors to be thread-safe. The rationale behind this, is to be able to write to a single Nutch segment, instead of requiring a x-way post-fetch segment merge where x is the number of fetcher threads (I haven't put much thought into this. If post-fetch processing is substantial, then it makes more sense to make this a per-thread thing)

FetcherThread
A FetcherThread has the following high-level responsibilities:

  1. Maintain FetchList and db of fetchedurls for a set of urls (as determined by Fetcher's url distribution strategy). FetcherThread is also responsible for avoiding duplicates between URLs (whether fetched, parsed or output)
  2. Delegate the actual URL downloading to Http and HttpResponse classes
  3. Process the outcome of downloading, which primarily has the following steps:
    1. Parse the HTML
    2. Extract links from parsed page and run each through FetchListScope, adding to relevant thread's link queue (for subsequent adding into fetchlist) if link is allowed
    3. Run fetch output through PostFetchScope, passing to PostFetchProcessor if allowed

FetcherThread periodically moves items from its link queue to its fetchlist. The link queue is the thread-safe holding area where other fetcher threads add URLs to be fetched.

A note about the PostFetchProcessor : I know it sounds weird :-).
Initially I called it something like PageOutputter or something, which leaves an equally bad taste in my mouth. I wanted to capture the idea of something that happens _after_ a url/page is downloaded, whether saving to Nutch segment, sending an email to someone every 100 downloaded pages or triggering a backup.

FetchList
Now on to the fetchlist system, one of the biggest differences between NC and OC. The main actors in are FetchList and HostQueue. Unlike Nutch where a FetchList is a sequence of URLs, our FetchList is a sequence of HostQueues, which are in turn a sequence of URLs with the same host. The FetchList also manages the server politeness policy (unlike Nutch where this is done in the Http class).

Different implementations of FetchList may choose different strategies of how to go about prioritizing certain hosts/URLs over others. This process can even be randomized (in which case, it somewhat simulates the Nutch fetchlist, although Nutch's fetchlist is deterministic and OC's DefaultFetchList is NOT).

LastModifiedDB and FetchedURLs
These are both new to OC, even though the WebDB serves as roughly the equivalent. Javadocs should be sufficient for understanding the roles of these 2 classes. One thing to note about FetchedURLs: to save space, the current implementation does _not_ save the actual URL, but rather a 64-bit checksum.

Scopes and Filters
A Scope consists of zero or more filters, where given an input, each filter replies ALLOW, REJECT or ABSTAIN. Self-explanatory. When all filters in a scope abstain, then the scope's allowByDefault value kicks in (also used when a scope has no filters).

The different scopes in use are: FetchListScope, ParseScope and PostFetchScope.

Limitations of OC

Posted by Kelvin on 19 Aug 2005 | Tagged as: work, programming, Lucene / Solr / Elasticsearch / Nutch, crawling

Follow-up post to some Reflections on modifying the Nutch crawler.

The current implementation of Our Crawler (OC) has the following limitations:

  1. No support for distributed crawling.
    I neglected to mention that by building the fetchlist offline, the Nutch Crawler (NC) has an easier job splitting the crawling amongst different crawl servers. Furthermore, because the database of fetched URLs (WebDB) is also modified offline, its real easy to check if a URL has already been fetched.

    Since both fetchlist building and fetched URL DB is modified online in OC, the multiple crawl server scenario complicates things somewhat. My justification to omit this in the initial phase at least, is that for most constrained crawls, a single crawl server (with multiple threads) is the most probable use case, and also most likely sufficient.

  2. No load-balancing of URLs amongst threads.
    Currently URL distribution is excessively simple (url.getHost().hashCode() % numberOfThreads), and the only guarantee the Fetcher makes, is that every URL from the same host will go to the same thread (an important guarantee, since each fetcherthread maintains its own fetchlist and database of fetched URLs). This also means that its pointless using 20 threads to crawl 3 hosts (check out the previous post if you're not clear why)

    However, when some hosts are significantly larger than others, then its highly probable that the number of pages each thread has to fetch is uneven, resulting in sub-optimal fetch times. This calls for a more sophisticacted method of assigning URLs to threads, but still maintaining the thread-host contract.

  3. No bot-trap detection.
    With NC, since the depth of the crawl (determined by the number of iterations the fetcher is run) is limited, bot-traps (intentional or otherwise) aren't a major concern.

    When letting OC loose on a site, however, bot-traps can be a big problem because OC will continue to run as long as there are still items in the fetchlist.

  4. Disk writes in multiple threads may nullify SequenceFile performance gains
    Each thread maintains its own database of fetched URLs (via a MapFile). When using alot of threads, its quite likely that multiple threads will be writing to disk at once. To be honest, I don't quite know enough about this to know if its a problem, but depending on hardware and HD utilization, I think its possible that the disk heads may end up jumping around to satisfy the simultaneous writes? If so, SequenceFile's premise of fast sequential write-access is no longer valid.

These are what I can think off right now. Will update this list as I go along.

Reflections on modifying the Nutch crawler

Posted by Kelvin on 16 Aug 2005 | Tagged as: work, programming, Lucene / Solr / Elasticsearch / Nutch

The code for my modifications to the Nutch-based fetcher is at alpha quality, meaning that it compiles, and I've tested it on smallish (<100 pages) sites. There are some unit tests written, but not as extensive as I'd like.

Some thoughts:

  1. Nutch crawler (NC) uses an offline fetchlist building strategy, whilst our crawler (OC) supports both online as well as offline fetchlist building. Fetchlist building is basically the process of determining which URL to crawl next. In Nutch, the fetchlist building basically goes like this:
    1. Inject seed urls into database of urls
    2. Generate fetchlist
    3. Fetch urls in fetchlist
    4. Parse fetched pages for links, adding them into database of urls selectively (depending on url filter rules)
    5. Rank fetched pages (based on PageRank)
    6. Generate fetchlist using top x pages
    7. Repeat steps 3 – 6

    Since step 5 is both time-consuming and computationally-intensive, NC fetchlist building has to be an offline step. (NB: Breadth-first crawling can possibly be just as effective as PageRank-based crawling).

    In crawls where the number of fetched pages is limited and the order in which the pages are fetched doesn't matter, then an offline fetchlist building strategy offers little or no advantage over an online one.

  2. Crawlers should be polite and not hammer a web server repeatedly with requests, so the crawler has to implement some kind of server politeness policy. The accepted norm is to wait a period of time between consecutive requests to the same server. To maximize crawl speed (and be polite), the crawler should try to download from as many different hosts at the same time as possible.

    Both of Nutch's current Http plugins (protocol-http and protocol-httpclient) are implemented in such a way that optimum crawl times can only be achieved when

    • the order in which the URLs are downloaded is sufficiently random
    • the set of hosts is sufficiently large
    • the set of hosts is well distributed

    In NC, this currently comprises of:
    1. Generating an MD5 hash of each URL, then sorting fetchlist entries by this hash. Since a single character change in the url will result in a wildly different hash, this approach does a good job of satisfying the requirement that URLs be randomized. The other 2 requirements (large and well-distributed set of hosts) are satisfied by the very definition of the WWW.
    2. Enforcing a fixed delay between consecutive requests to the same webserver (this is the fetcher.server.delay config variable).

    In the case of constrained crawling, when either the requirement of size or distribution of hosts is not satisifed, its not difficult to see how crawl time is severely hampered, i.e. if there are too few hosts, or if the distribution of URLs is skewed towards a couple of large hosts, the crawl time will always be grossly sub-optimal.

    OC's default fetchlist prioritizes hosts based on the number of unfetched pages they contain. This satisfies the pre-requisite for optimal crawl speed, and minimizes the chance of having "backed-up" urls from the large hosts which take a long time to fetch where the number of fetching threads gets reduced to the number of outstanding hosts there are. OC also waits a period of time between consecutive requests, and its also possible to vary this time depending on the time taken to download a URL (slow servers can mean overloaded server, or they can mean a slow network).

    Note that OC's fetchlist is an interface which means that Nutch's offline fetchlist building can be replicated by:
    1. Seed crawl from file (same as nutch)
    2. When crawl finishes, run Nutch fetchlist building steps
    3. Implement a simple fetchlist which reads entries from a Nutch fetchlist (.fl)
    4. Repeat steps 2 and 3.

    Actually, the current codebase already contains the necessary classes to do this.

  3. Java needs a library of external memory algorithms and data structures (I'm working on this now). Examples from Nutch are SequenceFile, MapFile and ArrayFile, as well as the external mergesort algorithm. Other examples are B+-tree, insertion sort for nearly sorted files, buffer tree, heap file, etc. The list goes on.

Next Page »