Mapping neighborhoods to street addresses via geocoding
Posted by Kelvin on 19 Apr 2010 | Tagged as: Lucene / Solr / Nutch, programming, Ubuntu
As far as I know, none of the geocoders consistently provide neighborhood data given a street address. Useful information when consulting the gods at google proves elusive too.
Here's a step-by-step guide to obtaining neighborhood names for your street addresses (on Ubuntu).
0. Geocode your addresses if necessary using Yahoo, MapQuest or Google geocoders. (this means converting addresses into latitude and longitude).
1. Install PostGIS.
sudo apt-get install postgresql-8.3-postgis
2. Complete the postgis install
sudo -u postgres createdb mydb
sudo -u postgres createlang plpgsql mydb
cd /usr/share/postgresql-8.3-postgis/
sudo -u postgres psql -d mydb -f lwpostgis.sql
sudo -u postgres psql -d mydb -f spatial_ref_sys.sql
3. Download and import Zillow neighborhood data. For this example, we'll be using California data.
cd /tmp
wget http://www.zillow.com/static/shp/ZillowNeighborhoods-CA.zip
unzip ZillowNeighborhoods-CA.zip
shp2pgsql ZillowNeighborhoods-CA public.neighborhood > ca.sql
sudo -u postgres psql -d mydb -f ca.sql
4. Connect to psql and run a query.
sudo -u postgres psql -d mydb
select name,city from public.neighborhood where ST_Within(makepoint(-122.4773980,37.7871760), the_geom)=true ;
If you've done everything right, this should be returned from the SQL:
name | city
----------------+---------------
Inner Richmond | San Francisco
(1 row)
Voila!g
Average length of a URL
Posted by Kelvin on 06 Nov 2009 | Tagged as: crawling, Lucene / Solr / Nutch, programming
Aug 16 update: I ran a more comprehensive analysis with a more complete dataset. Find out the new figures for the average length of a URL
I've always been curious what the average length of a URL is, mostly when approximating memory requirements of storing URLs in RAM.
Well, I did a dump of the DMOZ URLs, sorted and uniq-ed the list of URLs.
Ended up with 4074300 unique URLs weighing in at 139406406 bytes, which approximates to 34 characters per URL.
Idea: 2-stage recovery of corrupt Solr/Lucene indexes
Posted by Kelvin on 09 Sep 2009 | Tagged as: Lucene / Solr / Nutch, programming
I was recently onsite with a client who happened to have a corrupt Solr/Lucene index. The CheckIndex tool (lucene 2.4+) diagnosed the problem, and gave the option of fixing it.
Except... fixing the index in this case meant losing the corrupt segment, which also happened to be the one containing over 90% of documents.
Because Solr has the concept of a doc uid (which Lucene doesn't have), what I did was write a tool for them to dump out the uids in that corrupted segment into a text file, so after recovering the index, they were able to reindex the docs that were lost in that segment.
Using Hadoop IPC/RPC for distributed applications
Posted by Kelvin on 02 Jun 2008 | Tagged as: Lucene / Solr / Nutch, programming
Hadoop is growing to be a pretty large framework - release 0.17.0 has 483 classes!
Previously, I'd written about Hadoop SequenceFile. SequenceFile is part of the org.apache.hadoop.io package, the other notable useful classes in that package being ArrayFile and MapFile which are persistent array and dictionary data structures respectively.
About Hadoop IPC
Here, I'm going to introduce the Hadoop IPC, or Inter-Process Communication subsystem. Hadoop IPC is the underlying mechanism is used whenever there is a need for out-of-process applications to communicate with one another.
Hadoop IPC
1. uses binary serialization using java.io.DataOutputStream and java.io.DataInputStream, unlike SOAP or XML-RPC.
2. is a simple low-fuss RPC mechanism.
3. is unicast does not support multi-cast.
Why use Hadoop IPC over RMI or java.io.Serialization? Here's what Doug has to say:
Why didn't I use Serialization when we first started Hadoop? Because it looked big-and-hairy and I thought we needed something lean-and-mean, where we had precise control over exactly how objects are written and read, since that is central to Hadoop. With Serialization you can get some control, but you have to fight for it.
The logic for not using RMI was similar. Effective, high-performance inter-process communications are critical to Hadoop. I felt like we'd need to precisely control how things like connections, timeouts and buffers are handled, and RMI gives you little control over those.
Code!!!
I'm going to jump straight into some code examples to illustrate how easy Hadoop IPC can be.
Essentially, all unicast RPC calls involve a client and a server.
To create a server,
Configuration conf = new Configuration(); Server server = RPC.getServer(this, "localhost", 16000, conf); // start a server on localhost:16000 server.start();
To create a client,
Configuration conf = new Configuration();
InetSocketAddress addr = new InetSocketAddress("localhost", 16000); // the server's inetsocketaddress
ClientProtocol client = (ClientProtocol) RPC.waitForProxy(ClientProtocol.class,
ClientProtocol.versionID, addr, conf);
In this example, ClientProtocol is an interface implemented by the server class. ClientProtocol.java looks like this:
interface ClientProtocol extends org.apache.hadoop.ipc.VersionedProtocol {
public static final long versionID = 1L;
HeartbeatResponse heartbeat();
}
ClientProtocol defines a single method, heartbeat() which returns a HeartbeatResponse object. heartbeat() is the remote method called by the client which periodically lets the server know that it (the client) is still alive. The server then returns a HeartbeatResponse object which provides the client with some information.
Here's what HeartbeatResponse.java looks like:
public class HeartbeatResponse implements org.apache.hadoop.io.Writable {
String status;
public void write(DataOutput out) throws IOException {
UTF8.writeString(out, status);
}
public void readFields(DataInput in) throws IOException {
this.status = UTF8.readString(in);
}
}
Summary
So, to summarize, you'll need
1. A server which implements an interface (which itself extends VersionedProtocol)
2. 1 or more clients which call the interface methods.
3. Any method arguments or objects returned by methods must implement org.apache.hadoop.io.Writable
Comments Off
Is Nutch appropriate for aggregation-type vertical search?
Posted by Kelvin on 24 Sep 2007 | Tagged as: crawling, Lucene / Solr / Nutch, programming
I get pinged all the time by people who tell me they want to build a vertical search engine with Nutch. The part I can't figure out, though, is why Nutch?
What's vertical anyway?
So let's start from basics. Vertical search engines typically fall into 2 categories:
- Whole-web search engines which selectively crawl the Internet for webpages related to a certain topic/industry/etc.
- Aggregation-type search engines which mine other websites and databases, aggregating data and repackaging it into a format which is easier to search.
Now, imagine a biotech company comes to me to develop a search engine for everything related to biotechnology and genetics. You'd have to crawl as many websites as you can, and only include the ones related to biotechnology in the search index.
How would I implement the crawler? Probably use Nutch for the crawling and modify it to only extract links from a page if the page contents are relevant to biotechnology. I'd probably need to write some kind of relevancy scoring function which uses a mixture of keywords, ontology and some kind of similarity detection based on sites we know a priori to be relevant.
Now, second scenario. Imagine someone comes to me and want to develop a job search engine for a certain country. This would involve indexing all jobs posted in the 4 major job websites, refreshing this database on a daily basis, checking for new jobs, deleting expired jobs etc.
How would I implement this second crawler? Use Nutch? No way! Ahhhh, now we're getting to the crux of this post..
The ubiquity of Lucene ... and therefore Nutch
Nutch is one of two open-source Java crawlers out there, the other being Heritrix from the good guys at the Internet Archive. Its rode on Lucene as the default choice for full-text search API. Everyone who wants to build a vertical search engine in Java these days knows they're going to use Lucene as the search API, and naturally look to Nutch for the crawling side of things. And that's when their project runs into a brick wall...
To Nutch or not to Nutch
Nutch (and Hadoop) is a very very cool project with ambitious and praiseworthy goals. They're really trying to build an open-source version of Google (not sure if that actually is the explicitly declared aims).
Before jumping into any library or framework, you want to be sure you know what needs to be accomplished. I think this is the step many people skip: they have no idea what crawling is all about, so they try to learn what crawling is by observing what a crawler does. Enter Nutch.
The trouble is, observing/using Nutch isn't necessarily the best way to learn about crawling. The best way to learn about crawling is to build a simple crawler.
In fact, if you sit down and think what a 4 job-site crawler really needs to do, its not difficult to see that its functionality is modest and humble - in fact, I can write its algorithm out here:
for each site:
if there is a way to list all jobs in the site, then
page through this list, extracting job detail urls to the detail url database
else if exists browseable categories like industry or geographical location, then
page through these categories, extracting job detail urls to the detail url database
else
continue
for each url in the detail url database:
download the url
extract data into a database table according to predefined regex patterns
Won't be difficult to hack up something quick to do this, especially with the help of Commons HttpClient. You'll probably also want to make this app multi-threaded.
Other things you'll want to consider, is how many simultaneous threads to hit a server with, if you want to save the HTML content of pages vs just keeping the extracted data, how to deal with errors, etc.
All in all, I think you'll find that its not altogether overwhelming, and there's actually alot to be said for the complete control you have over the crawling and post-crawl extraction processes. Compare this to Nutch, where you'll need to fiddle with various configuration files (nutch-site.xml, urlfilters, etc), where calling apps from an API perspective is difficult, you'll have to work with the various file I/O structures to reach the content (SegmentFile, MapFile etc), various issues may prevent all urls from being fetched (retry.max being a common one), if you want custom crawl logic, you'll have to patch/fork the codebase (ugh!) etc.
The other thing that Nutch offers is an out-of-box search solution, but I personally have never found a compelling reason to use it - its difficult to add custom fields, adding OR phrase capability requires patching codebase, etc. In fact, I find it much much simpler to come up with my own SearchServlet.
Even if you decide not to come up with a homegrown solution, and you want to go with Nutch. Well, here's one other thing you need to know before jumping into Nutch.
To map-reduce, or not?
From Nutch 0.7 to Nutch 0.8, there was a pretty big jump in the code complexity with the inclusion of the map-reduce infrastructure. Map-reduce subsequently got factored out, together with some of the core distributed I/O classes into Hadoop.
For a simple example to illustrate my point, just take a look at the core crawler class, org.apache.nutch.fetcher.Fetcher, from the 0.7 branch, to the current 0.9 branch.
The 0.7 Fetcher is simple and easy to understand. I can't say the same of the 0.9 Fetcher. Even after having worked abit with the 0.9 fetcher and map-reduce, I still find myself having to do mental gymnastics to figure out what's going on. BUT THAT'S OK, because writing massively distributable, scaleable yet reliable applications is very very hard, and map-reduce makes this possible and comparatively easy. The question to ask though, is, does your search engine project to crawl and search those 4 job sites fall into this category? If not, you'd want to seriously consider against using the latest 0.8x release of Nutch, and tend to 0.7 instead. Of course, the biggest problem with this, is that 0.7 is not being actively maintained (to my knowledge).
Conclusion
Perhaps someone will read this post and think I'm slighting Nutch, so let me make this really clear: _for what its designed to do_, that is, whole-web crawling, Nutch does a good job of it; if what is needed is to page through search result pages and extract data into a database, Nutch is simply overkill.
Comments Off
Exploring Hadoop SequenceFile
Posted by Kelvin on 03 Jan 2007 | Tagged as: Lucene / Solr / Nutch
Hadoop's SequenceFile is at the heart of the Hadoop io package. Both MapFile (disk-backed Map) and ArrayFile (disk-backed Array) are built on top of SequenceFile.
So what exactly is SequenceFile? Its class javadoc tells us: Support for flat files of binary key/value pairs.- not very helpful.
Let's dig through the code and find out more:
- supports key/value pair, where key and value are any arbitrary classes which implement org.apache.hadoop.io.Writable
- contains 3 inner classes: Reader, Writer and Sorter
- Sorts are performed as external merge-sorts
- designed to be modified in batch, i.e. does NOT support appends or incremental updates. Modifications/appends involving creating a new SequenceFile, copying from the old->new, adding/changing values along the way
- compresses values using java.util.zip.Deflater after version 3
- from code comments: Inserts a globally unique 16-byte value every few entries, so that one can seek into the middle of a file and then synchronize with record starts and ends by scanning for this value
You might also be interested in a recent post on using Hadoop IPC/RPC for distributed applications.
Comments Off
PHP + Lucene integration
Posted by Kelvin on 01 Jan 2007 | Tagged as: Lucene / Solr / Nutch, programming
I've had very positive experiences integrating PHP front-end with a Lucene back-end, not with any fancy Java-in-PHP wizardry or Zend Lucene, but plain-old JSON-over-REST.
I've done some simple load tests, and it clearly outperforms Zend Lucene (though I don't can't offer you any numbers to back this claim up). Zend Lucene seems to suck bad at large (1GB+) indexes.
In terms of implementation, a servlet takes a query parameter, performs the search, then returns the results as JSON objects. There is a well defined API, and the solution offers the full expressiveness of Lucene query mechanism.
With a little creativity, you can even perform stuff like facet counting (see the left sidebar).
By abstracting the search presentation layer (JSON in this case), it would be a matter of minutes to add a XML or HTML display on top of it.
Comments Off
A simple API-friendly crawler
Posted by Kelvin on 01 Dec 2006 | Tagged as: crawling, Lucene / Solr / Nutch, programming
Alright. I know I've blogged about this before. Well, I'm revisiting it again.
My sense is that there's a real need for a simple crawler which is easy to use as an API and doesn't attempt to be everything to everyone.
Yes, Nutch is cool, but I'm so tired of fiddling around with configuration files, the proprietary fileformats, and the filesystem-dependence of plugins. Also, crawl progress reporting is poor unless you're intending to be parsing log files.
Here are some thoughts on what a simple crawler might look like:
Download all pages in a site
SimpleCrawler c = new SimpleCrawler();
c.addURL(url);
c.setOutput(new SaveToDisk(downloaddir));
c.setProgressListener(new StdOutProgressListener());
c.setScope(new HostScope(url));
c.start();
Download all urls from a file (depth 1 crawl)
SimpleCrawler c = new SimpleCrawler();
c.setMaxConnectionsPerHost(5);
c.setIntervalBetweenConsecutiveRequests(1000);
c.addURLs(new File(file));
c.setLinkExtractor(null);
c.setOutput(new DirectoryPerDomain(downloaddir));
c.setProgressListener(new StdOutProgressListener());
c.start();
Page through a search results page via regex
SimpleCrawler c = new SimpleCrawler();
c.addURL(url);
c.setLinkExtractor(new RegexLinkExtractor(regex));
c.setOutput(new SaveToDisk(downloaddir));
c.setProgressListener(new StdOutProgressListener());
c.start();
Save to nutch segment for compatibility
SimpleCrawler c = new SimpleCrawler();
c.addURL(url);
c.setOutput(new NutchSegmentOutput(segmentdir));
c.setProgressListener(new StdOutProgressListener());
c.start();
I'm basically trying to find the sweet-spot between Commons HttpClient, and a full-blown crawler app like Nutch.
Thoughts?
Search and crawling internship
Posted by Kelvin on 31 Oct 2006 | Tagged as: crawling, Lucene / Solr / Nutch, programming
I'm looking for a competent Java programmer who wants to get into the world of search engines and crawlers. The internship will involve a mixture of (some) training and (mostly) hands-on projects with real-world clients.
In particular, my area of expertise is in vertical search (real estate, jobs, classifieds), so more than likely, that will be the bulk of the exposure.
You'll be working mostly with Lucene and Nutch, but more than just using them, you'll be delving into their innards and pushing the limits of these libraries.
Upon graduation, if you're interested, you'll be invited to stay on and work alongside me on contracting projects, some of which might involve on-site work, perhaps overseas.
The background to this is that I'm basically tired of turning down consulting work because of my limited time, and want to start building a search/crawling consulting practice.
So if you're interested, send your resume along to internship AT supermind.org, with a brief note on your background and interest in this field.
Comments Off
Nutch 0.8, Map & Reduce, here I come!
Posted by Kelvin on 09 Aug 2006 | Tagged as: crawling, Lucene / Solr / Nutch, programming
Finally taking the plunge to Nutch 0.8 after exclusively working with 0.7 for over a year (and something like 5 projects).
From initial experiences, it appears that using M&R does obfuscate the code somewhat for a developer who wants to build an app off the Nutch infrastructure instead of using it out-of-box. For example, trying to decipher what's going on in org.apache.nutch.indexer.Indexer is pretty difficult, compared to its 0.7 counterpart (IndexSegment).
Some serious documentation needs to be done around the implementation details of M&R. I'll keep posting about my learnings..
