Scrapy, being based on Twisted, introduces an incredible host of obstacles to easily and efficiently writing self-contained unit tests:
1. You can't call reactor.run() multiple times
2. You can't stop the reactor multiple times, so you can't blindly call "crawler.signals.connect(reactor.stop, signal=signals.spider_closed)"
3. Reactor runs in its own thread, so your failed assertions won't make it to the main unittest thread, so test failures will be thrown as assertion errors but unittest doesn't know about them
To get around these hurdles, I created a BaseScrapyTestCase class that uses tl.testing's ThreadAwareTestCase and the following workarounds.
You'll use it like so:
1. Call run_reactor() at the end of test method.
2. You have to place your assertions in its own function which gets called in a ThreadJoiner so that unittest knows about assertion failures.
3. If you're testing multiple spiders, just call queue_spider() for each, and run_reactor() at the end.
4. BaseScrapyTestCase keeps track of the crawlers created, and makes sure to only attach a reactor.stop signal to the last one.
Let me know if you come up with a better/more elegant way of testing scrapy spiders!
Jetty 6/7 contain a HttpClient class that make it uber-easy to issue non-blocking HTTP requests in Java. Here is a code snippet to get you started.
Initialize the HttpClient object.
// 30 seconds timeout; if no server reply, the request expires
Create a ContentExchange object which encapsulates the HTTP request/response interaction.
We override the onResponseComplete() method to print the response body to console.
By default, an asynchronous request is performed. To run the request synchronously, all you need to do is add the following line:
WebDriver, is a fantastic Java API for web application testing. It has recently been merged into the Selenium project to provide a friendlier API for programmatic simulation of web browser actions. Its unique property is that of executing web pages on web browsers such as Firefox, Chrome, IE etc, and the subsequent programmatic access of the DOM model.
The problem with WebDriver, though, as reported here, is that because the underlying browser implementation does the actual fetching, as opposed to, Commons HttpClient, for example, its currently not possible to obtain the HTTP request and response headers, which is kind of a PITA.
I present here a method of obtaining HTTP request and response headers via an embedded proxy, derived from the Proxoid project.
ProxyLight from Proxoid
ProxyLight is the lightweight standalone proxy from the Proxoid project. It's released under the Apache Public License.
The original code only provided request filtering, and performed no response filtering, forwarding data directly from the web server to the requesting client.
I made some modifications to intercept and parse HTTP response headers.
Get my version here (released under APL): http://downloads.supermind.org/proxylight-20110622.zip
Using ProxyLight from WebDriver
The modified ProxyLight allows you to process both request and response.
This has the added benefit allowing you to write a RequestFilter which ignores images, or URLs from certain domains. Sweet!
What your WebDriver code has to do then, is:
- Ensure the ProxyLight server is started
- Add Request and Response Filters to the ProxyLight server
- Maintain a cache of request and response filters which you can then retrieve
- Ensure the native browser uses our ProxyLight server
Here's a sample class to get you started
// LRU response table. Note: this is not thread-safe.
// Use ConcurrentLinkedHashMap instead: http://code.google.com/p/concurrentlinkedhashmap/
* Get the native browser to use our proxy
*/"network.proxy.type""network.proxy.http", "localhost""network.proxy.http_port"// Now fetch the URL
// this response filter adds the intercepted response to the cache
// add request filters here if needed
// now start the proxy
I'm a little slow off the block here, but I just wanted to mention that Solr 3.2 had been released!
Get your download here: http://www.apache.org/dyn/closer.cgi/lucene/solr
Solr 3.2 release highlights include
- Ability to specify overwrite and commitWithin as request parameters when using the JSON update format
- TermQParserPlugin, useful when generating filter queries from terms returned from field faceting or the terms component.
- DebugComponent now supports using a NamedList to model Explanation objects in it's responses instead of Explanation.toString
- Improvements to the UIMA and Carrot2 integrations
I had personally been looking forward to the overwrite request param addition to JSON update format, so I'm delighted about this release.
Great work guys!
Just so no-one forgets, here's a recap of the Fallacies of Distributed Computing
1. The network is reliable.
2. Latency is zero.
3. Bandwidth is infinite.
4. The network is secure.
5. Topology doesn’t change.
6. There is one administrator.
7. Transport cost is zero.
8. The network is homogeneous.
The problem is described here:
I successfully tracked the problem to the "Connection:" header. It seems that
if the "Connection: keep-alive" request header is not sent the server will
respond with data which is not chunked . It will still reply with a
"Transfer-Encoding: chunked" response header though.
I don't think this behavior is normal and it is not a cURL problem. I'll
consider the case closed but if somebody wants to make something about it I
can send additional info and test it further.
The workaround is simple: have curl use HTTP version 1.0 instead of 1.1.
In PHP, add this:
There's a static Hashtable in java.net.URL (urlStreamHandlers) which gets invoked with every constructor call. Well, turns out when you're running a crawler with, say 50 threads, that turns out to be a major bottleneck.
Of 70 threads, I had running, 48 were blocked on the java.net.URL ctor. I was using the URL class for resolving relative URLs to absolute ones.
Since I had previously written a URL parser to parse out the parts of a URL, I went ahead and implemented my own URL resolution function.
Status: 12.407448 pages/s, 207.06316 kb/s, 2136.143 bytes/page
Status: 43.9947 pages/s, 557.29156 kb/s, 1621.4071 bytes/page
after increasing the number of threads to 100 (which would not have made much difference in the java.net.URL implementation).
Aug 16 update: I ran a more comprehensive analysis with a more complete dataset. Find out the new figures for the average length of a URL
I've always been curious what the average length of a URL is, mostly when approximating memory requirements of storing URLs in RAM.
Well, I did a dump of the DMOZ URLs, sorted and uniq-ed the list of URLs.
Ended up with 4074300 unique URLs weighing in at 139406406 bytes, which approximates to 34 characters per URL.
What's vertical anyway?
So let's start from basics. Vertical search engines typically fall into 2 categories:
- Whole-web search engines which selectively crawl the Internet for webpages related to a certain topic/industry/etc.
- Aggregation-type search engines which mine other websites and databases, aggregating data and repackaging it into a format which is easier to search.
Now, imagine a biotech company comes to me to develop a search engine for everything related to biotechnology and genetics. You'd have to crawl as many websites as you can, and only include the ones related to biotechnology in the search index.
How would I implement the crawler? Probably use Nutch for the crawling and modify it to only extract links from a page if the page contents are relevant to biotechnology. I'd probably need to write some kind of relevancy scoring function which uses a mixture of keywords, ontology and some kind of similarity detection based on sites we know a priori to be relevant.
Now, second scenario. Imagine someone comes to me and want to develop a job search engine for a certain country. This would involve indexing all jobs posted in the 4 major job websites, refreshing this database on a daily basis, checking for new jobs, deleting expired jobs etc.
How would I implement this second crawler? Use Nutch? No way! Ahhhh, now we're getting to the crux of this post..
The ubiquity of Lucene … and therefore Nutch
Nutch is one of two open-source Java crawlers out there, the other being Heritrix from the good guys at the Internet Archive. Its rode on Lucene as the default choice for full-text search API. Everyone who wants to build a vertical search engine in Java these days knows they're going to use Lucene as the search API, and naturally look to Nutch for the crawling side of things. And that's when their project runs into a brick wall…
To Nutch or not to Nutch
Nutch (and Hadoop) is a very very cool project with ambitious and praiseworthy goals. They're really trying to build an open-source version of Google (not sure if that actually is the explicitly declared aims).
Before jumping into any library or framework, you want to be sure you know what needs to be accomplished. I think this is the step many people skip: they have no idea what crawling is all about, so they try to learn what crawling is by observing what a crawler does. Enter Nutch.
The trouble is, observing/using Nutch isn't necessarily the best way to learn about crawling. The best way to learn about crawling is to build a simple crawler.
In fact, if you sit down and think what a 4 job-site crawler really needs to do, its not difficult to see that its functionality is modest and humble – in fact, I can write its algorithm out here:
for each site: if there is a way to list all jobs in the site, then page through this list, extracting job detail urls to the detail url database else if exists browseable categories like industry or geographical location, then page through these categories, extracting job detail urls to the detail url database else continue for each url in the detail url database: download the url extract data into a database table according to predefined regex patterns
Won't be difficult to hack up something quick to do this, especially with the help of Commons HttpClient. You'll probably also want to make this app multi-threaded.
Other things you'll want to consider, is how many simultaneous threads to hit a server with, if you want to save the HTML content of pages vs just keeping the extracted data, how to deal with errors, etc.
All in all, I think you'll find that its not altogether overwhelming, and there's actually alot to be said for the complete control you have over the crawling and post-crawl extraction processes. Compare this to Nutch, where you'll need to fiddle with various configuration files (nutch-site.xml, urlfilters, etc), where calling apps from an API perspective is difficult, you'll have to work with the various file I/O structures to reach the content (SegmentFile, MapFile etc), various issues may prevent all urls from being fetched (retry.max being a common one), if you want custom crawl logic, you'll have to patch/fork the codebase (ugh!) etc.
The other thing that Nutch offers is an out-of-box search solution, but I personally have never found a compelling reason to use it – its difficult to add custom fields, adding OR phrase capability requires patching codebase, etc. In fact, I find it much much simpler to come up with my own SearchServlet.
Even if you decide not to come up with a homegrown solution, and you want to go with Nutch. Well, here's one other thing you need to know before jumping into Nutch.
To map-reduce, or not?
From Nutch 0.7 to Nutch 0.8, there was a pretty big jump in the code complexity with the inclusion of the map-reduce infrastructure. Map-reduce subsequently got factored out, together with some of the core distributed I/O classes into Hadoop.
The 0.7 Fetcher is simple and easy to understand. I can't say the same of the 0.9 Fetcher. Even after having worked abit with the 0.9 fetcher and map-reduce, I still find myself having to do mental gymnastics to figure out what's going on. BUT THAT'S OK, because writing massively distributable, scaleable yet reliable applications is very very hard, and map-reduce makes this possible and comparatively easy. The question to ask though, is, does your search engine project to crawl and search those 4 job sites fall into this category? If not, you'd want to seriously consider against using the latest 0.8x release of Nutch, and tend to 0.7 instead. Of course, the biggest problem with this, is that 0.7 is not being actively maintained (to my knowledge).
Perhaps someone will read this post and think I'm slighting Nutch, so let me make this really clear: _for what its designed to do_, that is, whole-web crawling, Nutch does a good job of it; if what is needed is to page through search result pages and extract data into a database, Nutch is simply overkill.
Alright. I know I've blogged about this before. Well, I'm revisiting it again.
My sense is that there's a real need for a simple crawler which is easy to use as an API and doesn't attempt to be everything to everyone.
Yes, Nutch is cool, but I'm so tired of fiddling around with configuration files, the proprietary fileformats, and the filesystem-dependence of plugins. Also, crawl progress reporting is poor unless you're intending to be parsing log files.
Here are some thoughts on what a simple crawler might look like:
Download all pages in a site
SimpleCrawler c = new SimpleCrawler(); c.addURL(url); c.setOutput(new SaveToDisk(downloaddir)); c.setProgressListener(new StdOutProgressListener()); c.setScope(new HostScope(url)); c.start();
Download all urls from a file (depth 1 crawl)
SimpleCrawler c = new SimpleCrawler(); c.setMaxConnectionsPerHost(5); c.setIntervalBetweenConsecutiveRequests(1000); c.addURLs(new File(file)); c.setLinkExtractor(null); c.setOutput(new DirectoryPerDomain(downloaddir)); c.setProgressListener(new StdOutProgressListener()); c.start();
Page through a search results page via regex
SimpleCrawler c = new SimpleCrawler(); c.addURL(url); c.setLinkExtractor(new RegexLinkExtractor(regex)); c.setOutput(new SaveToDisk(downloaddir)); c.setProgressListener(new StdOutProgressListener()); c.start();
Save to nutch segment for compatibility
SimpleCrawler c = new SimpleCrawler(); c.addURL(url); c.setOutput(new NutchSegmentOutput(segmentdir)); c.setProgressListener(new StdOutProgressListener()); c.start();
I'm basically trying to find the sweet-spot between Commons HttpClient, and a full-blown crawler app like Nutch.