What's new in Solr 3.4.0
Posted by Kelvin on 06 Oct 2011 | Tagged as: Lucene / Solr / Nutch
If you are already using Apache Solr 3.1, 3.2 or 3.3, it's strongly recommended you upgrade to 3.4.0 because of the index corruption bug on OS or computer crash or power loss (LUCENE-3418), now fixed in 3.4.0.
Solr 3.4.0 release highlights include
- Bug fixes and improvements from Apache Lucene 3.4.0, including a
major bug (LUCENE-3418) whereby a Lucene index could
easily become corrupted if the OS or computer crashed or lost
power. - SolrJ client can now parse grouped and range facets results
(SOLR-2523). - A new XsltUpdateRequestHandler allows posting XML that's
transformed by a provided XSLT into a valid Solr document
(SOLR-2630). - Post-group faceting option (group.truncate) can now compute
facet counts for only the highest ranking documents per-group.
(SOLR-2665). - Add commitWithin update request parameter to all update handlers
that were previously missing it. This tells Solr to commit the
change within the specified amount of time (SOLR-2540). - You can now specify NIOFSDirectory (SOLR-2670).
- New parameter hl.phraseLimit speeds up FastVectorHighlighter
(LUCENE-3234). - The query cache and filter cache can now be disabled per request.
See this wiki page
(SOLR-2429). - Improved memory usage, build time, and performance of
SynonymFilterFactory (LUCENE-3233). - Added omitPositions to the schema, so you can omit position
information while still indexing term frequencies (LUCENE-2048). - Various fixes for multi-threaded DataImportHandler.
See the release notes for a more complete list of all the new features, improvements, and bugfixes.
As usual, the download is available here: http://www.apache.org/dyn/closer.cgi/lucene/solr/
Introducing SolrTutorial.com
Posted by Kelvin on 02 Oct 2011 | Tagged as: Lucene / Solr / Nutch
Just launched a Solr tutorial website, a site styled after my LuceneTutorial.com but tailored towards Solr users.
It also includes high-level overviews to Solr for non-programmers, such as Solr for Managers and Solr for SysAdmins.
HOWTO: Collect WebDriver HTTP Request and Response Headers
Posted by Kelvin on 22 Jun 2011 | Tagged as: crawling, Lucene / Solr / Nutch, programming
WebDriver, is a fantastic Java API for web application testing. It has recently been merged into the Selenium project to provide a friendlier API for programmatic simulation of web browser actions. Its unique property is that of executing web pages on web browsers such as Firefox, Chrome, IE etc, and the subsequent programmatic access of the DOM model.
The problem with WebDriver, though, as reported here, is that because the underlying browser implementation does the actual fetching, as opposed to, Commons HttpClient, for example, its currently not possible to obtain the HTTP request and response headers, which is kind of a PITA.
I present here a method of obtaining HTTP request and response headers via an embedded proxy, derived from the Proxoid project.
ProxyLight from Proxoid
ProxyLight is the lightweight standalone proxy from the Proxoid project. It's released under the Apache Public License.
The original code only provided request filtering, and performed no response filtering, forwarding data directly from the web server to the requesting client.
I made some modifications to intercept and parse HTTP response headers.
Get my version here (released under APL): http://downloads.supermind.org/proxylight-20110622.zip
Using ProxyLight from WebDriver
The modified ProxyLight allows you to process both request and response.
This has the added benefit allowing you to write a RequestFilter which ignores images, or URLs from certain domains. Sweet!
What your WebDriver code has to do then, is:
- Ensure the ProxyLight server is started
- Add Request and Response Filters to the ProxyLight server
- Maintain a cache of request and response filters which you can then retrieve
- Ensure the native browser uses our ProxyLight server
Here's a sample class to get you started
import com.mba.proxylight.ProxyLight;
import com.mba.proxylight.Response;
import com.mba.proxylight.ResponseFilter;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.openqa.selenium.firefox.FirefoxProfile;
import java.util.LinkedHashMap;
import java.util.Map;
public class SampleWebDriver {
protected int localProxyPort = 5368;
protected ProxyLight proxy;
// LRU response table. Note: this is not thread-safe.
// Use ConcurrentLinkedHashMap instead: http://code.google.com/p/concurrentlinkedhashmap/
private LinkedHashMap<String, Response> responseTable = new LinkedHashMap<String, Response>() {
protected boolean removeEldestEntry(Map.Entry eldest) {
return size() > 100;
}
};
public Response fetch(String url) {
if (proxy == null) {
initProxy();
}
FirefoxProfile profile = new FirefoxProfile();
/**
* Get the native browser to use our proxy
*/
profile.setPreference("network.proxy.type", 1);
profile.setPreference("network.proxy.http", "localhost");
profile.setPreference("network.proxy.http_port", localProxyPort);
FirefoxDriver driver = new FirefoxDriver(profile);
// Now fetch the URL
driver.get(url);
Response proxyResponse = responseTable.remove(driver.getCurrentUrl());
return proxyResponse;
}
private void initProxy() {
proxy = new ProxyLight();
this.proxy.setPort(localProxyPort);
// this response filter adds the intercepted response to the cache
this.proxy.getResponseFilters().add(new ResponseFilter() {
public void filter(Response response) {
responseTable.put(response.getRequest().getUrl(), response);
}
});
// add request filters here if needed
// now start the proxy
try {
this.proxy.start();
} catch (Exception e) {
e.printStackTrace();
}
}
public static void main(String[] args) {
SampleWebDriver driver = new SampleWebDriver();
Response res = driver.fetch("http://www.lucenetutorial.com");
System.out.println(res.getHeaders());
}
}
Solr 3.2 released!
Posted by Kelvin on 22 Jun 2011 | Tagged as: crawling, Lucene / Solr / Nutch, programming
I'm a little slow off the block here, but I just wanted to mention that Solr 3.2 had been released!
Get your download here: http://www.apache.org/dyn/closer.cgi/lucene/solr
Solr 3.2 release highlights include
- Ability to specify overwrite and commitWithin as request parameters when using the JSON update format
- TermQParserPlugin, useful when generating filter queries from terms returned from field faceting or the terms component.
- DebugComponent now supports using a NamedList to model Explanation objects in it's responses instead of Explanation.toString
- Improvements to the UIMA and Carrot2 integrations
I had personally been looking forward to the overwrite request param addition to JSON update format, so I'm delighted about this release.
Great work guys!
Recap: The Fallacies of Distributed Computing
Posted by Kelvin on 01 Mar 2011 | Tagged as: crawling, Lucene / Solr / Nutch, programming
Just so no-one forgets, here's a recap of the Fallacies of Distributed Computing
1. The network is reliable.
2. Latency is zero.
3. Bandwidth is infinite.
4. The network is secure.
5. Topology doesn’t change.
6. There is one administrator.
7. Transport cost is zero.
8. The network is homogeneous.
Solandra - Solr running on Cassandra
Posted by Kelvin on 21 Oct 2010 | Tagged as: Lucene / Solr / Nutch
Courtest of Nick Lothian..
http://nicklothian.com/blog/2009/10/27/solr-cassandra-solandra/
What's new in Solr 1.4.1
Posted by Kelvin on 20 Oct 2010 | Tagged as: Lucene / Solr / Nutch
Solr 1.4.1 is a bug-fix release. No new features.
Here's the list of bugs that were fixed.
* SOLR-1934: Upgrade to Apache Lucene 2.9.3 to obtain several bug
fixes from the previous 2.9.1. See the Lucene 2.9.3 release notes
for details. (hossman, Mark Miller)
* SOLR-1432: Make the new ValueSource.getValues(context,reader) delegate
to the original ValueSource.getValues(reader) so custom sources
will work. (yonik)
* SOLR-1572: FastLRUCache correctly implemented the LRU policy only
for the first 2B accesses. (yonik)
* SOLR-1595: StreamingUpdateSolrServer used the platform default character
set when streaming updates, rather than using UTF-8 as the HTTP headers
indicated, leading to an encoding mismatch. (hossman, yonik)
* SOLR-1660: CapitalizationFilter crashes if you use the maxWordCountOption
(Robert Muir via shalin)
* SOLR-1662: Added Javadocs in BufferedTokenStream and fixed incorrect cloning
in TestBufferedTokenStream (Robert Muir, Uwe Schindler via shalin)
* SOLR-1711: SolrJ - StreamingUpdateSolrServer had a race condition that
could halt the streaming of documents. The original patch to fix this
(never officially released) introduced another hanging bug due to
connections not being released. (Attila Babo, Erik Hetzner via yonik)
* SOLR-1748, SOLR-1747, SOLR-1746, SOLR-1745, SOLR-1744: Streams and Readers
retrieved from ContentStreams are not closed in various places, resulting
in file descriptor leaks.
(Christoff Brill, Mark Miller)
* SOLR-1580: Solr Configuration ignores 'mergeFactor' parameter, always
uses Lucene default. (Lance Norskog via Mark Miller)
* SOLR-1777: fieldTypes with sortMissingLast=true or sortMissingFirst=true can
result in incorrectly sorted results. (yonik)
* SOLR-1797: fix ConcurrentModificationException and potential memory
leaks in ResourceLoader. (yonik)
* SOLR-1798: Small memory leak (~100 bytes) in fastLRUCache for every
commit. (yonik)
* SOLR-1522: Show proper message if script tag is missing for DIH
ScriptTransformer (noble)
* SOLR-1538: Reordering of object allocations in ConcurrentLRUCache to eliminate
(an extremely small) potential for deadlock.
(gabriele renzi via hossman)
* SOLR-1558: QueryElevationComponent only works if the uniqueKey field is
implemented using StrField. In previous versions of Solr no warning or
error would be generated if you attempted to use QueryElevationComponent,
it would just fail in unexpected ways. This has been changed so that it
will fail with a clear error message on initialization. (hossman)
* SOLR-1563: Binary fields, including trie-based numeric fields, caused null
pointer exceptions in the luke request handler. (yonik)
* SOLR-1579: Fixes to XML escaping in stats.jsp
(David Bowen and hossman)
* SOLR-1582: copyField was ignored for BinaryField types (gsingers)
* SOLR-1596: A rollback operation followed by the shutdown of Solr
or the close of a core resulted in a warning:
"SEVERE: SolrIndexWriter was not closed prior to finalize()" although
there were no other consequences. (yonik)
* SOLR-1651: Fixed Incorrect dataimport handler package name in SolrResourceLoader
(Akshay Ukey via shalin)
* SOLR-1936: The JSON response format needed to escape unicode code point
U+2028 - 'LINE SEPARATOR' (Robert Hofstra, yonik)
* SOLR-1852: Fix WordDelimiterFilterFactory bug where position increments
were not being applied properly to subwords. (Peter Wolanin via Robert Muir)
* SOLR-1706: fixed WordDelimiterFilter for certain combinations of options
where it would output incorrect tokens. (Robert Muir, Chris Male)
* SOLR-1948: PatternTokenizerFactory should use parent's args (koji)
* SOLR-1870: Indexing documents using the 'javabin' format no longer
fails with a ClassCastException whenSolrInputDocuments contain field
values which are Collections or other classes that implement
Iterable. (noble, hossman)
* SOLR-1769 Solr 1.4 Replication - Repeater throwing NullPointerException (noble)
How to write a custom Solr FunctionQuery
Posted by Kelvin on 03 Sep 2010 | Tagged as: Lucene / Solr / Nutch, programming
Solr FunctionQueries allow you to modify the ranking of a search query in Solr by applying functions to the results.
There are a list of out-of-box FunctionQueries available here: http://wiki.apache.org/solr/FunctionQuery
In order to write a custom Solr FunctionQuery, you'll need to do 2 things:
1. Subclass org.apache.solr.search.ValueSourceParser. Here's a stub ValueSourceParser.
public void init(NamedList namedList) {
}
public ValueSource parse(FunctionQParser fqp) throws ParseException {
return new MyValueSource();
}
}
2. In solrconfig.xml, register your new ValueSourceParser directly under the <config> tag
3. Subclass org.apache.solr.search.ValueSource and instantiate it in your ValueSourceParser.parse() method.
Lets take a look at 2 ValueSource implementations to see what they do, starting with the simplest:
org.apache.solr.search.function.ConstValueSource
Example SolrQuerySyntax: _val_:1.5
It simply returns a float value.
final float constant;
public ConstValueSource(float constant) {
this.constant = constant;
}
public DocValues getValues(Map context, IndexReader reader) throws IOException {
return new DocValues() {
public float floatVal(int doc) {
return constant;
}
public int intVal(int doc) {
return (int)floatVal(doc);
}
public long longVal(int doc) {
return (long)floatVal(doc);
}
public double doubleVal(int doc) {
return (double)floatVal(doc);
}
public String strVal(int doc) {
return Float.toString(floatVal(doc));
}
public String toString(int doc) {
return description();
}
};
}
// commented out some boilerplate stuff
}
As you can see, the important method is DocValues getValues(Map context, IndexReader reader). The gist of the method is return a DocValues object which returns a value given a document id.
org.apache.solr.search.function.OrdFieldSource
ord(myfield) returns the ordinal of the indexed field value within the indexed list of terms for that field in lucene index order (lexicographically ordered by unicode value), starting at 1. In other words, for a given field, all values are ordered lexicographically; this function then returns the offset of a particular value in that ordering.
Example SolrQuerySyntax: _val_:"ord(myIndexedField)"
protected String field;
public OrdFieldSource(String field) {
this.field = field;
}
public DocValues getValues(Map context, IndexReader reader) throws IOException {
return new StringIndexDocValues(this, reader, field) {
protected String toTerm(String readableValue) {
return readableValue;
}
public float floatVal(int doc) {
return (float)order[doc];
}
public int intVal(int doc) {
return order[doc];
}
public long longVal(int doc) {
return (long)order[doc];
}
public double doubleVal(int doc) {
return (double)order[doc];
}
public String strVal(int doc) {
// the string value of the ordinal, not the string itself
return Integer.toString(order[doc]);
}
public String toString(int doc) {
return description() + '=' + intVal(doc);
}
};
}
}
OrdFieldSource is almost identical to ConstValueSource, the main differences being the returning of the order rather than a const value, and the use of StringIndexDocValues which is for obtaining the order of values.
Our own ValueSource
We now have a pretty good idea what a ValueSource subclass has to do:
return some value for a given doc id.
This can be based on the value of a field in the index (like OrdFieldSource), or nothing to do with the index at all (like ConstValueSource).
Here's one that performs the opposite of MaxFloatFunction/max() - MinFloatFunction/min():
protected final ValueSource source;
protected final float fval;
public MinFloatFunction(ValueSource source, float fval) {
this.source = source;
this.fval = fval;
}
public DocValues getValues(Map context, IndexReader reader) throws IOException {
final DocValues vals = source.getValues(context, reader);
return new DocValues() {
public float floatVal(int doc) {
float v = vals.floatVal(doc);
return v > fval ? fval : v;
}
public int intVal(int doc) {
return (int)floatVal(doc);
}
public long longVal(int doc) {
return (long)floatVal(doc);
}
public double doubleVal(int doc) {
return (double)floatVal(doc);
}
public String strVal(int doc) {
return Float.toString(floatVal(doc));
}
public String toString(int doc) {
return "max(" + vals.toString(doc) + "," + fval + ")";
}
};
}
@Override
public void createWeight(Map context, Searcher searcher) throws IOException {
source.createWeight(context, searcher);
}
// boilerplate methods omitted
}
And the corresponding ValueSourceParser:
public void init(NamedList namedList) {
}
public ValueSource parse(FunctionQParser fqp) throws ParseException {
ValueSource source = fp.parseValueSource();
float val = fp.parseFloat();
return new MinFloatFunction(source,val);
}
}
Dynamic facet population with Solr DataImportHandler
Posted by Kelvin on 02 Aug 2010 | Tagged as: Lucene / Solr / Nutch, programming
Here's what I'm trying to do:
Given this mysql table:
`id` integer AUTO_INCREMENT NOT NULL PRIMARY KEY,
`name` varchar(100) NOT NULL UNIQUE,
`category` varchar(100)
);
INSERT INTO tag (name,category) VALUES ('good','foo');
INSERT INTO tag (name,category) VALUES ('awe-inspiring','foo');
INSERT INTO tag (name,category) VALUES ('mediocre','bar');
INSERT INTO tag (name,category) VALUES ('terrible','car');
and this solr schema
<field name="tag-bar" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="tag-car" type="string" indexed="true" stored="true" multiValued="true"/>
to populate these tag fields via DataImportHandler.
The dumb (but straightforward) way to do it is to use sub-entities, but this is terribly expensive since you use one extra SQL query per category.
Solution
My general approach was to concatenate the rows into a single row, then use RegexTransformer and a custom dataimport Transformer to split out the values.
Here's how I did it:
My dataimporthandler xml:
query="select group_concat(concat(t.category,'=',t.name) separator '#') as tagfacets from tag t,booktag bt where bt.id='${book.id}' and t.category is not null">
<field column="tagfacets" splitBy="#"/>
</entity>
You'll see that a temporary field tagfacets is used. This will be deleted later on in TagFacetsTransformer.
import org.apache.solr.handler.dataimport.Context;
import org.apache.solr.handler.dataimport.Transformer;
import java.util.List;
import java.util.Map;
public class TagFacetsTransformer extends Transformer {
public Object transformRow(Map<String, Object> row, Context context) {
Object tf = row.get("tagfacets");
if (tf != null) {
if (tf instanceof List) {
List list = (List) tf;
for (Object o : list) {
String[] arr = ((String) o).split("=");
if (arr.length == 2) row.put("tag-" + arr[0], arr[1]);
}
} else {
String[] arr = ((String) tf).split("=");
if (arr.length == 2) row.put("tag-" + arr[0], arr[1]);
}
row.remove("tagfacets");
}
return row;
}
}
Here's the output via DIH's verbose output (with my own data):
<str>---------------------------------------------</str>
<lst name="transformer:RegexTransformer">
<str>---------------------------------------------</str>
<arr name="tagfacets">
<str>lang=ruby</str>
<str>framework=ruby-on-rails</str>
</arr>
<str>---------------------------------------------</str>
<lst name="transformer:org.supermind.solr.TagFacetsTransformer">
<str>---------------------------------------------</str>
<str name="tag-framework">ruby-on-rails</str>
<str name="tag-lang">ruby</str>
<str>---------------------------------------------</str>
</lst>
</lst>
</lst>
You can see the step-by-step transformation of the input value.
Pretty nifty, eh?
Upgrading to Lucene 3.0
Posted by Kelvin on 28 Apr 2010 | Tagged as: Lucene / Solr / Nutch, programming
Recently upgraded a 3-year old app from Lucene 2.1-dev to 3.0.1.
Some random thoughts to the evolution of the Lucene API over the past 3 years:
I miss Hits
Sigh. Hits has been deprecated for awhile now, but with 3.0 its gone. And I have to say its a pain that it is.
Where I used to pass the Hits object around, now I need to pass TopDocs AND Searcher in order to get to documents.
Instead of
Document doc = hits.doc(i);
its now
Document doc = searcher.doc(topdocs.scoreDocs[i].doc);
Much more verbose with zero benefit to me as a programmer.
Nice number indexing via NumericField
Where I previously had to pad numbers for lexicographic searching, there's now a proper NumericField and NumericRangeFilter.
Lockless commits
What more can I say? Yay!!
What has not changed...
Perhaps somewhat more important than what has changed, is what has remained the same, which is 95% of the API and the query language.
3 years is a mighty long time and Lucene has experienced explosive growth during this period. The overall sanity of change is a clear sign of Lucene's committers' dedication to simplicity and a hat-tip to Doug's original architecture and vision.
