Dynamic facet population with Solr DataImportHandler
Posted by Kelvin on 02 Aug 2010 at 02:03 pm | Tagged as: Lucene / Solr / Nutch, programming
Here's what I'm trying to do:
Given this mysql table:
`id` integer AUTO_INCREMENT NOT NULL PRIMARY KEY,
`name` varchar(100) NOT NULL UNIQUE,
`category` varchar(100)
);
INSERT INTO tag (name,category) VALUES ('good','foo');
INSERT INTO tag (name,category) VALUES ('awe-inspiring','foo');
INSERT INTO tag (name,category) VALUES ('mediocre','bar');
INSERT INTO tag (name,category) VALUES ('terrible','car');
and this solr schema
<field name="tag-bar" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="tag-car" type="string" indexed="true" stored="true" multiValued="true"/>
to populate these tag fields via DataImportHandler.
The dumb (but straightforward) way to do it is to use sub-entities, but this is terribly expensive since you use one extra SQL query per category.
Solution
My general approach was to concatenate the rows into a single row, then use RegexTransformer and a custom dataimport Transformer to split out the values.
Here's how I did it:
My dataimporthandler xml:
query="select group_concat(concat(t.category,'=',t.name) separator '#') as tagfacets from tag t,booktag bt where bt.id='${book.id}' and t.category is not null">
<field column="tagfacets" splitBy="#"/>
</entity>
You'll see that a temporary field tagfacets is used. This will be deleted later on in TagFacetsTransformer.
import org.apache.solr.handler.dataimport.Context;
import org.apache.solr.handler.dataimport.Transformer;
import java.util.List;
import java.util.Map;
public class TagFacetsTransformer extends Transformer {
public Object transformRow(Map<String, Object> row, Context context) {
Object tf = row.get("tagfacets");
if (tf != null) {
if (tf instanceof List) {
List list = (List) tf;
for (Object o : list) {
String[] arr = ((String) o).split("=");
if (arr.length == 2) row.put("tag-" + arr[0], arr[1]);
}
} else {
String[] arr = ((String) tf).split("=");
if (arr.length == 2) row.put("tag-" + arr[0], arr[1]);
}
row.remove("tagfacets");
}
return row;
}
}
Here's the output via DIH's verbose output (with my own data):
<str>---------------------------------------------</str>
<lst name="transformer:RegexTransformer">
<str>---------------------------------------------</str>
<arr name="tagfacets">
<str>lang=ruby</str>
<str>framework=ruby-on-rails</str>
</arr>
<str>---------------------------------------------</str>
<lst name="transformer:org.supermind.solr.TagFacetsTransformer">
<str>---------------------------------------------</str>
<str name="tag-framework">ruby-on-rails</str>
<str name="tag-lang">ruby</str>
<str>---------------------------------------------</str>
</lst>
</lst>
</lst>
You can see the step-by-step transformation of the input value.
Pretty nifty, eh?
-
Kevin
-
Kelvin
-
Heath Aldrich
-
Kelvin
