Kelvin Tan - Solr/Elasticsearch Consultant - Writing custom Solr Analysis Token Filters

Posts about programming

Writing custom Solr Analysis Token Filters

Posted by Kelvin on 02 Nov 2019 | Tagged as: programming

Introduction to Solr analysis

Solr filters are declared in Solr's schema as part of analyzer chains.

In Lucene/Solr/Elasticsearch, analyzers are used to process text, breaking text up into terms or tokens, which are then indexed.

An analyzer chain in Solr/Elasticsearch comprises of
1) a tokenizer
2) an optional character filter
3) one or more token filters

Token filters operate on tokens/terms, as opposed to character filters which operate on individual characters.

Examples of token filters are the LowercaseFilter which removes stopwords and the TrimFilter which trims whitespace from terms.

Basic implementation of a TokenFilter

Without further ado, here is the simplest possible TokenFilter and TokenFilterFactory implementation. (I included some javadoc from TokenFilter.java to provide a bit more background context)

public class MyTokenFilterFactory extends TokenFilterFactory {

    public MyTokenFilterFactory(Map<string, string=""> args) {

```
        super(args);
```
```
    }
```
```
 
```
```
    @Override
```

    public TokenStream create(TokenStream input) {

        return new MyTokenFilter(input);

```
    }
```
```
}
```
```
 
```

final class MyTokenFilter extends TokenFilter {

    public MyTokenFilter(TokenStream input) {

```
        super(input);
```
```
    }
```
```
 
```
```
      /**
```

   * Consumers (i.e., {@link IndexWriter}) use this method to advance the stream to

   * the next token. Implementing classes must implement this method and update

   * the appropriate {@link AttributeImpl}s with the attributes of the next

```
   * token.
```
```
   * 
```

   * The producer must make no assumptions about the attributes after the method

   * has been returned: the caller may arbitrarily change it. If the producer

   * needs to preserve the state for subsequent calls, it can use

   * {@link #captureState} to create a copy of the current attribute state.

```
   * 
```
```
 
```

   * This method is called for every token of a document, so an efficient

   * implementation is crucial for good performance. To avoid calls to

   * {@link #addAttribute(Class)} and {@link #getAttribute(Class)},

   * references to all {@link AttributeImpl}s that this stream uses should be

```
   * retrieved during instantiation.
```
```
   * 
```
```
 
```

   * To ensure that filters and consumers know which attributes are available,

   * the attributes must be added during instantiation. Filters and consumers

   * are not required to check for availability of attributes in

```
   * {@link #incrementToken()}.
```
```
   * 
```

   * @return false for end of stream; true otherwise

```
   */
```
```
    @Override
```

    public boolean incrementToken() throws IOException {

```
        if (!input.incrementToken()) {
```
```
            return false;
```
```
        }
```
```
        return true;
```
```
    }
```
```
}
```
```
 
```
```
</string,>
```

Some notes:
1. MyTokenFilterFactory exists solely to create MyTokenFilter. Configuration arguments can be obtained via it's constructor.
2. MyTokenFilter must be final.
3. MyTokenFilter implements a single method called incrementToken().

Now, this tokenfilter is a no-op implementation and is not very useful. Here is a (slightly) modified version which replaces all terms with "Hello".

final class MyTokenFilter extends TokenFilter {

    private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);

    public MyTokenFilter(TokenStream input) {

```
        super(input);
```
```
    }
```
```
 
```
```
    @Override
```

    public boolean incrementToken() throws IOException {

```
        if (!input.incrementToken()) {
```
```
            return false;
```
```
        }
```

        termAtt.setEmpty().append("Hello");

```
        return true;
```
```
    }
```
```
}
```

Some notes:
1. Line 2 shows the TokenFilter pattern of adding attributes at constructor time. See AttributeImpl for the list of available attributes.
2. In this filter, we are using CharTermAttribute. This CharTermAttribute will be populated with each incrementToken() invocation with the respective term.
3. To modify the term that will be returned, call termAtt.setEmpty().append(some-string).

Deploying to Solr

Compile these classes into a .jar file, place it in your core's lib folder. You can then use this filter in your field types like so:

<analyzer>
      <tokenizer class="solr.WhitespaceTokenizerFactory"></tokenizer>
      <filter class="com.mycompany.solr.MyTokenFilterFactory"></filter>
</analyzer>

1 Comment »

Linode has terrible customer service

Posted by Kelvin on 01 May 2016 | Tagged as: programming

I recently ran a small webapp on Linode to test out response latency from their California datacenter. When I no longer needed the app, I powered down the node, thinking that the hourly billing as advertised on this page: https://blog.linode.com/2014/04/09/introducing-hourly-billing/ would mean that I wouldn't get charged for a powered down node.

Unfortunately, I didn't read the fine print and it turns out a powered down node is billed exactly the same as one that's powered up. Wow.

So I got ended up getting billed for 3 months worth of service for about 2 weeks worth of actual usage.

I tried to explain to Linode's customer service folks about the miscommunication and asked if they would refund the last month's bill. Yes, of course I understood where they were coming from (that the powered down node was still using resources), but they didn't seem to care where I was coming from. Not even after I requested to speak to a customer service manager. Clearly they value short-term gains over my long-term value as a client.

After 8 emails of back and forth, I finally gave up and paid the last month and cancelled my account. Forever.

I will never use Linode again, nor recommend any of my clients to. Not when there are so many respectable alternatives, e.g. http://digitalocean.com/

Congratulations, Linode. You just lost a client forever, and earned yourself some negative publicity. Was it really worth it?

No Comments »

Power browsing proggit + HN + lobste.rs + dzone news

Posted by Kelvin on 20 Jan 2016 | Tagged as: programming

Disclaimer: this uses Erudite, a tool I wrote in Django.

Here's how I speed-read programming-related news. Open https://erudite.supermind.org/news/headlines/#tab_Programming in your browser.

Press ` (backtick key) to page the entire row, shift+` to page-prev the entire row.

Press 1 2 3 4 to page each respective column. shift+1 to previous page on column 1, shift+2 on column 2 etc.

When you see a link you like, →, then use the arrow keys to navigate to the link, then ctrl+enter to open it in a background tab.

Using Erudite is a *ton* faster then reading through Hacker News or proggit (though you can't easily read the comments easily).

Pure awesomeness.

No Comments »

Erudite – a text-only, keyboard-friendly news reader

Posted by Kelvin on 12 Jan 2016 | Tagged as: programming

Something I've been working on for a bit: https://erudite.supermind.org

A keyboard-friendly, text-only news reader. Somewhat mobile-friendly. Hit '?' for keyboard shortcuts.

1 Comment »

Monier-Williams Sanskrit-English-IAST search engine

Posted by Kelvin on 17 Sep 2015 | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch, Python

I just launched a search application for the Monier-Williams dictionary, which is the definitive Sanskrit-English dictionary.

See it in action here: http://sanskrit.supermind.org

The app is built in Python and uses the Whoosh search engine. I chose Whoosh instead of Solr or ElasticSearch because I wanted to try building a search app which didn't depend on Java.

Features include:
– full-text search in Devanagari, English, IAST, ascii and HK
– results link to page scans
– more frequently occurring word senses are boosted higher in search results
– visually displays the MW level or depth of a word with list indentation

1 Comment »

A HTML5 ElasticSearch Query DSL Builder

Posted by Kelvin on 16 Sep 2015 | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch

Tl;DR : I parsed ElasticSearch source and generated a HTML app that allows you to build ElasticSearch queries using its JSON Query DSL. You can see it in action here: http://supermind.org/elasticsearch/query-dsl-builder.html

I really like ElasticSearch's JSON-based Query DSL – it lets you create fairly complex search queries in a relatively painless fashion.

I do not, however, fancy the query DSL documentation. I've often found it inadequate, inconsistent with the source, and at times downright confusing.

Browsing the source, I realised that ES parses JSON queries in a fairly regular fashion, which would lend itself well to regex-based parsing of the Java source in order to generate documention of the JSON 'schema'.

The parsing I did in Java, and the actual query builder UI is in HTML and Javascript. The Java phase outputs a JSON data model of the query DSL, which the HTML app then uses to dynamically build the HTML forms etc.

Because of the consistent naming conventions of the objects, I was also able to embed links to documentation and github source within the page itself. Very useful!

You can see the result in action here: http://supermind.org/elasticsearch/query-dsl-builder.html

PS: I first did this for ES version 1.2.1, and then subsequently for 1.4.3 and now 1.7.2. The approach seems to work consistently across versions, with minor changes required in the Java backend parsing between version bumps. Hopefully this remains the case when we go to ES 2.x.

15 Comments »

Definitive guide to routing Android and Genymotion traffic through a socks proxy

Posted by Kelvin on 16 Nov 2014 | Tagged as: android, programming

If you only need to route traffic on Android through a ssh tunnel (not proxy), just use http://code.google.com/p/sshtunnel/
If all you need to do is to inspect network traffic, you can use Wireshark on Genymotion.

If however, you're on Genymotion and/or need to get Android traffic through a proxy, especially if you're trying to conduct MITM attacks or need to inspect network traffic, read-on. This article also assumes you're on either a Mac or Linux computer. Sorry, Windows users.

My requirements for this setup were:
1. inspect, save and playback network traffic
2. capture *all* incoming and outgoing TCP traffic from android devices, not just HTTP connections
3. write code to mangle and manipulate traffic as needed

I tried pretty much every technique I knew about and found on the internet, and found only *one* method consistently reliable and that worked on Genymotion: a combination of ssh, redsocks, iptables and a python proxy. To use this setup for an physical Android device, I connect it to the same wireless network, and then use ConnectBot to ssh-tunnel into the host computer.

Genymotion

First let's talk about getting this working on Genymotion.

Genymotion runs the Android emulator in a virtualbox instance. Contrary to what many report online, setting the proxy via Settings > Wifi > Manage really isn't enough. It doesn't force all network connections through a particular proxy. Interestingly, setting the proxy at host operating system level didn't work either (in Ubuntu 12.04 at least). Virtualbox seemed to be ignoring this setting and connecting directly. And running tsocks genymotion didn't work too. Neither did using connectbot or proxydroid with global proxy enabled.

Having said that, you should try the Wifi proxy approach first. It's way way easier and may be sufficient for your needs!

In the Genymotion Android emulator
Settings -> Wifi -> Long-presson active network
Select "Modify Network"
Select "Show Advanced Options"
Select "Proxy Settings -> Manual"
Hostname: 10.0.3.2
Port: 8888
Press Save

(this assumes Charles is listening in on port 8888). The 10.0.3.2 is Genymotion's special IP for the host IP. Now, if you use the Android browser, requests *should* be visible in Charles. If this is not happening, then something is wrong. If Android browser requests are visible in Charles but not the traffic you're interested in, then read on…

As this link explains really well, virtualbox operates at a lower level than socks, it's actually not possible to route virtualbox connections directly through a socks proxy. We'll get around this by using iptables and redsocks.

Our final application stack will look like this:

TCP requests from applications > iptables > redsocks > python proxy > ssh tunnel > internet

I'm almost positive it's possible to eliminate one or more steps with the right iptables wizardry, but I had already wasted so much time getting this working, I was just relieved to have a working solution.

First, setup your SSH tunnel with dynamic port forwarding (SOCKS forwarding) on port 7777. (I assume there is a SSH server you can tunnel into. If you don't have one, try the Amazon EC2 free-pricing tier)

ssh username@host -D 7777

Then setup iptables to route all TCP connections to port 5555. Remember: you have to do the SSH connect *before* changing iptables. Once the ssh connection is live, it stays alive (assuming you have keep-alive configured).

Put the following into a script called redsocks-iptables.sh

sudo iptables -t nat -N REDSOCKS
 
# Ignore LANs and some other reserved addresses.
sudo iptables -t nat -A REDSOCKS -d 0.0.0.0/8 -j RETURN
sudo iptables -t nat -A REDSOCKS -d 10.0.0.0/8 -j RETURN
sudo iptables -t nat -A REDSOCKS -d 127.0.0.0/8 -j RETURN
sudo iptables -t nat -A REDSOCKS -d 169.254.0.0/16 -j RETURN
sudo iptables -t nat -A REDSOCKS -d 172.16.0.0/12 -j RETURN
sudo iptables -t nat -A REDSOCKS -d 192.168.0.0/16 -j RETURN
sudo iptables -t nat -A REDSOCKS -d 224.0.0.0/4 -j RETURN
sudo iptables -t nat -A REDSOCKS -d 240.0.0.0/4 -j RETURN
 
# Anything else should be redirected to port 5555
sudo iptables -t nat -A REDSOCKS -p tcp -j REDIRECT --to-ports 5555
 
# Any tcp connection made by 'user' should be redirected, put your username here.
sudo iptables -t nat -A OUTPUT -p tcp -m owner --uid-owner user -j REDSOCKS

This iptables script basically redirects all tcp connections to port 5555.

And put this in redsocks-iptables-reset.sh

sudo iptables -P INPUT ACCEPT
sudo iptables -P FORWARD ACCEPT
sudo iptables -P OUTPUT ACCEPT
sudo iptables -t nat -F
sudo iptables -t mangle -F
sudo iptables -F
sudo iptables -X

Use this to reset your iptables to the original state (assuming you don't already have any rules in there).

Now we use redsocks to route these tcp connections to a socks proxy. You may have to install some dependencies such as libevent.

git clone https://github.com/darkk/redsocks.git
cd redsocks
make

Save this as redsocks.conf

base{log_debug = on; log_info = on; log = "file:/tmp/reddi.log";
       daemon = on; redirector = iptables;}
       redsocks { 
       local_ip = 127.0.0.1; 
       local_port = 5555; 
       ip = 127.0.0.1;
       port = 6666; type = socks5; }

And run redsocks like so:

redsocks -c redsocks.conf

Now we need to setup the python proxy. This listens in on port 6666 (the port redsocks forwards requests to) and forwards requests out to 7777 (the ssh socks port). Download it from http://code.activestate.com/recipes/502293-hex-dump-port-forwarding-network-proxy-server/

Rename it to hexdump.py, and run it like so:

python ./hexdump.py 6666:localhost:7777

Now you're all setup. When you make *any* request from your Genymotion android device, it should appear as coming from your ssh server. And the hexdump of the data appears in the console from which you ran hexdump.py like so.

2014-11-16 01:04:51,340 INFO client 127.0.0.1:38340 -> server localhost:7777 (37 bytes)
-> 0000   17 03 01 00 20 EF F3 1C 57 4C 02 8C AA 56 13 53    .... ...WL...V.S
-> 0010   A7 7E E0 7B B9 55 CC 32 6B 5D 42 AC B8 01 52 6B    .~.{.U.2k]B...Rk
-> 0020   2D 80 2A 44 08                                     -.*D.

Physical Android devices

To inspect traffic from a physical Android device, I had to go 2 steps further:

1. Setup ssh server locally (where redsocks is running) NOT the destination ssh server
2. Install connectbot and route all network through host (remember to enable the Global proxy option)

Yes, unfortunately, root is required on the device to use the 'Global proxy' option. I read someplace about a non-root workaround using an app called Drony which uses a Android VPN to route traffic to a proxy of your choice, but I didn't get this to work. Maybe you can.

Manipulating traffic

The Python script is pretty easy to decipher. The 2 important functions are:

  def dataReceived(self, recvd):
    logger.info("client %s -> server %s (%d bytes)\n%s" % (
      self.clientName, self.serverName, len(recvd), hexdump('->', recvd)))
    if hasattr(self, "dest"):
      self.dest.write(recvd)
    else:
      logger.debug("caching data until remote connection is open")
      self.clientFactory.writeCache.append(recvd)
 
  def write(self, buf):
    logger.info("client %s <= server %s (%d bytes)\n%s" % (
      self.clientName, self.serverName, len(buf), hexdump('<=', buf)))
    self.transport.write(buf)

Here you can intercept the data, inspect it, write it out to file, or modify it as you wish.

Notes

The really nice thing about using the python proxy is how easy it is to write data to file and to write rules to manipulate both incoming and outgoing traffic. Beats wireshark any day of the week. As I'm a little more familiar with Java, I initially got this working using Netty example's HexDumpProxy, but manipulating data in Netty was a gigantic PITA, not to mention the painful edit-compile-run cycle.

The major downside of this approach is not being able to (easily) see which server the data you're manipulating is destined for/coming from: the client and servers are both proxies. There's probably a way around this. If you figure it out, let me know!

If you need to decrypt and inspect SSL connections (I didn't), you can add Charles into the mix, after the python proxy and before the SSH tunnel. Just set the python proxy to out-go to the same port the Charles socks proxy is listening on, and set Charles' upstream socks proxy to the SSH tunnel port (in our case 7777). You'll find this setting in Charles under Proxy > External Proxy Settings

If at times one of the links in the chain breaks down, like the ssh connection dying, you may need to kill redsocks, reset iptables and start over. I didn't have to do this too often.

8 Comments »

[solved] Tomcat 6 UTF-8 encoding issue

Posted by Kelvin on 08 Oct 2013 | Tagged as: programming

If after following all the instructions in the Tomcat docs for enabling UTF-8 support (http://wiki.apache.org/tomcat/FAQ/CharacterEncoding#Q8) and you still run into UTF-8 issues, and your webapp involves reading and displaying the contents of files, give this a whirl.

In catalina.sh, either at the top of the file or after the long comments, insert this:

export CATALINA_OPTS="$CATALINA_OPTS -Dfile.encoding=UTF-8"

No Comments »

Guava Tables

Posted by Kelvin on 13 Sep 2013 | Tagged as: programming

Just discovered Guava's Table data structure. Whoa..!

https://code.google.com/p/guava-libraries/wiki/NewCollectionTypesExplained

Table<Vertex, Vertex, Double> weightedGraph = HashBasedTable.create();
weightedGraph.put(v1, v2, 4);
weightedGraph.put(v1, v3, 20);
weightedGraph.put(v2, v3, 5);
 
weightedGraph.row(v1); // returns a Map mapping v2 to 4, v3 to 20
weightedGraph.column(v3); // returns a Map mapping v1 to 20, v2 to 5

No Comments »

High-level overview of Latent Semantic Analysis / LSA

Posted by Kelvin on 09 Sep 2013 | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch

I've just spent the last couple days wrapping my head around implementing Latent Semantic Analysis, and after wading through a number of research papers and quite a bit of linear algebra, I've finally emerged on the other end, and thought I'd write something about it to lock the knowledge in. I'll do my best to keep it non-technical, yet accurate.

Step One – Build the term-document matrix

Input : documents
Output : term-document matrix

Latent Semantic Analysis has the same starting point as most Information Retrieval algorithms : the term-document matrix. Specifically, columns are documents, and rows are terms. If a document contains a term, then the value of that row-column is 1, otherwise 0.

If you start with a corpus of documents, or a database table or something, then you'll need to index this corpus into this matrix. Meaning, lowercasing, removing stopwords, maybe stemming etc. The typical Lucene/Solr analyzer chain, basically.

Step Two – Decompose the matrix

Input : term-document matrix
Output : 3 matrices, U, S and V

Apply Singular Value Decomposition (SVD) to the matrix. This is the computationally expensive step of the whole operation.

SVD is a fairly technical concept and quite an involved process (if you doing it by hand). If you do a bit of googling, you're going to find all kinds of mathematical terms related to this, like matrix decomposition, eigenvalues, eigenvectors, PCA (principal component analysis), random projection etc.

The 5 second explanation of this step is that the original term-document matrix gets broken down into 3 simpler matrices: a term-term matrix (also known as U, or the left matrix), a matrix comprising of the singular values (also known as S), and a document-document matrix (also known as V, or the right matrix).

Something which usually also happens in the SVD step for LSA, and which is important, is rank reduction. In this context, rank reduction means that the original term-document matrix gets somehow "factorized" into its constituent factors, and the k most significant factors or features are retained, where k is some number greater than zero and less than the original size of the term-document matrix. For example, a rank 3 reduction means that the 3 most significant factors are retained. This is important for you to know because most LSA/LSI applications will ask you to specify the value of k, meaning the application wants to know how many features you want to retain.

So what's actually happening in this SVD rank reduction, is basically an approximation of the original term-document matrix, allowing you to compare features in a fast and efficient manner. Smaller k values generally run faster and use less memory, but are less accurate. Larger k values are more "true" to the original matrix, but require longer to compute. Note: this statement may not be true of the stochastic SVD implementations (involving random projection or some other method), where an increase in k doesn't lead to a linear increase in running time, but more like a log(n) increase in running time.

Step Three – Build query vector

Input : query string
Output : query vector

From here, we're on our downhill stretch. The query string needs to be expressed in terms that allow for searching.

Step Four – Compute cosine distance

Input : query vector, document matrix
Output : document scores

To obtain how similar each document is to the query, aka the doc score, we have to go through each document vector in the matrix and calculate its cosine distance to the query vector.

Voila!!

3 Comments »

Supermind Search Consulting Blog Solr - Elasticsearch - Big Data

Posts about programming

Introduction to Solr analysis

Basic implementation of a TokenFilter

Deploying to Solr

Genymotion

Physical Android devices

Manipulating traffic

Notes

Step One – Build the term-document matrix

Step Two – Decompose the matrix

Step Three – Build query vector

Step Four – Compute cosine distance

Supermind Search Consulting Blog
Solr - Elasticsearch - Big Data