Posted by Kelvin on 01 May 2016 | Tagged as: programming
I recently ran a small webapp on Linode to test out response latency from their California datacenter. When I no longer needed the app, I powered down the node, thinking that the hourly billing as advertised on this page: https://blog.linode.com/2014/04/09/introducing-hourly-billing/ would mean that I wouldn't get charged for a powered down node.
Unfortunately, I didn't read the fine print and it turns out a powered down node is billed exactly the same as one that's powered up. Wow.
So I got ended up getting billed for 3 months worth of service for about 2 weeks worth of actual usage.
I tried to explain to Linode's customer service folks about the miscommunication and asked if they would refund the last month's bill. Yes, of course I understood where they were coming from (that the powered down node was still using resources), but they didn't seem to care where I was coming from. Not even after I requested to speak to a customer service manager. Clearly they value short-term gains over my long-term value as a client.
After 8 emails of back and forth, I finally gave up and paid the last month and cancelled my account. Forever.
I will never use Linode again, nor recommend any of my clients to. Not when there are so many respectable alternatives, e.g. http://digitalocean.com/
Congratulations, Linode. You just lost a client forever, and earned yourself some negative publicity. Was it really worth it?
Posted by Kelvin on 20 Jan 2016 | Tagged as: programming
Disclaimer: this uses Erudite, a tool I wrote in Django.
Here's how I speed-read programming-related news. Open https://erudite.supermind.org/news/headlines/#tab_Programming in your browser.
Press ` (backtick key) to page the entire row, shift+` to page-prev the entire row.
Press 1 2 3 4 to page each respective column. shift+1 to previous page on column 1, shift+2 on column 2 etc.
When you see a link you like, →, then use the arrow keys to navigate to the link, then ctrl+enter to open it in a background tab.
Using Erudite is a *ton* faster then reading through Hacker News or proggit (though you can't easily read the comments easily).
Posted by Kelvin on 12 Jan 2016 | Tagged as: programming
I just launched a search application for the Monier-Williams dictionary, which is the definitive Sanskrit-English dictionary.
See it in action here: http://sanskrit.supermind.org
The app is built in Python and uses the Whoosh search engine. I chose Whoosh instead of Solr or ElasticSearch because I wanted to try building a search app which didn't depend on Java.
– full-text search in Devanagari, English, IAST, ascii and HK
– results link to page scans
– more frequently occurring word senses are boosted higher in search results
– visually displays the MW level or depth of a word with list indentation
Tl;DR : I parsed ElasticSearch source and generated a HTML app that allows you to build ElasticSearch queries using its JSON Query DSL. You can see it in action here: http://supermind.org/elasticsearch/query-dsl-builder.html
I really like ElasticSearch's JSON-based Query DSL – it lets you create fairly complex search queries in a relatively painless fashion.
I do not, however, fancy the query DSL documentation. I've often found it inadequate, inconsistent with the source, and at times downright confusing.
Browsing the source, I realised that ES parses JSON queries in a fairly regular fashion, which would lend itself well to regex-based parsing of the Java source in order to generate documention of the JSON 'schema'.
Because of the consistent naming conventions of the objects, I was also able to embed links to documentation and github source within the page itself. Very useful!
You can see the result in action here: http://supermind.org/elasticsearch/query-dsl-builder.html
PS: I first did this for ES version 1.2.1, and then subsequently for 1.4.3 and now 1.7.2. The approach seems to work consistently across versions, with minor changes required in the Java backend parsing between version bumps. Hopefully this remains the case when we go to ES 2.x.
If you only need to route traffic on Android through a ssh tunnel (not proxy), just use http://code.google.com/p/sshtunnel/
If all you need to do is to inspect network traffic, you can use Wireshark on Genymotion.
If however, you're on Genymotion and/or need to get Android traffic through a proxy, especially if you're trying to conduct MITM attacks or need to inspect network traffic, read-on. This article also assumes you're on either a Mac or Linux computer. Sorry, Windows users.
My requirements for this setup were:
1. inspect, save and playback network traffic
2. capture *all* incoming and outgoing TCP traffic from android devices, not just HTTP connections
3. write code to mangle and manipulate traffic as needed
I tried pretty much every technique I knew about and found on the internet, and found only *one* method consistently reliable and that worked on Genymotion: a combination of ssh, redsocks, iptables and a python proxy. To use this setup for an physical Android device, I connect it to the same wireless network, and then use ConnectBot to ssh-tunnel into the host computer.
First let's talk about getting this working on Genymotion.
Genymotion runs the Android emulator in a virtualbox instance. Contrary to what many report online, setting the proxy via Settings > Wifi > Manage really isn't enough. It doesn't force all network connections through a particular proxy. Interestingly, setting the proxy at host operating system level didn't work either (in Ubuntu 12.04 at least). Virtualbox seemed to be ignoring this setting and connecting directly. And running tsocks genymotion didn't work too. Neither did using connectbot or proxydroid with global proxy enabled.
Having said that, you should try the Wifi proxy approach first. It's way way easier and may be sufficient for your needs!
In the Genymotion Android emulator
Settings -> Wifi -> Long-presson active network
Select "Modify Network"
Select "Show Advanced Options"
Select "Proxy Settings -> Manual"
(this assumes Charles is listening in on port 8888). The 10.0.3.2 is Genymotion's special IP for the host IP. Now, if you use the Android browser, requests *should* be visible in Charles. If this is not happening, then something is wrong. If Android browser requests are visible in Charles but not the traffic you're interested in, then read on…
As this link explains really well, virtualbox operates at a lower level than socks, it's actually not possible to route virtualbox connections directly through a socks proxy. We'll get around this by using iptables and redsocks.
Our final application stack will look like this:
TCP requests from applications > iptables > redsocks > python proxy > ssh tunnel > internet
I'm almost positive it's possible to eliminate one or more steps with the right iptables wizardry, but I had already wasted so much time getting this working, I was just relieved to have a working solution.
First, setup your SSH tunnel with dynamic port forwarding (SOCKS forwarding) on port 7777. (I assume there is a SSH server you can tunnel into. If you don't have one, try the Amazon EC2 free-pricing tier)
Then setup iptables to route all TCP connections to port 5555. Remember: you have to do the SSH connect *before* changing iptables. Once the ssh connection is live, it stays alive (assuming you have keep-alive configured).
Put the following into a script called redsocks-iptables.sh
This iptables script basically redirects all tcp connections to port 5555.
And put this in redsocks-iptables-reset.sh
Use this to reset your iptables to the original state (assuming you don't already have any rules in there).
Now we use redsocks to route these tcp connections to a socks proxy. You may have to install some dependencies such as libevent.
Save this as redsocks.conf
And run redsocks like so:
Now we need to setup the python proxy. This listens in on port 6666 (the port redsocks forwards requests to) and forwards requests out to 7777 (the ssh socks port). Download it from http://code.activestate.com/recipes/502293-hex-dump-port-forwarding-network-proxy-server/
Rename it to hexdump.py, and run it like so:
Now you're all setup. When you make *any* request from your Genymotion android device, it should appear as coming from your ssh server. And the hexdump of the data appears in the console from which you ran hexdump.py like so.
Physical Android devices
To inspect traffic from a physical Android device, I had to go 2 steps further:
1. Setup ssh server locally (where redsocks is running) NOT the destination ssh server
2. Install connectbot and route all network through host (remember to enable the Global proxy option)
Yes, unfortunately, root is required on the device to use the 'Global proxy' option. I read someplace about a non-root workaround using an app called Drony which uses a Android VPN to route traffic to a proxy of your choice, but I didn't get this to work. Maybe you can.
The Python script is pretty easy to decipher. The 2 important functions are:
Here you can intercept the data, inspect it, write it out to file, or modify it as you wish.
The really nice thing about using the python proxy is how easy it is to write data to file and to write rules to manipulate both incoming and outgoing traffic. Beats wireshark any day of the week. As I'm a little more familiar with Java, I initially got this working using Netty example's HexDumpProxy, but manipulating data in Netty was a gigantic PITA, not to mention the painful edit-compile-run cycle.
The major downside of this approach is not being able to (easily) see which server the data you're manipulating is destined for/coming from: the client and servers are both proxies. There's probably a way around this. If you figure it out, let me know!
If you need to decrypt and inspect SSL connections (I didn't), you can add Charles into the mix, after the python proxy and before the SSH tunnel. Just set the python proxy to out-go to the same port the Charles socks proxy is listening on, and set Charles' upstream socks proxy to the SSH tunnel port (in our case 7777). You'll find this setting in Charles under Proxy > External Proxy Settings
If at times one of the links in the chain breaks down, like the ssh connection dying, you may need to kill redsocks, reset iptables and start over. I didn't have to do this too often.
Posted by Kelvin on 08 Oct 2013 | Tagged as: programming
If after following all the instructions in the Tomcat docs for enabling UTF-8 support (http://wiki.apache.org/tomcat/FAQ/CharacterEncoding#Q8) and you still run into UTF-8 issues, and your webapp involves reading and displaying the contents of files, give this a whirl.
In catalina.sh, either at the top of the file or after the long comments, insert this:
Posted by Kelvin on 13 Sep 2013 | Tagged as: programming
Just discovered Guava's Table data structure. Whoa..!
weightedGraph.column(v3); // returns a Map mapping v1 to 20, v2 to 5
I've just spent the last couple days wrapping my head around implementing Latent Semantic Analysis, and after wading through a number of research papers and quite a bit of linear algebra, I've finally emerged on the other end, and thought I'd write something about it to lock the knowledge in. I'll do my best to keep it non-technical, yet accurate.
Step One – Build the term-document matrix
Input : documents
Output : term-document matrix
Latent Semantic Analysis has the same starting point as most Information Retrieval algorithms : the term-document matrix. Specifically, columns are documents, and rows are terms. If a document contains a term, then the value of that row-column is 1, otherwise 0.
If you start with a corpus of documents, or a database table or something, then you'll need to index this corpus into this matrix. Meaning, lowercasing, removing stopwords, maybe stemming etc. The typical Lucene/Solr analyzer chain, basically.
Step Two – Decompose the matrix
Input : term-document matrix
Output : 3 matrices, U, S and V
Apply Singular Value Decomposition (SVD) to the matrix. This is the computationally expensive step of the whole operation.
SVD is a fairly technical concept and quite an involved process (if you doing it by hand). If you do a bit of googling, you're going to find all kinds of mathematical terms related to this, like matrix decomposition, eigenvalues, eigenvectors, PCA (principal component analysis), random projection etc.
The 5 second explanation of this step is that the original term-document matrix gets broken down into 3 simpler matrices: a term-term matrix (also known as U, or the left matrix), a matrix comprising of the singular values (also known as S), and a document-document matrix (also known as V, or the right matrix).
Something which usually also happens in the SVD step for LSA, and which is important, is rank reduction. In this context, rank reduction means that the original term-document matrix gets somehow "factorized" into its constituent factors, and the k most significant factors or features are retained, where k is some number greater than zero and less than the original size of the term-document matrix. For example, a rank 3 reduction means that the 3 most significant factors are retained. This is important for you to know because most LSA/LSI applications will ask you to specify the value of k, meaning the application wants to know how many features you want to retain.
So what's actually happening in this SVD rank reduction, is basically an approximation of the original term-document matrix, allowing you to compare features in a fast and efficient manner. Smaller k values generally run faster and use less memory, but are less accurate. Larger k values are more "true" to the original matrix, but require longer to compute. Note: this statement may not be true of the stochastic SVD implementations (involving random projection or some other method), where an increase in k doesn't lead to a linear increase in running time, but more like a log(n) increase in running time.
Step Three – Build query vector
Input : query string
Output : query vector
From here, we're on our downhill stretch. The query string needs to be expressed in terms that allow for searching.
Step Four – Compute cosine distance
Input : query vector, document matrix
Output : document scores
To obtain how similar each document is to the query, aka the doc score, we have to go through each document vector in the matrix and calculate its cosine distance to the query vector.
Posted by Kelvin on 30 Mar 2013 | Tagged as: programming