I've just spent the last couple days wrapping my head around implementing Latent Semantic Analysis, and after wading through a number of research papers and quite a bit of linear algebra, I've finally emerged on the other end, and thought I'd write something about it to lock the knowledge in. I'll do my best to keep it non-technical, yet accurate.
Step One – Build the term-document matrix
Input : documents
Output : term-document matrix
Latent Semantic Analysis has the same starting point as most Information Retrieval algorithms : the term-document matrix. Specifically, columns are documents, and rows are terms. If a document contains a term, then the value of that row-column is 1, otherwise 0.
If you start with a corpus of documents, or a database table or something, then you'll need to index this corpus into this matrix. Meaning, lowercasing, removing stopwords, maybe stemming etc. The typical Lucene/Solr analyzer chain, basically.
Step Two – Decompose the matrix
Input : term-document matrix
Output : 3 matrices, U, S and V
Apply Singular Value Decomposition (SVD) to the matrix. This is the computationally expensive step of the whole operation.
SVD is a fairly technical concept and quite an involved process (if you doing it by hand). If you do a bit of googling, you're going to find all kinds of mathematical terms related to this, like matrix decomposition, eigenvalues, eigenvectors, PCA (principal component analysis), random projection etc.
The 5 second explanation of this step is that the original term-document matrix gets broken down into 3 simpler matrices: a term-term matrix (also known as U, or the left matrix), a matrix comprising of the singular values (also known as S), and a document-document matrix (also known as V, or the right matrix).
Something which usually also happens in the SVD step for LSA, and which is important, is rank reduction. In this context, rank reduction means that the original term-document matrix gets somehow "factorized" into its constituent factors, and the k most significant factors or features are retained, where k is some number greater than zero and less than the original size of the term-document matrix. For example, a rank 3 reduction means that the 3 most significant factors are retained. This is important for you to know because most LSA/LSI applications will ask you to specify the value of k, meaning the application wants to know how many features you want to retain.
So what's actually happening in this SVD rank reduction, is basically an approximation of the original term-document matrix, allowing you to compare features in a fast and efficient manner. Smaller k values generally run faster and use less memory, but are less accurate. Larger k values are more "true" to the original matrix, but require longer to compute. Note: this statement may not be true of the stochastic SVD implementations (involving random projection or some other method), where an increase in k doesn't lead to a linear increase in running time, but more like a log(n) increase in running time.
Step Three – Build query vector
Input : query string
Output : query vector
From here, we're on our downhill stretch. The query string needs to be expressed in terms that allow for searching.
Step Four – Compute cosine distance
Input : query vector, document matrix
Output : document scores
To obtain how similar each document is to the query, aka the doc score, we have to go through each document vector in the matrix and calculate its cosine distance to the query vector.