I’ve had lots of requests for a Ruby version to follow up my Latent Semantic Analysis in Python article. So I’ve rewritten the code and article for Ruby. I wrote LSA from scratch this time and test driven so it has some subtle differences from the Python version.
What is LSA?
Latent Semantic Analysis (LSA) is a mathematical method that tries to bring out latent relationships within a collection of documents. Rather than looking at each document isolated from the others it looks at all the documents as a whole and the terms within them to identify relationships.
An example of LSA: Using a search engine search for “ruby”.
Documents are returned which do not contain the search term “ruby” but contains terms like “rails”.
LSA has identified a latent relationship, “ruby” is semantically close to “rails”.
How does it work?
Given a set of word documents, each word in those documents represents a point in the semantic space. LSA uses a mathematical technique called Singular value decomposition to take the documents/words represented as a matrix and produce a reduced approximation of this matrix. In doing this it reduces the overall noise in the semantic space bringing words together. Hence after applying LSA some words share similar points in the semantic space, they are semantically similar.
These groups of semantically similar words form concepts and those concepts in turn relate to documents.
Term a <----------->
Term b <-----------> Concept d <---------> Document e
Term c <----------->
Background Reading on LSA
There are some very good papers which describe LSA in detail:
An introduction to LSA: http://lsa.colorado.edu/papers/dp1.LSAintro.pdf
Creating your own LSA space: http://www.andrew.cmu.edu/user/jquesada/pdf/bookSpacesRev1.pdf
Latent Semantic analysis: http://en.wikipedia.org/wiki/Latent_semantic_indexing
And on SVD
SVD recommendation system in ruby: http://www.igvita.com/2007/01/15/svd-recommendation-system-in-ruby/
MIT video lectures on SVD: http://ocw.mit.edu/OcwWeb/Mathematics/18-06Spring-2005/VideoLectures/index.htm
LSA Algorithm
This is an implementation of LSA in Ruby (v1.8.7). I made use of the linalg project which while not as mature as Python’s SciPy it does the job. Installing linalg can be rather challenging as it relies on LAPACK which in turn relies on some Fortran code.
1 Create the term-document matrix
We take a set of documents and map them to a vector space matrix. Each column represents a document and each row represents a term.
Document n Document n+1
term n [ 1 0 ]
term n+1 [ 0 1 ]
Read more about the Vector Space Model at Wikipedia.
We use the DMatrix class from Linalg to represent our term-document matrix. This is like the Ruby standard library Matrix class but with some powerful matrix functions builtin.
2 tf-idf Transform
tf => Term frequency
idf => Inverse document frequency.
Apply the tf-idf transform to the term-document matrix. This generally tends to help improve results with LSA.
Github source file: tf_idf_transform.rb
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
3 Latent Semantic transform
Github source file: lsa_transform.rb
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
The LSA transform consists of 3 stages. Lets take a closer look at these 3 stages:
1. Singular Value Decomposition
SVD: http://en.wikipedia.org/wiki/Singular_value_decomposition Linalg’s DMatrix has a singular_value_decomposition method which saves us a lot of work!
1
|
|
2. Reduce the dimensions of Sigma
We generally delete the smallest coefficients in the diagonal matrix Sigma to produce Sigma’.
1 2 3 4 5 6 |
|
The reduction of the dimensions of Sigma combines some dimensions such that they are on more than one term. The number of coefficients deleted can depend of the corpus used. It should be large enough to fit the real structure in the data, but small enough such that noise or unimportant details are not modelled.
The real difficulty and weakness of LSA is knowing how many dimensions to remove. There is no exact method of finding the right dimensions. Approaches which are used are L2-norm or Frobenius norm.
3. Calculate the Product with New Sigma’
Finally we calculate:
1
|
|
LSA In Action – rSemantic Gem
First install the rsemantic gem from Github
gem sources -a http://gems.github.com
sudo gem install josephwilk-rsemantic
Then we can run an example:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
We start with out Model-Term frequency matrix with is generated from creating a Vector Space Search with four documents (D1-D4):
D1 D2 D3 D4
hat [ +1.00 +0.00 +0.00 +1.00 ]
dog [ +0.00 +0.00 +1.00 +0.00 ]
cat [ +1.00 +1.00 +1.00 +0.00 ]
make [ +0.00 +0.00 +1.00 +0.00 ]
good [ +0.00 +0.00 +1.00 +0.00 ]
pet [ +0.00 +1.00 +1.00 +0.00 ]
fine [ +0.00 +1.00 +0.00 +0.00 ]
poni [ +0.00 +1.00 +0.00 +0.00 ]
disabl [ +1.00 +0.00 +0.00 +0.00 ]
Apply tf-idf transform:
D1 D2 D3 D4
hat [ +0.23 +0.00 +0.00 +0.69 ]
dog [ +0.00 +0.00 +0.28 +0.00 ]
cat [ +0.10 +0.07 +0.06 +0.00 ]
make [ +0.00 +0.00 +0.28 +0.00 ]
good [ +0.00 +0.00 +0.28 +0.00 ]
pet [ +0.00 +0.17 +0.14 +0.00 ]
fine [ +0.00 +0.35 +0.00 +0.00 ]
poni [ +0.00 +0.35 +0.00 +0.00 ]
disabl [ +0.46 +0.00 +0.00 <span style="color: #ff0000;">+0.00</span> ]
Perform SVD – Reduce Sigmas dimensions (removing the 2 smallest coefficients)
D1 D2 D3 D4
hat [ +0.34 -0.01 -0.01 +0.63 ]
dog [ +0.01 -0.00 +0.28 -0.01 ]
cat [ +0.03 +0.08 +0.06 +0.04 ]
make [ +0.01 -0.00 +0.28 -0.01 ]
good [ +0.01 -0.00 +0.28 -0.01 ]
pet [ +0.01 +0.17 +0.14 -0.01 ]
fine [ +0.02 +0.35 -0.00 -0.01 ]
poni [ +0.02 +0.35 -0.00 -0.01 ]
disabl [ +0.11 +0.02 +0.02 <span style="color: #ff0000;">+0.19</span> ]
Note the Word ‘disabl’ despite not being in D4 now has a weighting in that document.
Dependencies
Linalg is dependent on LAPACK which can be a challenge to install.
This article on setting up linalg on Mac OS X helped: http://www.commonmediainc.com/2008/03/24/building-lapack-and-rubys-linalg-on-mac-os-x/
Problems
LSA assumes the Normal distribution where the Poisson distribution has actually been observed.
Source Code
The source code initially shown here has gone through a lot of refactoring and has become much more than just LSA. You can get the full project with source code from github:
git clone git://github.com/josephwilk/rsemantic.git
And you can read about rSemantic on Github: http://josephwilk.github.com/rsemantic