Joseph Wilk

Joseph Wilk

Things with code, creativity and computation.

Latent Semantic Analysis in Ruby

I’ve had lots of requests for a Ruby version to follow up my Latent Semantic Analysis in Python article. So I’ve rewritten the code and article for Ruby. I wrote LSA from scratch this time and test driven so it has some subtle differences from the Python version.

What is LSA?

Latent Semantic Analysis (LSA) is a mathematical method that tries to bring out latent relationships within a collection of documents. Rather than looking at each document isolated from the others it looks at all the documents as a whole and the terms within them to identify relationships.

An example of LSA: Using a search engine search for “ruby”.

Documents are returned which do not contain the search term “ruby” but contains terms like “rails”.

LSA has identified a latent relationship, “ruby” is semantically close to “rails”.

How does it work?

Given a set of word documents, each word in those documents represents a point in the semantic space. LSA uses a mathematical technique called Singular value decomposition to take the documents/words represented as a matrix and produce a reduced approximation of this matrix. In doing this it reduces the overall noise in the semantic space bringing words together. Hence after applying LSA some words share similar points in the semantic space, they are semantically similar.

These groups of semantically similar words form concepts and those concepts in turn relate to documents.

Term a <----------->
Term b <-----------> Concept d <---------> Document e
Term c <----------->

Background Reading on LSA

There are some very good papers which describe LSA in detail:

And on SVD

LSA Algorithm

This is an implementation of LSA in Ruby (v1.8.7). I made use of the linalg project which while not as mature as Python’s SciPy it does the job. Installing linalg can be rather challenging as it relies on LAPACK which in turn relies on some Fortran code.

1 Create the term-document matrix

We take a set of documents and map them to a vector space matrix. Each column represents a document and each row represents a term.

         Document n  Document n+1
term n    [   1           0 ]
term n+1  [   0           1 ]

Read more about the Vector Space Model at Wikipedia.

We use the DMatrix class from Linalg to represent our term-document matrix. This is like the Ruby standard library Matrix class but with some powerful matrix functions builtin.

2 tf-idf Transform

tf => Term frequency

idf => Inverse document frequency.

Apply the tf-idf transform to the term-document matrix. This generally tends to help improve results with LSA.

Github source file: tf_idf_transform.rb

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
      def self.transform(matrix)
        number_of_documents = matrix.num_columns
        @@number_of_documents_with_term = []

        matrix.columns.each_with_index do |document, column_index|
          document_term_total = document.rows.inject(0.0) {|word_sum, word_count| word_sum + word_count.to_f }

          document.rows.each_with_index do |term_weight, row_index|
            unless term_weight.to_f == 0.0
              matrix[row_index, column_index] = (term_weight / document_term_total) *
              Math.log((number_of_documents / number_of_documents_with_term(row_index, matrix).to_f).abs)
            end
          end
        end
        matrix
      end

3 Latent Semantic transform

Github source file: lsa_transform.rb

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
      def transform(matrix, number_of_dimensions_to_reduce = 1)
          columns = matrix.num_columns

          if dimensions <= columns: #Its a valid reduction
            u, sigma, vt = matrix.singular_value_decomposition

            sigma_prime = reduce_dimensions(number_of_dimensions_to_reduce, sigma)

            matrix_prime = u * sigma_prime * vt
          else
            raise Exception, "dimension reduction cannot be greater than %s" % columns
          end

          matrix_prime
        end

The LSA transform consists of 3 stages. Lets take a closer look at these 3 stages:

1. Singular Value Decomposition

SVD: http://en.wikipedia.org/wiki/Singular_value_decomposition Linalg’s DMatrix has a singular_value_decomposition method which saves us a lot of work!

1
 u, sigma, vt = matrix.singular_value_decomposition

2. Reduce the dimensions of Sigma

We generally delete the smallest coefficients in the diagonal matrix Sigma to produce Sigma’.

1
2
3
4
5
6
        def reduce_dimensions(number_of_dimensions_to_reduce, matrix)
          for diagonal_index in dimensions_to_be_reduced(matrix, number_of_dimensions_to_reduce)
            matrix[diagonal_index, diagonal_index] = 0
          end
          matrix
        end

The reduction of the dimensions of Sigma combines some dimensions such that they are on more than one term. The number of coefficients deleted can depend of the corpus used. It should be large enough to fit the real structure in the data, but small enough such that noise or unimportant details are not modelled.

The real difficulty and weakness of LSA is knowing how many dimensions to remove. There is no exact method of finding the right dimensions. Approaches which are used are L2-norm or Frobenius norm.

3. Calculate the Product with New Sigma’

Finally we calculate:

1
 matrix_prime = u * sigma_prime * vt

LSA In Action – rSemantic Gem

First install the rsemantic gem from Github

gem sources -a http://gems.github.com
sudo gem install josephwilk-rsemantic

Then we can run an example:

1
2
3
4
5
6
7
8
9
10
11
12
13
require 'rubygems'
require 'semantic'

D1 = "The cat in the hat disabled"



D2 = "A cat is a fine pet ponies."
D3 = "Dogs and cats make good pets"
D4 = "I haven't got a hat."

#verbose mode so it shows us all the different stages
search = Semantic::Search.new([D1, D2, D3, D4], :verbose => true)

We start with out Model-Term frequency matrix with is generated from creating a Vector Space Search with four documents (D1-D4):

           D1    D2    D3    D4
hat     [ +1.00 +0.00 +0.00 +1.00 ]
dog     [ +0.00 +0.00 +1.00 +0.00 ]
cat     [ +1.00 +1.00 +1.00 +0.00 ]
make    [ +0.00 +0.00 +1.00 +0.00 ]
good    [ +0.00 +0.00 +1.00 +0.00 ]
pet     [ +0.00 +1.00 +1.00 +0.00 ]
fine    [ +0.00 +1.00 +0.00 +0.00 ]
poni    [ +0.00 +1.00 +0.00 +0.00 ]
disabl  [ +1.00 +0.00 +0.00 +0.00 ]

Apply tf-idf transform:

           D1    D2    D3    D4
hat     [ +0.23 +0.00 +0.00 +0.69 ]
dog     [ +0.00 +0.00 +0.28 +0.00 ]
cat     [ +0.10 +0.07 +0.06 +0.00 ]
make    [ +0.00 +0.00 +0.28 +0.00 ]
good    [ +0.00 +0.00 +0.28 +0.00 ]
pet     [ +0.00 +0.17 +0.14 +0.00 ]
fine    [ +0.00 +0.35 +0.00 +0.00 ]
poni    [ +0.00 +0.35 +0.00 +0.00 ]
disabl  [ +0.46 +0.00 +0.00 <span style="color: #ff0000;">+0.00</span> ]

Perform SVD – Reduce Sigmas dimensions (removing the 2 smallest coefficients)

           D1    D2    D3    D4
hat     [ +0.34 -0.01 -0.01 +0.63 ]
dog     [ +0.01 -0.00 +0.28 -0.01 ]
cat     [ +0.03 +0.08 +0.06 +0.04 ]
make    [ +0.01 -0.00 +0.28 -0.01 ]
good    [ +0.01 -0.00 +0.28 -0.01 ]
pet     [ +0.01 +0.17 +0.14 -0.01 ]
fine    [ +0.02 +0.35 -0.00 -0.01 ]
poni    [ +0.02 +0.35 -0.00 -0.01 ]
disabl  [ +0.11 +0.02 +0.02 <span style="color: #ff0000;">+0.19</span> ]

Note the Word ‘disabl’ despite not being in D4 now has a weighting in that document.

Dependencies

Linalg is dependent on LAPACK which can be a challenge to install.

This article on setting up linalg on Mac OS X helped: http://www.commonmediainc.com/2008/03/24/building-lapack-and-rubys-linalg-on-mac-os-x/

Problems

LSA assumes the Normal distribution where the Poisson distribution has actually been observed.

Source Code

The source code initially shown here has gone through a lot of refactoring and has become much more than just LSA. You can get the full project with source code from github:

git clone git://github.com/josephwilk/rsemantic.git

And you can read about rSemantic on Github: http://josephwilk.github.com/rsemantic

Comments