Saturday, January 18, 2014

Term-Document matrices and Latent Semantic Analysis

Latent Semantic Analysis, or LSA, is a process used in natural language processing in order to find similar words or documents in a term-document matrix.

Term-Document matrix
A term-document matrix is a matrix which tells you how often a given word is found in a given document. Here is an example taken from the paper "Indexing by Latent Semantic Analysis" by Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer and Richard Harshman:

D1D2D3D4D5D6D7D8D9
human100100000
interface101000000
computer110000000
user011010000
system011200000
response010010000
time010010000
EPS001100000
survey010000001
trees000001110
graph000000111
minors000000011

In this example, there are 9 nameless documents (could be websites, Wikipedia articles, anything) and 12 selected terms. Each numbers are the number of times the given term is found in the given document. For example the term "human" is found once in D1 but never occurs in D2.

Term document-matrices are neat because the meaning of both the words and the documents can be characterized by using a row or a column from the matrix. For example the row belonging to the term "human" is [1, 0, 0, 1, 0, 0, 0, 0, 0]. This says something about its meaning because it occurs in D1 and D4. Other words which are also found in the same documents would imply that they are somehow related or similar. On the other hand the column belonging to the document D1 is [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]. This says something about its meaning as well because it contains the words "human", "interface" and "computer" only. Other documents which also contain the same words would imply that they are also somehow related or similar.

This means that by comparing their respective row or column, that is, their a vector, you can determine how similar two words or documents are. You can even create a search engine which searches the documents based on keywords by treating the keyword list as a tiny document and comparing its vector with the vectors of the documents.

How can you compare vectors? You can use a vector similarity function such as Cosine similarity which given two vectors will give you a number between -1 and 1, depending on how similar they are.

Unfortunately this is not always the case because words in isolation do not discriminate meaning well enough. This is because of synonymous (different words which mean the same thing such as "couch" and "sofa") and polysemous (same word which means different things such as "bank account" and "river bank") words. This means that in the above matrix, the words "human" and "user" which should be similar will have the following vectors:
[1, 0, 0, 1, 0, 0, 0, 0, 0]
[0, 1, 1, 0, 1, 0, 0, 0, 0]

These vectors do not hint at a similarity between the two words. However, although they appear in different documents, they may co-occur with other words which are shared between these documents. For example both "human" and "user" occur in documents which contain the words "computer" and "interface". This indirect co-occurrence can be harnessed by using LSA.

Latent Semantic Analysis
LSA uses linear algebra to bring out these latent (hidden) co-occurrences. In particular, it uses Singular Vector Decomposition (SVD). We will not go into the method used to perform this operation but will instead assume that there is a library which does it for us, in particular we will be using python's NumPy. SVD will basically decompose a matrix into 3 matrices which when multiplied together will give the original matrix again. These matrices are called U, S and V. Here is the python 2.7 code with NumPy which does this:

import numpy

a = numpy.matrix("""
1 0 0 1 0 0 0 0 0;
1 0 1 0 0 0 0 0 0;
1 1 0 0 0 0 0 0 0;
0 1 1 0 1 0 0 0 0;
0 1 1 2 0 0 0 0 0;
0 1 0 0 1 0 0 0 0;
0 1 0 0 1 0 0 0 0;
0 0 1 1 0 0 0 0 0;
0 1 0 0 0 0 0 0 1;
0 0 0 0 0 1 1 1 0;
0 0 0 0 0 0 1 1 1;
0 0 0 0 0 0 0 1 1
""")

(U, s, V) = numpy.linalg.svd(a, full_matrices=False)
S = numpy.diag(s)

print U
print S
print V

U
-0.221350778443-0.1131796173670.28895815444-0.414750740379-0.106275120934-0.340983323615-0.5226577714610.06045013762070.406677508728
-0.197645401447-0.07208777875830.135039638689-0.5522395836560.2817689389490.4958780111110.07042344117380.009940037207770.108930265566
-0.240470226090.0431519520879-0.164429078921-0.594961818064-0.106755285123-0.2549551297670.302240236-0.0623280149762-0.492444363678
-0.4035988634940.0570702584462-0.3378035375020.09911372949330.3317337175860.384831917013-0.002872175291180.000390504202226-0.0123293478854
-0.644481152473-0.1673012056810.3611481514330.333461601349-0.158954979122-0.2065225879340.165828574516-0.0342720233321-0.270696289374
-0.2650374700350.107159573274-0.4259984968870.0738121921920.0803193764074-0.169676388586-0.2829157265310.01614654719570.053874688728
-0.2650374700350.107159573274-0.4259984968870.0738121921920.0803193764074-0.169676388586-0.2829157265310.01614654719570.053874688728
-0.300828163915-0.1412704682640.3303084345220.1880919178790.1147846224740.272155276471-0.03299411015550.01899801442590.165339169935
-0.2059178612570.273647431063-0.177597017072-0.0323519366242-0.5371500033060.08094397821430.4668975251010.03629882953370.579426105711
-0.01274618303830.4901617924530.2311201548860.02480199852750.594169515589-0.3921250643110.28831746071-0.2545679451760.225424068667
-0.03613584902220.622785234540.223086362590.000700072121447-0.06825293819960.11490895384-0.1595754765060.681125438043-0.231961226249
-0.03175632893360.4505089193510.141115163889-0.00872947061057-0.300495110030.277343396711-0.339495286197-0.678417878879-0.182534975926

S
3.340883752130.00.00.00.00.00.00.00.0
0.02.541701000040.00.00.00.00.00.00.0
0.00.02.353943517660.00.00.00.00.00.0
0.00.00.01.644532292370.00.00.00.00.0
0.00.00.00.01.504831550490.00.00.00.0
0.00.00.00.00.01.306381950240.00.00.0
0.00.00.00.00.00.00.8459030826470.00.0
0.00.00.00.00.00.00.00.5601344228390.0
0.00.00.00.00.00.00.00.00.36367684004

V
-0.197392802296-0.605990268919-0.462917508082-0.542114416925-0.279469108426-0.00381521297476-0.0146314675059-0.0241368353337-0.0819573680281
-0.05591351777210.165592877548-0.127312061589-0.2317552288740.1067747170060.1928479362620.4378748825980.61512189920.529937071643
0.110269729184-0.4973264936270.2076059529360.569921445337-0.5054499066560.09818423983060.1929555718170.2529039787480.0792731465334
-0.949785023586-0.02864889894910.04160919513870.2677140377470.150035432580.0150814907330.01550718752510.0101990092357-0.0245549055501
0.0456785564271-0.2063272776610.37833623285-0.205604711440.3271944093950.3948412135540.3494853475250.149798472318-0.601992994659
-0.0765935584562-0.2564752211910.72439964169-0.3688609008460.034813049761-0.300161116158-0.2122014242629.74341694268e-050.36221897331
-0.177318297290.4329842446230.236889703269-0.264799522759-0.6723035298250.3408398274270.152194721647-0.249145920279-0.0380341888592
0.0143932590528-0.0493053257482-0.008825502048390.0194669438940.0583495626425-0.4544765234850.761527011149-0.4496427567090.0696375496788
0.0636922895993-0.242782899677-0.02407687483340.08420690169030.2623758762320.619847193574-0.0179751825326-0.519890498080.453506754839

Rank reduction
If U, S and V are multiplied together in that order, you'll get the original matrix again (approximately due to rounding errors). Notice that the second matrix S is a diagonal matrix, that is, every element in the matrix is zero except the diagonal. I will not go into the mathematics of this but if you take the second matrix S and replace the smallest number in the diagonal with zeros, when you multiply the three matrices together again you will get a matrix which is similar to the original matrix but which has some important differences. The more of the smallest values in the diagonal are replaced with zeros, the more similar the rows of similar words will become. This only works to a point of course. There has to be a balance between how many values are replaced with zeros and how much information is left in the matrix. Conviniently the values in the diagonal are sorted. If we replace the smallest 7 values with zeros, we get the following:

S[2][2] = 0.0
S[3][3] = 0.0
S[4][4] = 0.0
S[5][5] = 0.0
S[6][6] = 0.0
S[7][7] = 0.0
S[8][8] = 0.0

A = U*S*V
print A

0.1620579738990.4004982831060.3789545403210.4675662611790.175953674216-0.05265494658-0.115142842816-0.159101981799-0.0918382678755
0.1405852892390.3698007716290.3289960296850.4004272249810.164972474341-0.0328154503866-0.0705685702014-0.0967682651277-0.0429807318561
0.1524494769130.5050044441640.3579365839550.4101067800920.2362317332390.02421651570030.05978051008650.0868573009880.123966320774
0.2580493268460.8411234345120.605719948810.6973571707970.3923179494750.03311800516390.08324490705230.1217723857780.187379725005
0.4487897544761.234364833831.05086149681.265795591310.556331394289-0.0737899841222-0.154693831118-0.209598161026-0.0488795415146
0.1595542774750.5816818999270.3752189684960.4168976798470.2765405155640.05590374461940.1322184985960.1889114592280.216907605531
0.1595542774750.5816818999270.3752189684960.4168976798470.2765405155640.05590374461940.1322184985960.1889114592280.216907605531
0.2184627834010.5495805806470.5109604712650.6280580180990.242536067696-0.0654109751045-0.142521455703-0.196611863571-0.107913297073
0.0969063857150.5320643792230.2299136540580.2117536295160.2665251262360.136756182060.3146207783410.4444405821280.424969482179
-0.06125388126640.232108208041-0.138898404449-0.2656458899290.1449254944030.2404210479630.5461471689840.7673742003910.663709334488
-0.06467702166480.335281153514-0.14564054552-0.3014060708260.2027564098420.3057261210210.6948933689670.9766112138750.848749689135
-0.04308204300740.25390566477-0.0966669539764-0.2078582066670.151913399990.2212270314070.5029448763150.7069116271570.615504399537

Now if we take the first and fourth rows which belong to "human" and "user", we get these:
0.162057973899 0.400498283106 0.378954540321 0.467566261179 0.175953674216 -0.05265494658 -0.115142842816 -0.159101981799 -0.0918382678755
0.258049326846 0.841123434512 0.60571994881 0.697357170797 0.392317949475 0.0331180051639 0.0832449070523 0.121772385778 0.187379725005
These rows are much more similar than they were before and they are more similar than other rows in the matrix.

This is called a rank reduction because the effect of replacing the values with zeros was that you get a matrix with a smaller "rank" which is just a numeric property of matrices.

Dimension reduction
The problem with a rank reduction is that suddenly the matrix becomes full of numbers where it used to be full of zeros. This means that in order to use the matrix you need a lot of processing which is slow. So a slight modification is used instead which has the effect of making the matrix smaller, whilst still exposing the latent co-occurrences of the matrix. After zeroing out the values, rather than multiplying all 3 matrices, only U and S or S and T are multiplied together, depending on whether you want to compare rows or columns.

S[2][2] = 0.0
S[3][3] = 0.0
S[4][4] = 0.0
S[5][5] = 0.0
S[6][6] = 0.0
S[7][7] = 0.0
S[8][8] = 0.0

A = U*S
print A

-0.739507219222-0.2876687466460.00.00.00.00.00.00.0
-0.660310310378-0.1832255793610.00.00.00.00.00.00.0
-0.8033830712150.1096793597760.00.00.00.00.00.00.0
-1.348376885430.1450555329650.00.00.00.00.00.00.0
-2.15313661085-0.4252296417880.00.00.00.00.00.00.0
-0.8854593773450.2723675945540.00.00.00.00.00.00.0
-0.8854593773450.2723675945540.00.00.00.00.00.00.0
-1.00503192501-0.3590672904620.00.00.00.00.00.00.0
-0.6879476369470.6955299491910.00.00.00.00.00.00.0
-0.04258351581441.245844718060.00.00.00.00.00.00.0
-0.1207256708681.582933853440.00.00.00.00.00.00.0
-0.1060942033621.145058970840.00.00.00.00.00.00.0

The columns of zeros on the right of the matrix correspond to the values which were zeroed out in the diagonal of S. We can simply remove these columns.

-0.739507219222-0.287668746646
-0.660310310378-0.183225579361
-0.8033830712150.109679359776
-1.348376885430.145055532965
-2.15313661085-0.425229641788
-0.8854593773450.272367594554
-0.8854593773450.272367594554
-1.00503192501-0.359067290462
-0.6879476369470.695529949191
-0.04258351581441.24584471806
-0.1207256708681.58293385344
-0.1060942033621.14505897084

Now we only have only two columns to compare. The rows corresponding to "human" and "user" are:
-0.739507219222 -0.287668746646
-1.34837688543 0.145055532965

This is called a dimension reduction because the vectors being compared will become smaller, which allows for faster computation. The same thing could have been done to compare documents by multiplying S and V instead and then removing the bottom rows instead of the rows on the right.