Tuesday, March 18, 2014

Random Indexing for word space context vectors

Last time I had described what a term document matrix is and how to make it's use faster using Latent Semantic Analysis. Today I shall be explaining how to use Random Indexing as an alternative to LSA. You can read more about it here.

Unlike LSA, random indexing is a very easy and very fast online process. It is online in the sense that you don't need to first completely construct the co-occurrence matrix in order to use it, but can instead use it as you are counting co-occurrences. It also works just as well as LSA. Interested yet?

First of all, how do you construct a co-occurrence matrix? You start off with a matrix full of zeros with columns for words in context w1, w2, w3, and w4 and rows for neighbouring words n1, n2, n3, and n4:
   n1 n2 n3 n4
w1 0  0  0  0
w2 0  0  0  0
w3 0  0  0  0
w4 0  0  0  0
You go through the corpus and you encounter word w2 and near it is word n4. You record this encounter by incrementing the matrix element for w2-n4:
   n1 n2 n3 n4
w1 0  0  0  0
w2 0  0  0  1
w3 0  0  0  0
w4 0  0  0  0
You do this for every neighbouring word next to every word in context and finally you have your co-occurrence matrix.

Now this very same process can be viewed in a different way. Instead of incrementing single numbers, it can be viewed as vector addition. Forget the names given to the columns:
w1 0  0  0  0
w2 0  0  0  0
w3 0  0  0  0
w4 0  0  0  0
Instead assign a vector to each neighbour word where each vector consists of just zeros and a single 1. No two vectors can have a 1 in the same position. So these could be the vectors assigned to our 4 neighbour words:
n1: (1 0 0 0)
n2: (0 1 0 0)
n3: (0 0 1 0)
n4: (0 0 0 1)
Now when you encounter neighbour word n4 next to word in context w2, you just take w2's row, which is a vector, and add it to n4's vector:
w2: (0 0 0 0) +
n4: (0 0 0 1)
    =========
    (0 0 0 1)
Which gives you the updated matrix:
w1 0  0  0  0
w2 0  0  0  1
w3 0  0  0  0
w4 0  0  0  0
Do this for every neighbour next to every word in context and you have your co-occurrence matrix.

All random indexing does is change the neighbour words' vectors. Instead of the vectors being as long as the number of neighbours, the vectors are shortened to a fraction of what they are supposed to be. Consequently the number of columns in the matrix will also be reduced. "But how can you put a 1 in a different position for every neighbour word?" I hear you ask. You don't. What you do is you still keep the vectors mostly zero, but now you randomly put a few 1s and -1s in it:
w1 0  0  0
w2 0  0  0
w3 0  0  0
w4 0  0  0

n1: ( 1  0 -1)
n2: ( 0  1 -1)
n3: ( 0 -1  1)
n4: (-1  1  0)
If we encounter word in context w2 with neighbour word n4, then the matrix will become:
w1  0  0  0
w2 -1  1  0
w3  0  0  0
w4  0  0  0
If word in context w2 also has word n2 as a neighbour, then the matrix will become:
w1  0  0  0
w2 -1  2 -1
w3  0  0  0
w4  0  0  0

Now it will not be possible to know how many times each neighbour word co-occurs with each word in context, but that's OK. What matters is that each time you encounter a neighbour word, it leaves its mark on the word in context's row and that the more often it is encountered, the more its mark is pronounced. This means that you can still find similar words by comparing their rows together. The difference is that the rows are shorter and hence faster to compare.

Notice the online nature of this process. At any point in the corpus traversal, the matrix is always in its compressed form and there are no complex transformations that need to be made on it. You can ever add more words to or remove words from the matrix, without any further processing or extra data. You just have to find a compromise between how many columns to remove from the matrix and the individuality of the rows. The smaller the rows, the less of an individual mark each neighbour word will leave after adding its vector.

No comments:

Post a Comment