Tuesday, March 18, 2014

Random Indexing for word space context vectors

Last time I had described what a term document matrix is and how to make it's use faster using Latent Semantic Analysis. Today I shall be explaining how to use Random Indexing as an alternative to LSA. You can read more about it here.

Unlike LSA, random indexing is a very easy and very fast online process. It is online in the sense that you don't need to first completely construct the co-occurrence matrix in order to use it, but can instead use it as you are counting co-occurrences. It also works just as well as LSA. Interested yet?

First of all, how do you construct a co-occurrence matrix? You start off with a matrix full of zeros with columns for words in context w1, w2, w3, and w4 and rows for neighbouring words n1, n2, n3, and n4:
   n1 n2 n3 n4
w1 0  0  0  0
w2 0  0  0  0
w3 0  0  0  0
w4 0  0  0  0
You go through the corpus and you encounter word w2 and near it is word n4. You record this encounter by incrementing the matrix element for w2-n4:
   n1 n2 n3 n4
w1 0  0  0  0
w2 0  0  0  1
w3 0  0  0  0
w4 0  0  0  0
You do this for every neighbouring word next to every word in context and finally you have your co-occurrence matrix.

Now this very same process can be viewed in a different way. Instead of incrementing single numbers, it can be viewed as vector addition. Forget the names given to the columns:
w1 0  0  0  0
w2 0  0  0  0
w3 0  0  0  0
w4 0  0  0  0
Instead assign a vector to each neighbour word where each vector consists of just zeros and a single 1. No two vectors can have a 1 in the same position. So these could be the vectors assigned to our 4 neighbour words:
n1: (1 0 0 0)
n2: (0 1 0 0)
n3: (0 0 1 0)
n4: (0 0 0 1)
Now when you encounter neighbour word n4 next to word in context w2, you just take w2's row, which is a vector, and add it to n4's vector:
w2: (0 0 0 0) +
n4: (0 0 0 1)
    =========
    (0 0 0 1)
Which gives you the updated matrix:
w1 0  0  0  0
w2 0  0  0  1
w3 0  0  0  0
w4 0  0  0  0
Do this for every neighbour next to every word in context and you have your co-occurrence matrix.

All random indexing does is change the neighbour words' vectors. Instead of the vectors being as long as the number of neighbours, the vectors are shortened to a fraction of what they are supposed to be. Consequently the number of columns in the matrix will also be reduced. "But how can you put a 1 in a different position for every neighbour word?" I hear you ask. You don't. What you do is you still keep the vectors mostly zero, but now you randomly put a few 1s and -1s in it:
w1 0  0  0
w2 0  0  0
w3 0  0  0
w4 0  0  0

n1: ( 1  0 -1)
n2: ( 0  1 -1)
n3: ( 0 -1  1)
n4: (-1  1  0)
If we encounter word in context w2 with neighbour word n4, then the matrix will become:
w1  0  0  0
w2 -1  1  0
w3  0  0  0
w4  0  0  0
If word in context w2 also has word n2 as a neighbour, then the matrix will become:
w1  0  0  0
w2 -1  2 -1
w3  0  0  0
w4  0  0  0

Now it will not be possible to know how many times each neighbour word co-occurs with each word in context, but that's OK. What matters is that each time you encounter a neighbour word, it leaves its mark on the word in context's row and that the more often it is encountered, the more its mark is pronounced. This means that you can still find similar words by comparing their rows together. The difference is that the rows are shorter and hence faster to compare.

Notice the online nature of this process. At any point in the corpus traversal, the matrix is always in its compressed form and there are no complex transformations that need to be made on it. You can ever add more words to or remove words from the matrix, without any further processing or extra data. You just have to find a compromise between how many columns to remove from the matrix and the individuality of the rows. The smaller the rows, the less of an individual mark each neighbour word will leave after adding its vector.

Wednesday, March 12, 2014

Summary of research paper "Unsupervised Lexical Substitution with a Word Space Model" by Dario Pucci, Marco Baroni, Franco Cutugno, and Alessandro Lenci

This is a summary of the research paper http://www.evalita.it/sites/evalita.fbk.eu/files/proceedings2009/Lexical%20Substitution/LS_UNINA_UNIPI_UNITN.pdf. This paper describes a way to automatically choose a word which can substitute another word in a sentence with minimal resources.

Overview
The system described in this paper was used for the EVALITA 2009 where a manually constructed evaluation set in Italian was provided.

In order to determine how well a word can replace another in a context, two types of vectors are used: a contextualized vector and an out-of-context vector. The contextualized vector is a composition of out-of-context vectors.

First, out-of-context vectors are collected from the corpus for each word using the usual method of counting how often each word co-occurs with the word being represented by the vector. The counts are then weighted to give less importance to words that co-occur with many words.

In order to find a substitutable word for a target word in a sentence, the out-of-context vectors of each content word in the sentence are collected and their weights are summed together. This is the contextualized vector. The word with the out-of-context vector that is most similar to this contextualized vector is chosen as the best word to substitute.

Nitty gritty
A lemmatized and part of speech tagged 2 billion token Italian corpus was constructed from which to extract the out-of-context vectors. The 20,000 most frequent nouns, verbs, adjectives and adverbs were considered as content words. Out-of-context vectors for these content words were extracted by counting how often they appear with other content words in the same sentence. The counts were weighted using log likelihood ratios and the co-occurrence matrix of counts was compressed to half its size using random indexing.

To create a contextualized context vector, the context sentence is lemmatized and part of speech tagged. The sentence is then split into a left and right side with respect to the target word and each side is filtered of all non-content words. The words in the sentence sub-parts that are within a set number of content words from the target word are used to construct the contextualized sentence. Their out-of-context vectors are normalized, multiplied by "1/(4*d)" where "d" is the distance of the word in the sentence sub-part from the target word (the closest content word has a distance of 1), and summed together.

Finally, cosine measure is used to find the most substitutable words by measuring the distance of the contextualized context vector to the out-of-context vectors of all the words that have the same part of speech tag as the target word.

Evaluation
The system was evaluated by using an evaluation set which consisted of sentences with a target word and suggested substitutable words by human annotators. Each suggestion included the number of annotators that suggested it. When comparing the best suggested substitutable word by the system with the most popular suggested word by the annotators, the system matched 10.84% of the evaluation set (that included a most popular suggested word).