Wednesday, December 26, 2012

Summary of research paper "Automatic Retrieval and Clustering of Similar Words" by Dekang Lin

This is a summary of the research paper http://webdocs.cs.ualberta.ca/~lindek/papers/acl98.pdf. This paper describes a way to measure the similarity between words using their contexts. It also describes a way to evaluate the similarities by comparing with existing thesauruses.

The way similarity between two words is calculated is by checking how the words are used in a corpus. To check how a word is used, the grammatical relations between the word and other words in the same sentences it is found are used. For example, these are the grammatical relations for the word "dog" in the sentence "I have a brown dog":

(dog obj-of have): it is an object of the verb "have"
(dog adj-mod brown): it is modified by the adjective "brown"
(dog det a): it takes the determiner "a"

The way the relations are represented is called "dependency triples". By counting these dependency triples for every word in a corpus, the frequencies of each relation can be used to determine if two words are used in the same way and by extension if they mean the same thing. This is called the Distributional Hypothesis which says that words that words that appear in similar contexts are semantically similar. These dependency triples are collected using a broad coverage dependency parser.

In order to describe how these frequencies are used in calculations, the following notation is used: ||w,r,w'|| which is the frequency of the dependency triple (w r w'), that is, that word "w" is related to word "w'" with the relation "r". For example, ||cell, subj-of, absorb|| is the number of times the word "cell" is used as a subject of the word "absorb" in the corpus. Asterisks are used a wild cards, that is, to say that that it doesn't matter what is instead of the asterisks. For example, ||cell, subj-of, *|| is the number of times the word "cell" is used as a subject of some other word, ||cell, *, absorb|| is the number of times the word "cell" is somehow related to the word "absorb", ||*, subj-of, absorb|| is the number of times a word is used as a subject of the word "absorb" and ||*, *, *|| is total number of dependency triples of any kind.

In order to determine how similar two words are based on the frequency of their relations, a formula from another paper by the same author called Using Syntactic Dependency as Local Context to Resolve Word Sense Ambiguity (http://aclweb.org/anthology-new/P/P97/P97-1009.pdf) is used which says that

similarity(A, B) = log P(common(A, B)) / log P(describe(A, B))

This means that the similarity between two words A and B is the amount of information needed to describe what is in common between both words divided by the amount of information needed to describe both words. The less the words have information in common, the less similar they are.

The amount of information needed to describe a word is the sum of the information in each of its dependency triples. The amount of information needed to describe a dependency triple (w, r, w') is -log(actual probability that a randomly selected dependency triple is (w, r, w')) - -log(expected probability that a randomly selected dependency triple is (w, r, w')).

The actual probability of (w, r, w') is ||w, r, w'|| / ||*, *, *||.

The expected probability of (w, r, w') can be estimated using the probability that a randomly selected relation is "r", that a randomly selected word is "w" given that the relation is "r" and that another randomly selected word is "w'" given that the relation is "r". These probabilities are assumed to be independent and are simply multiplied together to find the probability of the whole dependency triple. This probability is P(r)P(w|r)P(w'|r) which equals (||*, r, *|| / ||*, *, *||) × (||w, r, *|| / ||*, r, *||) × (||*, r, w'|| / ||*, r, *||).

So the amount of information needed to describe a dependency triple is I(w,r,w') = log((||w,r,w'|| × ||*,r,*||) / (||w,r,*|| × ||*,r,w'||))

We can now finally define the function which gives the similarity between two words.

Let "T(w)" be a function which when given a word "w" will return the set of pairs "(r, w')" which make I(w,r,w') positive.

The similarity between 2 words, "w1" and "w2" is given by the function sim(w1, w2) = sum(I(w1, r, w) + I(w2, r, w) for (r,w) in (T(w1) ∩ T(w2))) / (sum(I(w1, r, w) for (r,w) in T(w1)) + sum(I(w2, r, w) for (r,w) in T(w2)))

With our function sim(w1, w2) we can now start measuring how similar two words are based on their relations to other words.

A 64 million word corpus was parsed using a broad coverage dependency parser which extracted 56.5 million dependency triples. For each noun, verb, adjective and adverb, the top 200 most similar words were found. The resulting thesaurus of similar words consists of these 200 most similar words for each word sorted in descending order of their similarity together with their similarity, for example:

brief (noun): affidavit 0.13, petition 0.05, memorandum 0.05, motion 0.05, lawsuit 0.05, deposition 0.05, slight 0.05, prospectus 0.04, document 0.04, paper 0.04 ...

In order to determine the quality of the similarity measure, its values were compared to existing thesauruses Roget and WordNet in order to see if the similar words determined by the measure are the same as the similar words in the thesauruses. Other similarity measures were also used in order to compare them all together.

These other similarity measures were:

Cosine similarity:
sim_cos(w1, w2) = |T(w1) ∩ T(w2)| / sqrt(|T(w1)| × |T(w2)|)

Hindle similarity which is the same as that proposed in a paper called NOUN CLASSIFICATION FROM PREDICATE ARGUMENT STRUCTURES except that it does not use dependency triples which have negative information (the measure only uses dependency triples whose relation is of subject of and object of):
sim_hindle(w1, w2) = sum(min(I(w1, r, w), I(w2, r, w)) for (r,w) in (T(w1) ∩ T(w2)) if r in {subj-of, obj-of})

and Hindle_r which is the same as Hindle except that all dependency relations are used:
sim_hindle_r(w1, w2) = sum(min(I(w1, r, w), I(w2, r, w)) for (r,w) in (T(w1) ∩ T(w2)))

In order to compare the similarities given by these measures with the thesauruses, the thesauruses needed to have a quantified similarity associated between their similar words. For this reason a similarity measure which gives the similarity between two words according to the thesauruses were developed.

For WordNet, the similarity measures use super classes of the different senses of words. Every word in WordNet belongs to some category of words called a class, here is an example:

Next to each class is the probability that a randomly selected word belongs to that class, which is done by counting the number of words in each class. In order to quantify the similarity between words in WordNet, the most specific class which is common between the two words is used, which in the case of "hill" and "coast" would be "geological formation". By finding the probability of this class for every different sense (meaning) of the two words, converting the probabilities of those classes into information and taking the maximum of the informations, the similarity of the two words would be computed. Formally, this is as follows:
sim_WN(w1, w2) = max(2 log(P(most_specific_common_class(s1, s2))) / (log(P(s1)) + log(P(s2))) for s1 in senses(w1) for s2 in senses(w2))

For Roget, the Roget categories of the words are used. The more words are in common of both categories, the more similar the words are. This is measured as such:
sim_Roget(w1, w2) = 2|RogetCategory(w1) ∩ RogetCategory(w2)| / (|RogetCategory(w1)| + |RogetCategory(w2)|)

Once these measures were defined, the thesauruses were transformed into a form similar to the generated ones. Here is an example:

brief (noun) according to "sim" thesaurus: affidavit 0.13, petition 0.05, memorandum 0.05, motion 0.05, lawsuit 0.05, deposition 0.05, slight 0.05, prospectus 0.04, document 0.04, paper 0.04

brief (noun) according to "sim_WN" thesaurus: outline 0.96, instrument 0.84, summary 0.84, affidavit 0.80, deposition 0.80, law 0.77, survey 0.74, sketch 0.74, resume 0.74, argument 0.74

brief (noun) according to "sim_Roget" thesaurus: recital 0.77, saga 0.77, autobiography 0.77, anecdote 0.77, novel 0.77, novelist 0.77, tradition 0.70, historian 0.70, tale 0.64

In order to determine the similarity between each set of similar words, the following formula was used:
entries_similarity(entries1, entries2) = sum(entry1.similarity × entry2.similarity for entry1 in entries1 for entry2 in entries2 if entry1.word == entry2.word) / sqrt(sum(entry1.similarity^2 for entry1 in entries1) × sum(entry2.similarity^2 for entry2 in entries2))

For example, the similarity between the entries for "brief" using "sim" thesaurus and using "sim_WN" thesaurus is:
sum(0.13 × 0.80, 0.05 × 0.80) / sqrt(sum(0.13^2, 0.05^2, 0.05^2, 0.05^2, 0.05^2, 0.05^2, 0.05^2, 0.04^2, 0.04^2, 0.04^2) × sum(0.96^2, 0.84^2, 0.84^2, 0.80^2, 0.80^2, 0.77^2, 0.74^2, 0.74^2, 0.74^2, 0.74^2)) = 0.297215762

The thesauruses were compared and only words which occurred more than 100 times and which were found in all thesauruses were used to evaluate. The average similarity between entries in the thesauruses was calculated.

When compared to WordNet entries:
sim 0.212199
Hindle 0.204179
cosine 0.199402
Roget 0.178397
Hindle_r 0.164716

When compared to Roget entries:
WordNet 0.178397
sim 0.149045
Hindle 0.14663
cosine 0.135697
Hindle_r 0.115489

It can be seen that the "sim" thesuarus is always better than the other generated thesauruses, but WordNet is closer to Roget than the "sim" thesaurus.

Wednesday, December 12, 2012

An equation for enumerating integers

In my last post I described Cantor's way of proving that some infinite lists are bigger than others. I said that his is done by showing that a number of infinite lists, which are called countably infinite, can be paired up with every natural number from 0 onward. I shall now describe an equation that gives the integer that is paired up with a particular natural number.

The integers are (..., -2, -1, 0, 1, 2, ...)
We can pair up integers with natural numbers by pairing all the even natural numbers with positive integers and all the odd natural numbers with negative integers, like this:
0:0, 1:-1, 2:1, 3:-2, 4:2, ...

An equation which maps the natural numbers to the integers in this way is as follows:
y = -(x+1)/2*(x mod 2) + x/2*((x+1) mod 2)

"mod" is an operator called a modulo operator which gives the resulting remainder for a number divided by another. For example 4 mod 2 gives 0 and 5 mod 2 gives 1.

"x mod 2" will give 1 when x is odd and 0 when x is even.
"(x+1) mod 2" will give 1 when x is even and 0 when x is odd.

These are used in order to multiply them with terms that will deal with either the odd natural numbers or the even natural numbers.

"-(x+1)/2" will work when x is odd and will return -1 for 1, -2 for 3, -3 for 4, etc.
"x/2" will work when x is even and will return 1 for 2, 2 for 4, 3 for 6, etc.

These are used to pair up the negative integers with the odd natural numbers and the positive integers with the even natural numbers.

"-(x+1)/2*(x mod 2)" will give the negative integers for odd natural numbers and 0 for even natural numbers, as "-(x+1)/2" will be multiplied by 1 for odd naturals and by 0 for even naturals.
"x/2*((x+1) mod 2)" will give the positive integers for even natural numbers and 0 for odd natural numbers, as "x/2" will be multiplied by 1 for even naturals and by 0 for odd naturals.

By adding them together, since both terms will alternate between 0 and their signed integers, the sum will be an alternating sequence of positive and negative numbers for every natural number x. When x is 0, the result of the sum will also be 0 as both terms will be multiplied by 0.

The equation in Python 3 is as follows:
for x in range(20):
 print(x, ":", -(x+1)//2*(x % 2) + x//2*((x+1) % 2))

And here is the output:
0 : 0
1 : -1
2 : 1
3 : -2
4 : 2
5 : -3
6 : 3
7 : -4
8 : 4
9 : -5
10 : 5
11 : -6
12 : 6
13 : -7
14 : 7
15 : -8
16 : 8
17 : -9
18 : 9
19 : -10