Saturday, January 12, 2013

Summary of research paper "Discovering Word Senses from Text" by Patrick Pantel and Dekang Lin

This is a summary of the research paper http://webdocs.cs.ualberta.ca/~lindek/papers/kdd02.pdf. This paper describes a way to group together similar words according to the different meanings of each word and all by just using the contexts of the words. You should also check the PhD thesis of the algorithm used in this paper by Panten at http://www.patrickpantel.com/download/papers/2003/cbc.pdf. You might also be interested in seeing this Java implementation of the algorithm at https://github.com/fozziethebeat/S-Space/blob/master/src/main/java/edu/ucla/sspace/clustering/ClusteringByCommittee.java.

In order to judge the similarity between two words, the grammatical relations between those words and other words as used in a corpus were used. For example, US state names can be recognized as similar because they all share the following contexts:
___ appellate court
campaign in ___
___ capital
governor of ___
___ driver's license
illegal in ___
primary in ___
___'s sales tax
senator for ___

So, for example, by counting the number of times the verb "campaign" is grammatically related to each word in a corpus, that count can be used to find other words which are similar because they have similar counts. The grammatically related word is called a feature of the word. By finding all the different features of each word in a corpus, we can give each word a list of its features and the number of times that feature was encountered as a context of the word. For example, we can find the number of times the word "campaign" appears as a feature of "California", as well as the word "capital", "governor", etc. All these features when grouped together are called a feature vector of the word.

However, in this paper, it is not the counts, or frequencies, that are associated with the features. Using raw counts to judge the importance of a feature is misleading as the feature might just be a common feature in general and not useful to find similar words by checking if they have features in common. So instead the pointwise mutual information is used. This compares the probability of the two word co-occurring together with the probability that they do so by coincidence (independently). In this paper it is calculated as:

mi(w,c) = p(w,c) / (p(w) × p(c))
where "w" is the word, "c" is a context of "w", "p(w,c)" is the probability of finding "c" as a context of "w", "p(w)" is the probability of finding the word "w" in some context and "p(c)" is the probability of finding "c" as a context of some word. These probabilities are calculated as follows:

p(w,c) = f(w,c) / N
p(w) = sum(f(w,c) for c in contexts) / N
p(c) = sum(f(w,c) for w in words) / N
N = sum(sum(f(w,c) for w in words) for c in contexts)
where "f(w,c)" is the number of times "c" is a context of "w" and "N" is the total number of word-context combinations.

However the pointwise mutual information gives bigger values to rare words and contexts so it is made fair by being multiplied by a discounting factor:

df(w,c) = f(w,c) / (f(w,c)+1) × min(f(w), f(c)) / min(f(w), f(c))
where
f(w) = sum(f(w,c) for c in contexts)
f(c) = sum(f(w,c) for w in words)

In order to measure the similarity between two words, the cosine coefficient is used:

sim(a,b) = sum(mi(a,c) × mi(b,c) for c in contexts) / sqrt(sum(mi(a,c)^2 for c in contexts) × sum(mi(b,c)^2 for c in contexts))

Words with different meanings, or senses, are called polysemous words. In order to cluster the words together according to their senses, using a different cluster for each sense, three phases are used.

In the first phase, the top 10 similar words for each word were found. In order to speed up the search, the feature vectors were indexed by their features, such that you can find all the feature vectors that have a given feature. The index is then used to find feature vectors that share features in common with the word we are finding similar words to. In this way you skip all the words which share nothing in common with the word of interest. In order to further speed things up, only some of the word's features are considered. The features which have a small value (pointwise mutual information) are ignored when looking for similar words. So only words which share these high valued features are retrieved from the index. So if we are looking for similar words to a word with the following feature vector:
{
  ("object of", "carry") : 5.118,
  ("object of", "bring") : 3.364,
  ("object of", "look at") : 2.753,
  ("object of", "have") : 1.353
}
and we only consider features with a value higher than 3.0 then we will use the index to find all the words which have a feature vector with the features
{
  ("object of", "carry") : 5.118,
  ("object of", "bring") : 3.364
}
This narrows down the number of words to check the similarity of (using the sim function) and so speeds up the search greatly. The top 10 most similar words are then retrieved. The fact that words which don't share high valued features are ignored did not seem to affect the result.

In the second phase, a clustering algorithm called "clustering by committee" is applied to the words using the similarity lists of the first phase. The clustering algorithm tries to assign the words into small clusters called "committees". Words which don't fit into any committee are called "residue". The point of these small clusters is to obtain centroids from which to cluster all the words, just like the randomly chosen centroids for k-means clustering, but instead the centroids are chosen intelligently. These centroids would originate from committees of words all of which mean a particular word sense. In this way the committee would decide whether to allow a word in the cluster or not. The way each committee is chosen is such that it is composed of words which do not have multiple meanings as used in the corpus. This makes the centroid of its feature vectors tend to only have features which belong to a particular word sense rather than to multiple senses, which leads to purer clusters of word senses. For example, when clustering USA state names, state names which are also names of cities (such as New York) are not included in the committee and only state names which mean nothing else are included which would avoid clustering city names in with the state name cluster when full clustering is performed. This is the clustering algorithm:
def clustering_by_committee(words, simlists, committeeSimilarityThreshold, residueSimilarityThreshold):
    highest_scoring_clusters = []
    #STEP 1.1
    for word in words:
        #STEP 1.2
        clusters = averagelink_cluster(simlists[word]) #use average link hierarchical agglomerative clustering to cluster the words into similarity clusters such that clusters can only be merged if their similarity is below a minimum similarity threshold
        #STEP 1.3
        for cluster in clusters:
            scores[cluster] = size(cluster) * average_pairwise_similarity(cluster)
        #STEP 1.4
        highest_scoring_clusters.append(arg_max(scores)) #append the highest scoring cluster

    #STEP 2.1
    sort_descending(highest_scoring_clusters) #sort highest_scoring_clusters in descending order of score

    #STEP 3.1
    committees = []
    #STEP 3.2
    for cluster in highest_scoring_clusters:
        #STEP 3.3
        centroid = compute_centroid(cluster) #the centroid is computed by averaging the frequencies (not the pointwise mutual information) of the features of each word in the cluster. these average frequencies are used to compute the pointwise mutual information which will be used to create the centroid vector (just as the feature vectors were constructed).
        #STEP 3.4
        if similarity(centroid, any committee in committees) < committeeSimilarityThreshold: #a committee cannot be very similar to other committees
            committees.append(cluster)

    #STEP 4.1
    if committees == []:
        return committees

    residues = []
    #STEP 5.1
    for word in words:
        #STEP 5.2
        if similarity(word, any committee in committees) < residueSimilarityThreshold: #if a word is not sufficiently close to any committee then it is a residue and more committees are needed for this word to belong to. otherwise the word is "covered" by a committee.
            residues.append(word)

    #STEP 6.1
    if residues == []:
        return committees
    #STEP 6.2
    else:
        return committes + clustering_by_committee(residues, simlists, threshold1, threshold2)
In the third phase, the committees discovered in the second phase are used to actually cluster words by their different senses. The words are not added to committees themselves, rather to clusters each of which is represented by its own committee. The similarity of a word to a cluster is the similarity between the word's feature vector and the centroid of the cluster's committee (which remains the same regardless of which words are added to the cluster). If hard clustering is used then words are only added to the most similar cluster, but if soft clustering is used, as is the case here, then the same word can be added to more than one cluster, one for each sense of the word. Here is the algorithm which assigns words to clusters:
def assign_word_to_clusters(word, committeeClusters, minWordSimilarity, maxClusterSimilarity):
    clusters = []
    similar_clusters = top_200_most_similar(committeeClusters, word) #find the top 200 clusters whose committee's centroid is most similar to the word
    while similar_clusters != []:
        best_cluster = get_most_similar_cluster(similar_clusters, word)
        if similarity(word, best_cluster) < minWordSimilarity:
            break #there is no cluster left that is similar enough to assign the word to
        else if similarity(best_cluster, any cluster in similar_clusters) < maxClusterSimilarity: #a word cannot be assigned to a cluster whose committee's centroid is similar to another cluster's committee's centroid
            best_cluster.add(word)
            remove_common_features(word, best_cluster) #remove the features from the word's feature vector which are also found in best_cluster's committee's centroid
        similar_clusters.remove(best_cluster)
The fact that the features in the word which are also found in the cluster's committee's centroid are removed before it is assigned to another cluster is important to allow it to be assigned to other less common sense clusters. For example, the word "plant" would first be assigned to the "factory" sense cluster in a newspaper corpus because that is the meaning of the word most of the time in such a corpus. After it is assigned to the factory sense cluster, the features which have to do with this sense are removed (or their values are set to 0) from the feature vector of "plant". Without these features, the remaining features would be of a different sense such as of living plant and so now it can be easily assigned to the living plant sense as these features are not overshadowed anymore. So this system will output a list of clusters to which each sense of every word will belong. In order to evaluate it, the sense clusters are compared to WordNet synonym groups called synsets. This is how the comparison is made: First, the similarity measure proposed by Lin which measures the similarity between two WordNet synsets is used. This is described in another post here.
sim_synset(s1, s2) = 2 log P(most_specific_common_synset(s1, s2)) / (log P(s1) + log P(s2))
The similarity between a word and a synset is the maximum similarity between the synset and one of the synsets in which the word belongs:
sim_synset_word(s, w) = max(sim_synset(s, t) for t in synsets_of(w))
Let "top_words(c, k)" be the top "k" most similar words in a sense cluster "c" to the committee which the cluster was based on. The similarity between a synset and a sense cluster is the average similarity between the synset and "top_words(c, k)".
sim_synset_cluster(s, c, k) = sum(sim_synset_word(s, w) for w in top_words(c, k)) / k
We can check if a word belongs to a correct sense cluster by checking the similarity of the sense cluster to all the synsets of the word. If the maximum similarity found is greater than a threshold, then it is considered to be a correct sense cluster.
is_correct_sense_cluster(w, c, k, threshold) = max(sim_synset_cluster(s, c, k) for s in synsets_of(w)) >= threshold
"k" was set to 4 and the threshold was varied in the experiments. By checking for which synset is most similar to the cluster, we can find the WordNet sense that corresponds to the cluster. If more than one cluster corresponds to the same WordNet sense, then only one of them is considered correct. The precision of the system's assignment of a particular word is the percentage of clusters the word is assigned to that are a correct sense. The precision of the system in general is the average precision of all words. The recall of the system's assignment of a particular word is the percentage of correct clusters the word is assigned to out of all the actual correct senses of the word which are found in the corpus. Since the corpus was not sense tagged this is a tricky situation. So the next best thing was to use several clustering methods to extract as many correct sense clusters (as defined above) for the words as possible. The union of all these correct senses are used as the number of different senses each word has in the corpus. The overall recall of the system is the average recall of every word. The precision and recall scores are combined into an F measure:
F = 2 × precision × recall / (precision + recall)
In order to do the evaluation, 144 million words were parsed using Minipar in order to extract the grammatical relations between words. The intersection of words which are found in WordNet with those which in the corpus have a pointwise mutual information with all their features greater than 250 was used for evaluation. 13403 words met the criteria and the average number of features per word was 740.8. As was already mentioned, the threshold described in "is_correct_sense_cluster" was varied during the experiment. Here are the results of the F measure using different thresholds with different clustering algorithms. The clustering algorithms were modified to just give the centroids of the clusters and then associated all words which have a similarity greater than 0.18. This is in order to associate the same word with different clusters for each sense. The system described is "CBC" which stands for "Clustering By Committee" which outperformed all the other clustering methods for all thresholds.