Tuesday, June 26, 2012

Summary of research paper "Simple EnglishWikipedia: A New Text Simplification Task" by William Coster and David Kauchak

This is a summary of the research paper http://www.aclweb.org/anthology-new/P/P11/P11-2117.pdf. This paper describes a way to automatically create a parallel corpus that pairs sentences with a simpler version of the sentences which will be useful for training text simplification systems. The sentences for the corpus come from articles in Wikipedia and Simple English Wikipedia. Simple English Wikipedia is a version of Wikipedia which is edited with the aim of being understandable by children and English learners by encouraging the use of simple vocabulary and grammar. By pairing up equivalent sentences in Wikipedia and Simple English Wikipedia, it was possible to build a 137,000 sentence parallel corpus.

In order to build such a corpus, both Wikipedia sites were downloaded, the articles were paired by title, all pairs that contained an unusable article (e.g. stubs, disambiguation pages, etc.) were removed, the equivalent paragraphs in the article pairs were paired up and finally the equivalent sentences in the paragraph pairs were paired up.

To align paragraphs in the article pairs, where a single paragraph in Simple English Wikipedia could correspond to multiple paragraphs in Wikipedia but not the other way round, the words in each paragraph from the Simple English Wikipedia article were used as a set of keywords to search for all corresponding paragraphs in the Wikipedia article. This was done by using TF-IDF and keeping only those paragraphs which resulted in a vector with cosine similarity to the Simple English Wikipedia paragraph of 0.5 or more. In this way, each paragraph in the Simple English Wikipedia article is paired up with a list of paragraphs in the Wikipedia article which are similar to it.

Once the paragraph pairs were found in each article pair, the next step was to pair up the sentences in the paragraph pairs such that in each pair you get a Wikipedia sentence and its simplified counterpart from Simple English Wikipedia. Such a pairing of sentences is called an "alignment". This was less simple than how the paragraphs were paired because now there could be multiple sentences in Simple English Wikipedia that correspond to multiple sentences in Wikipedia. Instead, the sentences in the paragraph pairs are paired up in many different ways and the alignment which gives the maximum sum of similarities in each sentence pair was used. The similarity of two sentences in a pair is found by using TF-IDF with cosine similarity. The paragraph pairing was important in order to limit the number of different ways to attempt to align the sentences.

Following some clarifications via email with the author, here is how the sentence alignment algorithm works:

Let "n" be the number of Wikipedia sentences in a paragraph pair
Let "m" be the number of Simple English Wikipedia sentences in a paragraph pair
Let "sim(i,j)" be a function which measures the similarity between the "i"th Wikipedia sentence and the "j"th Simple English Wikipedia sentence (using TF-IDF with cosine similarity) in a paragraph pair

I shall now introduce a new function "align(n,m)" which is not mentioned in the paper but which will help explain the whole sentence alignment concept.

The function "align(n,m)" gives the best sentence alignment such that the sum of the similarity between each sentence pair is maximized. In order to align the sentences, 6 operations are successively applied to the sentences in the paired paragraphs. The operations are:
  • Skip the next sentence in the Wikipedia paragraph (it does not correspond to any of the sentences in the Simple English Wikipedia paragraph)
  • Skip the next sentence in the Simple English Wikipedia paragraph (it does not correspond to any of the sentences in the Wikipedia paragraph)
  • Pairing the next Wikipedia sentence to the next Simple English Wikipedia sentence
  • Pairing the next Wikipedia sentence to the next two Simple English Wikipedia sentences
  • Pairing the next two Wikipedia sentences to the next Simple English Wikipedia sentence
  • Pairing the next two Wikipedia sentences to the next two Simple English Wikipedia sentences crossed up (for example, if the Wikipedia sentences are [a, b] and the Simple English Wikipedia sentences are [x, y] then the pairing would be [(a,y), (b,x)])

These operations are applied successively using dynamic programming in order to find the alignment with the maximum sum of similarities.

We shall first see the function "a(n,m)" which calculates this maximal similarity and then we shall see "align(n,m)". "a(n,m)" will also be applying the operations using dynamic programming, but it will only be calculating the sum of the similarities of the sentence pairs rather then the actual pairs. Having initial arguments being the number of sentences in each paragraph in the paragraph pair, "a(n,m)" will recursively call itself on less and less sentences until it reaches a base case of having no sentences to align. In accordance with the paper, the "i"th sentence is not 0 based like in arrays but 1 based. So the first sentence is in position 1, the second is in position 2, etc.

This is "a(i,j)" in Python without using dynamic programming (for simplicity):

def a(i, j):
    if i == 0 or j == 0: #Base case: if there are no more sentences to align in the paragraphs then the similarity is 0
        return 0
    
    else: #Recursive case: if there are sentences to align then check what the maximum similarity achieved will be after applying each operation and return the maximum of the similarities
        alternatives = []
        
        #Max similarity after skipping next Simple English Wikipedia sentence
        alternatives.append(a(i, j-1) - skip_penalty)
        
        #Max similarity after skippin next Wikipedia sentence
        alternatives.append(a(i-1, j) - skip_penalty)
        
        #Max similarity after pairing next Wikipedia sentence with next Simple English Wikipedia sentence
        alternatives.append(a(i-1, j-1) + sim(i, j))
        
        if j >= 2: #If there are more than one Simple English Wikipedia sentences then try pairing up two such sentences
            #Max similarity after pairing next Wikipedia sentence with next two Simple English Wikipedia sentences
            alternatives.append(a(i-1, j-2) + sim(i, j-1) + sim(i, j))
            
        if i >= 2: #If there are more than one Wikipedia sentences then try pairing up two such sentences
            #Max similarity after pairing next two Wikipedia sentences with next Simple English Wikipedia sentence
            alternatives.append(a(i-2, j-1) + sim(i-1, j) + sim(i, j))
        
        if i >= 2 and j >= 2: #If there are more than one sentences in both paragraphs then try pairing up two of each
            #Max similarity after pairing next two Wikipedia sentences with next two Simple English Wikipedia sentences crossed up
            alternatives.append(a(i-2, j-2) + sim(i-1, j) + sim(i, j-1))

        return max(alternatives) #Return the maximum of the similarities

"skip_penalty" is a constant (the authors set it to 0.0001) which is used to reduce the total similarity every time a sentence is skipped. This is to discourage skipping too many sentences.

Now "align(n,m)" would be very similar to "a(n,m)". Every "sim(x,y)" there is in "a(n,m)" represents a pairing of the "x"th Wikipedia sentence with the "y"th Simple English Wikipedia sentence. So in "align(n,m)", every time we compute "sim(x,y)" we would add the pair "(x,y)" to our alignment. We still need to know what the maximum similarity of the alignment will be after applying each operation so "align(n,m)" needs to return both the similarity and the alignment.

This is "align(i,j)" in Python without using dynamic programming (for simplicity):

def align(i, j):
    if i == 0 or j == 0: #Base case: if there are no more sentences to align in the paragraphs then the similarity is 0
        return (0, [])
    
    else: #Recursive case: if there are sentences to align then check what the maximum similarity achieved will be after applying each operation and return the maximum of the similarities
        alternatives = []
        
        #Best alignment after skipping next Simple English Wikipedia sentence
        (similarity, alignment) = align(i, j-1)
        alternatives.append((similarity - skip_penalty, alignment))
        
        #Best alignment after skippin next Wikipedia sentence
        (similarity, alignment) = align(i-1, j)
        alternatives.append((similarity - skip_penalty, alignment))
        
        #Best alignment after pairing next Wikipedia sentence with next Simple English Wikipedia sentence
        (similarity, alignment) = align(i-1, j-1)
        alternatives.append((similarity + sim(i, j), alignment + [(i, j)]))
        
        if j >= 2: #If there are more than one Simple English Wikipedia sentences then try pairing up two such sentences
            #Best alignment after pairing next Wikipedia sentence with next two Simple English Wikipedia sentences
            (similarity, alignment) = align(i-1, j-2)
            alternatives.append((similarity + sim(i, j-1) + sim(i, j), alignment + [(i, j-1), (i, j)]))
            
        if i >= 2: #If there are more than one Wikipedia sentences then try pairing up two such sentences
            #Best alignment after pairing next two Wikipedia sentences with next Simple English Wikipedia sentence
            (similarity, alignment) = align(i-2, j-1)
            alternatives.append((similarity + sim(i-1, j) + sim(i, j), alignment + [(i-1, j), (i, j)]))
        
        if i >= 2 and j >= 2: #If there are more than one sentences in both paragraphs then try pairing up two of each
            #Best alignment after pairing next two Wikipedia sentences with next two Simple English Wikipedia sentences crossed up
            (similarity, alignment) = align(i-2, j-2)
            alternatives.append((similarity + sim(i-1, j) + sim(i, j-1), alignment + [(i-1, j), (i, j-1)]))

        return max(alternatives, key=lambda p:p[0]) #Return the maximum of the similarities (the similarity is the first element of each similarity/alignment pair)

And here is the same algorithm using dynamic programming:
def align(i, j):
    memo = dict()
    
    for x in range(0, i+1):
        memo[(x, 0)] = (0.0, [])
    for y in range(0, j+1):
        memo[(0, y)] = (0.0, [])

    for y in range(1, j+1):
        for x in range(1, i+1):
            alternatives = []
             
            #Best alignment after skipping next Simple English Wikipedia sentence
            (similarity, alignment) = memo[(x, y-1)]
            alternatives.append((similarity - skip_penalty, alignment))
             
            #Best alignment after skippin next Wikipedia sentence
            (similarity, alignment) = memo[(x-1, y)]
            alternatives.append((similarity - skip_penalty, alignment))
             
            #Best alignment after pairing next Wikipedia sentence with next Simple English Wikipedia sentence
            (similarity, alignment) = memo[(x-1, y-1)]
            alternatives.append((similarity + sim(x, y), alignment + [(x, y)]))
             
            if y >= 2: #If there are more than one Simple English Wikipedia sentences then try pairing up two such sentences
                #Best alignment after pairing next Wikipedia sentence with next two Simple English Wikipedia sentences
                (similarity, alignment) = memo[(x-1, y-2)]
                alternatives.append((similarity + sim(x, y-1) + sim(x, y), alignment + [(x, y-1), (x, y)]))
                 
            if x >= 2: #If there are more than one Wikipedia sentences then try pairing up two such sentences
                #Best alignment after pairing next two Wikipedia sentences with next Simple English Wikipedia sentence
                (similarity, alignment) = memo[(x-2, y-1)]
                alternatives.append((similarity + sim(x-1, y) + sim(x, y), alignment + [(x-1, y), (x, y)]))
             
            if x >= 2 and y >= 2: #If there are more than one sentences in both paragraphs then try pairing up two of each
                #Best alignment after pairing next two Wikipedia sentences with next two Simple English Wikipedia sentences crossed up
                (similarity, alignment) = memo[(x-2, y-2)]
                alternatives.append((similarity + sim(x-1, y) + sim(x, y-1), alignment + [(x-1, y), (x, y-1)]))

            memo[(x,y)] = max(alternatives, key=lambda p:p[0]) #Return the maximum of the similarities (the similarity is the first element of each similarity/alignment pair)
    return memo[(i,j)]

This function will return the best alignment and its total similarity. The alignment will be a list of sentence position pairs and not the sentence pairs themselves.

Once the sentences in each paragraph were aligned, the sentence pairs were then filtered such that if any sentence pairs were less similar than 0.5 then they would be discarded. The filtering went from 10,000 article pairs to 75,000 paragraph pairs to 137,000 sentence pairs. Out of a sample of 100 sentence pairs, human judges determined that 91 of them were correct sentence simplifications. They were also tested on a phrase based translation system called Moses which simplified test sentences with statistically significant improvement.

The final parallel corpus is available at http://www.cs.pomona.edu/~dkauchak/simplification/.

Saturday, June 23, 2012

Summary of research paper "Readability Assessment for Text Simplification" by Sandra Aluisio, Lucia Specia, Caroline Gasperin and Carolina Scarton

This is a summary of the research paper http://aclweb.org/anthology-new/W/W10/W10-1001.pdf. This paper is about a program which classifies text into one of three difficulty of reading levels: "rudimentary", "basic" and "advanced". These difficulty levels are defined by INAF (Portuguese document at http://www.ibope.com.br/ipm/relatorios/relatorio_inaf_2009.pdf), the National Indicator of Functional Literacy. The aim is to use this program to assist authors to detect parts of their writing which can be simplified. The program is designed to work with Portuguese rather than English.

The following list of features is extracted from the Portuguese text to assess using various tools and resources described in the paper (copied directly from the paper):
  1. Number of words
  2. Number of sentences
  3. Number of paragraphs
  4. Number of verbs
  5. Number of nouns
  6. Number of adjectives
  7. Number of adverbs
  8. Number of pronouns
  9. Average number of words per sentence
  10. Average number of sentences per paragraph
  11. Average number of syllables per word
  12. Flesch index for Portuguese
  13. Incidence of content words
  14. Incidence of functional words
  15. Raw Frequency of content words
  16. Minimal frequency of content words
  17. Average number of verb hypernyms
  18. Incidence of NPs
  19. Number of NP modifiers
  20. Number of words before the main verb
  21. Number of high level constituents
  22. Number of personal pronouns
  23. Type-token ratio
  24. Pronoun-NP ratio
  25. Number of “e” (and)
  26. Number of “ou” (or)
  27. Number of “se” (if)
  28. Number of negations
  29. Number of logic operators
  30. Number of connectives
  31. Number of positive additive connectives
  32. Number of negative additive connectives
  33. Number of positive temporal connectives
  34. Number of negative temporal connectives
  35. Number of positive causal connectives
  36. Number of negative causal connectives
  37. Number of positive logic connectives
  38. Number of negative logic connectives
  39. Verb ambiguity ratio
  40. Noun ambiguity ratio
  41. Adverb ambiguity ratio
  42. Adjective ambiguity ratio
  43. Incidence of clauses
  44. Incidence of adverbial phrases
  45. Incidence of apposition
  46. Incidence of passive voice
  47. Incidence of relative clauses
  48. Incidence of coordination
  49. Incidence of subordination
  50. Out-of-vocabulary words
  51. LM probability of unigrams
  52. LM perplexity of unigrams
  53. LM perplexity of unigrams, without line break
  54. LM probability of bigrams
  55. LM perplexity of bigrams
  56. LM perplexity of bigrams, without line break
  57. LM probability of trigrams
  58. LM perplexity of trigrams
  59. LM perplexity of trigrams, without line break

The features from 1 to 42 are the "Coh-Metix PORT" feature group and were derived from the Coh-Metrix-PORT tool, which is a Portuguese adaptation of the Coh-Metrix.

The features from 43 to 49 are the "Syntactic" feature group and were added to analyse syntactic constructions which are useful for automatic simplification.

The features from 50 to 59 are the "Language model" feature group and are derived by comparing n-grams in the input text to n-gram frequencies in the language (Portuguese) as well as their perplexity and scores for any words which are not listed in the system's vocabulary.

The features from 1 to 3 and 9 to 11 are also called the "Basic" feature group because they require no linguistic knowledge. Feature 12 is also put into a feature group on its own called "Flesch".

Experiments were performed to check if it is possible to use machine learning techniques to learn to detect the INAF difficulty of reading levels and to discover which features are best to use for complexity detection.

Three types of machine learning algorithms in Weka were evaluated:
  • "Standard classification" (SMO): A standard classifier (support vector machines) which assumes that there is no relation between the difficulty levels
  • "Ordinal classification" (OrdinalClassClassifier): An ordinal classifier which assumes that the difficulty levels are ordered
  • "Regression" (SMO-reg): A regressor which assumes that the difficulty levels are continuous

Manually simplified corpora were used as training data for the algorithms. The corpora were simplified into each of the difficulty levels. Using these simplified corpora it was possible to check if the system could predict that the original corpora are of "advanced" difficulty level, the lightly simplified corpora are of "basic" difficulty level and the heavily simplified corpora are of "rudimentary" difficulty level.

In order to determine the importance of each feature described, the absolute Pearson correlation between each feature of each corpora and how well it predicts the corpora's difficulty level is computed. The results for the best features below are copied from the paper (1=highly correlated, 0=not correlated):
  • Words per sentence: 0.693
  • Incidence of apposition: 0.688
  • Incidence of clauses: 0.614
  • Flesch index: 0.580
  • Words before main verb: 0.516
  • Sentences per paragraph: 0.509

In order to determine which machine learning algorithm is better at classifying the corpora into correct difficulty levels using the features mentioned earlier, the Pearson correlation with true score was used on how the corpora were classified. The experiments were also run using the sub-groups of features which were mentioned to see how important each group was. Here are the results, summarized from the paper:

Standard classification
Feature groupPearson correlation
All0.84
Language model0.25
Basic0.76
Syntactic0.82
Coh-Metrix-PORT0.79
Flesch0.52

Ordinal classification
Feature groupPearson correlation
All0.83
Language model0.49
Basic0.73
Syntactic0.81
Coh-Metrix-PORT0.8
Flesch0.56

Regression
Feature groupPearson correlation
All0.8502
Language model0.6245
Basic0.7266
Syntactic0.8063
Coh-Metrix-PORT0.8051
Flesch0.5772

It is clear that it is better to use all the features together than any sub-group of them, but the syntactic features, followed by the Coh-Metrix-PORT features are the most useful feature groups and the language model feature group was the worst.

The simple classification algorithm was chosen as the best because although it is the simplest algorithm, its results are comparable to the other algorithms' results.

Sunday, June 17, 2012

Rummy and the most probable sequence in a probability tree

Rummy
Lately I was trying to program a game of Rummy, which is a card game where you have to replace one of your cards with the card on the deck until you have grouped your cards into sequences, where each sequence is 3 or more cards. The challenge with programming such a game seemed to be with writing code which determines which of your cards should be discarded to be replaced with the one on the deck.

The way this was done was by giving each card in the deck a probability of being available. Cards in the player's hand are given a probability of 1 and cards which are lost in the graveyard or taken by the opponent are given a probability of 0. The other cards will be given a probability depending on how likely the opponent has them given the cards they have taken.

Once each card has a probability of being obtained we can then calculate the probability of every legal sequence of cards in a group and of each possible arrangement of groups. The probability of each group is calculated and the groups with the maximum probability, containing both cards in the player's hand and cards which are likely to be available, is to be pursued. Any cards in the player's hand which are not in the most likely groups to arrange can be discarded to be replaced.

Examples of legal sequences of cards which can be used as a group in Rummy are
  • ace of diamonds, ace of hearts, ace of spades
  • two of hearts, two of clubs, two of diamonds
  • two of hearts, two of clubs, two of diamonds, two of spades
  • ace of diamonds, two of diamonds, three of diamonds
  • two of spades, three of spades, four of spades
  • two of spades, three of spades, four of spades, five of spades, six of spades
  • etc.

A number of groups needs to be arranged such that the total number of cards equals 10 and no two groups have the same cards. The probability of a card group is the product of the probabilities of each card in the group. The aim is to find an arrangement of card groups that have the maximum probability of being obtained. Obviously since only cards that are already in the player's hand have a probability of 1, the more cards used which are already in the player's hand, the better.

Since a group cannot contain any cards in another group, we need to treat the choosing of groups as a probability tree, where each node in the tree is a new legal group. As we choose more groups, we have less groups to choose from for the next choice. Since there are so many cards to choose from, an algorithm which quickly chooses the most probable sequence in the probability tree was needed.

The probability tree would look something like this:




Most probable sequence in a probability tree
I'll be expressing the algorithm using Python. I'll be using this Node class to represent a node in the probability tree:
class Node:
    def __init__(self, prob, children):
        self.prob = prob
        self.children = children
Where "prob" is the probability of choosing that node and "children" is a list of child nodes to follow next. Obviously in practice both of these properties can be dynamically generated with callback functions.

After having a tree, we need to represent a root path in the tree which is a list of nodes from the root to one of the leaves. We also need to calculate the probability of the path which is done using this function:
def path_prob(path):
    prob = 1.0
    for node in path:
        prob *= node.prob
    return prob
The probability of nodes in sequence in a probability tree is the product of each node's probability.

We'll start from a simple algorithm, one which generates the paths from the root to every leaf node.
def all_root_paths_nonacc(node):
    if node.children == []:
        yield [node]
    else:
        for child in node.children:
            for subpath in all_root_paths_nonacc(child):
                yield [node] + subpath
It's a naturally recursive generator function which doesn't use an accumulator. It starts from the root and attaches that root node to the front of every path from its children to a leaf. When a leaf is reached, a list consisting of only that leaf is returned.

Once we have this function, we can generate every root path in our probability tree and then find which path has the greatest probability using some code such as:
#comment to make code longer than one line
max(all_root_paths_nonacc(root), key=lambda path:path_prob(path))
#comment to make code longer than one line

However this would be inefficient to use. We need to be able to choose the most probable path during the recursion in such a way that we can stop recurring mid-way through the paths in order to avoid wasting time. I'll explain how this will work in a minute. First we need to know what path we are building during recursion and to do that we need to use an accumulator, like this:
def all_root_paths(node, prev_path=[]):
    curr_path = prev_path + [node]
    if node.children == []:
        yield curr_path
    else:
        for child in node.children:
            for path in all_root_paths(child, curr_path):
                yield path
This is a modified version of the previous algorithm. It keeps an extra accumulator parameter which starts off being an empty list and then start adding the current node to the end of it every time a new recursion is made. That way at any point in the recursion you can know what path is being built from the root node to the current node. When a leaf is reached, the path is returned.

This algorithm still returns all the root paths in the probability tree which we then have to choose from. Now we shall use an algorithm which returns the most probable path without returning any other path.
def naive_max_root_path(node, prev_path=[], curr_max_path=None):
    curr_path = prev_path + [node]
    if node.children == []:
        if curr_max_path == None or path_prob(curr_path) > path_prob(curr_max_path):
            return curr_path
        else:
            return curr_max_path
    else:
        for child in node.children:
            curr_max_path = naive_max_root_path(child, curr_path, curr_max_path)
        return curr_max_path
Again this is a modification of the previous algorithm. This time we keep two accumulators, one with the path that has been built up to now and one with the path which is the most probable up to now. Every time a full path is built (by reaching a leaf) a choice is made between returning the newly built path and the best current path. The returned path is then used as the current best path. Obviously if the current best path is "None" (null), then it means that we are still generating the first path and so this should be considered the best path.

Even if only the best path is returned, the algorithm checks each and every possible path there is, so this algorithm will not be faster than the other algorithms. We shall now instead use to our advantage the fact that when a probability is multiplied by another probability, the product will be smaller than or equal to both probabilities. What this means is that as we multiply the probabilities of the nodes in a path from first to last, the product will never grow but will either remain the same (if there is a probability sequence of 1s) or will become smaller. So if at any point the probability of a partially built path becomes smaller than the probability of the best path, we can immediately stop building that path and try some other path since its probability will only get even smaller.
def quick_max_root_path(node, prev_path=[], curr_max_path=None):
    curr_path = prev_path + [node]
    if node.children == []:
        if curr_max_path == None or path_prob(curr_path) > path_prob(curr_max_path):
            return curr_path
        else:
            return curr_max_path
    else:
        for child in node.children:
            if curr_max_path == None or path_prob(curr_path + [child]) > path_prob(curr_max_path):
                curr_max_path = quick_max_root_path(child, curr_path, curr_max_path)
        return curr_max_path
As you can see, the only difference between this and the previous algorithm is that a decision is also made in the recursive case when the path is still being built. This algorithm can probably be even faster by avoiding the calls to "path_prob" and passing the probabilities as parameters too. Then with every recursive call, the probability of the current path is the probability of the previous path times the probability of the new node.

Many times, especially in the Rummy case, there is more than one best sequence in a probability tree, all of which have an equal probability. In such cases we might want an algorithm which returns all such paths and not just the first one. Here is a naive version of the algorithm which checks every possible root path:
def naive_max_root_paths(node, prev_path=[], curr_max_paths=[]):
    curr_path = prev_path + [node]
    if node.children == []:
        if curr_max_paths == [] or path_prob(curr_path) > path_prob(curr_max_paths[0]):
            return [curr_path]
        elif path_prob(curr_path) == path_prob(curr_max_paths[0]):
            return curr_max_paths + [curr_path]
        else:
            return curr_max_paths
    else:
        for child in node.children:
            curr_max_paths = naive_max_root_paths(child, curr_path, curr_max_paths)
        return curr_max_paths
Obviously we now need to return a list and so the parameter which keeps the current best path needs to be a list of paths. As can be seen, there are now three cases to consider in the base case which are:
  • If the current path is better than any of the best paths (all paths in "curr_max_paths" have an equal probability), return a list consisting of only the current path.
  • If the current path is equal to any of the best path, return a list of the best paths together with the current path/
  • If the best paths are better than the current path, return the best paths only.
So the only difference between this and "naive_max_root_path" is that the path is considered if it is of equal probability to the best and added to a list if so.

We now use the previous trick to make the algorithm quick by skipping paths which are worse than the best mid-way through. However we now need to consider paths which are of equal probability to the best too in the recursive case.
def quick_max_root_paths(node, prev_path=[], curr_max_paths=[]):
    curr_path = prev_path + [node]
    if node.children == []:
        if curr_max_paths == [] or path_prob(curr_path) > path_prob(curr_max_paths[0]):
            return [curr_path]
        elif path_prob(curr_path) == path_prob(curr_max_paths[0]):
            return curr_max_paths + [curr_path]
        else:
            return curr_max_paths
    else:
        for child in node.children:
            if curr_max_paths == [] or path_prob(curr_path + [child]) >= path_prob(curr_max_paths[0]):
                curr_max_paths = quick_max_root_paths(child, curr_path, curr_max_paths)
        return curr_max_paths

Conclusion
As already explained, the probability tree path in Rummy will be the card groups chosen and we need to find which arrangement of card groups is the most likely to be available and thus that the player should aim for. Now that we can find which arrangement of card groups is the most probable, we know which cards we can discard, and since, using the last algorithm, we can know all the arrangements that give the best probability, we can even narrow down the cards to discard to the ones which are needed the least in all the alternative best arrangements.