Tuesday, May 30, 2017

Neural language models and how to make them in Tensorflow 1.0

In this blog post I will explain the basics you need to know in order to create a neural language model in Tensorflow 1.0. This tutorial will cover how to create a training set from raw text, how to use LSTMs, how to work with variable length sentences, what teacher forcing is, and how to train the neural network with early stopping.

Introduction

Components and vocabulary

A neural language model is a neural network that, given a prefix of a sentence, will output the probability that a word will be the next word in the sentence. For example, given the prefix "the dog", the neural network will tell you that "barked" has a high probability of being the next word. This is done by using a recurrent neural network (RNN), such as a long short term memory (LSTM), to turn the prefix into a single vector (that represents the prefix) which is then passed to a softmax layer that gives a probability distribution over every word in the known vocabulary.

In order to be able to still have prefix before the first word (to be able to predict a first word in the sentence) we will make use of an artificial "<BEG>" word that represents the beginning of a sentence and is placed before every sentence. Likewise we will have an artificial "<END>" word that represents the end of sentence in order to be able to predict when the sentence ends.



It is important to note that the words are presented to the neural network as vectors and not as actual strings. The vectors are called word "embeddings" and are derived from a matrix where each row stands for a different word in the vocabulary. This matrix is trained just like the neural network's weights in order to provide the best vector representation for each word.

Training

The probabilities are not given explicitly during training. Instead we just expose the neural network to complete sentences taken from a corpus (we will use the Brown corpus from NLTK) and let it learn the probabilities indirectly. Before training, each word will have an approximately equal probability given any prefix. The training procedure works by inputting a prefix of a sentence at one end of the neural network and then forcing it to increase the probability of the given next word in the sentence by a little bit. Since the probabilities of all the words in the output must add up to 1, increasing the probability of the given next word will also decrease the probability of all the other words by a little bit. By repeatedly increasing the probability of the correct word, the distribution will accumulate at the correct word.



Of course given a prefix there will be several words that can follow and not just one. If each one is increased just a bit in turn repeatedly, all the known correct next words will increase together and all the other words will decrease, forcing the probability distribution to share the peak among all the correct words. If the correct words occur at different frequencies given the prefix, the most frequent word will get the most probability (due to increasing its probability more often) whilst the rest of the correct words will get a fraction of that probability in proportion to their relative frequencies.

The most common way to train neural language models is not actually to use a training set of prefixes and next words, but to use full sentences. Upon being inputted with a sequence of vectors, an RNN will output another sequence of vectors. The nth output vector represents the prefix of the input sequence up to the nth input.



This means that we can present the neural network with a sentence, which gets turned into a sequence of vectors by the RNN, each of which gets turned into a probability distribution by the softmax. The training objective is to make the neural network predict which is the next word in the sentence after every prefix (including the end of sentence token at the end). This is done with all the sentences in the corpus so that the neural network is forced to extract some kind of pattern in the language of the sentences and be able to predict the next word in any sentence prefix.

Teacher forcing

If we always provide a correct prefix during training, then we are using a training method called "teacher forcing", where the neural network is only trained to deal with correct prefixes. This is the simplest method (and the method we will be using in this blog post) but it also introduces a bias to the neural network as it might not always be exposed to correct prefixes. Let's say that the neural network is going to be used to generate sentences by, for example, picking the most probable word given the prefix and then adding the chosen word to the end of the prefix. We can repeatedly do this until we choose the end of sentence token, at which point we should have a complete sentence. The problem with teacher forcing is that if the neural network makes one mistake during the generation process and picks a non-sense word as a most probable next word, then the rest of the sentence will probably also be garbage as it was never trained on sentences with mistakes.

One way to deal with this is to include not only prefixes in the training sentences by also prefixes with some of the words replaced by words chosen by the still-in-training neural network and force it to still give a higher probability to the correct next word. This is called scheduled sampling. Another way to deal with this is to take the training prefixes and some generated prefixes (from the still-in-training neural net) and take their vector representation from the RNN. Generative adversarial training will then be used to make the RNN represent both groups of vectors similarly. This forces the RNN to be fault tolerant to prefixes with errors and to be able to represent them in a way that can lead to correct next words. This is called professor forcing.

Full code listing

This is the full code that you can execute to get a Tensorflow neural language model:

from __future__ import absolute_import, division, print_function, unicode_literals
from builtins import ascii, bytes, chr, dict, filter, hex, input, int, map, next, oct, open, pow, range, round, str, super, zip

import tensorflow as tf
import numpy as np
import random
import timeit
import collections
import nltk

TRAINING_SESSION = True

rand = random.Random(0)
embed_size      = 256
state_size      = 256
max_epochs      = 100
minibatch_size  = 20
min_token_freq  = 3

run_start = timeit.default_timer()

print('Loading raw data...')

all_seqs = [ [ token.lower() for token in seq ] for seq in nltk.corpus.brown.sents() if 5 <= len(seq) <= 20 ]
rand.shuffle(all_seqs)
all_seqs = all_seqs[:20000]

trainingset_size = round(0.9*len(all_seqs))
validationset_size = len(all_seqs) - trainingset_size
train_seqs = all_seqs[:trainingset_size]
val_seqs = all_seqs[-validationset_size:]

print('Training set size:', trainingset_size)
print('Validation set size:', validationset_size)

all_tokens = (token for seq in train_seqs for token in seq)
token_freqs = collections.Counter(all_tokens)
vocab = sorted(token_freqs.keys(), key=lambda token:(-token_freqs[token], token))
while token_freqs[vocab[-1]] < min_token_freq:
    vocab.pop()
vocab_size = len(vocab) + 2 # + edge and unknown tokens

token_to_index = { token: i+2 for (i, token) in enumerate(vocab) }
index_to_token = { i+2: token for (i, token) in enumerate(vocab) }
edge_index = 0
unknown_index = 1

print('Vocabulary size:', vocab_size)

def parse(seqs):
    indexes = list()
    lens = list()
    for seq in seqs:
        indexes_ = [ token_to_index.get(token, unknown_index) for token in seq ]
        indexes.append(indexes_)
        lens.append(len(indexes_)+1) #add 1 due to edge token
        
    maxlen = max(lens)
    
    in_mat  = np.zeros((len(indexes), maxlen))
    out_mat = np.zeros((len(indexes), maxlen))
    for (row, indexes_) in enumerate(indexes):
        in_mat [row,:len(indexes_)+1] = [edge_index]+indexes_
        out_mat[row,:len(indexes_)+1] = indexes_+[edge_index]
    return (in_mat, out_mat, np.array(lens))
    
(train_seqs_in, train_seqs_out, train_seqs_len) = parse(train_seqs)
(val_seqs_in,   val_seqs_out,   val_seqs_len)   = parse(val_seqs)

print('Training set max length:', train_seqs_in.shape[1]-1)
print('Validation set max length:', val_seqs_in.shape[1]-1)

################################################################
print()
print('Training...')

#Full correct sequence of token indexes with start token but without end token.
seq_in = tf.placeholder(tf.int32, shape=[None, None], name='seq_in') #[seq, token]

#Length of sequences in seq_in.
seq_len = tf.placeholder(tf.int32, shape=[None], name='seq_len') #[seq]
tf.assert_equal(tf.shape(seq_in)[0], tf.shape(seq_len)[0])

#Full correct sequence of token indexes without start token but with end token.
seq_target = tf.placeholder(tf.int32, shape=[None, None], name='seq_target') #[seq, token]
tf.assert_equal(tf.shape(seq_in), tf.shape(seq_target))

batch_size = tf.shape(seq_in)[0] #Number of sequences to process at once.
num_steps = tf.shape(seq_in)[1] #Number of tokens in generated sequence.

#Mask of which positions in the matrix of sequences are actual labels as opposed to padding.
token_mask = tf.cast(tf.sequence_mask(seq_len, num_steps), tf.float32) #[seq, token]

with tf.variable_scope('prefix_encoder'):
    #Encode each sequence prefix into a vector.
    
    #Embedding matrix for token vocabulary.
    embeddings = tf.get_variable('embeddings', [ vocab_size, embed_size ], tf.float32, tf.contrib.layers.xavier_initializer()) #[vocabulary token, token feature]
        
    #3D tensor of tokens in sequences replaced with their corresponding embedding vector.
    embedded = tf.nn.embedding_lookup(embeddings, seq_in) #[seq, token, token feature]
    
    #Use an LSTM to encode the generated prefix.
    init_state = tf.contrib.rnn.LSTMStateTuple(c=tf.zeros([ batch_size, state_size ]), h=tf.zeros([ batch_size, state_size ]))
    cell = tf.contrib.rnn.BasicLSTMCell(state_size)
    prefix_vectors = tf.nn.dynamic_rnn(cell, embedded, sequence_length=seq_len, initial_state=init_state, scope='rnn')[0] #[seq, prefix, prefix feature]
    
with tf.variable_scope('softmax'):
    #Output a probability distribution over the token vocabulary (including the end token).
    
    W = tf.get_variable('W', [ state_size, vocab_size ], tf.float32, tf.contrib.layers.xavier_initializer())
    b = tf.get_variable('b', [ vocab_size ], tf.float32, tf.zeros_initializer())
    logits = tf.reshape(tf.matmul(tf.reshape(prefix_vectors, [ -1, state_size ]), W) + b, [ batch_size, num_steps, vocab_size ])
    predictions = tf.nn.softmax(logits) #[seq, prefix, token]

losses = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=seq_target, logits=logits) * token_mask
total_loss = tf.reduce_sum(losses)
train_step = tf.train.AdamOptimizer().minimize(total_loss)
saver = tf.train.Saver()

sess = tf.Session()

if TRAINING_SESSION:
    sess.run(tf.global_variables_initializer())

    print('epoch', 'val loss', 'duration', sep='\t')

    epoch_start = timeit.default_timer()
    
    validation_loss = 0
    for i in range(len(val_seqs)//minibatch_size):
        minibatch_validation_loss = sess.run(total_loss, feed_dict={
                                                                seq_in:     val_seqs_in [i*minibatch_size:(i+1)*minibatch_size],
                                                                seq_len:    val_seqs_len[i*minibatch_size:(i+1)*minibatch_size],
                                                                seq_target: val_seqs_out[i*minibatch_size:(i+1)*minibatch_size],
                                                            })
        validation_loss += minibatch_validation_loss
    
    print(0, round(validation_loss, 3), round(timeit.default_timer() - epoch_start), sep='\t')
    last_validation_loss = validation_loss

    saver.save(sess, './model')

    trainingset_indexes = list(range(len(train_seqs)))
    for epoch in range(1, max_epochs+1):
        epoch_start = timeit.default_timer()
        
        rand.shuffle(trainingset_indexes)
        for i in range(len(trainingset_indexes)//minibatch_size):
            minibatch_indexes = trainingset_indexes[i*minibatch_size:(i+1)*minibatch_size]
            sess.run(train_step, feed_dict={
                                            seq_in:     train_seqs_in [minibatch_indexes],
                                            seq_len:    train_seqs_len[minibatch_indexes],
                                            seq_target: train_seqs_out[minibatch_indexes],
                                        })
            
        validation_loss = 0
        for i in range(len(val_seqs)//minibatch_size):
            minibatch_validation_loss = sess.run(total_loss, feed_dict={
                                                                    seq_in:     val_seqs_in [i*minibatch_size:(i+1)*minibatch_size],
                                                                    seq_len:    val_seqs_len[i*minibatch_size:(i+1)*minibatch_size],
                                                                    seq_target: val_seqs_out[i*minibatch_size:(i+1)*minibatch_size],
                                                                })
            validation_loss += minibatch_validation_loss

        if validation_loss > last_validation_loss:
            break
        last_validation_loss = validation_loss
        
        saver.save(sess, './model')
        
        print(epoch, round(validation_loss, 3), round(timeit.default_timer() - epoch_start), sep='\t')

    print(epoch, round(validation_loss, 3), round(timeit.default_timer() - epoch_start), sep='\t')

################################################################
print()
print('Evaluating...')

saver.restore(sess, tf.train.latest_checkpoint('.'))

def seq_prob(seq):
    seq_indexes = [ token_to_index.get(token, unknown_index) for token in seq ]
    outputs = sess.run(predictions, feed_dict={
                                        seq_in:  [ [ edge_index ] + seq_indexes ],
                                        seq_len: [ 1+len(seq) ],
                                    })[0]
    probs = outputs[np.arange(len(outputs)), seq_indexes+[ edge_index ]]
    return np.prod(probs)

print('P(the dog barked.) =', seq_prob(['the', 'dog', 'barked', '.']))
print('P(the cat barked.) =', seq_prob(['the', 'cat', 'barked', '.']))
print()

def next_tokens(prefix):
    prefix_indexes = [ token_to_index.get(token, unknown_index) for token in prefix ]
    probs = sess.run(predictions, feed_dict={
                                        seq_in:  [ [ edge_index ] + prefix_indexes ],
                                        seq_len: [ 1+len(prefix) ],
                                    })[0][-1]
    token_probs = list(zip(probs, ['<end>', '<unk>']+vocab))
    return token_probs

print('the dog ...', sorted(next_tokens(['the', 'dog']), reverse=True)[:5])
print()

def greedy_gen():
    prefix = []
    for _ in range(100):
        probs = sorted(next_tokens(prefix), reverse=True)
        (_, next_token) = probs[0]
        if next_token == '<unk>':
            (_, next_token) = probs[1]
        elif next_token == '<end>':
            break
        else:
            prefix.append(next_token)
    return prefix

print('Greedy generation:', ' '.join(greedy_gen()))
print()

def random_gen():
    prefix = []
    for _ in range(100):
        probs = next_tokens(prefix)
        (unk_prob, _) = probs[unknown_index]
                
        r = rand.random() * (1.0 - unk_prob)
        total = 0.0
        for (prob, token) in probs:
            if token != '<unk>':
                total += prob
                if total >= r:
                    break
        if token == '<end>':
            break
        else:
            prefix.append(token)
    return prefix

print('Random generation:', ' '.join(random_gen()))
print()

Code explanation

This section explains snippets of the code.

Preprocessing

The first thing we need to do is create the training and validation sets. We will use the Brown corpus from NLTK as data. Since the purpose of this tutorial is to quickly train a neural language model on a normal computer, we will work with a subset of the corpus so that training will be manageable. We will only take sentences that are between 5 and 20 tokens long and only use a random sample of 20000 sentences from this pool. From this we will take a random 10% to be used for the validation set and the rest will be used for the training set.

all_seqs = [ [ token.lower() for token in seq ] for seq in nltk.corpus.brown.sents() if 5 <= len(seq) <= 20 ]
rand.shuffle(all_seqs)
all_seqs = all_seqs[:20000]

trainingset_size = round(0.9*len(all_seqs))
validationset_size = len(all_seqs) - trainingset_size
train_seqs = all_seqs[:trainingset_size]
val_seqs = all_seqs[-validationset_size:]

Next we need to get the vocabulary from the training set. This will consist of all the words that occur frequently in the training set sentences, with the rare words being replaced by an "unknown" token. This will allow the neural network to be able to work with out-of-vocabulary words as they will be represented as "<unk>" and the network will ave seen this token in the training sentences. Each vocabulary word will be represented by an index, with "0" representing the beginning and end token, "1" representing the unknown token, "2" representing the most frequent vocabulary word, "3" the second most frequent word, and so on.

all_tokens = (token for seq in train_seqs for token in seq)
token_freqs = collections.Counter(all_tokens)
vocab = sorted(token_freqs.keys(), key=lambda token:(-token_freqs[token], token))
while token_freqs[vocab[-1]] < min_token_freq:
    vocab.pop()
vocab_size = len(vocab) + 2 # + edge and unknown tokens

token_to_index = { token: i+2 for (i, token) in enumerate(vocab) }
index_to_token = { i+2: token for (i, token) in enumerate(vocab) }
edge_index = 0
unknown_index = 1

Finally we need to turn all the sentences in the training and validation set into a matrix of indexes where each row represents a sentence. Since different sentences have different lengths, we will make the matrix as wide as the longest sentence and use 0 to pad each sentence into uniform length (pads are added to the end of the sentences). To know where a sentence ends we will also keep track of the sentence lengths. For training we will need to use an input matrix and a target matrix. The input matrix contains sentences that start with the start-of-sentence token in every row whilst the target matrix contains sentences that end with the end-of-sentence token in every row. The former is used to pass as input to the neural net during training whilst the latter is used to tell which is the correct output of each sentence prefix.

def parse(seqs):
    indexes = list()
    lens = list()
    for seq in seqs:
        indexes_ = [ token_to_index.get(token, unknown_index) for token in seq ]
        indexes.append(indexes_)
        lens.append(len(indexes_)+1) #add 1 due to edge token
        
    maxlen = max(lens)
    
    in_mat  = np.zeros((len(indexes), maxlen))
    out_mat = np.zeros((len(indexes), maxlen))
    for (row, indexes_) in enumerate(indexes):
        in_mat [row,:len(indexes_)+1] = [edge_index]+indexes_
        out_mat[row,:len(indexes_)+1] = indexes_+[edge_index]
    return (in_mat, out_mat, np.array(lens))
    
(train_seqs_in, train_seqs_out, train_seqs_len) = parse(train_seqs)
(val_seqs_in,   val_seqs_out,   val_seqs_len)   = parse(val_seqs)

Model definition

The Tensorflow neural network model we shall implement will accept a batch of sentences at once. This means that the input will be a matrix of integers where each row is a sentence and each integer is a word index. Since both the number of sentences and the sentence length are variable we will use "None" as a dimension size. We will then get the size using the "tf.shape" function. The sentences on their own are not enough as an input as we also need to provide the sentence lengths as a vector. The length of this vector needs to be as long as the number of rows in the sequences (one for each sentence). These lengths will be used to generate a mask matrix of 0s and 1s where a "1" indicates the presence of a token in the sequence matrix whilst a "0" indicates the presence of a pad token. This is generated by the "tf.sequence_mask" function.

#Full correct sequence of token indexes with start token but without end token.
seq_in = tf.placeholder(tf.int32, shape=[None, None], name='seq_in') #[seq, token]

#Length of sequences in seq_in.
seq_len = tf.placeholder(tf.int32, shape=[None], name='seq_len') #[seq]
tf.assert_equal(tf.shape(seq_in)[0], tf.shape(seq_len)[0])

#Full correct sequence of token indexes without start token but with end token.
seq_target = tf.placeholder(tf.int32, shape=[None, None], name='seq_target') #[seq, token]
tf.assert_equal(tf.shape(seq_in), tf.shape(seq_target))

batch_size = tf.shape(seq_in)[0] #Number of sequences to process at once.
num_steps = tf.shape(seq_in)[1] #Number of tokens in generated sequence.

#Mask of which positions in the matrix of sequences are actual labels as opposed to padding.
token_mask = tf.cast(tf.sequence_mask(seq_len, num_steps), tf.float32) #[seq, token]

Now that the inputs are handled we need to process the prefixes of the sequences into prefix vectors using an LSTM. We first convert the token indexes into token vectors. This is done using an embedding matrix where the first row of the matrix is the word vector of the first word in the vocabulary, the second for the second word, and so on. This is then passed to an LSTM that is initialized with zero-vectors for both the cell state and the hidden state. We pass in the sequence lengths to keep the RNN from interpreting pad values.

with tf.variable_scope('prefix_encoder'):
    #Encode each sequence prefix into a vector.
    
    #Embedding matrix for token vocabulary.
    embeddings = tf.get_variable('embeddings', [ vocab_size, embed_size ], tf.float32, tf.contrib.layers.xavier_initializer()) #[vocabulary token, token feature]
        
    #3D tensor of tokens in sequences replaced with their corresponding embedding vector.
    embedded = tf.nn.embedding_lookup(embeddings, seq_in) #[seq, token, token feature]
    
    #Use an LSTM to encode the generated prefix.
    init_state = tf.contrib.rnn.LSTMStateTuple(c=tf.zeros([ batch_size, state_size ]), h=tf.zeros([ batch_size, state_size ]))
    cell = tf.contrib.rnn.BasicLSTMCell(state_size)
    prefix_vectors = tf.nn.dynamic_rnn(cell, embedded, sequence_length=seq_len, initial_state=init_state, scope='rnn')[0] #[seq, prefix, prefix feature]

Next we need to take the prefix vectors and derive probabilities for the next word in every prefix. Since the prefix vectors are a 3D tensor and the weights matrix is a 2D tensor we have to first reshape the prefix vectors into a 2D tensor before multiplying them together. Following this we will then reshape the resulting 2D tensor back into a 3D tensor and apply softmax to it.

with tf.variable_scope('softmax'):
    #Output a probability distribution over the token vocabulary (including the end token).
    
    W = tf.get_variable('W', [ state_size, vocab_size ], tf.float32, tf.contrib.layers.xavier_initializer())
    b = tf.get_variable('b', [ vocab_size ], tf.float32, tf.zeros_initializer())
    logits = tf.reshape(tf.matmul(tf.reshape(prefix_vectors, [ -1, state_size ]), W) + b, [ batch_size, num_steps, vocab_size ])
    predictions = tf.nn.softmax(logits) #[seq, prefix, token]

Training

Now comes the part that has to do with training the model. We need to add the loss function which is masked to ignore padding tokens. We will take the sum crossentropy and apply the Adam optimizer to it. Crossentropy measures how close to 1.0 the probability of the correct next word in the softmax is. This allows us to maximize the correct probability which is done by using Adam, an optimization technique that works using gradient descent but with the learning rate being adapted for every weight. We would also like to be able to save our model weights during training in order to reuse them later. Finally we create a Tensorflow session variable and use it to initialize all the model weights.

losses = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=seq_target, logits=logits) * token_mask
total_loss = tf.reduce_sum(losses)
train_step = tf.train.AdamOptimizer().minimize(total_loss)
saver = tf.train.Saver()

sess = tf.Session()
    
sess.run(tf.global_variables_initializer())

The second thing we should do is measure how well the randomly initialised neural net performs in terms of the sum crossentropy. In order to avoid running out of memory, instead of putting all the validation set data as input to the neural net we will break it up into minibatches of a fixed size and find the total loss of all the minibatches together. This final loss value will be placed in a variable called "last_validation_loss" in order to be able to track how the loss is progressing as training goes on. The last line is to save the weights of the neural network in the same folder as the code and give the files (there will be several files with different extensions) the file name "model".

validation_loss = 0
for i in range(len(val_seqs)//minibatch_size):
    minibatch_validation_loss = sess.run(total_loss, feed_dict={
                                                            seq_in:     val_seqs_in [i*minibatch_size:(i+1)*minibatch_size],
                                                            seq_len:    val_seqs_len[i*minibatch_size:(i+1)*minibatch_size],
                                                            seq_target: val_seqs_out[i*minibatch_size:(i+1)*minibatch_size],
                                                        })
    validation_loss += minibatch_validation_loss

last_validation_loss = validation_loss

saver.save(sess, './model')

Next we'll do the same thing but on the training set and we'll run the "train_step" operation instead of the "total_loss" operation in order to actually optimize the weights into a smaller "total_loss" value. It is more beneficial to take randomly sampled minibatches instead of just breaking the training set into deterministically chosen groups, so we use an array of indexes called "trainingset_indexes" to determine which training pairs will make it to the next minibatch. We randomly shuffle these indexes and then break it into fixed size groups. The indexes in the next group are used to choose the training pairs are used for the next minibatch. Following this we will again calculate the new loss value on the validation set to see how we're progressing. If the new loss is worse than the previous loss then we stop the training. Otherwise we save the model weights and continue training. This is called early stopping. There is a hard limit set to the number of epochs to run in order to avoid training for too long.

trainingset_indexes = list(range(len(train_seqs)))
for epoch in range(1, max_epochs+1):
    rand.shuffle(trainingset_indexes)
    for i in range(len(trainingset_indexes)//minibatch_size):
        minibatch_indexes = trainingset_indexes[i*minibatch_size:(i+1)*minibatch_size]
        sess.run(train_step, feed_dict={
                                        seq_in:     train_seqs_in [minibatch_indexes],
                                        seq_len:    train_seqs_len[minibatch_indexes],
                                        seq_target: train_seqs_out[minibatch_indexes],
                                    })
        
    validation_loss = 0
    for i in range(len(val_seqs)//minibatch_size):
        minibatch_validation_loss = sess.run(total_loss, feed_dict={
                                                                seq_in:     val_seqs_in [i*minibatch_size:(i+1)*minibatch_size],
                                                                seq_len:    val_seqs_len[i*minibatch_size:(i+1)*minibatch_size],
                                                                seq_target: val_seqs_out[i*minibatch_size:(i+1)*minibatch_size],
                                                            })
        validation_loss += minibatch_validation_loss

    if validation_loss > last_validation_loss:
        break
    last_validation_loss = validation_loss
    
    saver.save(sess, './model')

Application

We can now use the trained neural network. We first restore the last saved model weights which are the ones that gave the best validation loss and then we will define two functions: one for getting the probability of a whole sequence and the other for getting the next token after a prefix. "seq_prob" works by getting every prefix's softmax output, getting the probability of each token in the sequence after each prefix and then multiplying them together. "next_tokens" works by passing a prefix to the neural network and only getting the softmax output of the last (longest) prefix. The probabilities and corresponding tokens are returned.

saver.restore(sess, tf.train.latest_checkpoint('.'))

def seq_prob(seq):
    seq_indexes = [ token_to_index.get(token, unknown_index) for token in seq ]
    outputs = sess.run(predictions, feed_dict={
                                        seq_in:  [ [ edge_index ] + seq_indexes ],
                                        seq_len: [ 1+len(seq) ],
                                    })[0]
    probs = outputs[np.arange(len(outputs)), seq_indexes+[ edge_index ]]
    return np.prod(probs)

print('P(the dog barked.) =', seq_prob(['the', 'dog', 'barked', '.']))
print('P(the cat barked.) =', seq_prob(['the', 'cat', 'barked', '.']))
print()

def next_tokens(prefix):
    prefix_indexes = [ token_to_index.get(token, unknown_index) for token in prefix ]
    probs = sess.run(predictions, feed_dict={
                                        seq_in:  [ [ edge_index ] + prefix_indexes ],
                                        seq_len: [ 1+len(prefix) ],
                                    })[0][-1]
    token_probs = list(zip(probs, ['<end>', '<unk>']+vocab))
    return token_probs

print('the dog ...', sorted(next_tokens(['the', 'dog']), reverse=True)[:5])
print()

These are the outputs I got:

P(the dog barked.) = 1.71368e-08
P(the cat barked.) = 6.16375e-10

the dog ... [(0.097657956, 'was'), (0.089791521, '<unk>'), (0.058101207, 'is'), (0.055007596, 'had'), (0.02786131, 'could')]

We can extend "next_tokens" to generate whole sentences using one of two ways: generating the most probable sentence or generating a randomly sampled sentence. For the first we are going to use greedy search which chooses the most probable word given a prefix and adds it to the prefix. This will not give the most probable sentence but it should be close (use beam search for a better selection method). For the second function we want to choose words at random but based on their probability such that rare words are rarely chosen. This is called sampling sentences (the probability of sampling a particular sentence is equal to the probability of the sentence). We did this using roulette selection. For both functions we left out the unknown token during generation and gave a hard maximum word limit of 100 words.

def greedy_gen():
    prefix = []
    for _ in range(100):
        probs = sorted(next_tokens(prefix), reverse=True)
        (_, next_token) = probs[0]
        if next_token == '<unk>':
            (_, next_token) = probs[1]
        elif next_token == '<end>':
            break
        else:
            prefix.append(next_token)
    return prefix

print('Greedy generation:', ' '.join(greedy_gen()))
print()

def random_gen():
    prefix = []
    for _ in range(100):
        probs = next_tokens(prefix)
        (unk_prob, _) = probs[unknown_index]
                
        r = rand.random() * (1.0 - unk_prob)
        total = 0.0
        for (prob, token) in probs:
            if token != '<unk>':
                total += prob
                if total >= r:
                    break
        if token == '<end>':
            break
        else:
            prefix.append(token)
    return prefix

print('Random generation:', ' '.join(random_gen()))
print()

These are the outputs I got:

Greedy generation: `` i don't know '' .

Random generation: miss the road place , title comes to the seeds of others to many and details under the dominant home