Thursday, December 27, 2018

Language models and the unknown token: Why perplexity gets worse as the corpus size increases

Unless you're making use of a character-based language model somehow, you're going to have a finite vocabulary of words that you can handle and so you need a way to handle out-of-vocabulary words. One common way to handle such words is by replacing them all with the unknown token, a pseudo word that replaces all out-of-vocabulary words. The unknown token has several advantages: Infrequent words in the training corpus will probably either not be learned by the neural network or not be used when generating sentences. Ignoring them will save a lot in model size as the embedding layer and softmax take a lot of parameters. Plus all the infrequent words put together will occur very frequently so the language model will definitely learn to use the unknown token. It also makes it possible to handle new words at test time (that are not found even once in the training corpus) as they will be replaced by the unknown token.

The problem with this approach is what happens when measuring the probability or perplexity of a sentence based on the probabilities of individual words. If you're comparing language models to see which ones make the best predictions, you usually use them all on a corpus to see how well they predict the words in the corpus. The higher the probabilities assigned to those sentences, the better the language model. This is usually measured using language model perplexity. But see what happens when you vary the vocabulary size. You will find that smaller vocabulary sizes lead to better language models, even though this makes no sense.

It turns out that if you just multiply all the probabilities of individual words as-is, including that of the unknown token, then your probability will be sensitive to the vocabulary size. Let's say that you have a vocabulary size of 0, that is, you're considering all the words to be out-of-vocabulary and hence all of them will be replaced by the unknown token. This means that during training, the language model will learn that after every unknown token there is (probably) another unknown token, unless its the end of the sentence. This will make the language model give very high probabilities for the unknown token. High word probabilities mean high sentence probabilities which mean good perplexities.

Now If we add another word to the vocabulary then we'll be introducing some uncertainty into the language model as now it has to decide between using the unknown token or the known word. Even in a perfect language model, the same prefix of words can be followed by either of the two words so there is no way to correctly assign 100% of the probability to one or the other. This means that the probabilities will be split between the two words, leading to an overall decrease in probabilities, leading to a worse perplexity. Adding more words to the vocabulary makes this even worse, which means that language models with smaller vocabularies have a huge unfair advantage over language models that actually do their job and correctly predict the right word.

We can't do away with the unknown token but we can strip away the unknown token's power. Assuming that all the language models are being evaluated on the same corpus, then different vocabularies will have different words being turned into the unknown token. Let's say that your language model considers 1000 different words in its vocabulary but the corpus you're evaluating it on has 500 different words that are out-of-vocabulary. So in reality your language model is predicting one of 1500 different words; it's just that 500 of those words are assumed to be a single word with a single probability. But really there should be 500 separate probabilities for those out-of-vocabulary words and not just one. If we avoid merging all those probabilities into one, then all the language models will have a fair comparison all they will all have the same vocabulary and they will all have the same amount of uncertainty about which word should come next. The question is how to distribute that single unknown token probability between the 500 out-of-vocabulary words. The simplest solution is to assume a uniform distribution and just give each word the same slice of probability from the whole. So if the unknown token has a probability of $p$, then each out-of-vocabulary word gets a probability of $\frac{p}{500}$.

Now every time you encounter the unknown token in the evaluation corpus you know that the token is being used in place of one of those 500 words. But you don't know which one it is. Not a problem, just divide the probability by 500 and it's as if all words in the corpus are in the vocabulary. Do this to every unknown token probability and now you have a fair measure of perplexity. Let's see an example.

Let's say that we want to find the probability of the following sentence:
the loud dog barked at the loud man

and let's say that the language model we're using to do that has the following vocabulary:
the at dog man

this means that the sentence is now changed to:
the UNK dog UNK at the UNK man

Now the naive way to get the probability of the sentence is as follows:

$$
P(\text{the UNK dog UNK at the UNK man}) = \\
p(\text{the}|\text{}) \times p(\text{UNK}|\text{the}) \times p(\text{dog}|\text{the UNK}) \times p(\text{UNK}|\text{the UNK dog}) \\
\times p(\text{at}|\text{the UNK dog UNK}) \times p(\text{the}|\text{the UNK dog UNK at}) \times p(\text{UNK}|\text{the UNK dog UNK at the}) \\
\times p(\text{man}|\text{the UNK dog UNK at the UNK}) \times p(\text{}|\text{the UNK dog UNK at the UNK man})
$$

But now with the new way we'll divide the unknown token's probabilities by 2, the number of different out of vocabulary words ('loud' and 'barked'):

$$
P(\text{the UNK dog UNK at the UNK man}) = \\
p(\text{the}|\text{}) \times p(\text{UNK}|\text{the})/2 \times p(\text{dog}|\text{the UNK}) \times p(\text{UNK}|\text{the UNK dog})/2 \\
\times p(\text{at}|\text{the UNK dog UNK}) \times p(\text{the}|\text{the UNK dog UNK at}) \times p(\text{UNK}|\text{the UNK dog UNK at the})/2 \\
\times p(\text{man}|\text{the UNK dog UNK at the UNK}) \times p(\text{}|\text{the UNK dog UNK at the UNK man})
$$

Of course we can leave the re-weighting till the end of the equation by dividing the first equation by the number of different out-of-vocabulary words for as many times as there are unknown tokens, like this:

$$
P(\text{the UNK dog UNK at the UNK man}) = \\
p(\text{the}|\text{}) \times p(\text{UNK}|\text{the}) \times p(\text{dog}|\text{the UNK}) \times p(\text{UNK}|\text{the UNK dog}) \\
\times p(\text{at}|\text{the UNK dog UNK}) \times p(\text{the}|\text{the UNK dog UNK at}) \times p(\text{UNK}|\text{the UNK dog UNK at the}) \\
\times p(\text{man}|\text{the UNK dog UNK at the UNK}) \times p(\text{}|\text{the UNK dog UNK at the UNK man})/ 2^3
$$

Now the sentence probability goes up as the vocabulary size increases!

Thursday, November 29, 2018

Explainable AI (XAI): Sensitivity Analysis

Imagine you have a neural network that can classify images well. At test time, would it be enough to just know what the class of the image is? Or would you also want to know why the image was classified the way it was? This is one of the goals of explainable AI and in this post we'll see what sensitivity analysis is.

Sensitivity analysis is a way to measure the importance of different parts of an input to one part of an output. In other words, you want to know which pixels were most important to give a high probability for a particular class. The way sensitivity analysis measures this is by measuring how sensitive the output is to each pixel, that is, which pixels, when changed, will change the output the most. Most pixels should leave the output unchanged when they themselves are changed, but some pixels would be critical to the particular output that the neural network has given. To measure this, we simply find the following:

$\left|\frac{df(x)_i}{dx_j}\right|$

that is, the sensitivity of output $i$ to input $j$ is the absolute value of the gradient of the output with respect to the input. In Tensorflow we can compute this as follows:

sensitivity = tf.abs(tf.gradients([outputs[0, output_index]], [images])[0][0])

where 'output_index' is a placeholder that is a scalar of type tf.int64, 'outputs' is a softmax with probabilities for each class and where the first index is the batch index and the second index is the class probability, and 'images' is the pixels of images where the first index is batch index and the second index is the image in vector form. This code also assumes that only the first image in the batch is to be analysed. This is because Tensorflow can only find the gradient of a single scalar so we can only find the sensitivity of a single output of a single image.

Here are some examples I got when I tried it on a simple fully connected two layer neural network trained to classify MNIST handwritten digits.



These heat maps show which pixels have the highest sensitivity score for the digit they were classified as. We can see how the empty space in the middle is important for classifying the zero. The four and the nine can be easy confused for each other were it not for the little bit in the top left corner. The seven can be confused for a nine if we complete the top left loop and the five can be confused with a six or eight if we draw a little diagonal line.

Wednesday, October 24, 2018

Mentally estimating square roots

What's $\sqrt{10}$? I'd reckon it's about $3 + \frac{10-9}{2 \times 3} = 3.166$. The actual answer is 3.162. What about $\sqrt{18}$? Should be about $4 + \frac{18-16}{2 \times 4} = 4.250$. The actual answer is 4.242. I found out about this calculation from this video but there was no explanation for why it works. Here's an explanation.

So the method here is as follows.
  1. Let the number you want to find the square root of be $a^2$.
  2. Let the largest square number which is less than $a^2$ be $b^2$ such that $b^2 <= a^2 < (b+1)^2$. For example if $a^2$ is 10 then $b^2$ is 9, if $a^2$ is 18 then $b^2$ is 16.
  3. The square root of $a^2$ is approximately $b + \frac{a^2 - b^2}{2 b}$.

This method is easy to carry out mentally but why does it work? The trick here is that the graph of the square root function grows so slowly that we can approximate the curve between two adjacent square numbers as a line.



We can use the line to approximate the square root of any number between two square numbers. The first thing we need to know is the gradient of the line. The vertical distance between two adjacent square numbers on the square root curve is 1, since the two square numbers are the squares of two consecutive numbers. The horizontal distance changes and becomes larger as the adjacent square numbers become larger but we can calculate it as follows:

$$(b+1)^2 - b^2 = b^2 + 2b + 1 - b^2 = 2b + 1$$

So the horizontal distance is twice the square root of the smaller square number plus one. Therefore the gradient of the line is $\frac{1}{2b+1}$. Once we know by how much the line grows vertically for every horizontal unit, we can then determine how much higher than $b$ the point on the line will be at $a$ by multiplying the gradient by $a^2-b^2$, as shown below:



Since the difference in height is less than 1, it is going to be the part of the square root that comes after the decimal point, with the whole number part being $b$.

It might be hard to mentally divide by an odd number in $\frac{a^2-b^2}{2b+1}$ so we further approximate it as $\frac{a^2-b^2}{2b}$ instead. And that's why this method works.

Saturday, September 29, 2018

Comparing numpy scalars directly is time consuming, use .tolist() before a comparison

This is something that I found out about recently when going through the elements of a numpy array in order to do some checks on each numbers. Turns out you shouldn't just do this

for x in nparr:
    if x == 0:
        something something

as that uses a lot more time than doing this

for x in nparr.tolist():
    if x == 0:
        something something

This is because a for loop iterating over a numpy array does not result in a sequence of Python constants but in a sequence of numpy scalars which would result in comparing a numpy array to a constant. Converting the array into a list first before the for loop will then result in a sequence of constants.

Here is some profiling I've done using cProfile to check different ways to do an 'if' on a numpy array element:

import cProfile
import numpy as np

runs = 1000000

print('Comparing numpy to numpy')
x = np.array(1.0, np.float32)
y = np.array(1.0, np.float32)
cProfile.run('''
for _ in range(runs):
    if x == y:
        pass
''')
print()

print('Comparing numpy to constant')
x = np.array(1.0, np.float32)
cProfile.run('''
for _ in range(runs):
    if x == 1.0:
        pass
''')
print()

print('Comparing constant to constant')
x = 1.0
cProfile.run('''
for _ in range(runs):
    if x == 1.0:
        pass
''')
print()

print('Comparing numpy.tolist() to constant')
x = np.array(1.0, np.float32)
cProfile.run('''
for _ in range(runs):
    if x.tolist() == 1.0:
        pass
''')
print()

print('Comparing numpy to numpy.array(constant)')
x = np.array(1.0, np.float32)
cProfile.run('''
for _ in range(runs):
    if x == np.array(1.0, np.float32):
        pass
''')
print()

print('Comparing numpy.tolist() to numpy.tolist()')
x = np.array(1.0, np.float32)
y = np.array(1.0, np.float32)
cProfile.run('''
for _ in range(runs):
    if x.tolist() == y.tolist():
        pass
''')
print()

Here are the results in order of speed:

Comparing constant to constant:0.088 seconds
Comparing numpy.tolist() to constant:0.288 seconds
Comparing numpy.tolist() to numpy.tolist():0.508 seconds
Comparing numpy to numpy:0.684 seconds
Comparing numpy to constant:1.192 seconds
Comparing numpy to numpy.array(constant):1.203 seconds

It turns out that it is always faster to first convert your numpy scalars into constants via .tolist() than to do anything with them as numpy scalars.

Thursday, August 30, 2018

The McLauren series and Taylor series (approximating complicated functions with simple polynomials)

The McLauren series

Imagine you had a function $f(x)$ that you knew was a polynomial, but whose details were unknown and you could only apply operations to the function without being able to read it. How could you find the coefficients of this polynomial? We know that for coefficients $a_i$:

$f(x) = a_0 + a_1 x + a_2 x^2 + a_3 x^3 + a_4 x^4 + ...$

If we find $f(0)$ then we can find $a_0$.

$f(0) = a_0 + a_1 0 + a_2 0^2 + a_3 0^3 + a_4 0^4 + ... = a_0 + 0 + 0 + ... = a_0$

That was easy, but how can we find $a_1$? We need an operation that gets rid of $a_0$ and also the $x$ in the term $a_1 x$. That operation turns out to be differentiation with respect to $x$:

$f'(x) = a_1 + 2 a_2 x + 3 a_3 x^2 + 4 a_4 x^3 + ...$

Great! Now we can find $a_1$ by replacing $x$ with 0:

$f'(0) = a_1 + 2 a_2 0 + 3 a_3 0^2 + 4 a_4 0^3 + ... = a_1$

We can find $a_2$ by repeating these two steps:

$f''(x) = 2 a_2 + 2 \cdot 3 a_3 x + 3 \cdot 4 a_4 x^2 + ...$
$f''(0) = 2 a_2 + 2 \cdot 3 a_3 0 + 3 \cdot 4 a_4 0^2 + ... = 2 a_2$

What we found is twice of $a_2$ which means that we need to divide by 2 to get $a_2$. The differentiation operation is multiplying constants by each coefficient and the constants get bigger and bigger the more times we apply differentiation. You might notice that what's happening is that $a_0$ was multiplied by 1, $a_1$ was also multiplied by 1, $a_2$ has been multiplied by 2, $a_3$ will be multiplied by 6, $a_4$ by 24, and so on. These are factorials, sequences of whole numbers multiplied together ($3! = 1 \times 2 \times 3 = 6$). Which means that we need to divide by the next factorial after each round of differentiation and substitution by zero.

In general we can find the $i$th coefficient in an unknown polynomial function by doing the following:

$a_i = \frac{f^i(0)}{i!}$

That's great. Now to test it. Let's see if a complex function is actually a polynomial in disguise. Take something simple such as $f(x) = e^x$. This doesn't look like a polynomial, but it may be represented by a polynomial with an infinite number of terms. Let's find what are the coefficients of the hidden polynomial in $e^x$.

$f(x) = e^x = a_0 + a_1 x + a_2 x^2 + a_3 x^3 + a_4 x^4 + ...$
$\frac{f(0)}{0!} = \frac{e^0}{1} = a_0$
$\frac{f(0)}{0!} = 1$

OK, so $a_0$ is 1. Let's find the rest of the coefficients.

$a_1 = \frac{f'(0)}{1!} = \frac{e^0}{1} = 1$
$a_2 = \frac{f''(0)}{2!} = \frac{e^0}{2} = \frac{1}{2}$
$a_3 = \frac{f'''(0)}{3!} = \frac{e^0}{6} = \frac{1}{6}$
$...$

So the first few terms of the polynomial hidden within $e^x$ are:
$f(x) = 1 + x + \frac{1}{2} x^2 + \frac{1}{6} x^3 + ...$

Does this partial polynomial look anything like $e^x$ when plotted as a graph?



Pretty good within a boundary! Note how the curve has a perfect fit at $x = 0$ and that it gets worse as we move away from there. Adding more terms to the polynomial will enlarge the area around $x = 0$ that is close to the curve but it will always be radiating out from there.

Let's try for $f(x) = cos(x)$ now.

$a_0 = \frac{f(0)}{0!} = \frac{cos(0)}{1} = 1$
$a_1 = \frac{f'(0)}{1!} = \frac{-sin(0)}{1} = 0$
$a_2 = \frac{f''(0)}{2!} = \frac{-cos(0)}{2} = -\frac{1}{2}$
$a_3 = \frac{f'''(0)}{3!} = \frac{sin(0)}{6} = 0$
$...$

So the first few terms of the polynomial hidden within $cos(x)$ is
$f(x) = 1 - \frac{1}{2} x^2 + ...$



The Taylor series

As you can see, this "polynomialisation" of functions is a neat way to approximate a function we might not know how to implement exactly but know how to differentiate and how to find its value when $x$ is 0. But what if we don't know what $f(0)$ is such as in $ln(x)$ or $\frac{1}{x}$? A slight modification to our method allows us to use any value of $x$ and not just 0. Let's call this value $b$. By slightly modifying the polynomial we expect to be hiding inside the function, we can make the polynomial act the same way when $f(b)$ is used instead of $f(0)$:

$f(x) = a_0 + a_1 (x - b) + a_2 (x - b)^2 + a_3 (x - b)^3 + a_4 (x - b)^4 + ...$
$f(b) = a_0 + a_1 (b - b) + a_2 (b - b)^2 + a_3 (b - b)^3 + a_4 (b - b)^4 + ...$
$f(b) = a_0$

$f'(x) = a_1 + 2 a_2 (x - b) + 3 a_3 (x - b)^2 + 4 a_4 (x - b)^3 + ...$
$f'(b) = a_1$

$a_i = \frac{f^i(b)}{i!}$

The catch here is that we are now finding coefficients to the polynomial $a_0 + a_1 (x - b) + a_2 (x - b)^2 + ...$ and not of $a_0 + a_1 x + a_2 x^2 + ...$, but that's OK. Let's try this on $ln(x)$ with $b = 1$.

$a_0 = \frac{f(1)}{0!} = \frac{ln(1)}{1} = 0$
$a_1 = \frac{f'(1)}{1!} = \frac{\frac{1}{1}}{1} = 1$
$a_2 = \frac{f''(1)}{2!} = \frac{-\frac{1}{1^2}}{1} = -1$
$a_3 = \frac{f'''(1)}{3!} = \frac{\frac{2}{1^3}}{1} = 2$
$...$

So the first few terms of the polynomial hidden within $ln(x)$ is
$f(x) = (x - 1) - (x - 1)^2 + 2 (x - 1)^3 + ...$



Adding more terms will approximate the original function better and better but what if we didn't have to? Remember how I said in the previous section that the polynomial approximates the original function best close to $x = 0$. Well now we can approximate it best around any point $b$ and not just around 0. This means that if our function has multiple known values, such as $cos(x)$ which is known to be 1 at $x = 0$, 0 at $x = \frac{\pi}{2}$, -1 at $x = \pi$, etc., then we can use several short polynomials centered at different points in the function instead of one large polynomial that approximates it well over a large interval. Let's try approximating $cos(x)$ around $x = \pi$, which means that we'll set $b$ to $\pi$.

$a_0 = \frac{f(\pi)}{0!} = \frac{cos(\pi)}{1} = -1$
$a_1 = \frac{f'(\pi)}{1!} = \frac{-sin(\pi)}{1} = 0$
$a_2 = \frac{f''(\pi)}{2!} = \frac{-cos(\pi)}{2} = \frac{1}{2}$
$a_3 = \frac{f'''(\pi)}{3!} = \frac{sin(\pi)}{6} = 0$
$...$

So the first few terms of the polynomial hidden within $cos(x)$ which best approximates the area around $x = \pi$ is
$f(x) = -1 + \frac{1}{2} (x - \pi)^2 + ...$



This is useful when implementing mathematical functions on a computer. You keep several simple polynomials defined at different points in the domain of the function and then pick the closest one to the $x$ you need to evaluate. You can then compute an approximation that isn't too bad without requiring a lot of computational time.

Sunday, July 29, 2018

Hyperparameter tuning using Scikit-Optimize

One of my favourite academic discoveries this year was Scikit-Optimize, a nifty little Python automatic hyperparameter tuning library that comes with a lot of features I found missing in other similar libraries.

So as explained in an earlier blog post, automatic hyperparameter tuning is about finding the right hyperparameters for a machine learning algorithm automatically. Usually this is done manually using human experience but even simple MonteCarlo search random guesses can result in better performance than human tweaking (see here). So automatic methods were developed that try to explore the space of hyperparameters and their resulting performance after training the machine learning model and then try to home in on the best performing hyperparameters. Of course each time you want to evaluate a new hyperparameter combination you'd need to retrain and evaluate your machine learning model, which might take a very long time to finish, so it's important that a good hyperparameter combination is found with as little evaluations as possible. To do this we'll use Bayesian Optimization, a process where a separate simpler model is trained to predict the resulting performance of the whole hyperparameter space. We check this trained model to predict which hyperparameters will give the best resulting performance and actually evaluate them by training our machine learning model with them. The actual resulting performance is then used to update the hyperparameter space model so that it makes better predictions and then we'll get a new promising hyperparameter combination from it. This is repeated for a set number of times. Now the most common hyperparameter space model to use is a Gaussian Process which maps continuous numerical hyperparameters to a single number which is the predicted performance. This is not very good when your hyperparameters contain categorical data such as a choice of activation function. There is a paper that suggests that random forests are much better at mapping general hyperparameter combinations to predicted performance.

Now that we got the theory out of the way, let's see how to use the library. We'll apply it on a gradient descent algorithm that needs to minimize the squared function. For this we'll need 3 hyperparameters: the range of the initial value to be selected randomly, the learning rate, and the number of epochs to run. So we have two continuous values and one discrete integer value.

import random

def cost(x):
    return x**2

def cost_grad(x):
    return 2*x

def graddesc(learning_rate, max_init_x, num_epochs):
    x = random.uniform(-max_init_x, max_init_x)
    for _ in range(num_epochs):
        x = x - learning_rate*cost_grad(x)
    return x

Now we need to define the skopt optimizer:

import skopt

opt = skopt.Optimizer(
            [
                skopt.space.Real(0.0, 10.0, name='max_init_x'),
                skopt.space.Real(1.0, 1e-10, 'log-uniform', name='learning_rate'),
                skopt.space.Integer(1, 20, name='num_epochs'),
            ],
            n_initial_points=5,
            base_estimator='RF',
            acq_func='EI',
            acq_optimizer='auto',
            random_state=0,
        )

The above code is specifying 3 hyperparameters:
  • the maximum initial value that is a real number (continuous) and that can be between 10 and 0
  • the learning rate that is also a real number but that is also on a logarithmic scale (so that you are equally likely to try very large values and very small values) and can be between 1 and 1e-10
  • the number of epochs that is an integer (whole number) and that can be between 1 and 20
It is also saying that the hyperparameter space model should be initialized based on 5 random hyperparameter combinations (you train the hyperparameter space model on 5 randomly chosen hyperparameters in order to be able to get the first suggested hyperparameter), that this model should be a random forest (RF), that the acquisition function (the function to decide which hyperparameter combination is the most promising to try next according to the hyperparameter space model) is the expected improvement (EI) of the hyperparameter combination, that the acquisition optimizer (the optimizer to find the next promising hyperparameter combination) is automatically chosen, and that the random state is set to a fixed number (zero) so that it always gives the same random values each time you run it.

Next we will use the optimizer to find good hyperparameter combinations.

best_cost = 1e100
best_hyperparams = None
for trial in range(5 + 20):
    [max_init_x, learning_rate, num_epochs] = opt.ask()
    [max_init_x, learning_rate, num_epochs] = [max_init_x.tolist(), learning_rate.tolist(), num_epochs.tolist()]
    next_hyperparams = [max_init_x, learning_rate, num_epochs]
    next_cost = cost(graddesc(max_init_x, learning_rate, num_epochs))
    if next_cost < best_cost:
        best_cost = next_cost
        best_hyperparams = next_hyperparams
    print(trial+1, next_cost, next_hyperparams)
    opt.tell(next_hyperparams, next_cost)
print(best_hyperparams)
The nice thing about this library is that you can use an 'ask/tell' system where you ask the library to give you the next hyperparameters to try and then you do something with them in order to get the actual performance value and finally you tell the library what this performance value is. This lets you do some nifty things such as ask for another value if the hyperparameters resulted in an invalid state in the machine learning model or even to save your progress and continue later.

In the above code we're running a for loop to run the number of times we want to evaluate different hyperparameters. We need to run it for the 5 random values we specified before to initialize the hyperparameter space model plus another 20 evaluations to actually optimize the hyperameters. Now skopt does something funny which is that it returns not plain Python values for hyperparameters but rather each number is represented as a numpy scalar. Because of this we convert each numpy scalar back into a plain Python float or int using ".tolist()". We ask for the next hyperparamters to try, convert them to plain Python values, get their resulting cost after running gradient descent, store the best hyperparameters found up to now, and tell the library what the given hyperparameters' resulting performance was. At the end we print the best hyperparamter combination found.

Some extra stuff:
  • You can ask for categorical hyperparameters using "skopt.space.Categorical(['option1', 'option2'], name='options')" which will return one of the values in the list when calling "ask".
  • You can ask for a different hyperparameter combination in case of an invalid one by changing "ask" to give you several hyperparameter suggestions rather than just one and then trying each one of them until one works by using "opt.ask(num_hyperpars)" (you can also incrementally ask for more values and always take the last one).
  • You can save and continue by saving all the hyperparameters evaluated and their corresponding performance value in a text file. You then later resupply the saved hyperparameters and their performance using "tell" for each one. This is much faster than actually evaluating them on the machine learning model so straight supplying known values will be ready quickly. Just be careful that you also call "ask" before each "tell" in order to get the same exact behaviour from the optimizer or else the next "tell" will give different values from what it would have given had you not loaded the previous ones manually.

Thursday, June 28, 2018

Saving/Loading a Tensorflow model using HDF5 (h5py)

The normal way to save the parameters of a neural network in Tensorflow is to create a tf.train.Saver() object and then calling the object's "save" and "restore" methods. But this can be a bit cumbersome and you might prefer to have more control on what and how things get saved and loaded. The standard file format for saving large tensors (such as the parameters of a neural network) is to use HDF5.

Saving is easy. Here is the code you'll need:
with h5py.File('model.hdf5', 'w') as f:
    for var in tf.trainable_variables():
        key = var.name.replace('/', ' ')
        value = session.run(var)
        f.create_dataset(key, data=value)

Notice that we need to use the session in order to get a parameter value.

If you're using variable scopes then your variable names will have slashes in them and here we're replacing slashes with spaces. The reason is because the HDF5 format treats key values as directories where folder names are separated by slashes. This means that you need to traverse the keys recursively in order to arrive at the data (one folder name at a time) if you do not know the full name at the start. This replacement of slashes simplifies the code for loading a model later.

Notice also that you can filter the variables to save as you like as well as save extra stuff. I like to save the Tensorflow version in the file in order to be able to check for incompatible variable names in contrib modules (RNNs had some variable names changed in version 1.2).

Now comes the loading part. Loading is a tiny bit more involved because it requires that you make you neural network code include stuff for assigning values to the variables. All you need to do whilst constructing your Tensorflow graph is to include the following code:
param_setters = dict()
for var in tf.trainable_variables():
    placeholder = tf.placeholder(var.dtype, var.shape, var.name.split(':')[0]+'_setter')
    param_setters[var.name] = (tf.assign(var, placeholder), placeholder)

What this code does is it creates separate placeholder and assign nodes for each variable in your graph. In order to modify a variable you need to run the corresponding assign node in a session and pass the value through the corresponding placeholder. All the corresponding assign nodes and placeholders are kept in a dictionary called param_setters. We're also naming the placeholder the same as the variable but with '_setter' at the end.

Notice that param_setters is a dictionary mapping variable names to a tuple consisting of the assign node and the placeholder.

Now we can load the HDF5 file as follows:
with h5py.File('model.hdf5', 'r') as f:
    for (name, val) in f.items()
        name = name.replace(' ', '/')
        val = np.array(val)
        session.run(param_setters[name][0], { param_setters[name][1]: val })

What's happening here is that we're loading each parameter from the file and replacing the spaces in names back into slashes. We then run the corresponding assign node for the given parameter name in param_setters and set it to the loaded value.

Wednesday, May 23, 2018

Fancy indexing in Tensorflow: Getting a different element from every row in a matrix

Let's say you have the following 4 by 3 matrix:
M = np.array([[  1,  2,  3 ],
              [  4,  5,  6 ],
              [  7,  8,  9 ],
              [ 10, 11, 12 ]])
When in Tensorflow we'd also have this line:
M = tf.constant(M)

Let's say that you want to get the first element of every row. In both Numpy and Tensorflow you can just do this:
x = M[:, 0]
which means 'get every row and from every row get the element at index 0'. "x" is now equal to:
np.array([1, 4, 7, 10])

Now let's say that instead of the first element of every row, you wanted the third element of the first row, the second element of the second row, the first element of the third row, and the second element of the fourth row. In Numpy, this is how you do that:
idx = [2,1,0,1]
x = M[[0,1,2,3], idx]
or equivalently:
x = M[np.arange(M.shape[0]), idx]
This is just breaking up the 'coordinates' of the desired elements into separate lists for each axis. "x" is now equal to:
np.array([3, 5, 7, 11])

Unfortunately this sort of fancy indexing isn't available in Tensorflow. Instead we have the function "tf.gather_nd" which lets us provide a list of 'coordinates'. Unlike Numpy, tf.gather_nd does not take separate lists per axis but expects a single list with one coordinate per item like this:
idx = tf.constant([[0,2],[1,1],[2,0],[3,1]], tf.int32)
x = tf.gather_nd(M, idx)

This is typically inconvenient as we usually have a single vector of indexes rather then a list of coordinates. It would be better to be able to just put a "range" like we did with Numpy. We can use the range and then join it to the vector of indexes sideways using "tf.stack", like this:
idx = tf.constant([2,1,0,1], tf.int32)
x = tf.gather_nd(M, tf.stack([tf.range(M.shape[0]), idx], axis=1))

Kind of bulky but at least it's possible. I miss Theano's Numpy-like interface.

Monday, April 30, 2018

Why Bahdanau's neural attention requires two layers

The neural attention model was first described for the task of machine translation in Bahdanau's Neural machine translation by jointly learning to align and translate. The source sentence is translated into the target sentence by attending to different words in the source sentence as the target sentence is generated one word at a time. The attended source sentence is defined as follows:

$$
\begin{align}
c_i &= \sum_j \alpha_{ij} s_j \\
\alpha_{ij} &= \frac{e^{z_{ij}}}{\sum_k e^{z_{ik}}} \\
z_{ij} &= W \tanh(V (s_j ++ p_{i-1}))
\end{align}
$$
where $c_i$ is the attended source sentence taken from the weighted sum of the source sentence vectors $s_j$ and the weights $\alpha_{ij}$, $p_{i-1}$ is the prefix vector produced by the 'decoder' RNN that remembers what has been generated thus far, $++$ means concatenation, and $W$ and $V$ are two weight matrices.

So the question we're asking is, why do we need to use two layers to produce $z_{ij}$? Can we do with just one layer?

In reality what happens when you use a single layer is that the attention weights will remain the same across time steps such that, although the attention will be different on different words in the source sentence, as the target sentence gets generated these words will keep receiving the same attention they did throughout the whole generation process. The reason for this is that softmax, which is the function the produces $\alpha_{ij}$, is shift invariant, that is, does not change if you add the same number to each of its inputs.

Let's say $z$ is defined as [ 1, 2, 3 ]. Then $\alpha$ will be
$$
\begin{matrix}
(\frac{e^1}{e^1 + e^2 + e^3}) & (\frac{e^2}{e^1 + e^2 + e^3}) & (\frac{e^3}{e^1 + e^2 + e^3})
\end{matrix}
$$
but if we add the constant k to each of the three numbers then the result will still be the same
$$
\begin{matrix}
& (\frac{e^{1+k}}{e^{1+k} + e^{2+k} + e^{3+k}}) & (\frac{e^{2+k}}{e^{1+k} + e^{2+k} + e^{3+k}}) & (\frac{e^{3+k}}{e^{1+k} + e^{2+k} + e^{3+k}}) \\
=& (\frac{e^1e^k}{e^1e^k + e^2e^k + e^3e^k}) & (\frac{e^2e^k}{e^1e^k + e^2e^k + e^3e^k}) & (\frac{e^3e^k}{e^1e^k + e^2e^k + e^3e^k}) \\
=& (\frac{e^1e^k}{e^k(e^1 + e^2 + e^3)}) & (\frac{e^2e^k}{e^k(e^1 + e^2 + e^3)}) & (\frac{e^3e^k}{e^k(e^1 + e^2 + e^3)}) \\
=& (\frac{e^1}{e^1 + e^2 + e^3}) & (\frac{e^2}{e^1 + e^2 + e^3}) & (\frac{e^3}{e^1 + e^2 + e^3})
\end{matrix}
$$

This proves that adding the same constant to every number in $z$ will leave the softmax unaltered. Softmax is shift invariant.

Now let's say that $z$ is determined by a single neural layer such that $z_{ij} = W (s_j ++ p_{i-1}) = W_0 s_j + W_1 p_{i-1}$. We can draw a matrix of all possible $z$ where the columns are the source vectors $s$ and the rows are the decoder prefix states $p$.
$$
\begin{matrix}
(W_0 s_0 + W_1 p_0) & (W_0 s_1 + W_1 p_0) & (W_0 s_2 + W_1 p_0) & \dots \\
(W_0 s_0 + W_1 p_1) & (W_0 s_1 + W_1 p_1) & (W_0 s_2 + W_1 p_1) & \dots \\
(W_0 s_0 + W_1 p_2) & (W_0 s_1 + W_1 p_2) & (W_0 s_2 + W_1 p_2) & \dots \\
\dots & \dots & \dots & \dots
\end{matrix}
$$

Given a single row, the $p$s are always the same, which means that the only source of variation between $z$s of the same prefix $p$ is from the source vectors $s$. This makes sense.

Now take the first two rows. What is the result of subtracting the second from the first?

$$
\begin{matrix}
(W_0 s_0 + W_1 p_0 - W_0 s_0 - W_1 p_1) & (W_0 s_1 + W_1 p_0 - W_0 s_1 - W_1 p_1) & (W_0 s_2 + W_1 p_0 - W_0 s_2 - W_1 p_1) & \dots \\
(W_1(p_0-p_1)) & (W_1(p_0-p_1)) & (W_1(p_0-p_1)) & \dots
\end{matrix}
$$

Notice how all the columns have the same difference, which means that the second column can be rewritten as:

$$
\begin{matrix}
(W_0 s_0 + W_1 p_0 + W_1(p_0-p_1)) & (W_0 s_1 + W_1 p_0 + W_1(p_0-p_1)) & (W_0 s_2 + W_1 p_0 + W_1(p_0-p_1)) & \dots
\end{matrix}
$$

We know that adding the same constant to every $z$ will leave the softmax unaltered, which means that every time step in the decoder RNN will lead to the same attention vector. The individual attention values will be different, but they will not change throughout the whole generation process. Using two layers with a non-linear activation function in the middle will disrupt this as the difference between two consecutive $z$ will now be different at each time step.

Thursday, March 29, 2018

Word Mover's Distance (a text semantic similarity measure)

Word Mover's Distance is a way to measure the similarity between the meaning of two sentences. It's basically a bag of words method (ignores the order of the words in sentences) but it's been found to work really well, including for evaluating automatic caption generation. You can find code to perform k-nearest neighbours classification of sentiment written by the creator of WMD here, but reading the code is a challenge so here's how WMD works.

This measure makes use of an existing distance measure called Earth Mover's Distance. Given a set of heaps of dirt and a set of holes, and both the heaps and the holes have a point as a location (a vector) and a certain mass or capacity (a weight), what is the least amount of work needed to move all the dirt into the holes or to fill all the holes completely (depending on whether there is more total mass or total capacity)? By work we mean the total distance travelled multiplied by the mass carried. This is what EMD calculates. It measures the distance between a set of vectors paired with weights and another set of vectors paired with weights, without requiring that both sets have the same number of vectors. It is a linear programming problem and there is a Python library for it called "pyemd" that can be installed via pip.

Now back to Word Mover's Distance. WMD uses EMD between two sentences. It assumes some kind of pre-computed word embedding vectors such as word2vec, which are vectors that encode the meaning of words. The words in the sentences are the heaps and holes and these word vectors are their locations. For weights, we use the normalized frequency of the words in the sentence, that is, the number of times the word was found in the sentence divided by the total number of words in the sentence. We also ignore any words that are stop words (non-informative common words like "the" and "of") and any words that are not found in the list of word vectors. Once we have collected the set of vectors and weights of the two sentences we then just put them in the EMD solver to get the distance.

The Python library "gensim" implements WMD given a word2vec dataset and two lists of words (the sentences) without stop words. You can find a tutorial on how to use it here.

Monday, February 26, 2018

Finding the Greatest Common Divisor (GCD) / Highest Common Factor (HCF) of two numbers

The greatest common divisor or highest common factor of two numbers can be found using a very simple algorithm described since the time of Euclid, known as the Euclidean algorithm. The nice thing about it is that it can be computed visually by drawing squares in a rectangle as follows.


Let's say that you have a line segment of length 6 units and a smaller line segment of length 2 units.




We can check if the smaller line's length is a divisor of the larger line's length if we can place several of the smaller lines next to each other and reach the exact length of the larger line.




Likewise we can check if a number is a common divisor of two other numbers by checking if a square can fill exactly a rectangle.


In the above diagram, the big blue rectangle has sides of 6 units by 4 units and the small red squares have sides 2 units by 2 units. Since the red square can exactly fill the blue rectangle by placing copies of it next to each other, 2 is a common divisor of 4 and 6.

The greatest common divisor is the largest square that can exactly fill a rectangle. The largest square that can do this has its side equal to the smallest side of the rectangle (otherwise not even one square will fit in the rectangle) whilst the smallest square has its side equal to 1 unit (assuming that sides are always whole number lengths). Let's take the above rectangle as an example and use the Euclidean algorithm to find the largest square that fits a 6 by 4 rectangle.

The Euclidean algorithm takes advantage of the fact that the greatest common divisor of two numbers is also a divisor of the difference of the two numbers. If two numbers "a" and "b" have "c" as a common divisor, then they can be rewritten as "c×p" and "c×q". The difference "a-b" is then equal to "c×p - c×q" which is equal to "c×(p - q)" which is therefore a divisor of "c" since one of its factors is "c". Notice that "c" can be any divisor of "a" and "b", which means that "a-b" should also have the greatest common divisor of "a" and "b" as a divisor.

This means that the largest square to exactly fill the original 4 by 6 square will also be the largest square to exactly fill a 2 by 4 rectangle, since 2 is the difference between 4 and 6.

This piece of information allows us to substitute the difficult problem of finding the largest square that exactly fills a rectangle with an easier problem of finding the same square but on a smaller rectangle. The smaller the rectangle, the easier it is to find the largest square to exactly fill it and the Euclidean algorithm allows us to progressively chop off parts of the rectangle until all that remains is a small square which would be the square that exactly fills the original rectangle.

So the algorithm goes as follows:

  1. If the rectangle has both sides equal then it itself is the square that is the largest square that exactly fills the rectangle and we are done.
  2. Otherwise put a square inside the rectangle of length equal to the shortest side of the rectangle.
  3. Chop off the part of the rectangle that is covered by the square and repeat.

The 6 by 4 rectangle is trivial so let's try this algorithm on a slightly harder problem of 217 by 124. This time we'll be using exact pixels to represent the rectangle lengths.

This is a 217 by 124 rectangle.



This is the overlap it has with a square of length equal to its shortest side.




We now chop off that part of the rectangle and find the largest square which fits the remaining 93 by 124 rectangle.




This is the new square's overlap.




This is the remaining 93 by 31 rectangle.




This is the new square's overlap.




This is the remaining 62 by 31 rectangle.




This is the new square's overlap.




This is the remaining 31 by 31 rectangle.




We have finally reached a square of side 31, which means that 31 is the greatest common divisor of 217 and 124.


Of course we don't really need to visualise anything. We can do this all numerically by just repeatedly subtracting the smallest number from the biggest and replacing the biggest with the difference. Here's a Python function that does that:

def gcd(a, b):
    while a != b:
        if a < b:
            b = b - a
        else:
            a = a - b
    return a

We can make this shorter by using pattern matching:

def gcd(a, b):
    while a != b:
        (a, b) = (min(a,b), max(a,b)-min(a,b))
    return a

If one of the numbers is much bigger than the other, this program will waste a lot of time repeatedly subtracting the same number until the bigger number becomes smaller than the other. Fortunately this process can be done in a single operation by dividing and getting the remainder. The remainder is what's left after you can't subtract the second number from the first any more. This speeds the algorithm up considerably. The operation is called the modulo operation. With modulo the subtractions will not stop when both numbers are equal as modulo assumes that you can do one more subtraction and make the first number zero. Instead we will now keep on looping until the remainder is zero, in which case the last number you used to get the remainder was the greatest common divisor:

def gcd(a, b):
    while b > 0:
        (a, b) = (min(a,b), max(a,b)%min(a,b))
    return a

Tuesday, January 30, 2018

A quick guide to running deep learning applications on Amazon AWS virtual server supercomputer

Deep learning is serious business and if you want to work on sizeable problems you're going to need more hardware than you probably have at home. Here is how to use Amazon Web Services in order to be able to upload your code and datasets to an online server which will run it for you and then let you download the results. It's basically a virtual computer that you use over the Internet by sending commands.

First of all, when I got started I was initially following this video on how to use AWS:
https://youtu.be/vTAMSm06baA?t=1476

AWS Educate

If you're a student then you can take advantage of AWS Educate where if you are accepted you will receive 100 free hours of runtime to get started. 100 hours in deep learning will not last long but it will allow you to experiment and get your bearings with AWS without worrying about the bill too much. It will also be a few days before you get a reply. Here's the link:
https://aws.amazon.com/education/awseducate/

Get the applications you need

If you're on Windows, then before creating your server you should install WinSCP and PuTTY in order to be able to upload your data and run your programs. If you're not on Windows then you can use the terminal with ssh to do this.

Create an account

Start by visiting this link and making an account:
https://aws.amazon.com/

You need to provide your credit card details before starting. You will be charged automatically every month so make sure to keep an eye on your usage as you pay for the amount of time you let the server run per hour.

Enter the AWS console

Next go into the AWS console which is where you get to manage everything that has to do with virtual servers:
https://console.aws.amazon.com/

Note that there is the name of a city at the top such as "Ohio". This is to say where you want your servers to be in. Amazon has servers all around the world and you might prefer one region over another, mostly to reduce latency. Different regions also have different prices, so that might take priority in your decision. Ohio and W. Virginia seem to be the cheapest. See here for more information:
https://www.concurrencylabs.com/blog/choose-your-aws-region-wisely/

Keep in mind that each region has its own interface so that if you reserve a server in one region, you can only configure your server when that region is selected. You will not see a list of all your servers in any region. So make sure you remember which regions you use.

After having chosen a region, next go to Services and click on EC2:
https://console.aws.amazon.com/ec2/

Create a virtual server

Click on the big blue button called "Launch Instance" in order to create your virtual server. You can now see a list of AMIs which are preconfigured virtual servers that are specialized for some kind of task, such as deep learning. You're going to copy an instance of one of these and then upload your files to it. Look for the AMI called "Deep Learning AMI (Ubuntu)" which contains a bunch of deep learning libraries together with CUDA drivers. Click on "Select".

This is where you choose the computing power you want. The better it is, the more expensive. If you just want to see how it works then choose a free one which says "Free tier eligible". If you want to get down to business then choose one of the "GPU Compute" instances such as "p2.xlarge" which has 12GB of GPU memory. The "p2.xlarge" costs about 90 cents per hour (which starts from the moment you create the instance, not when you start running your programs so it also includes the time it takes to upload your data).

If this is your first time creating a powerful instance then you will first need to ask Amazon to let you use it (this is something they do to avoid people hogging all the resources). Under "Regarding" choose "Service Limit Increase", under "Limit Type" choose "EC2 Instances", and under "Region" choose the region you selected previously. You also have to say something about how you'll use it. After being granted access you can then continue from where we left off.

After ticking the check box next to your selected computer power, click on "Next: Configure Instance Details".

Leave this next step with default settings. Click on "Next: Add Storage".

This is where you choose your storage space. You will need at least 50GB of space in order to store the deep learning AMI but you will need additional space for your datasets and results. Keep in mind that you have to pay per GB per month for storage. If you need frequent access to the hard drive (such as loading minibatches from disk) then you'll need to use an SSD drive which costs about 10 cents per GB per month. Otherwise you can use a magnetic drive which costs about 5 cents per GB per month.

Next click on "Next: Add Tags". This is where you make up some informative tags for your virtual server in order to differentiate it from other virtual servers. You can do something like "Name=Deep learning". If you only have one server then this is not important. Click on "Next: Configure Security Group".

This is where you create some firewall rules to make sure that only your computer has access to the virtual server, even if someone else knows the password. It might be the case that this doesn't work for you and you won't be able to connect at all even from your IP address in which case choose "Anywhere" as a source which will not make any restricts based on IP. Click "Review and Launch".

As soon as you click on "Launch" at the bottom the clock starts ticking and you will start being billed. If you've got the AWS Educate package then you will have 100 free hours but they will start being used up. You can stop an instance any time you want but you will be billed for one hour as soon as it starts, even if you stop it immediately.

If this is your first time using a server in the selected region (the place you want your server to be in) then you will be asked to create a cryptographic private key which is a file that will be used instead of a password to connect to the virtual server. If you're on Windows then you need to use a program that comes with PuTTY called PuTTYgen which converts the .pem file that Amazon sends you to a .ppk file that PuTTY can use. Follow the section called "To convert your private key" in the following link to do this:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html?icmpid=docs_ec2_console

Connecting to your virtual server

Now that we have our server ready we need to connect to it, upload our stuff, and get it to run. We're also on the clock so we need to hurry. Start by visiting your list of server instances, which can be found in the side bar when you're in the EC2 dashboard:
https://console.aws.amazon.com/ec2/v2/home#Instances:sort=instanceId

Your server should be running. You can stop it by right clicking it and under "Instance state" choosing "Stop". This will stop it from charging you every hour but you will be charged for at least one hour as soon as you start.

Clicking on a server and then clicking on "Connect" will give you information to access the server using PuTTY or ssh.

If you're using Windows, you connect to your server using PuTTY, which you can find out how using this guide:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html?icmpid=docs_ec2_console
After connecting using PuTTY you can transfer the configuration to WinSCP by opening WinSCP, going on Tools, and choosing "Import sites". Now you can connect using WinSCP in order to upload and download files as well.

If you're not using Windows then you should use the terminal and follow this guide instead:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html
You can upload stuff from Linux by using FileZilla or by using the normal directory explorer and clicking CTRL+L to enter an FTP location.

Upload all the files you need including datasets and scripts. You can zip the files before uploading them and then unzip them on the server using the "unzip" Linux command.

DO NOT INSTALL LIBRARIES WITH PIP YET! See the next section first.

Using your virtual server

Finally we are close to start running our scripts. But we still need to do two more things first.

First of all, look at the top of the PuTTY or terminal which tells you how to activate different Anaconda environments. These are Python environments with different libraries available which are connected to CUDA so that you will be able to run on the GPU. In PuTTY or terminal enter "source activate <environment name>". Remember this line to use later.

Now you can install anything you want using pip.

Secondly as soon as you close PuTTY or terminal all running processes will stop (but the instance will still be running so you'll still be paying). What you need to do is to use the application called "screen" which will keep everything running on its own. See this link:
https://www.rackaid.com/blog/linux-screen-tutorial-and-how-to/

Now you'll need to activate the environment again because screen creates a new session which is disconnected from the previous one.

Hoorray!

You're done! You can now start using your server. When you're ready, make sure to stop it and you can even terminate it but that will completely delete the server with all data so only do it when you're really ready, otherwise you'll be wasting money reuploading and rerunning everything.

You can see your current bill by clicking on Services and going on Bill. This will show you your current month's bill as well as previous bills as well. You can ever get a daily break down and forecasts.
https://console.aws.amazon.com/billing/home