Saturday, October 21, 2017

Hyperparameter tuning using Hyperopt

One of the most tedious but important things there is in machine learning is tuning the hyperparameters of your machine learning algorithm, such as the learning rate and initial parameters in gradient descent. For example you might want to check how your gradient descent algorithm performs when the learning rate is 1.0, 0.1, or 0.01 and the initial parameters being randomly chosen between -10.0 and 10.0 or between -1.0 and 1.0.

The problem with doing this is that unless you have several supercomputers at your disposal, testing the learning algorithm is a slow process and there are so many different ways to configure your hyperparameters. In the above example you'd have to try six different configurations: (1.0,(-10.0,10.0)), (1.0,(-1.0,1.0)), (0.1,(-10.0,10.0)), etc.. An exhaustive search (grid search) would take too long if each test takes very long and in practice you'd have a search space which is much bigger than six. We can test a random sample of the configurations but ideally the sample would be chosen intelligently rather than randomly. We could start out by trying some randomly chosen configurations and then start homing in on some of the more promising hyperparameter choices.

This is where the Python library Hyperopt comes in. It's a simple library that searches a space of hyperparameters in a rational way so that you'll end up with a better result after trying 100 configurations than if you just randomly sampled 100 different configurations and picked the best performing one. It does this by using a tree of Parzen Estimators.

Let's start with a simple gradient descent example that finds the minimum of "y = x^2". We'll write a function that takes in a learning rate, and a range within which to initialize "x" (we'll only ask for the maximum of this range and assume that the negative of this number is the minimum). We'll then apply gradient descent on the initial "x" value for 10 epochs. The function will finally return the value of "x" that was found near the minimum.

import random

def loss(x):
    return x**2

def loss_grad(x):
    return 2*x

def graddesc(learning_rate, max_init_x):
    x = random.uniform(-max_init_x, max_init_x)
    for _ in range(10):
        x = x - learning_rate*loss_grad(x)
    return x

Great, now we want to find the best hyperparameters to use. In practice we'd have a validation set for our machine learning model to learn to perform well on. Once the hyperparameters that result in the best performance on the validation set are found, we'd apply them to learn on a separate test set and it is this performance that is used to judge the quality of the learning algorithm. However, for this blog post we'll instead focus on the simpler mathematical optimization problem of finding the minimum of "y = x^2".

This is how you use Hyperopt to find the best hyperparameter combination. First you create a function called an objective function that takes in a tuple with the chosen hyperparameters and returns a dictionary that is read by hyperopt to assess the fitness of the chosen hyperparameters. Then you take this function and the possible hyperparameter choices you want to allow and pass them to the hyperopt function called "fmin" which finds the hyperparameters that give the smallest loss.

import hyperopt

def hyperopt_objective(hyperparams):
    (learning_rate, max_init_x) = hyperparams
    l = loss(graddesc(learning_rate, max_init_x))
    return {
            'loss':   l,
            'status': hyperopt.STATUS_OK,
        }

best = hyperopt.fmin(
        hyperopt_objective,
        space     = [
                hyperopt.hp.choice('learning_rate', [ 1.0, 0.1, 0.01 ]),
                hyperopt.hp.choice('max_init_x',    [ 10.0, 1.0 ]),
            ],
        algo      = hyperopt.tpe.suggest,
        max_evals = 10
    )
print(best)

The output of this program is this:
{'learning_rate': 1, 'max_init_x': 1}
This is saying that the best loss is obtained when the learning rate is the item at index 1 in the given list (0.1) and the maximum initial value is the item at index 1 in the given list (1.0). The "space" parameter in fmin is there to say how to construct a combination of hyperparameters. We specified that we want a list of two things: a learning rate that can be either 1.0, or 0.1, or 0.01, and a maximum initial value that can be either 10.0 or 1.0. We use "hp.choice" to let fmin choose among a list. We could instead use "hp.uniform" in order to allow any number within a range. I prefer to use a list of human friendly numbers instead of allowing any number so I use the choice function instead. We have also said that we want to allow exactly 10 evaluations of the objective function in order to find the best hyperparameter combination.

Although this is how we are expected to use this library, it is not very user friendly in general. For example there is no feedback given throughout the search process so if each evaluation of a hyperparameter combination takes hours to complete then we could end up waiting for several days without seeing anything, just waiting for the function to return a value. The return value also requires further processing as it's just indexes. We can make this better by adding some extra stuff to the objective function:

eval_num = 0
best_loss = None
best_hyperparams = None
def hyperopt_objective(hyperparams):
    global eval_num
    global best_loss
    global best_hyperparams

    eval_num += 1
    (learning_rate, max_init_x) = hyperparams
    l = loss(graddesc(learning_rate, max_init_x))
    print(eval_num, l, hyperparams)

    if best_loss is None or l < best_loss:
        best_loss = l
        best_hyperparams = hyperparams

    return {
            'loss':   l,
            'status': hyperopt.STATUS_OK,
        }

We can now see each hyperparameter combination being evaluated. This is the output we'll see:
1 1.6868761146697238 (1.0, 10.0)
2 0.34976768426779775 (0.01, 1.0)
3 0.006508209785146999 (0.1, 1.0)
4 1.5999357079405185 (0.01, 10.0)
5 0.2646974732349057 (0.01, 1.0)
6 0.5182259594937579 (1.0, 10.0)
7 53.61565213613977 (1.0, 10.0)
8 1.8239879002601682 (1.0, 10.0)
9 0.15820396975495435 (0.01, 1.0)
10 0.784445725853568 (1.0, 1.0)

Also, the variable best_hyperparams will contain the tuple with the best hyperparameters found. Printing best_hyperparams will show "(0.1, 1.0)". We can even save the best hyperparameters found till now in a file so that we can stop the search early if we run out of patience.

Here is the full code in one place:

import random
import hyperopt

def loss(x):
    return x**2

def loss_grad(x):
    return 2*x

def graddesc(learning_rate, max_init_x):
    x = random.uniform(-max_init_x, max_init_x)
    for _ in range(10):
        x = x - learning_rate*loss_grad(x)
    return x

eval_num = 0
best_loss = None
best_hyperparams = None
def hyperopt_objective(hyperparams):
    global eval_num
    global best_loss
    global best_hyperparams

    eval_num += 1
    (learning_rate, max_init_x) = hyperparams
    l = loss(graddesc(learning_rate, max_init_x))
    print(eval_num, l, hyperparams)

    if best_loss is None or l < best_loss:
        best_loss = l
        best_hyperparams = hyperparams

    return {
            'loss':   l,
            'status': hyperopt.STATUS_OK,
        }

hyperopt.fmin(
        hyperopt_objective,
        space     = [
                hyperopt.hp.choice('learning_rate', [ 1.0, 0.1, 0.01 ]),
                hyperopt.hp.choice('max_init_x',    [ 10.0, 1.0 ]),
            ],
        algo      = hyperopt.tpe.suggest,
        max_evals = 10
    )

print(best_hyperparams)

To find out more about Hyperopt see this documentation page and the Github repository.