Sunday, August 28, 2016

Using dropout in neural networks to avoid overfitting

Imagine you want to train a neural network to recognise objects in photos. So you build a training set of labelled photos and use it for training. The training brings down the error so that it can recognise the objects in the training set with 100% accuracy. So now you try to use the network on some new photos for actual use but it gets all of them wrong. This is a pesky problem with training neural networks called overfitting, that is, the network learns a set of weights that work for the training set provided but nothing else beyond it. It's a problem of generalisation that is present in every machine learning technique that has to learn from a finite training set due to something called the No Free Lunch theorem.

Generally solutions are to make the training set larger by adding a wider variety of examples and to make the model as simple as possible since a complicated model is more likely to learn something complicated but uninteresting rather than something useful. There should also be a separate validation set which has some extra data that was not used in the training set which is used to check how the learning is performing on new data.

There is also a very interesting solution in neural networks called dropout. Dropout is a technique that was published in a paper titled Improving neural networks by preventing co-adaptation of feature detectors and it works by randomly replacing activations in certain layers with a zero for every training set entry. This means that given a particular layer, for the first training set entry you randomly choose half of the neurons and replace their computed values with zeros before passing them on to the next layer. For the second training set entry you randomly choose a different set of neurons and repeat.

There are several hypothesis for why this should work:
  • Dropout is a form of regularisation which prevents neurons from overly relying on a small set of previous neurons. An overly important single neuron would mean that the output of the network is relying on a single piece of evidence rather than looking at all the evidence and deciding from there. That single piece of evidence that the neuron learned to detect might be consistently present in the training set data but it might not be in other data which results in overfitting. With dropout no single neuron is guaranteed to be there when using the network on a training set entry so the network learns to avoid relying on a single neuron.
  • Dropout forces each neuron to learn a function that is independently interpretable. This means that a single neuron learns to detect the presence of, say, stripes, whilst another neuron learns to detect the presence of fur, and so on. Without dropout the network will tend to have multiple neurons that detect a single thing as a co-adapted group, which means that all of the neurons have to be activated when used on new data, making them less reliable. With dropout no pair of neurons is guaranteed to be there together when using the network on a training set entry so the neurons learn to avoid relying on each other.
    • If you're thinking that this contradicts the first point, keep in mind that these independent functions will be learned by several nodes since a single neuron is not guaranteed to be there. This means that several versions of the same function might be learned, adding to the function's reliability.
  • Dropout adds noise to the activations which forces the network to be able to handle incomplete input data. Since the activations get corrupted by dropout then the network has to learn to deal with the error, which makes it more reliable.
  • As a continuation of the previous point, the corruption is like new training data. This is an example of an artificial expansion of the training set, which is used explicitly when training with images by, for example, cropping the images, adding noise pixels, blurring, and so on.
  • Apart from expanding the training set, dropout also simulates an ensemble of networks. An ensemble of networks is when you train several neural networks that are trained on the same training set and then use them all together on new data. The networks will likely give different output as they will not learn the same thing, but the outputs of all the networks together are used as votes to determine what the majority of the networks decided the output should be. Dropout simulates this by creating a new neural network for each training set entry, which is done by using different neurons for each entry. These networks will have some shared weights but this is good as it means that they occupy less memory and will give their output faster (since you're computing just one netork rather than many smaller ones). In fact the output of the whole network will the the average of all the simulated smaller ones.

Which of these hypothesis is true will probably depend on what the network is learning and there are probably many more possible reasons for why dropout works.

Moving on, how do you actually implement dropout? What we'll do is we create a vector mask of ones and zeros, with half the numbers being ones chosen uniformly randomly, and then multiply it by the vector of layer activations we want to apply dropout to. The resulting vector is what will be used as input to the next layer. The problem is that we want to use all the neurons at test time (when we finish training). If we do this then what happens is that the neurons of the next layer will receive inputs that are twice as large since the weights of the layer adapted to only half the neurons being there. This means that we need to reduce the activations by half during test time. This is not a very neat solution as we'd like to get rid of any extra processing at test time. Instead, we use "inverted dropout" where we double the activations of the neurons that are not dropped at training time. This will make the weights adapt to having inputs that are twice as large which will work fine with twice the number of neurons at normal activation amount. You can find this explanation in these lecture notes on page 44.

Here is some Python code using numpy:
def apply_dropout(layer, dropout_prob=0.5):
    mask = np.random.binomial(n=1, p=1-dropout_prob, size=layer.shape)
    dropped_layer = layer * mask
    rescaled_dropped_layer = dropped_layer/(1-dropout_prob)
    return rescaled_dropped_layer

Just use the output of this function as input to the next layer. At test time you can either avoid using it altogether or use a dropout_prob of 0.0 which will have no effect. For implementing in Theano, you can see how it is applied in the function _dropout_from_layer in this source code.

As an extra piece of information, when using recurrent neural networks you'll also want to use dropout; however be careful as the paper titled Recurrent Neural Network Regularization explains that you shouldn't apply dropout on the recurrent layers but only on the feedforward layers (that is, before or after the RNN).