
Final glue – loss functions and optimizers
The network which transforms input data into output is not enough to start training it. We need to define our learning objective, which is to have a function that accepts two arguments: the network's output and the desired output. Its responsibility is to return to us a single number: how close the network's prediction is from the desired result. This function is called the loss function, and its output is the loss value. Using the loss value, we calculate gradients of network parameters and adjust them to decrease this loss value, which pushes our model to better results in the future. Both of those pieces—the loss function and the method of tweaking a network's parameters by gradients—are so common and exist in so many forms that both of them form a significant part of the PyTorch library. Let's start with loss functions.
Loss functions
Loss functions reside in the nn
package and are implemented as an nn.Module
subclass. Usually, they accept two arguments: output from the network (prediction), and desired output (ground-truth data which is also called the label of the data sample). At the time of writing, PyTorch 0.4 contains 17 different loss functions. The most commonly used are:
nn.MSELoss
: The mean square error between arguments, which is the standard loss for regression problemsnn.BCELoss
andnn.BCEWithLogits
: Binary cross-entropy loss. The first version expects a single probability value (usually it's the output of theSigmoid
layer), while the second version assumes raw scores as input and appliesSigmoid
itself. The second way is usually more numerically stable and efficient. These losses (as their names suggest) are frequently used in binary classification problems.nn.CrossEntropyLoss
andnn.NLLLoss
: Famous "maximum likelihood" criteria, which is used in multi-class classification problems. The first version expects raw scores for each class and appliesLogSoftmax
internally, while the second expects to have log probabilities as the input.
There are other loss functions available and you are always free to write your own Module
subclass to compare output and target. Now let's look at the second piece of the optimization process.
Optimizers
The responsibility of the basic optimizer is to take gradients of model parameters and change these parameters, in order to decrease loss value. By decreasing loss value, we're pushing our model towards desired outputs, which can give us hope of better model performance in the future. "Change parameters" may sound simple, but there are lots of details here and the optimizer procedure is still a hot research topic. In the torch.optim
package, PyTorch provides lots of popular optimizer implementations and the most widely known are as follows:
SGD
: A vanilla stochastic gradient descent algorithm with optional momentum extensionRMSprop
: An optimizer, proposed by G. HintonAdagrad
: An adaptive gradients optimizer
All optimizers expose the unified interface, which makes it easy to experiment with different optimization methods (sometimes the optimization method can really make a difference in convergence dynamics and final result). On construction, you need to pass an iterable of Variables
, which will be modified during the optimization process. The usual practice is to pass the result of the params()
call of the upper-level nn.Module
instance, which will return an iterable of all leaf Variables
with gradients.
Now, let's discuss the common blueprint of a training loop:
for batch_samples, batch_labels in iterate_batches(data, batch_size=32): # 1 batch_samples_t = torch.tensor(batch_samples)) # 2 batch_labels_t = torch.tensor(batch_labels)) # 3 out_t = net(batch_samples_t) # 4 loss_t = loss_function(out_t, batch_labels_t) # 5 loss_t.backward() # 6 optimizer.step() # 7 optimizer.zero_grad() # 8
Usually, you iterate over your data over and over again (one iteration over a full set of examples is called an epoch). Data is usually too large to fit into CPU or GPU memory at once, so it is split into batches of equal size. Every batch includes data samples and target labels, and both of them have to be tensors (lines 2
and 3
). You pass data samples to your network (line 4
) and feed its output and target labels to the loss function (line 5
). The result of the loss function shows the "badness" of the network result relative to the target labels. As input to the network and the network's weights are tensors, all transformations of your network are nothing more than a graph of operations with intermediate tensor instances. The same is true for the loss function: its result is also a tensor of one single loss value. Every tensor in this computation graph remembers its parent, so to calculate gradients for the whole network, all you need to do is to call the backward()
function on a loss function result (line 6
).
The result of this call will be the unrolling of the graph of the performed computations and the calculating of gradients for every leaf tensor with require_grad=True
. Usually, such tensors are our model's parameters, such as weights and biases of feed-forward networks, and convolution filters. Every time a gradient is calculated, it is accumulated in the tensor.grad
field, so one tensor can participate in a transformation multiple times and its gradients will be properly summed up together. For example, one single RNN (which stands for recurrent neural networks and we'll talk about them in Chapter 12, Chatbots Training with RL) cell could be applied to multiple input items.
After the loss.backward()
call is finished, we have the gradients accumulated, and now it's time for the optimizer to do its job: it takes all gradients from the parameters we've passed to it on construction and applies them. All this is done with the method step()
(line 7
).
The last, but not least, piece of the training loop is our responsibility to zero gradients of parameters. It can be done by calling zero_grad()
on our network, but, for our convenience, optimizer also exposes such a call, which does the same thing (line 8
). Sometimes zero_grad()
is placed at the beginning of the training loop, but it doesn't matter much.
The preceding scheme is a very flexible way to perform optimization and can fulfill the requirements even in sophisticated research. For example, you can have two optimizers tweaking the options of different models on the same data (and this is a real-life scenario from GAN training).
So, we are done with the essential functionality of PyTorch required to train NNs. This chapter ends with a practical medium-size example to demonstrate all the concepts we've learned, but before we go to it, we need to discuss one important topic which is essential for a NN practitioner: the monitoring of the learning process.