Calculating the Derivative of the Error with respect to Convolution Filter Weights in a CNN

Over the last month, I have been exploring the world of deep learning. Deep learning based algorithms have been around for a long time, but have recently shot into prominence after many false starts. Researchers are able to apply these algorithms to solve many previously intractable problems in a range of areas such as image processing and natural language processing to achieve human-like performance. Much of this progress is a result of vast quantities of data now available to train these networks and the realization that graphical processing units (GPU) can be used to parallelize much of the computation involved in training deep networks.

Given the ability of deep learning networks to automatically deduce the structure of complex problems and the huge investment of money and brainpower being poured into the field by the tech industry, algorithms powered by deep learning will play an ever larger role in our lives. Thus, some knowledge of how deep learning works will become vital to people in a range of industries. Depending on your level of technical sophistication (and pain tolerance 🙂 ) there are many levels at which you can learn about deep learning. A non-technical person could benefit from understanding how performance of deep learning systems is measured, difference between rules based, supervised and un-supervised learning, how deep learning systems are trained and so on. A more technical person with some programming knowledge could install and play with one of the many deep learning libraries such as Caffe, Torch, TensorFlow, CNTK etc., that are available for free. If you are like me, you are not satisfied with understanding at a middling level. You want to understand how things *really* work. Fortunately, as I have discovered over the last month (and perhaps contrary to what you may have been led to believe), much of the mathematics that powers deep learning systems (particularly multilayered neural networks) is not that complex and within reach of someone with a college level knowledge of calculus and familiarity with matrix algebra and non-linear optimization techniques such as gradient descent. Understanding this math and implementing some model networks yourself will give you valuable insights on how neural networks work and the subtleties involved with training these networks and setting various hyper parameters.

At the most fundamental level, training a neural network is akin to solving a non-linear optimization problem where gradient descent is used to optimize parameters to minimize an objective function. I won’t go into details of how this is set up, as there are a number of excellent tutorials available on the web. I’ll point you to a few that I found particularly helpful.

  • CNN’s for Visual Recognition (http://cs231n.github.io/) is excellent. Module 1 is particularly helpful. Read all parts of module 1, and make sure you follow how the minimal 2D toy data example works (I wrote Matlab code corresponding to the provided python code for the minimal 2D toy example during my own learning and will attach it to this post at some point).
  • For Matlab implementation of the major deep learning algorithms (Convolutional Neural Networks, Auto Encoders, Deep Belief Networks etc), see https://github.com/rasmusbergpalm/DeepLearnToolbox. I can’t thank the author of this tool-box enough. Going through the code and following how neural networks are constructed, the final error is computed during the forward pass and how various gradients are calculated and propagated during the backwards pass was critical to get a solid understanding of deep learning systems.

Training of neural networks proceeds by calculating the derivative of the output error wrt (with respect to) the input to a layer and the weights of that layer and then propagating the  derivatives back through the preceding layers. The functional form of different operations applied in a layer (matrix multiply, sigmoid, pooling etc) are typically chosen so that calculating the derivatives is simple and efficient. The convolutional layers of a CNN are bit of an exception. There are many subtleties associated with how the derivatives wrt convolution filter weights are calculated and applied during gradient descent. The purpose of this post is to demystify how these derivatives are calculated and used. I’ll divide the post in two parts. In the first part, we’ll calculate the derivative of the output error wrt the weights analytically and compare the results with the gradient computed numerically. It will be gratifying to see that the two agree precisely. In the second post, we’ll consider a more realistic scenario where our input and output will consist of a batch of multi-dimensional planes and we’ll use gradient descent to calculate the optimal weights to minimize the error in the output for different values of a learning parameter. This process closely follows the calculation performed in a single convolutional layer of a real CNN.

The figures below demonstrate how the convolutional layer is structured and the notation used in the rest of this post. For clarity, it is good to keep track of the dimensions of various matrices during the forward and backward pass. To help the reader along, I specify the dimensions wherever applicable.

 

Note that I have stated without proof the rather remarkable result that the derivative of the error wrt weights of the convolution filter is the convolution of the flipped input and the output. For a detailed proof, please refer to:

http://jefkine.com/general/2016/09/05/backpropagation-in-convolutional-neural-networks/

This is an important result as the convolution operation can be efficiently implemented on GPUs. One could argue that if the derivative of the error had turned out to be some complicated expression that couldn’t be efficiently parallelized, many applications of deep learning in vision related applications may not be possible.

Part 1

We’ll first consider a simplified example where the number of planes in the input and output is 1 and there is a single data sample in our batch. We’ll use the term “image” interchangeably with a data sample (underscoring the fact that the biggest application of CNNs is in image/vision processing). We’ll calculate the derivative of the error wrt filter weights using the analytical expression shown above and compare it with the derivative computed numerically. Doing so will clarify many of the subtleties involved with CNNs related to zero padding, output image dimensions, different convolution “modes” (full, valid) etc. The procedure and the matlab code is shown below. Note that we have dropped the input and output indicies (i, j) as the number of input and output planes is 1.

Part 2

Let’s now consider a more realistic scenario where we have a number of input and output planes and more than one data sample in our batch. Our initial estimate of the weights will be the true weight values with some added noise. We’ll run a gradient descent optimization to see if we can arrive at the correct weight values using  the derivative  of the output error wrt the weights calculated using the analytical expression and updating the weights in each iteration. We’ll apply a scale factor to the weight gradient before using it to update the weights. This scale factor will serve as our learning parameter. We’ll record the norm of the difference between the current weight values and the true values and plot the progression of this norm for different values of the learning parameter.

The Matlab code and the weight error plot is shown below.

 

Plot of the norm of the difference between the current estimate and the true values of filter weights. Plots are shown for three values of the “learning parameter” (the step size used in the weight update step of the gradient descent)

As we can see, the gradient descent is able to drive the norm of the weight error to nearly 0. As expected, the rate of convergence is higher for larger values of the learning rate. I haven’t shown it here, but too large a value of the learning rate causes the optimization to become unstable. Many methods are available to adjust the size of the learning rate during the gradient descent optimization as the change in the parameters during an iteration becomes smaller.

That’s it! Hope this post will help you better understand how gradients are calculated and used during backpropagation in a CNN.

Be the first to comment

Leave a Reply

Your email address will not be published.


*