Understanding Data Parallelism in Machine Learning

In this post, I’ll show how to modify the spiral data set example presented in Karpathy’s post (http://cs231n.github.io/neural-networks-case-study/) to run in a data parallel mode. Data parallelism is a popular technique used to speed up training on large mini-batches when each mini-batch is too large to fit on a GPU. Under data parallelism, a mini-batch is split up into smaller sized batches that are small enough to fit on the memory available on different GPUs on the network. Each GPU holds an identical copy of the network parameters and runs the forward and backward pass. At the end of the backward pass, each GPU sends the computed gradients to a parameter server. The parameter server aggregates the gradients and computes the updates to the network parameters using some variant of Stochastic Gradient Descent. The updated parameters are then sent to each GPU and the process is repeated for a fresh mini-batch. This process is shown in the diagram below.

 

Using data-parallelism and appropriately adjusting the learning rate, large number of GPUs can be used to process extremely large mini-batches (~8000 images) in parallel. For example, Girshick et al. (https://arxiv.org/pdf/1706.02677.pdf) show that with using a large minibatch size of 8192 on 256 GPUs, ResNet-50 can be trained on ImageNet in 1 hour while maintaining the same level of accuracy as the 256 minibatch baseline.

I decided to use the spiral dataset example as the neural network used to classify this dataset, while simple, still retains the essential elements of neural networks – two weight layers, a RELU activation layer and a softmax classifier. The example also demonstrates the effectiveness of neural networks in handling highly non-linear data. This ability arises from the use of RELU non-linearity in the activation layer. As shown here, removing the non-linearity will cause the classification accuracy to drop by almost half.  I won’t go into the details of the network architecture as Karpathy’s post referenced above does an excellent job of describing the problem, the network architecture, calculation of gradients and details of the forward and backward steps. A movie showing how the decision boundaries adjust to the contours of the data as training progresses is shown below.

This post makes the following contributions:

  1. It shows how the training data can be split into two batches and the forward and backward steps for each batch can be executed on two “nodes” (a node represents a GPU in a real networked implementation)
  2. Shows the implementation of a “parameter server” that aggregates the gradients from the nodes and after gradients from all nodes have been received, computes the updates to the network parameters using a simplified stochastic gradient descent (with a constant learning rate and no momentum).
  3. Major data transfer operations (sending batch data to each node, gradients from the nodes to the parameter server and updated weights from the parameter server to each node) are split up into separate steps to make it easy to see exactly what and how much data is sent in each step
  4. Python threading APIs are used to execute each node on a separate thread. Since the amount of processing involved in this toy example is relatively small, threading doesn’t result in a significant speed up. However when each node is slowed down by inserting a sleep, the multi-threaded implementation takes half the time of the non-threaded implementation.

Without Multi-Threading

With Multi-Threading

The Code

I have written extensive comments in the code, so it should be easy to follow. To the best of my knowledge, I have also described more advanced techniques that are used in real systems that would have been an overkill here.

 

 

 

 

Be the first to comment

Leave a Reply

Your email address will not be published.


*