Building a 2 Layer Network to Classify a Spiral Dataset using Matlab and CNTK

The purpose of this post is to provide some additional insights and Matlab and CNTK implementations for the two layer network used to classify a spiral dataset described in the following lecture of Stanford’s “Convolutional Neural Networks for Visual Recognition” class.

Using a neural network to classify a spiral dataset powerfully illustrates the effectiveness of NNs to handle inherently non-linear problems. It is evident that no linear classifier will be able to do a good job classifying spiral data where the boundary between two classes is a curve. A simple two layer network with RELU activation on the hidden layer automatically learns the decision boundaries and achieves a 99% classification accuracy on such a data set.

The problem description on the course website does a great job describing the problem, the architecture of the neural network used to solve it and the calculations in the forward and backward pass. This post makes the following contributions:

Shows a movie of the decision boundaries as they are learnt by the network
Provides Matlab and CNTK implementations for the Forward and Backwards pass (see appendix)
Provides some additional insights related to variation of the classification accuracy with the number of hidden layer neurons, step size, and most importantly, the choice of the activation function
Shows how to use custom loss and training functions in CNTK to achieve the same results as Matlab

Input Data

Our input data consists of 2D points (2 by 1 vectors) that are distributed along three spiral shaped arms. Clearly, a classifier with linear decision boundaries will be inadequate to classify such as dataset.

Movie Showing the Evolution of the Decision Boundaries

The movie shows the decision boundaries as they are learnt by the network. The boundaries start off linear but the network very quickly learns to bend them so they fit the contours of the data. This can be seen in the very sharp drop in the loss function in the beginning of the optimization.

You may be interested in the technique used to draw the decision boundaries and make the movie. For every “key frame” in the iteration, I ask the network to classify every point on the image frame (whose size is chosen to span our input data) by running the forward pass of the network. Each class is then drawn in a different color. The frequency of the “key frames” will depend upon how long you want the movie to be – if you have 2000 iterations and want the movie to be 6 seconds long playing at 15 FPS, you’ll need 90 frames. This means roughly 1 key frame every 20 iterations. The Matlab code to make the movie is included in the zip.

Variation of Classification Accuracy and Rate of Convergence with Step Size

The plots below show the loss curves for various step sizes in the Matlab implementation and the loss curve for Step Size = 1 using a Stochastic Gradient Descent learner for the CNTK implementation.

As expected, the loss function decays more slowly for lower step sizes. Somewhat surprisingly, the classification accuracy is slightly different for different step sizes. This could be because there are multiple local minimas at the bottom of the loss function valley. Since I’m using a constant step size during the optimization (instead of an adaptive technique where the step size is modified based on the first and second moment of the gradients), perhaps the optimization is achieving a slightly different minimum for different step sizes.

The plots below show the evolution of the loss function for the Matlab and CNTK implementation (left plot: Matlab, right plot: CNTK) for step size = 1.

CNTK achieves a slightly lower total loss because the CNTK loss function comprises only of softmax with cross entropy component, not the L2 weight regularization component. Later, we’ll see how we can customize CNTK to use our loss function that adds the L2 regularization component to softmax with cross entropy. We’ll also replace CNTK’s Stochastic Gradient Descent implementation with our own.

	CNTK	Matlab
Final Value of Loss Function	0.1155	0.2462

Changing the Number of Neurons in the Hidden Layer

Next, let’s consider the effect of changing the number of neurons in the hidden layer. Classification accuracy for both the Matlab implementation and CNTK implementation is reported.

Number of Neurons in Hidden Layer	Classification Accuracy (Matlab)	Classification Accuracy (CNTK)
100	99%	99%
50	98.6%	98.4%
7	94.3%	97.1%
5	73.3%	93.4%
3	56.4%	65.4%

One can reduce the number of neurons in the hidden layer to 7 before the classification accuracy starts to drop significantly!

Classification Accuracy for Various Activation Functions

Activation Function	RELU	Leaky RELU	Exponential RELU	Tanh	Sigmoid	None (Linear Activation)
Classification Accuracy	98%	98%	94.7%	97.3	89%	49%

We can see that activation functions with sharp changes such as RELU are the most effective. Among the non-linear functions, a centered function such as tanh is more effective than non-centered function such as sigmoid. Not using an activation function significantly degrades classification accuracy.

Using Custom Loss and Training Functions

CNTK provides built-in implementations for commonly used loss functions such as softmax with cross entropy and trainers such as Stochastic Gradient Descent, ADAM ¹, ADAGRAD² and some of their variants. However, some problems require implementing a custom loss and training function. CNTK’s implementation of softmax with cross entropy loss doesn’t include the contribution from weight regularization, although the gradient of the weight regularization is included in the gradient calculation and parameter update. As mentioned earlier, this is the reason why the final loss computed by CNTK is slightly lower than the loss computed by our Matlab implementation. One can ask CNTK to use a custom loss function as follows:

# Define our custom loss function
def cross_entropy_with_softmax_plus_regularization(model, labels, l2_regularization_weight):
    w_norm = C.Constant(0);
    for p in (model.parameters):
        w_norm = C.plus(w_norm, 0.5*C.reduce_sum(C.square(p)))

    return C.reduce_log_sum_exp(model.output) - C.reduce_log_sum_exp(C.times_transpose(labels, model.output)) + l2_regularization_weight*w_norm
    
my_loss = cross_entropy_with_softmax_plus_regularization(z, label, l2_regularization_weight)
# plug-in our loss function into the trainer
trainer = C.Trainer(z, (my_loss, eval_error), [learner])

# Define our custom loss function

def cross_entropy_with_softmax_plus_regularization(model, labels, l2_regularization_weight):

w_norm = C.Constant(0);

for p in (model.parameters):

w_norm = C.plus(w_norm, 0.5*C.reduce_sum(C.square(p)))

return C.reduce_log_sum_exp(model.output) - C.reduce_log_sum_exp(C.times_transpose(labels, model.output)) + l2_regularization_weight*w_norm

my_loss = cross_entropy_with_softmax_plus_regularization(z, label, l2_regularization_weight)

# plug-in our loss function into the trainer

trainer = C.Trainer(z, (my_loss, eval_error), [learner])

Using this loss function, the final value of the total loss is very similar for the CNTK and Matlab implementations

	CNTK	Matlab
Final Value of the Loss Function	0.2540	0.2462

Similarly, you can also insert your own parameter update logic by defining a custom trainer, as follows:

def my_gradient_update(parameters, gradients):
    update_funcs = [];
    for p, g in zip(parameters, gradients):
        # ideally, these should be parameters.. don't know of a way to do this yet.
        learning_rate = 1;
        numSamples = 300;
        # we need to divide by numSamples as the gradients are not scaled by the mini-batch size.
        gradient_scale_factor = learning_rate/numSamples;
        update_funcs.append(C.assign(p, p - gradient_scale_factor * g))
    return C.combine(update_funcs)

learner = C.universal(my_gradient_update, z.parameters)
trainer = C.Trainer(z, (my_loss, eval_error), [learner])

def my_gradient_update(parameters, gradients):

update_funcs = [];

for p, g in zip(parameters, gradients):

# ideally, these should be parameters.. don't know of a way to do this yet.

learning_rate = 1;

numSamples = 300;

# we need to divide by numSamples as the gradients are not scaled by the mini-batch size.

gradient_scale_factor = learning_rate/numSamples;

update_funcs.append(C.assign(p, p - gradient_scale_factor * g))

return C.combine(update_funcs)

learner = C.universal(my_gradient_update, z.parameters)

trainer = C.Trainer(z, (my_loss, eval_error), [learner])

Appendix

Matlab Code:

spiral_data_classifier_matlab

CNTK Code:

CNTK_101_Spiral

Adam K, Lei Ba J. ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION. https://arxiv.org/abs/1412.6980. https://arxiv.org/abs/1412.6980. Accessed September 29, 2017.

Duchi J, Hazan E, Singer Y. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. http://jmlr.org/papers/volume12/duchi11a/duchi11a.pdf.

Telesens

Building a 2 Layer Network to Classify a Spiral Dataset using Matlab and CNTK

1 Trackback / Pingback

Leave a Reply Cancel reply