A Neural Network in Python, Part 2: activation functions, bias, SGD, etc.

This is Part 2 of A Neural Network in Python, which was a very simple neural network to learn the XOR function. This part builds on that example to demonstrate more activation functions, learning a simple math function, adding a bias, improvements to the initial random weights, stochastic gradient descent, mean square error loss function, and graphical visualisation. Phew! I won’t go into much, if any theory, but I will provide links to resources where you can find out more. What this program does is give you an example you can tinker with, to see what effect those various improvements have. In particular, its a lot faster!

Main variables:

Wh & Wz are the weight matrices, of dimension previous layer size * next layer size.
X is the input vector of evenly spaced values from which to compute Y.
Y is the corresponding target value Y = f(X)
Z is the vector of learned values for f(X), Z = activate(H.Wz).

#   A Very Simple Neural Network in Python 3 with Numpy, Part 2
#   Alan Richmond @ Python3.codes
import numpy as np
import matplotlib.pyplot as plt
import math, time

epochs = 3000
batchSize = 4
activation = 'sigmoid'
#activation = 'tanh'
#activation = 'ReLU'

def f(x): return np.sin(x)

minx, maxx = 0, 6.28
miny, maxy = -1, 1
numx = int(maxx * 5 + 1)
inputLayerSize, hiddenLayerSize, outputLayerSize = 2, 5, 1

funcs = {'sigmoid':  (lambda x: 1/(1 + np.exp(-x)),
                      lambda x: x * (1 - x),  (0,  1), .45),
            'tanh':  (lambda x: np.tanh(x),
                      lambda x: 1 - x**2,     (0, -1), 0.005),
            'ReLU':  (lambda x: x * (x &amp;gt; 0),
                      lambda x: x &amp;gt; 0,        (0, maxx), 0.0005),
        }
(activate, activatePrime, (mina, maxa), L) = funcs[activation]

X = x = np.linspace(minx, maxx, num=numx)
X.shape = (numx, 1)
Y = y = f(X)
Y = (Y - miny)*(maxa - mina)/(maxy - miny) + mina   # normalise into activation

# add a bias unit to the input layer
X = np.concatenate((np.atleast_2d(np.ones(X.shape[0])).T, X), axis=1)

# Random initial weights
r0 = math.sqrt(2.0/(inputLayerSize))
r1 = math.sqrt(2.0/(hiddenLayerSize))
Wh = np.random.uniform(size=(inputLayerSize, hiddenLayerSize),low=-r0,high=r0)
Wz = np.random.uniform(size=(hiddenLayerSize,outputLayerSize),low=-r1,high=r1)

def next_batch(X, Y):
    for i in np.arange(0, X.shape[0], batchSize):
        yield (X[i:i + batchSize], Y[i:i + batchSize])

start = time.time()
lossHistory = []

for i in range(epochs):         # Training:
    epochLoss = []

    for (Xb, Yb) in next_batch(X, Y):

        H = activate(np.dot(Xb, Wh))            # hidden layer results
        Z = activate(np.dot(H,  Wz))            # output layer results
        E = Yb - Z                              # how much we missed (error)
        epochLoss.append(np.sum(E**2))

        dZ = E * activatePrime(Z)               # delta Z
        dH = dZ.dot(Wz.T) * activatePrime(H)    # delta H
        Wz += H.T.dot(dZ) * L                   # update output layer weights
        Wh += Xb.T.dot(dH) * L                  # update hidden layer weights

    mse = np.average(epochLoss)
    lossHistory.append(mse)

X[:, 1] += maxx/(numx-1)/2
H = activate(np.dot(X, Wh))
Z = activate(np.dot(H, Wz))
Z = ((miny - maxy) * Z - maxa * miny + maxy * mina)/(mina - maxa)
Y = y

end = time.time()

plt.figure(figsize=(12, 9))
plt.subplot(311)
plt.plot(lossHistory)
plt.subplot(312)
plt.plot(H, '-*')
plt.subplot(313)
plt.plot(x, Y, 'ro')    # training data
plt.plot(X[:, 1], Z, 'bo')   # learned
plt.show()

print('[', inputLayerSize, hiddenLayerSize, outputLayerSize, ']',
      'Activation:', activation, 'Iterations:', epochs,
      'Learning rate:', L, 'Final loss:', mse, 'Time:', end - start)

Walkthrough

We import some libraries: numpy, pyplot, math, time.
Set the hyperparameters, including choice of activation function.
Instead of the XOR in Part 1, we’re going to learn the sine function.
1. Set min and max values and number of x points for graph plotting etc.
Activation functions logistic, tanh, ReLU, and their parameters. mina and maxa are used for normalising. They are stored in a dictionary for convenience, using lambda expressions.
Set linear increments for X and x. X is going to get an extra value for bias, so the copy x is for the graph plotting. Similarly for Y = y = f(X)
Y is normalised to match the activation function’s output range.
Adding a bias unit to X.
The weights are randomly initialised to best practice recommendations.
We’re not going to use all the input data on every epoch, rather, we’ll use it in batches and this will be much faster.
Start the timer and prepare an empty loss history.
Training:
1. Grab a batch of training data and process it just as in Part 1, except for accumulating some loss data, and applying the learning factor L to the weight updates.
We re-use the X vector to test the results, except we shift it along a bit so as to not test the same values that were used for training, otherwise the results could have simply been memorised. The notation X[:,1] selects the second column, avoiding the first column which has the bias in it.
Apply the learned weights to the training data (same as the forward propagation) and de-normalise the results.
Plot graphs and print some stats. The first graph shows how the error decreased over time. The second shows the values in the hidden layer, giving some insight into how the final output is calculated (as a linear sum of the hidden layer values). The final graph shows the target function (red dots) and the learned function (blue dots). You’ll notice that the learned function often starts out quite close to the target function, but may sometimes drift away, and sometimes even ‘give up’ trying to match the target!

Experiments

Try changing the number of epochs and the batch size.
Try selecting different activation functions by commenting out or uncommenting.
Try different functions. Don’t forget to change the min and max values on the following lines! Try extending the function domain further left or right.
Try changing the hidden layer size. The other 2 layers need to stay fixed.
Try varying the learning rate L – the last value in the funcs dictionary.
Try removing the bias unit (comment out the ‘concatenate’ instruction. Don’t forget to reduce the input layer size to 1.
Is there a better way to initialise the weights?
Keep an eye on the graphs and printed stats. Try to minimise the final error and the time taken

Bias Nodes

If a neural network does not have a bias node in a given layer, it will not be able to produce output in the next layer that differs from 0 (on the linear scale, or the value that corresponds to the transformation of 0 when passed through the activation function) when the feature values are 0. See Why are bias nodes used in neural networks?

Activation Functions

Activation functions are generally used to provide non-linearity. Without that, adding layers adds nothing that couldn’t be done with just one layer. So for example the XOR function we saw in the last part can’t be done. If you add a linear function onto another, the result is still linear. See What is the role of the activation function in a neural network? (Quora)

Stochastic Gradient Descent

Stochastic Gradient Descent (SGD), a simple modification to the standard gradient descent algorithm that computes the gradient and updates our weight matrix W on small batches of training data, rather than the entire training set itself. Computing the cost and gradient for the entire training set can be very slow. Also batch optimization methods don’t give an easy way to incorporate new data in an ‘online’ setting. Stochastic Gradient Descent (SGD) addresses both of these issues by following the negative gradient of the objective after seeing only a single or a few training examples. See Stochastic Gradient Descent (SGD) with Python and Optimization: Stochastic Gradient Descent.

Initial Weights

The initial weights need to be different from each other in order for the learning process to gain traction, but they should not be too different from zero because the gradient of descent will be very shallow (think of the sigmoid curve, far from the origin) and learning will be very slow. Research has found that a small normal distribution proportional to sqrt(2/fan-in) works well. See What are good initial weights in a neural network?