Articles

# Introduction

Introduction to Neural Networks – Part 1 can be found here; and Part 2 here.

So far these posts have only looked at single layer networks and how to train them. However most useful networks tend to need additional layers of nodes to be accurate and useful. The methods discussed so far can all be used when training such networks and the process of using them is a simple lead-on from what has been done so far i.e. get partial derivatives of weights and biases and adjust until the network error is minimised.

But the problem is that getting values of partial derivatives in early layers of the network can be tricky. The solution is to apply a technique known as backpropagation. As the name implies, what the technique involves is getting partial derivatives in the final layer and working back layer by layer using the chain rule.

Once backpropagation and the efficiency measures of the last post are implemented the range of problems that can be solved is opened up significantly. Aside from the example given later on, these techniques can be used to recognise handwritten digits or characters, compress large-scale data and for a number of purposes such as security, credit checking and predicting stock performance [1]. Indeed in many cases the network is often suprisingly simple needing just one or two additional layers.

# Non-Linear Classification

Consider classifying the following sets that cannot be seperated linearly.

Although the 2 sets are entangled it is possible to classify them by drawing around them and shading in bordering regions. For humans this is easy to do but with neural networks a bit of experimentation is required and it cannot be done with a single layer network. Using the training algorithm introduced in the first post, stopping once the error has reached close to a minimum, it is possible to do a colour map of the data set using the final values of the weights and biases. The following is the result of training a 2-1 network:

The colour mapping is based on an rgb coding where the output value of $1$ gives red which is rgb(1,0,0) and $0$ gives blue which is rgb(0,0,1). This involves adding the following lines to the code:

#create grid of x and y values and put them into a matrix with 2 columns
#this helps to vectorise functions
dr <- 0.05;x <- seq(min(X),max(X),dr); y <- seq(min(Y),max(Y),dr)
r <- matrix(c(rep(x,each=length(y)),rep(y,length(x))),ncol=2)

#get shade using network weights and biases
#colour in points - as this is vectorised code it will run quickly


As one would expect the areas away from the center are classified properly, while around the center points are often mis-classified with various shades of purple. To classify this dataset with a neural network will involve adding additional layers.

Problems that can be tackled by single layer networks often tend to have more efficient and simpler alternatives. But neural networks tend to become very useful when additional “hidden layers” of neurons are added in between the input and output nuerons to form a “deep belief” network. Generally adding hidden layers can dramatically improve error rates and more accurately reflect the nature of the problem at hand, as is shown in a later non-linear classification example. Furthermore with problems such as character identification there often is a limit on how accurately a single layer network will perform, particularly in terms of the accuracy results when exposed to a seperate test set after training.

The process of extending the network is quite canonical. In the example below, which shows a neural network that simulates the XOR function, the 2 bits feed into 2 “hidden” perceptrons using the weights displayed (and with no bias), which then link up using their respective weights to give a final result. The operation $1+0 \equiv 1$ mod $2$ is illustrated and the activation function used for each neuron is the step function.

# Backpropagation

When training networks that have additional layers it is possible to find algebraic forms for partial derivatives of weights and biases. For the first layer of weights and biases such expressions would be complex, cumbersome and time consuming to compute.

One method that can tackle the issue is to calculate numerical derivatives from first principles by tweaking the weights and biases in each dimension, then applying the gradient descent. This is known as “annealing” and can get computationally expensive as networks get larger and larger. A more efficient approach is to use what is known as backpropogation. The general idea is to first calculate the derivatives of weights and bias in the final layer using the algebraic approach and then to work back a layer at a time using the chain rule until the whole network is covered.

# Numerical Example of Backpropagation

Consider the following 2-2-1 network with weights and biases as illustrated and finding values of the network and partial derivatives when $i_1=1$ and $i_2=1$.

## Feeding Forward

First lets feedforward and get the values of the hidden layer and output neurons. For example to get the value of $h_1,$ first operate the dot product and bias
 $\vec{x}.\vec{w}+b=1*0+(-1)*1+1=0$ 

and apply the logisitic activation function to get

$h_1=\frac{1}{1+\exp(0)}=0.5$

Using this procedure $h_2=\sigma(-2)=0.1192$ and it can be repeated again to $o_1$ and $o_2$ using the values $h_1, h_2$ and the hidden biases.

In code this is achieved by the 2 lines

hidden <- sigmoid(weights_1%*%input+bias_1)
result <- sigmoid(weights_2%*%hidden+bias_2)


where weights_1 is the $2$ x $2$ matrix matrix(c(w1,w2,w3,w4),nrow=2,byrow=F) and weights_2 is the $2$ x $2$ matrix matrix(c(w5,w6,w7,w8),nrow=2,byrow=F).

## Partial Derivatives in the Output Layer

Now lets get the partial derivative of the error function with respect to the weights linking the hidden layer to the output. This part can be done using some basic calculus.

The error is:

 $E=\frac{1}{2} \sum_{i=1}^2 (o_i-a_i)$ 

so using the Chain Rule

 $\frac{\partial E}{\partial {w_5}} = (o_1-a_1)*o_1*(1-o_1)*h_1\\ \\ = (0.1653-1)*0.1653*(1-0.1653)*0.5 \\= -0.05758$ 

and similar expressions can be found for the other weights as well as the biases and hidden layer values.

In fact in R this is quite easy to evaluate quickly as the vector of partial derivatives of biases is $( \vec{o}-\vec{a} )*\vec{o}*( 1-\vec{o} )$, where $*$ is the Hadamard product of the vectors, and the matrix of partial derivatives of weights turns out to be the outer product of the vectors $( \vec{o}-\vec{a} )*\vec{o}*( 1-\vec{o} )$ and $\vec{h}$.

In code this is easy to achieve in R using

dE_W2 <- outer((result-actual)*result*(1-result),hidden)
dE_b2 <- outer((result-actual)*result*(1-result),1)
dE_h <- rowSums(matrix(rep((result-actual)*result*(1-result),2),nrow=2,byrow=T)*weights_2)


## Backpropagating One Layer at a Time

The trick with backpropagation is to use this information and the chain rule to find the partial derivatives in the previous layer.

For instance
 $\frac{\partial E}{\partial w_1}= \frac{\partial E}{\partial h_1} \frac{\partial h_1}{\partial w_1}$ 

The partial derivative $\frac{\partial E}{\partial h_1}$ should already have been found in the previous section:
 $\frac{\partial E}{\partial h_1} = \frac{\partial E_1}{\partial h_1}+\frac{\partial E_2}{\partial h_1}$ $= (o_1-a_1)*o_1*(1-o_1)*w_5+(o_2-a_2)*o_2*(1-o_2)*w_6$ $= (0.1653-1)*0.1653*(1-0.1653)*1$ $+(0.2247-0)*0.2247*(1-0.2247)*(-1)$ $= -0.1543$. 
and $\frac{\partial h_1}{\partial w_1}$ can be obtained by noting that $h_1=\sigma(\vec{w}.\vec{x}+b)$, so
 $\frac{\partial h_1}{\partial w_1} = h_1*(1-h_1)*i_1 = 0.5*(1-0.5)*1 = 0.25$. $\frac{\partial E}{\partial w_1} = \frac{\partial E}{\partial h_1} \frac{\partial h_1}{\partial w_1} = -0.1543*0.25 = -0.038575$. 

 

Similarly using $\frac{\partial h_1}{\partial b_1} = h_1*(1-h_1) = 0.5*(1-0.5) = 0.25$
 $\frac{\partial E}{\partial b_1}=\frac{\partial E}{\partial h_1} \frac{\partial h_1}{\partial b_1} = -0.1543*0.25 = -0.038575$. 

 

In code this is easy to achieve in R using

dE_W1 <- outer(input,hidden*(1-hidden))*matrix(rep(dE_h,2),nrow=2,byrow=F)
dE_b1 <- hidden*(1-hidden)*dE_h


In general this process is repeated goin back one layer each time until the derivatives for the input values are found.

# Getting back to the non-linear classifier…

The following uses a network with 7 hidden layers of 30 neurons each. It is just what a human might have drawn if asked to draw a red region around red points and blue region around blue points.

# Neural Network Libraries in R

As many readers may already have found out, R has 3 libraries that can implement neural networks known as neuralnet, nnet and mlp. This is, of course, a relief considering how tricky it is to write optimal matrix operations for feedforward and backpropagation steps for each neural network. But also the library is optimised, which helps with much tougher problems involving complex networks. Indeed when generating the neural network to classify the last problem the neuralnet library was used to generate the neural network using the following code on the datasets $X$ and $Y$ as this time around an ad-hoc solver with 7 hidden layers would have been time consuming to write.

#put input variables in Nx2 matrix
tr_in <- matrix(c(X,Y),nrow=2,ncol=N)
#target values
tr_out <- class-1
#convert tr_in to data frame before using neuralnet function
tr_in <- data.frame(tr_in)
names(tr_out) <- 'tr_out'

#build formula which tells neural net to train a network
#with 2 inputs and 1 output
form.in<-as.formula('tr_out~X1+X2')
#train the network with 7 hidden of 30 neurons each
mod2<-neuralnet(form.in,data=tr_in,hidden=rep(30,7),
lifesign="full",stepmax=100000,threshold=1e-4,
linear.output=TRUE)


# Conclusion

The above problem and numerical example illustrates what is needed to solve more useful neural networks. As mentioned above when using neural networks in practice it is recommended to use a preferred library in R and to use the previous examples as a basis for understanding neural networks.

There will be more discussion of this in the next post but suffice to say the 3 libraries can sometimes perform very differently on the same sets of problems. Any user that tries emulating any of the regression or classification examples will probably find very little differences performance wise between libraries. However when using neural networks in practice often they are used when datasets are in the hundreds of thousands, if not millions, where tuning such a neural network can be very hard and time consuming and where each library performs very differently on the same sets of problems. Opinions vary on which library is best – the next post will use the mlp library as it proved to be the most versatile for more complex problems but that is not to say that the others are any worse.

As noted, all of the previous examples can be tackled using much more efficient alternative methods. The next article will look more closely at some complex problems where alternative solutions are not inherently obvious and/or where alternative methods are not much more efficient and discuss how to decide when and how to use a neural network.

### References

Authored by:
Liam Murray

Liam Murray is a data driven individual with a passion for Mathematics, Machine Learning, Data Mining and Business Analytics. Most recently, Liam has focused on Big Data Analytics – leveraging Hadoop and statistically driven languages such as R and Python to solve complex business problems. Previously, Liam spent more than six years within the finance industry working on power, renewables & PFI infrastructure sector projects with a focus on the financing of projects as well as the ongoing monitoring of existing assets. As a result, Liam has an acute awareness of the needs & challenges associated with supporting the advanced analytics requirements of an organization.