Articles

# Introduction

Introduction to Neural Networks – Part 1 can be found here.

In practice training neural networks can become a difficult and time consuming process and a question of trial and error. Learning rates have to be tweaked, an activation function needs to be chosen and an optimisation method needs to be used that suits the training dataset. While understanding linear algebra and statistics can lead to an understanding of machine learning, mastering neural networks requires patience and the ability to experiment with solutions.

The last post introduced the concept of neural networks and how to train them. While Python and R have packages that can work with matrices and vectors very efficiently, there can still be performance issues when training datasets, especially as weights matrices grow in size. This post covers key tactics that are often used to make neural networks more efficient.

As these methods are implemented in many programming packages the commonly held view is that such methods can be used as a “black box”. However in practice when designing complex neural networks, designers often incorporate a lot of “domain” knowledge in the heuristic design and need to understand how these techniques work.

# Issues Faced Training Networks

## Testing the Output

Problems neural networks tackle can often be reduced to predictive questions of the variety “given the input $\vec{x}$ what will the output $\vec{a}$ be?”. The regression problem shown later on is an example of a predictive problem with 1 solution. However with many artificial intelligence problems there is an additional issue. Lets say a network is correctly tuned to identify the ages of 20,000 people in a training set based on their photos. Can the network then correctly guess the age of people from photos not in the training dataset? To tackle this issue usually a “test” set is prepared of vectors $\vec{x}$ and their respective outputs $\vec{a}$ to see how good the network really is and the problem is not viewed as solved until the network can accurately identify outputs in the test set after being trained.

## Online and Batch Training

One way to train a network is to go through the training set updating the weights at each data point for what is known as 1 “epoch”, covering the whole training set. This is known as online training. An alternative way to do this is to run a training epoch without changing the weights or biases. Then the update rule is applied using the net sum of gradient values from the training epoch. This is known as batch training and which method works better will vary from problem to problem as some networks will converge more quickly using online training than using batch training and vice versa.

Given that the standard error function $E$ is in the form of an average one shortcut that can work is to batch update using a random sample of a fraction of the training set. Depending on how the data is distributed and the size of the subset used, under the law of large numbers the average gradient of the subset should be roughly the same as that of the whole set. At each epoch the sample is shuffled randomly to ensure all inputs in the training set are sampled at some epoch. Whether or not this method works will depend upon how the data is distributed but the performance uplift can be significant as often with a dataset of tens of thousands of entries sampling a subset of 100 entries is sufficient to accurately adjust the weights and biases.

## Stopping Conditions

An issue often encountered with some neural networks is where the error cannot go below a certain level. If that is the case then it is worth specifying that the algorithm stops when the magnitude of the gradient of weights and biases falls below a chosen threshold, as illustrated in the regression example later on.

## Fastly Converging Methods – Resilient Propagation (RPROP)

One of the fastest methods available to minimise error in a neural network is known as the Resilient Propagation method, or RPROP. It was proposed in a 1993 paper by Martin Riedmiller and Heinrich Braun [1]. The essential idea is that instead of adjusting weights and biases based on partial derivatives, each weight and bias is given a variable delta which represents how much it will be changed by in the next step. The delta increases at each step. If the sign of the partial derivative has changed then it has gone too far and the value is returned to its old value and the delta is reduced to a smaller value and the process is repeated.

To implement it at each iteration backpropagation is used to get the partial derivatives. But instead of applying the gradient descent, weights and biases are modified using the following pseudocode:

 $\textrm{if}(\frac{\partial E}{\partial \omega_{ij}}(t-1)*\frac{\partial E}{\partial \omega_{ij}}(t) > 0) \textrm{ then } \{$ $\Delta_{ij}(t)=\textrm{minimum}(\Delta_{ij}(t-1)*\eta^+,\Delta_{\textrm{max}})$ $\Delta w_{ij}(t)=-\textrm{sign}(\frac{\partial E}{\partial \omega_{ij}}(t))*\Delta_{ij}(t)$ $w_{ij}(t+1)=w_{ij}(t)+\Delta w_{ij}(t)$

$\textrm{else if}(\frac{\partial E}{\partial \omega_{ij}}(t-1)*\frac{\partial E}{\partial \omega_{ij}}(t) < 0) \textrm{ then } \{$
$\Delta_{ij}(t)=\textrm{maximum}(\Delta_{ij}(t-1)*\eta^-,\Delta_{\textrm{min}})$
$w_{ij}(t+1)=w_{ij}(t)-\Delta w_{ij}(t-1)$
$\frac{\partial E}{\partial \omega_{ij}}(t) = 0$

$\textrm{ else if}(\frac{\partial E}{\partial \omega_{ij}}(t-1)*\frac{\partial E}{\partial \omega_{ij}}(t) = 0) \textrm{ then } \{$
$\Delta w_{ij}(t)=-\textrm{sign}(\frac{\partial E}{\partial \omega_{ij}}(t))*\Delta_{ij}(t)$
$w_{ij}(t+1)=w_{ij}(t)+\Delta w_{ij}(t)$
$\}$

The above procedure can be implemented in R or Python and the results when using additional layers can be astounding. While using gradient descent relies on tweaking the learning rate according to which problem is solved, the optimal values of the parameters in this algorithm tend to be fairly consistent no matter what problem is being solved. Using an initial value of $\Delta_{ij}$ at $0.1$ and setting $\Delta_{\textrm{min}}$ to $1$ x $10^{-6}$ , $\Delta_{\textrm{max}}$ to $50$, $\eta^+$ to $1.2$ and $\eta^-$ to $0.5$ tends to give very good results whatever the problem.

# Regression Example

Consider the 2 sets of datapoints $X$ and $Y$ displayed below. Clearly it makes sense that there may be a linear relationship between $Y$ and $X$ of the form $Y=\alpha+\beta X$.

Then a 1-1 neural network can be set up with $X$ as input and $Y' = \alpha + \beta X$ as output and the identity mapping as activation function. Gradient descent is used to minimise the error $\frac{1}{2n}\sum_{i=1}^n{(Y_i'-Y_i)^2}$ with respect to the weight $\beta$ and bias $\alpha$.

One thing that is noteable about this problem is that regression is a problem which has an exact solution and handled in R with the following code:

fit <- lm(Y ~ X)
abline(coefficients(fit)[1],coefficients(fit)[2],col="red",lwd=2)


This allows for a comparison to be made between the neural network solution and the exact solution.

As the error will never reach $0$ this time it is better to have the algorithm stop when the partial derivatives of the error are too small. The following code shows how to do that using the total_derr() function:

#initialise weight and bias
alpha <- runif(1,min=-1,max=1);beta <- runif(1,min=-1,max=1)

#error function
err <- function(){return(mean(1/2*(alpha+beta*X-Y)**2))}
#magnitude of error gradient as a function of alpha and beta
total_derr <- function(){return(((mean(X*(alpha+beta*X-Y)))**2)
+((mean(alpha+beta*X-Y)))**2)}

print(total_derr())
EPS <- 1e-4
learning_rate <- 1e-4
count <- 1
while(total_derr()>EPS){
#use new_alpha variable so beta can update
#in next line using alpha value before update
new_alpha <- alpha - learning_rate*mean(alpha+beta*X-Y)
#update beta
beta <- beta - learning_rate*mean(X*(alpha+beta*X-Y))
#update alpha to new_alpha
alpha <- new_alpha
#only print every 1000 epochs
#it is a very fast algorithm and too many updates will slow it down
if(count%%1000==0) print(total_derr())
count <- count + 1
}

#finally draw regression line
abline(alpha,beta,col="blue",lwd=2)


With $\epsilon=10^{-4}$ the following lines are obtained, where the solution determined by the network (blue line) is close to the exact solution (red line).

But with $\epsilon=10^{-8}$ there is no discernable difference.

# Conclusion

As with the previous post, the above example given has an acceptable and much more efficient alternative algorithm. When using single layer networks the techniques mentioned above are not particularly necessary but when adding additional layers they become absolutely essential. As mentioned before the examples given so far are there to motivate an understanding of neural networks. The next post will introduce the method of backpropagation which is very useful in tuning networks efficiently as layers are added and when implemented in addition to stochastic gradient descent, areas of artificial intelligence such as facial recognition can become very much within reach.

### References

Authored by:
Liam Murray

Liam Murray is a data driven individual with a passion for Mathematics, Machine Learning, Data Mining and Business Analytics. Most recently, Liam has focused on Big Data Analytics – leveraging Hadoop and statistically driven languages such as R and Python to solve complex business problems. Previously, Liam spent more than six years within the finance industry working on power, renewables & PFI infrastructure sector projects with a focus on the financing of projects as well as the ongoing monitoring of existing assets. As a result, Liam has an acute awareness of the needs & challenges associated with supporting the advanced analytics requirements of an organization.