Articles

Introduction

When creating inventions or algorithms, it is not unusual to look for inspiration in nature. The original camera design is loosely based on the inner workings of the eye, velcro was inspired by certain insects’ ability to latch onto surfaces, water filters are based on biological membranes and the Japanese Bullet Train’s design is based on a kingfisher’s beak so as to have minimal noise [1].

Artificial neural networks, a key area of artificial intelligence, does just that by drawing inspiration from how neurons enable humans to think. The approach neural network designers take allows algorithms in areas such as speech, voice and facial recognition where the algorithm learns as it goes along. This is vital, for instance, when developing software that can recognise someone’s speech even in situations where the algorithm has never been exposed to the person’s voice (or indeed accent) before.

While the idea is inspired by neuron activity in physiology, the similarities are not very deep and how neurons in the brain actually learn is much more complex than how artificial neural network learn. Nevertheless artificial neural networks have been known to not only emulate what a human can do but beat it. While this is normal with mathematical problems, this is no mean feat with tasks such as recognising characters where traditional approaches normally fail.

Linear Classification Example

Consider a basic classification problem where the 2 sets below are linearly separable. A line that separates these 2 sets is sought.

A Neural Network Solution

Consider the following:

Two input values $x_1$ and $x_2$ are fed into the node returning the output $y = f(x_1w_1+x_2w_2+b)$. An algorithm is sought to select values for $w_1, w_2$ and $b$ so that the output $y$ gives the correct class for each coordinate $(x_1, x_2)$ in the above example.

This is known as a single layer feedforward neural network and more specifically a “2-1” network as 2 input nodes feed into 1 output node. The function $f$ is often referred to as an “activation function” and the form $f$ takes can have a huge bearing on how the problem at hand is solved. A more succinct way to represent this is $y=f(\vec{x}.\vec{w}+b)$. The vector $\vec{w}$ represents the “weights” of the network and $b$ is known as the “bias”, while each node in a neural network is often referred to as a “neuron” with output nodes referred to as “perceptrons”.

Training the Network

The first step in solving this problem is to “train” the network, akin to how a human brain learns from experience. The set of inputs $(x_1,x_2)$ and their associated class $c$ (valued at either $0$ or $1)$ is known as the “training set”.

A key step in achieving this is to define an error function, in this case the function
 $E=\frac{1}{2m}\sum_{i=1}^m(o_i-a_i)^2$ 
where $o_i$ and $a_i$ are the output and target values of the $i\textrm{-th}$ coordinate respectively.

The goal is to get the values of the outputs $o_i$ as close as possible to the actual values $a_i$. Clearly minimising $E$ will do this, which can be done using gradient descent. So a suitable algorithm would be to generate random weights and bias and use the update rule:
 $w_i \mapsto w_i - \eta \frac{\partial E}{\partial w_i}$ $b \mapsto b - \eta \frac{\partial E}{\partial b}$ 
for each datapoint until the error $E$ is within acceptable limits. The factor $\eta$ is known as the “learning rate” and experimenting to find a suitable learning rate is the key challenge of tuning the network.

Coding the Network

Firstly code is presented using the step activation function where $f(x)=1$ only if $x$ is positive and $0$ otherwise. Essentially this can be interpreted as “only fire the neuron when the weighted output is positive”. The step function is not differentiable so instead of getting partial derivatives the following update rule is used:
 $w_1 \mapsto w_1-\eta(x_1 w_1+x_2 w_2+b-c)x_1$ $w_2 \mapsto w_2-\eta(x_1 w_1+x_2 w_2+b-c)x_2$ $b \mapsto b-\eta(x_1 w_1+x_2 w_2+b-c)$ 

In R the code for training the weights therefore is:

train_w <- function(w,b){
#put training coordinates into matrix to vectorise code
r <- matrix(c(X,Y),nrow=length(X),byrow=F)
#evaluate output for whole training set using binary step function
result <- as.vector(ifelse(r%*%w+b>0,1,0))
#apply update rule - example of batch training
w <- w-learning_rate*colSums(matrix(rep(result-class,2),
nrow=length(X))*matrix(c(X,Y),nrow=length(X)))
return(w)
}

while the code to train the bias is:

train_b <- function(w,b){
#put training coordinates into matrix to vectorise code
r <- matrix(c(X,Y),nrow=length(X),byrow=F)
#evaluate output for whole training set using binary step function
result <- as.vector(ifelse(r%*%w+b>0,1,0))
#apply update rule - example of batch training
b <- b-learning_rate*colSums(matrix(rep(result-class,2),nrow=length(X))
return(b)
}

and the code to get the error is:

error <- function(w,b){
#put training coordinates into matrix to vectorise code
r <- matrix(c(X,Y),nrow=length(X),byrow=F)
#evaluate output for whole training set using binary step function
result <- as.vector(ifelse(r%*%w+b>0,1,0))
#evaluate error based on output (result) and actual (class)
err <- sum(1/2*((result-class)**2))
return(err)
}

and the main loop to train the network is:

learning_rate <- 1
#randomly initialise weights and bias
w <- runif(2,min=-1,max=1)
b <- runif(1,min=-1,max=1)

EPS=1e-2
cat(sprintf("Initial:\t%f\n",error(w,b)))
epoch EPS){
#train weights and save value of w so bias can be trained
new_w <- train_w(w,b)
b <- train_b(w,b) #train bias
w <- new_w #update variable w
#show error at each epoch
cat(sprintf("Epoch %d:\t%f\n",epoch,error(w,b)))
epoch <- epoch+1
}

abline(-w[1]/w[2],-b/w[2]) #draw seperating line


Running the above program in R, which draws a seperating line using the final values of the weights and bias gives the following:

Role of Activation Function

The activation function is a critical feature of neural networks. Popular choices are the binary step function and the logistic function $f(x)=\frac{1}{1+\exp(-x)},$ also known as the sigmoid function. With the step function often small changes in the network can lead to large changes in the output, which is not ideal. This is why the sigmoid function, which is better behaved, is often chosen instead and there are many other choices which offer similar features [2].

A bit of differentiation gives that the derivative of $\sigma$ is $1-\sigma$. The update rule therefore becomes:
$w_1 \mapsto w_1-\eta\sigma(1-\sigma)x_1$ $w_2 \mapsto w_2-\eta\sigma(1-\sigma)x_2$ $b \mapsto b-\eta\sigma(1-\sigma)$ 

and implementing it in R involves changing the following lines

 result <- as.vector(ifelse(r%*%w+b>0,1,0))
w <- w-learning_rate*colSums(matrix(rep(result-class,2),
nrow=length(X))*matrix(c(X,Y),nrow=length(X)))


to

 result <- as.vector(sigmoid(r%*%w+b))
w <- w-learning_rate*colSums(matrix(rep((result-class)*result*(1-result),2),nrow=length(X))
*matrix(c(X,Y),nrow=length(X)))


wherever those 2 lines appear.

This uses the sigmoid function definition:

sig <- function(x){1/(1+exp(-x))}
sigmoid <- Vectorize(sig)


In this case the step function solution usually converges very quickly, while the sigmoid perceptron is slower, especially if the data is linearly seperable but the boundary between the sets is quite narrow. However with more complex problems and networks it often is the case that the step function causes some runs to not converge while the sigmoid doesn’t encounter such issues.

Conclusion

While artificial neural networks are inspired by physiology the similarities are not very deep. Neural researchers often point out that there is more to how brains learn than activation rules and biological neural networks have much more complex mechanisms in place than merely adding up weighted inputs[3].

One thing that is notable about the problem illustrated is that using Support Vector Machines is an acceptable and efficient alternative[4]. With many examples of single layer networks, efficient algorithms tend to already exist to solve the problem being tackled by the network. Nevertheless, mastering the above example and other examples of single layer networks is a key step in understanding more complex, and useful, networks.

References

Authored by:
Liam Murray

Liam Murray is a data driven individual with a passion for Mathematics, Machine Learning, Data Mining and Business Analytics. Most recently, Liam has focused on Big Data Analytics – leveraging Hadoop and statistically driven languages such as R and Python to solve complex business problems. Previously, Liam spent more than six years within the finance industry working on power, renewables & PFI infrastructure sector projects with a focus on the financing of projects as well as the ongoing monitoring of existing assets. As a result, Liam has an acute awareness of the needs & challenges associated with supporting the advanced analytics requirements of an organization.