Articles

# Introduction to R

Producing basic statistical analysis can be quite complex with many of the programming languages used in mathematical programming such as C++, Java, VBA or Python.

For this reason a number of data-centric languages have been designed to simplify
this process, such as R, SAS, and Matlab. R has gained popularity amongst data scientists as it is an open source project (unlike SAS and Matlab), easy to manipulate and contains a significant amount of libraries dedicated to data exploration.

For example in Java doubling all the elements of an array of unknown length, known as x, a loop such as for(int i=0;i<x.length();i++) x[i]*=2; is required, while in R simply writing x <- x*2 suffices, without needing to know the length of the array. This elegance makes R very useful in data science where one can master R without necessarily having a prior programming background.

# Data Visualisation in R

R’s elegance also applies to data visualisations, where graphics can be created with a few simple lines of code. During the developmental phases of projects this is useful for quick visual checks of data. For instance, the following 3 lines of code are all that are needed to show UK population estimates from 2004-2014.

``````x <- 2005:2014
y <- c(59.79,60.18,60.62,61.07,61.57,62.04,62.51,63.02,63.5,63.9)
plot(x,y)`````` This shows how useful plot() is when checking data and models for inconsistencies. However when finishing off projects, for presenting and reporting to a wider audience, there is a bit more effort needed to make this more presentable.

# File Types

R supports many file types including pdf, png, bmp and jpeg. General tips for choosing which file type to use can be found online. Typically pdf is best for printing and file sharing, as it is usually supported by Linux, Windows and Mac, while png is usually best for web display, as image quality and finer details are often preserved.

The code below creates a png file with a plot of x against y, assuming these vectors have already been defined in R. This is the main template used to generate the images contained in this post.

``````png(file = "test.png",width = 400, height = 400)
plot(x,y)
dev.off()``````

Using other device drivers is straightforward e.g. using pdf() in place of png() for this example gives a pdf document. Executing dev.off() writes the existing plot to file and resets R’s image buffer to a blank canvas.

# Basic Plotting Techniques

Typing in the function help(plot.default) in an R terminal reveals the specification of plot(). There are numerous plotting functions dedicated to refining plots and a par() function which allows additional tweaking of parameters e.g. par(mar=c(5.1,4.1,4.1,2.1)) sets the margins of the graph to their default positions.

In general, the function plot() has to be used before any other plotting function. For instance to use the function lines() the following code sequence is required:

``````plot(x,y,type="n") #nothing plotted
lines(x,y)`````` However one interesting exception to this is barplot(), where plot() is not required. Executing barplot(y) gives the following bar chart which will be polished shortly. # Improving Appearance

The following summarises some of the key functions and parameters that can be used to improve the look of plots. R’s plotting functions tend to have many common parameters e.g the parameter xlab sets the text label under the x-axis when using both barplot() and plot().

 Command Description Example text() Place text in graph text(1, 1, “(1, 1)”) writes “(1, 1)” at the coordinate (1, 1) axis() Draws axis and modifies how labels, values and ticks are displayedBest called after setting axes=FALSE in the main plotting function being used axis(1) draws a default x-axisaxis(2, pos=0) draws the y-axis going through the point x=0, rather than at the left-most part of the graph cex Magnifies text plot(x, y, main=”Sample”, cex.main=2) draws a graph with a heading double its normal sizeSimilar effects can be achieved with cex.lab, cex.sub and cex.axis col Sets colour plot(x, y, col=”blue”)A full specification of character strings used for colors is available from the Columbia University website main, xlab, ylab, sub Respectively set the heading, x-axis label, y-axis label and sub-heading below the x-axis plot(x, y, main=”A Graph”, xlab=”X values”, ylab=”Y values”) xaxp, yaxp Set ticks on x-axis and y-axis xaxp(0, 20, 10) marks ticks at x=0, 2, …, 10 xlim,ylim Set limits of x and y valuesThis allows for the creation of whitespace xlim(0,1) ensures the x-axis is drawn from x=0 to x=1 mtext() Write text in the marginCan be used to replace the axis labels mtext(side = 1, text = “X”, line = 4) writes below the x-axis margin (side=2 for y-axis, 3 for top, 4 for right side)The variable line determines how far from the margin it is. Setting xlab=”” in plot() in effect makes mtext() a replacement for xlab, except with more flexibility with text positioning

The parameter mar modifies margin spacing. A detailed tutorial on margin spacing is found on the R-bloggers website. . Exact understanding isn’t necessary, it’s enough to play around with the mar vector knowing that par(mar=c(5.1,4.1,4.1,2.1) sets the default margins at the image’s bottom, left, top and right, respectively.

Putting this all together an improved bar chart is achieved with the following code:

``````x <- 2005:2014
y <- c(59.79,60.18,60.62,61.07,61.57,62.04,62.51,63.02,63.5,63.9)
par(mar=c(6,6,4.1,2.1))
mp <- barplot(y, space=0.5, col="red", main="UK Population 2004-2014",xlab="",ylab="Population in millions",yaxp=c(59,64,5),ylim=c(59,64),xpd = FALSE,cex.main=2,cex.lab=2)
axis(1,at=mp,labels=x,las=2)
mtext(side = 1, text = "Year", line = 4,cex=2)`````` Indeed more can be done with this – suppose net profit (or EBITDA) on a household item follows the formula EBITDA = P(40-P)/10-30, where P is the price charged. Then the following code gives a fairly sharp visualisation of the net profit curve for one item.

``````P <- seq(0,40,0.1)
lwdth <- 3
EBITDA <- (40-P)*P/10-30
par(mar=c(2,6,4.1,2.1))
plot(P,EBITDA,cex.main=2,cex.lab=2,type="l",col="blue",main="Net Profit against Price",xlab="",ylab="Net Profit",axes=FALSE, ylim=c(-30,30),lwd=lwdth)
points(20,10,pch=19,cex=2)
axis(1,pos=0,xaxp=c(0,40,8),labels=c("",as.character(seq(5,40,5))),at=seq(0,40,5),lwd=lwdth)
axis(2,pos=0,yaxp=c(-30,30,12),las=1,lwd=lwdth)
text(11,25,"maximum, where ",cex=1.5)
text(29,25,expression(paste(frac(paste(partialdiff,EBITDA),paste(partialdiff,P))," = 0")),cex=1.5)
arrows(20,22,20,11.5,lwd=lwdth)
mtext("Price",side=1,line=0,cex=2)`````` Additional functions used here include points() and arrows(), which are self-explanatory, and the variables lwd (line width) and las (controlling the angle at which text is written). The partial derivative signs and equation were produced using the expression() function, which is well documented by R..

### Histograms and Q-Q plots

R provides the functions hist() and qqplot() which can be used to check how good the function rnorm() is at generating Gaussian numbers.

Firstly using the hist function:

``````x <- rnorm(10000)
par(mar=c(5.1,5,4.1,2.1))
hist(x,main="Normal Distribution",xlab="Gaussian Value",ylab="Frequency",col="red",cex.lab=2,cex.main=2,xlim=c(-4,4))`````` This appears visually to be similar enough to a Gaussian distribution. A more sensitive test is taking a qqplot against the standard normal distribution:

``````x <- rnorm(1000)
par(mar=c(5.1,5,4.1,2.1))
qqplot(x,qnorm(seq(-3,3,0.01)),main="Q-Q plot of rnorm(1000)",xlab="",ylab="Sample Quantiles",cex.lab=2,cex.main=2)
mtext("Theoretical Quantiles",side=1,line=4,cex=2)
qqline(x)`````` The scattered points almost fall in a straight line, showing the rnorm() random generator is producing suitably normalised random numbers.

### Heat Maps and 3D Plots.

Sometimes it’s useful to get an overview of data to spot patterns. Inputting demographic data for a website’s traffic for one week and using the image() and image.plot() functions:

``````install.packages("fields")
require(fields)
data_matrix <- matrix(c(17,30,25,15,8,5,
25,45,33,23,12,7,
15,27,22,13,7,4,
17,24,25,10,4,4,
12,20,20,9,4,2,
16,22,18,7,5,2,
14,20,16,5,5,4),ncol=7);
par(oma=c(1.5,1.5,1.5,1.5))
image(1:ncol(data_matrix), 1:nrow(data_matrix), t(data_matrix), col = heat.colors(12), axes = FALSE,xlab="",ylab="")
for (x in 1:ncol(data_matrix))
for (y in 1:nrow(data_matrix))
text(x, y, data_matrix[y,x])
mtext(text=c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"), side=1, line=0.3, at=1:7,cex=1.25,las=2)
mtext(text=c("0-24","25-34","35-44","45-54","55-64","65+"), side=2, line=0.3, las=1,at=1:6,cex=1.25)
image.plot(data_matrix,col = heat.colors(12), legend.only=TRUE,legend.mar=1)``````

This gives a heatmap: This is a particularly useful visualisation to look at for large datasets, where spotting patterns in data is not feasible from simply looking at the numbers.

In addition, R provides support for 3D graphs. The following code illustrates the sinc function, where sinc(x,y)=sin(r)/r, r=sqrt(x^2+y^2):

``````require(grDevices) # for trans3d
x <- seq(-20, 20, length=60)
y <- x
f <- function(x, y) {r <- sqrt(x^2+y^2); sin(r)/r}
z <- outer(x, y, f)
z[is.na(z)] <- 1
op <- par(bg = "white")
persp(x, y, z, theta = 30, phi = 30, expand = 0.5, col = "lightblue",
ltheta = 120, shade = 0.75, ticktype = "detailed",
xlab = "", ylab = "", zlab = "", box=FALSE)`````` This can be varied easily using sin(r) or the wave function z=sin(x)+cos(y)  # Conclusion

The functions described above cover most of what typical R users need to produce appealing and presentable data visualisations. Results may vary from one machine and browser to another, but with a little experimentation and the use of R’s help() function, a surprising level of quality can be achieved with very little complex coding required. Authored by:
Liam Murray

Liam Murray is a data driven individual with a passion for Mathematics, Machine Learning, Data Mining and Business Analytics. Most recently, Liam has focused on Big Data Analytics – leveraging Hadoop and statistically driven languages such as R and Python to solve complex business problems. Previously, Liam spent more than six years within the finance industry working on power, renewables & PFI infrastructure sector projects with a focus on the financing of projects as well as the ongoing monitoring of existing assets. As a result, Liam has an acute awareness of the needs & challenges associated with supporting the advanced analytics requirements of an organization.