Articles

Data Visualisations in R

RIntroduction to R

Producing basic statistical analysis can be quite complex with many of the programming languages used in mathematical programming such as C++, Java, VBA or Python.

For this reason a number of data-centric languages have been designed to simplify
this process, such as R, SAS, and Matlab. R has gained popularity amongst data scientists as it is an open source project (unlike SAS and Matlab), easy to manipulate and contains a significant amount of libraries dedicated to data exploration.

For example in Java doubling all the elements of an array of unknown length, known as x, a loop such as for(int i=0;i<x.length();i++) x[i]*=2; is required, while in R simply writing x <- x*2 suffices, without needing to know the length of the array. This elegance makes R very useful in data science where one can master R without necessarily having a prior programming background.

Data Visualisation in R

R’s elegance also applies to data visualisations, where graphics can be created with a few simple lines of code. During the developmental phases of projects this is useful for quick visual checks of data. For instance, the following 3 lines of code are all that are needed to show UK population estimates from 2004-2014[1].

x <- 2005:2014
y <- c(59.79,60.18,60.62,61.07,61.57,62.04,62.51,63.02,63.5,63.9)
plot(x,y)

Basic Plot

This shows how useful plot() is when checking data and models for inconsistencies. However when finishing off projects, for presenting and reporting to a wider audience, there is a bit more effort needed to make this more presentable.

File Types

R supports many file types including pdf, png, bmp and jpeg. General tips for choosing which file type to use can be found online[2]. Typically pdf is best for printing and file sharing, as it is usually supported by Linux, Windows and Mac, while png is usually best for web display, as image quality and finer details are often preserved.

The code below creates a png file with a plot of x against y, assuming these vectors have already been defined in R. This is the main template used to generate the images contained in this post.

png(file = "test.png",width = 400, height = 400)
plot(x,y)
dev.off()

Using other device drivers is straightforward e.g. using pdf() in place of png() for this example gives a pdf document. Executing dev.off() writes the existing plot to file and resets R’s image buffer to a blank canvas.

Basic Plotting Techniques

Typing in the function help(plot.default) in an R terminal reveals the specification of plot(). There are numerous plotting functions dedicated to refining plots and a par() function which allows additional tweaking of parameters e.g. par(mar=c(5.1,4.1,4.1,2.1)) sets the margins of the graph to their default positions.

In general, the function plot() has to be used before any other plotting function. For instance to use the function lines() the following code sequence is required:

plot(x,y,type="n") #nothing plotted
lines(x,y)

Example of using the lines() function

However one interesting exception to this is barplot(), where plot() is not required. Executing barplot(y) gives the following bar chart which will be polished shortly.Simple Bar Plot

Improving Appearance

The following summarises some of the key functions and parameters that can be used to improve the look of plots. R’s plotting functions tend to have many common parameters e.g the parameter xlab sets the text label under the x-axis when using both barplot() and plot().

Command Description Example
text() Place text in graph text(1, 1, “(1, 1)”) writes “(1, 1)” at the coordinate (1, 1)
axis() Draws axis and modifies how labels, values and ticks are displayedBest called after setting axes=FALSE in the main plotting function being used axis(1) draws a default x-axisaxis(2, pos=0) draws the y-axis going through the point x=0, rather than at the left-most part of the graph
cex Magnifies text plot(x, y, main=”Sample”, cex.main=2)
draws a graph with a heading double its normal sizeSimilar effects can be achieved with cex.lab, cex.sub and cex.axis
col Sets colour plot(x, y, col=”blue”)A full specification of character strings used for colors is available from the Columbia University website[3]
main, xlab, ylab, sub Respectively set the heading, x-axis label, y-axis label and sub-heading below the x-axis plot(x, y, main=”A Graph”, xlab=”X values”, ylab=”Y values”)
xaxp, yaxp Set ticks on x-axis and y-axis xaxp(0, 20, 10) marks ticks at x=0, 2, …, 10
xlim,ylim Set limits of x and y valuesThis allows for the creation of whitespace xlim(0,1) ensures the x-axis is drawn from x=0 to x=1
mtext() Write text in the marginCan be used to replace the axis labels mtext(side = 1, text = “X”, line = 4) writes below the x-axis margin (side=2 for y-axis, 3 for top, 4 for right side)The variable line determines how far from the margin it is. Setting xlab=”” in plot() in effect makes mtext() a replacement for xlab, except with more flexibility with text positioning

The parameter mar modifies margin spacing. A detailed tutorial on margin spacing is found on the R-bloggers website. [4]. Exact understanding isn’t necessary, it’s enough to play around with the mar vector knowing that par(mar=c(5.1,4.1,4.1,2.1) sets the default margins at the image’s bottom, left, top and right, respectively.

Putting this all together an improved bar chart is achieved with the following code:

x <- 2005:2014
y <- c(59.79,60.18,60.62,61.07,61.57,62.04,62.51,63.02,63.5,63.9)
par(mar=c(6,6,4.1,2.1))
mp <- barplot(y, space=0.5, col="red", main="UK Population 2004-2014",xlab="",ylab="Population in millions",yaxp=c(59,64,5),ylim=c(59,64),xpd = FALSE,cex.main=2,cex.lab=2)
axis(1,at=mp,labels=x,las=2)
mtext(side = 1, text = "Year", line = 4,cex=2)

UK Population

Indeed more can be done with this – suppose net profit (or EBITDA) on a household item follows the formula EBITDA = P(40-P)/10-30, where P is the price charged. Then the following code gives a fairly sharp visualisation of the net profit curve for one item.

P <- seq(0,40,0.1)
lwdth <- 3
EBITDA <- (40-P)*P/10-30
par(mar=c(2,6,4.1,2.1))
plot(P,EBITDA,cex.main=2,cex.lab=2,type="l",col="blue",main="Net Profit against Price",xlab="",ylab="Net Profit",axes=FALSE, ylim=c(-30,30),lwd=lwdth)
points(20,10,pch=19,cex=2)
axis(1,pos=0,xaxp=c(0,40,8),labels=c("",as.character(seq(5,40,5))),at=seq(0,40,5),lwd=lwdth)
axis(2,pos=0,yaxp=c(-30,30,12),las=1,lwd=lwdth)
text(11,25,"maximum, where ",cex=1.5)
text(29,25,expression(paste(frac(paste(partialdiff,EBITDA),paste(partialdiff,P))," = 0")),cex=1.5)
arrows(20,22,20,11.5,lwd=lwdth)
mtext("Price",side=1,line=0,cex=2)

Net profit curve
Additional functions used here include points() and arrows(), which are self-explanatory, and the variables lwd (line width) and las (controlling the angle at which text is written). The partial derivative signs and equation were produced using the expression() function, which is well documented by R.[5].

More Advanced Plots

Histograms and Q-Q plots

R provides the functions hist() and qqplot() which can be used to check how good the function rnorm() is at generating Gaussian numbers.

Firstly using the hist function:

x <- rnorm(10000)
par(mar=c(5.1,5,4.1,2.1))
hist(x,main="Normal Distribution",xlab="Gaussian Value",ylab="Frequency",col="red",cex.lab=2,cex.main=2,xlim=c(-4,4))

Gaussian Frequency Distribution
This appears visually to be similar enough to a Gaussian distribution. A more sensitive test is taking a qqplot against the standard normal distribution:

x <- rnorm(1000)
par(mar=c(5.1,5,4.1,2.1))
qqplot(x,qnorm(seq(-3,3,0.01)),main="Q-Q plot of rnorm(1000)",xlab="",ylab="Sample Quantiles",cex.lab=2,cex.main=2)
mtext("Theoretical Quantiles",side=1,line=4,cex=2)
qqline(x)

Q-qplot of rnorm() against standard normal
The scattered points almost fall in a straight line, showing the rnorm() random generator is producing suitably normalised random numbers.

Heat Maps and 3D Plots.

Sometimes it’s useful to get an overview of data to spot patterns. Inputting demographic data for a website’s traffic for one week and using the image() and image.plot() functions:

install.packages("fields")
require(fields)
data_matrix <- matrix(c(17,30,25,15,8,5,
				25,45,33,23,12,7,
				15,27,22,13,7,4,
				17,24,25,10,4,4,
				12,20,20,9,4,2,
				16,22,18,7,5,2,
				14,20,16,5,5,4),ncol=7);
par(oma=c(1.5,1.5,1.5,1.5))
image(1:ncol(data_matrix), 1:nrow(data_matrix), t(data_matrix), col = heat.colors(12), axes = FALSE,xlab="",ylab="")
for (x in 1:ncol(data_matrix))
  for (y in 1:nrow(data_matrix))
    text(x, y, data_matrix[y,x])
mtext(text=c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"), side=1, line=0.3, at=1:7,cex=1.25,las=2)
mtext(text=c("0-24","25-34","35-44","45-54","55-64","65+"), side=2, line=0.3, las=1,at=1:6,cex=1.25)
image.plot(data_matrix,col = heat.colors(12), legend.only=TRUE,legend.mar=1)

This gives a heatmap:
Heatmap of Web Traffic by Day and User Age
This is a particularly useful visualisation to look at for large datasets, where spotting patterns in data is not feasible from simply looking at the numbers.

In addition, R provides support for 3D graphs. The following code illustrates the sinc function, where sinc(x,y)=sin(r)/r, r=sqrt(x^2+y^2):

require(grDevices) # for trans3d
x <- seq(-20, 20, length=60)
y <- x
f <- function(x, y) {r <- sqrt(x^2+y^2); sin(r)/r}
z <- outer(x, y, f)
z[is.na(z)] <- 1
op <- par(bg = "white")
persp(x, y, z, theta = 30, phi = 30, expand = 0.5, col = "lightblue",
      ltheta = 120, shade = 0.75, ticktype = "detailed",
      xlab = "", ylab = "", zlab = "", box=FALSE)

Sinc function in 3D
This can be varied easily using sin(r) or the wave function z=sin(x)+cos(y)

z=sin(r) wave function

Conclusion

The functions described above cover most of what typical R users need to produce appealing and presentable data visualisations. Results may vary from one machine and browser to another, but with a little experimentation and the use of R’s help() function, a surprising level of quality can be achieved with very little complex coding required.

  1. ^United Kingdom Population
  2. ^10 tips for making your R graphics look their best, January 30, 2009
  3. ^Colors in R
  4. ^Setting graph margins in R using the par() function and lots of cow milk
  5. ^Mathematical Annotation in R

Liam Murray

Authored by:
Liam Murray

Liam Murray is a data driven individual with a passion for Mathematics, Machine Learning, Data Mining and Business Analytics. Most recently, Liam has focused on Big Data Analytics – leveraging Hadoop and statistically driven languages such as R and Python to solve complex business problems. Previously, Liam spent more than six years within the finance industry working on power, renewables & PFI infrastructure sector projects with a focus on the financing of projects as well as the ongoing monitoring of existing assets. As a result, Liam has an acute awareness of the needs & challenges associated with supporting the advanced analytics requirements of an organization.

Advertisements

1 reply »

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s