# Introduction to R

Producing basic statistical analysis can be quite complex with many of the programming languages used in mathematical programming such as C++, Java, VBA or Python.

For this reason a number of data-centric languages have been designed to simplify

this process, such as R, SAS, and Matlab. R has gained popularity amongst data scientists as it is an open source project (unlike SAS and Matlab), easy to manipulate and contains a significant amount of libraries dedicated to data exploration.

For example in Java doubling all the elements of an array of unknown length, known as **x**, a loop such as **for(int i=0;i<x.length();i++) x[i]*=2;** is required, while in R simply writing **x <- x*2** suffices, without needing to know the length of the array. This elegance makes R very useful in data science where one can master R without necessarily having a prior programming background.

# Data Visualisation in R

R’s elegance also applies to data visualisations, where graphics can be created with a few simple lines of code. During the developmental phases of projects this is useful for quick visual checks of data. For instance, the following 3 lines of code are all that are needed to show UK population estimates from 2004-2014^{[1]}.

```
x <- 2005:2014
y <- c(59.79,60.18,60.62,61.07,61.57,62.04,62.51,63.02,63.5,63.9)
plot(x,y)
```

This shows how useful **plot()** is when checking data and models for inconsistencies. However when finishing off projects, for presenting and reporting to a wider audience, there is a bit more effort needed to make this more presentable.

# File Types

R supports many file types including pdf, png, bmp and jpeg. General tips for choosing which file type to use can be found online^{[2]}. Typically pdf is best for printing and file sharing, as it is usually supported by Linux, Windows and Mac, while png is usually best for web display, as image quality and finer details are often preserved.

The code below creates a png file with a plot of **x** against **y**, assuming these vectors have already been defined in R. This is the main template used to generate the images contained in this post.

```
png(file = "test.png",width = 400, height = 400)
plot(x,y)
dev.off()
```

Using other device drivers is straightforward e.g. using **pdf()** in place of **png()** for this example gives a pdf document. Executing **dev.off()** writes the existing plot to file and resets R’s image buffer to a blank canvas.

# Basic Plotting Techniques

Typing in the function **help(plot.default)** in an R terminal reveals the specification of **plot()**. There are numerous plotting functions dedicated to refining plots and a **par()** function which allows additional tweaking of parameters e.g. **par(mar=c(5.1,4.1,4.1,2.1))** sets the margins of the graph to their default positions.

In general, the function **plot()** has to be used before any other plotting function. For instance to use the function **lines()** the following code sequence is required:

```
plot(x,y,type="n") #nothing plotted
lines(x,y)
```

However one interesting exception to this is **barplot()**, where **plot()** is not required. Executing **barplot(y)** gives the following bar chart which will be polished shortly.

# Improving Appearance

The following summarises some of the key functions and parameters that can be used to improve the look of plots. R’s plotting functions tend to have many common parameters e.g the parameter **xlab** sets the text label under the x-axis when using both **barplot()** and **plot()**.

Command | Description | Example |

text() |
Place text in graph | text(1, 1, “(1, 1)”) writes “(1, 1)” at the coordinate (1, 1) |

axis() |
Draws axis and modifies how labels, values and ticks are displayedBest called after setting axes=FALSE in the main plotting function being used |
axis(1) draws a default x-axisaxis(2, pos=0) draws the y-axis going through the point x=0, rather than at the left-most part of the graph |

cex |
Magnifies text | plot(x, y, main=”Sample”, cex.main=2)draws a graph with a heading double its normal sizeSimilar effects can be achieved with cex.lab, cex.sub and cex.axis |

col |
Sets colour | plot(x, y, col=”blue”)A full specification of character strings used for colors is available from the Columbia University website^{[3]} |

main, xlab, ylab, sub |
Respectively set the heading, x-axis label, y-axis label and sub-heading below the x-axis | plot(x, y, main=”A Graph”, xlab=”X values”, ylab=”Y values”) |

xaxp, yaxp |
Set ticks on x-axis and y-axis | xaxp(0, 20, 10) marks ticks at x=0, 2, …, 10 |

xlim,ylim |
Set limits of x and y valuesThis allows for the creation of whitespace | xlim(0,1) ensures the x-axis is drawn from x=0 to x=1 |

mtext() |
Write text in the marginCan be used to replace the axis labels | mtext(side = 1, text = “X”, line = 4) writes below the x-axis margin (side=2 for y-axis, 3 for top, 4 for right side)The variable line determines how far from the margin it is. Setting xlab=”” in plot() in effect makes mtext() a replacement for xlab, except with more flexibility with text positioning |

The parameter **mar** modifies margin spacing. A detailed tutorial on margin spacing is found on the R-bloggers website. ^{[4]}. Exact understanding isn’t necessary, it’s enough to play around with the **mar** vector knowing that **par(mar=c(5.1,4.1,4.1,2.1)** sets the default margins at the image’s bottom, left, top and right, respectively.

Putting this all together an improved bar chart is achieved with the following code:

```
x <- 2005:2014
y <- c(59.79,60.18,60.62,61.07,61.57,62.04,62.51,63.02,63.5,63.9)
par(mar=c(6,6,4.1,2.1))
mp <- barplot(y, space=0.5, col="red", main="UK Population 2004-2014",xlab="",ylab="Population in millions",yaxp=c(59,64,5),ylim=c(59,64),xpd = FALSE,cex.main=2,cex.lab=2)
axis(1,at=mp,labels=x,las=2)
mtext(side = 1, text = "Year", line = 4,cex=2)
```

Indeed more can be done with this – suppose net profit (or EBITDA) on a household item follows the formula **EBITDA = P(40-P)/10-30**, where **P** is the price charged. Then the following code gives a fairly sharp visualisation of the net profit curve for one item.

```
P <- seq(0,40,0.1)
lwdth <- 3
EBITDA <- (40-P)*P/10-30
par(mar=c(2,6,4.1,2.1))
plot(P,EBITDA,cex.main=2,cex.lab=2,type="l",col="blue",main="Net Profit against Price",xlab="",ylab="Net Profit",axes=FALSE, ylim=c(-30,30),lwd=lwdth)
points(20,10,pch=19,cex=2)
axis(1,pos=0,xaxp=c(0,40,8),labels=c("",as.character(seq(5,40,5))),at=seq(0,40,5),lwd=lwdth)
axis(2,pos=0,yaxp=c(-30,30,12),las=1,lwd=lwdth)
text(11,25,"maximum, where ",cex=1.5)
text(29,25,expression(paste(frac(paste(partialdiff,EBITDA),paste(partialdiff,P))," = 0")),cex=1.5)
arrows(20,22,20,11.5,lwd=lwdth)
mtext("Price",side=1,line=0,cex=2)
```

Additional functions used here include **points()** and **arrows()**, which are self-explanatory, and the variables **lwd** (line width) and **las** (controlling the angle at which text is written). The partial derivative signs and equation were produced using the **expression()** function, which is well documented by R.^{[5]}.

# More Advanced Plots

### Histograms and Q-Q plots

R provides the functions **hist()** and **qqplot()** which can be used to check how good the function **rnorm()** is at generating Gaussian numbers.

Firstly using the **hist** function:

```
x <- rnorm(10000)
par(mar=c(5.1,5,4.1,2.1))
hist(x,main="Normal Distribution",xlab="Gaussian Value",ylab="Frequency",col="red",cex.lab=2,cex.main=2,xlim=c(-4,4))
```

This appears visually to be similar enough to a Gaussian distribution. A more sensitive test is taking a qqplot against the standard normal distribution:

```
x <- rnorm(1000)
par(mar=c(5.1,5,4.1,2.1))
qqplot(x,qnorm(seq(-3,3,0.01)),main="Q-Q plot of rnorm(1000)",xlab="",ylab="Sample Quantiles",cex.lab=2,cex.main=2)
mtext("Theoretical Quantiles",side=1,line=4,cex=2)
qqline(x)
```

The scattered points almost fall in a straight line, showing the **rnorm()** random generator is producing suitably normalised random numbers.

### Heat Maps and 3D Plots.

Sometimes it’s useful to get an overview of data to spot patterns. Inputting demographic data for a website’s traffic for one week and using the **image()** and **image.plot()** functions:

```
install.packages("fields")
require(fields)
data_matrix <- matrix(c(17,30,25,15,8,5,
25,45,33,23,12,7,
15,27,22,13,7,4,
17,24,25,10,4,4,
12,20,20,9,4,2,
16,22,18,7,5,2,
14,20,16,5,5,4),ncol=7);
par(oma=c(1.5,1.5,1.5,1.5))
image(1:ncol(data_matrix), 1:nrow(data_matrix), t(data_matrix), col = heat.colors(12), axes = FALSE,xlab="",ylab="")
for (x in 1:ncol(data_matrix))
for (y in 1:nrow(data_matrix))
text(x, y, data_matrix[y,x])
mtext(text=c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"), side=1, line=0.3, at=1:7,cex=1.25,las=2)
mtext(text=c("0-24","25-34","35-44","45-54","55-64","65+"), side=2, line=0.3, las=1,at=1:6,cex=1.25)
image.plot(data_matrix,col = heat.colors(12), legend.only=TRUE,legend.mar=1)
```

This gives a heatmap:

This is a particularly useful visualisation to look at for large datasets, where spotting patterns in data is not feasible from simply looking at the numbers.

In addition, R provides support for 3D graphs. The following code illustrates the **sinc** function, where **sinc(x,y)=sin(r)/r**, **r=sqrt(x^2+y^2)**:

```
require(grDevices) # for trans3d
x <- seq(-20, 20, length=60)
y <- x
f <- function(x, y) {r <- sqrt(x^2+y^2); sin(r)/r}
z <- outer(x, y, f)
z[is.na(z)] <- 1
op <- par(bg = "white")
persp(x, y, z, theta = 30, phi = 30, expand = 0.5, col = "lightblue",
ltheta = 120, shade = 0.75, ticktype = "detailed",
xlab = "", ylab = "", zlab = "", box=FALSE)
```

This can be varied easily using **sin(r)** or the wave function **z=sin(x)+cos(y)**

# Conclusion

The functions described above cover most of what typical R users need to produce appealing and presentable data visualisations. Results may vary from one machine and browser to another, but with a little experimentation and the use of R’s **help()** function, a surprising level of quality can be achieved with very little complex coding required.

**^**United Kingdom Population**^**10 tips for making your R graphics look their best, January 30, 2009**^**Colors in R**^**Setting graph margins in R using the par() function and lots of cow milk**^**Mathematical Annotation in R

**Authored by:
**

**Liam Murray**

Liam Murray is a data driven individual with a passion for Mathematics, Machine Learning, Data Mining and Business Analytics. Most recently, Liam has focused on Big Data Analytics – leveraging Hadoop and statistically driven languages such as R and Python to solve complex business problems. Previously, Liam spent more than six years within the finance industry working on power, renewables & PFI infrastructure sector projects with a focus on the financing of projects as well as the ongoing monitoring of existing assets. As a result, Liam has an acute awareness of the needs & challenges associated with supporting the advanced analytics requirements of an organization.

Categories: Articles, Education & Training, Featured, Liam Murray

## 1 reply »