R is good for a lot of stuff. I have used it for cluster analysis, uplift modeling, logistic regression, and so on. I have also used SAS for the same things. For me, however, what separates the two (other than price tag) is the graphics in R. In this article I will demonstrate some very basic plotting options, which are probably trivial to an advanced user, and only scratches the surface of R’s capability.
R has many functions for graphics/plotting. One I use a lot is qplot (quick plot) in the ggplot2 package. The function I use in the next several plots is qplot. For convenience, I use the iris data set in these examples. This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
In order to follow, you only need to call the library ggplot2 with
First, I define a new plot and add title to the plot and define the new plot as qp. Then I create a histogram of my new plot. In addition, I add borders to the bins of the histogram to make it more pleasing to the eye—it is a histogram with or without the edges colored. This plot is created by adding the function geom_histogram() to the plot qp, adding a bin width of 0.5.
qp=qplot(Sepal.Length, data=iris, main=”A Simple Histogram”)
qp+geom_histogram(binwidth = 0.5, color=”white”)
In this next step, I simply add a fill color, changing from the default to a light gray.
qp+geom_histogram(binwidth = 0.5, color=”white”, fill=”lightgray”)
Since the white borders do not show up very well, I change the color to black in the next step.
qp+geom_histogram(binwidth = 0.5, color=”black”, fill=”gray”)
Now, I change the border and fill colors to dark green and light green, respectively’
qp+geom_histogram(colour = “darkgreen”, fill = “lightgreen”, binwidth = 0.5)
In this next step, I want to demonstrate a gradient color fill based on the frequency count. The default base color is blue, with dark blue representing low frequency counts and light blues as the frequency count increase. This was done by creating a new plot m defined using the function ggplot (also a part of ggplot2). If you want to see all the features of ggplot, just type ??ggplot. ggplot() initializes a ggplot object. It can be used to declare the input data frame for a graphic and to specify the set of plot aesthetics intended to be common throughout all subsequent layers unless specifically overridden.
m <- ggplot(iris, aes(x = Sepal.Length))
m + geom_histogram(aes(fill = ..count..),binwidth = 0.5)
Next, I added a dark blue border to the bins of the histogram.
m + geom_histogram(aes(fill = ..count..),colour = “darkblue”, binwidth = 0.5)
Now, I change the fill color of the histogram from the blue gradient to a light blue.
m + geom_histogram(colour = “darkblue”, fill = “lightblue”, binwidth = 0.5)
This next plot is the same histogram, but with a scale transformation of the frequency count (y) axis. I used the square root of y for this transformation.
m + geom_histogram(color=”black”, fill=”gray”, binwidth = 0.5, origin = 4) + coord_trans(y = “sqrt”)
For the next set of plots, I am using the variables Sepal.Length versus Petal.length with qplot. There are 150 observation for each variable I the irsis dataset. In this first example, I am creating a scatter plot of Sepal.Length versus Petal.Length, shown in Figure 1. Notice that there might be a linear relationship between these two variables. We will explore that later.
qplot(x=Sepal.Length, y=Petal.Length, data=iris)
In the next scatterplot I change the size of the “dots” on the graph using size.
qplot(x=Sepal.Length, y=Petal.Length, data=iris, size=2)
The next command just changes the plot color from the default black to red.
qplot(x=Sepal.Length, y=Petal.Length, data=iris, size=2, color=”red”)
In the next plot, I used boxplot option in qplot to show the measure of “spread” for each frequency count—this is just to show the boxplot feature.
qplot(factor(Sepal.Length), Petal.Length, data = iris, geom=c(“boxplot”, “jitter”))
The next plot is a dotplot of Sepal.Length and is similar to a histogram.
qplot(Sepal.Length, data = iris, geom = “dotplot”)
Finally, I revisit the observation that there might be a linear relationship between Sepal.Length and Petal.Length. The easiest way to do this is with the plot command and least squares fitting. I first pulled the variables of interest from the iris data set and placed them in arrays, and plotted them.
Next, I assigned an initial guess for the coefficients in the least square routine as the slope and y-intercept for the equation of a line, y=mx+b. Then I used nonlinear least square (for generality) to fit the line to the data. Finally, I overlaid the line on the scatterplot of the data.
p1 = 0.1
p2 = 0.2
fit = nls(petal.length ~-p1*(sepal.length)+p2, start=list(p1=p1,p2=p2))
Formula: petal.length ~ -p1 * (sepal.length) + p2 Parameters: Estimate Std. Error t value Pr(>|t|) p1 1.85843 0.08586 21.65 <2e-16 ***p2 -7.10144 0.50666 -14.02 <2e-16 ***—Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.8678 on 148 degrees of freedom Number of iterations to convergence: 1 Achieved convergence tolerance: 3.809e-08
new = data.frame(sepal.length = seq(min(sepal.length),max(sepal.length),len=200))
The fit using nonlinear least square yield the line Petal.Length = 1.85843 Sepal.Length – 7.10144.
As I previously stated, these are basic examples and show only a small portion of what you can do with plots using R. For more in-depth posts about using R, check out one of the R user groups on LinkedIn.
Jeffrey Strickland, Ph.D.
Jeffrey Strickland, Ph.D., is the Author of “Predictive Analytics Using R” and a Senior Analytics Scientist with Clarity Solution Group. He has performed predictive modeling, simulation and analysis for the Department of Defense, NASA, the Missile Defense Agency, and the Financial and Insurance Industries for over 20 years. Jeff is a Certified Modeling and Simulation professional (CMSP) and an Associate Systems Engineering Professional. He has published nearly 200 blogs on LinkedIn, is also a frequently invited guest speaker and the author of 20 books including:
- Discrete Event simulation using ExtendSim
- Crime Analysis and Mapping
- Missile Flight Simulation
- Mathematical modeling of Warfare and Combat Phenomenon
- Predictive Modeling and Analytics
- Using Math to Defeat the Enemy
- Verification and Validation for Modeling and Simulation
- Simulation Conceptual Modeling
- System Engineering Process and Practices
- Weird Scientist: the Creators of Quantum Physics
- Albert Einstein: No one expected me to lay a golden eggs
- The Men of Manhattan: the Creators of the Nuclear Era
- Fundamentals of Combat Modeling