Articles

Graphics R Tricky

R

R is good for a lot of stuff. I have used it for cluster analysis, uplift modeling, logistic regression, and so on. I have also used SAS for the same things. For me, however, what separates the two (other than price tag) is the graphics in R. In this article I will demonstrate some very basic plotting options, which are probably trivial to an advanced user, and only scratches the surface of R’s capability.

R has many functions for graphics/plotting. One I use a lot is qplot (quick plot) in the ggplot2 package. The function I use in the next several plots is qplot.  For convenience, I use the iris data set in these examples. This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

In order to follow, you only need to call the library ggplot2 with

library(ggplot2)

First, I define a new plot and add title to the plot and define the new plot as qp. Then I create a histogram of my new plot. In addition, I add borders to the bins of the histogram to make it more pleasing to the eye—it is a histogram with or without the edges colored. This plot is created by adding the function geom_histogram() to the plot qp, adding a bin width of 0.5.

qp=qplot(Sepal.Length, data=iris, main=”A Simple Histogram”)

qp+geom_histogram(binwidth = 0.5, color=”white”)

Figure 1. Histogram of Sepal.Length using qplot and default fill color

Figure 1. Histogram of Sepal.Length using qplot and default fill color

In this next step, I simply add a fill color, changing from the default to a light gray.

qp+geom_histogram(binwidth = 0.5, color=”white”, fill=”lightgray”)

Figure 2. Histogram of Sepal.Length with a light gray fill color

Figure 2. Histogram of Sepal.Length with a light gray fill color

Since the white borders do not show up very well, I change the color to black in the next step.

qp+geom_histogram(binwidth = 0.5, color=”black”, fill=”gray”)

Figure 3. Histogram of Sepal.Length with black bin borders and a light gray fill color

Figure 3. Histogram of Sepal.Length with black bin borders and a light gray fill color

Now, I change the border and fill colors to dark green and light green, respectively’

qp+geom_histogram(colour = “darkgreen”, fill = “lightgreen”, binwidth = 0.5)

Figure 4. Histogram of Sepal.Length with dark green bin borders and light green fill color

Figure 4. Histogram of Sepal.Length with dark green bin borders and light green fill color

In this next step, I want to demonstrate a gradient color fill based on the frequency count. The default base color is blue, with dark blue representing low frequency counts and light blues as the frequency count increase. This was done by creating a new plot m defined using the function ggplot (also a part of ggplot2). If you want to see all the features of ggplot, just type ??ggplot. ggplot() initializes a ggplot object. It can be used to declare the input data frame for a graphic and to specify the set of plot aesthetics intended to be common throughout all subsequent layers unless specifically overridden.

m <- ggplot(iris, aes(x = Sepal.Length))

m + geom_histogram(aes(fill = ..count..),binwidth = 0.5)

Figure 5. Histogram of Sepal.Length using the gradient fill color based on frequency count

Figure 5. Histogram of Sepal.Length using the gradient fill color based on frequency count

Next, I added a dark blue border to the bins of the histogram.

m + geom_histogram(aes(fill = ..count..),colour = “darkblue”, binwidth = 0.5)

Figure 6. Histogram of Sepal.Length as in Figure 5, but with dark blue bin borders

Now, I change the fill color of the histogram from the blue gradient to a light blue.

m + geom_histogram(colour = “darkblue”, fill = “lightblue”, binwidth = 0.5)

Figure 7. Histogram of Sepal.Length changing the gradient fill to a light blue fill

Figure 7. Histogram of Sepal.Length changing the gradient fill to a light blue fill

This next plot is the same histogram, but with a scale transformation of the frequency count (y) axis. I used the square root of y for this transformation.

m + geom_histogram(color=”black”, fill=”gray”, binwidth = 0.5, origin = 4) + coord_trans(y = “sqrt”)

Figure 8. Histogram of Sepal.Length using a scale frequency count (y) axis

Figure 8. Histogram of Sepal.Length using a scale frequency count (y) axis

For the next set of plots, I am using the variables Sepal.Length versus Petal.length with qplot. There are 150 observation for each variable I the irsis dataset. In this first example, I am creating a scatter plot of Sepal.Length versus Petal.Length, shown in Figure 1. Notice that there might be a linear relationship between these two variables. We will explore that later.

qplot(x=Sepal.Length, y=Petal.Length, data=iris)

Figure 9. Sepal.Length versus Peatl.Length plot using qplot

Figure 9. Sepal.Length versus Peatl.Length plot using qplot

In the next scatterplot I change the size of the “dots” on the graph using size.

qplot(x=Sepal.Length, y=Petal.Length, data=iris, size=2)

Figure 10. Scatterplot of Sepal.Length versus Petal.Length using the size option

The next command just changes the plot color from the default black to red.

qplot(x=Sepal.Length, y=Petal.Length, data=iris, size=2, color=”red”)

Figure 11. Scatterplot of Sepal.Length versus Petal.Length using the color option

Figure 11. Scatterplot of Sepal.Length versus Petal.Length using the color option

In the next plot, I used boxplot option in qplot to show the measure of “spread” for each frequency count—this is just to show the boxplot feature.

qplot(factor(Sepal.Length), Petal.Length, data = iris, geom=c(“boxplot”, “jitter”))

Figure 12. Scatterplot of Sepal.Length versus Petal.Length with boxplots for frequency counts

Figure 12. Scatterplot of Sepal.Length versus Petal.Length with boxplots for frequency counts

The next plot is a dotplot of Sepal.Length and is similar to a histogram.

qplot(Sepal.Length, data = iris, geom = “dotplot”)

Figure 13. Dotplot of Sepal.Length

Figure 13. Dotplot of Sepal.Length

Finally, I revisit the observation that there might be a linear relationship between Sepal.Length and Petal.Length. The easiest way to do this is with the plot command and least squares fitting. I first pulled the variables of interest from the iris data set and placed them in arrays, and plotted them.

sepal.length=c(5.1,4.9,4.7,4.6,5,5.4,4.6,5,4.4,4.9,5.4,4.8,4.8,4.3,5.8,5.7,5.4,5.1,5.7,5.1,5.4,5.1,4.6,5.1,4.8,5,5,5.2,5.2,4.7,4.8,5.4,5.2,5.5,4.9,5,5.5,4.9,4.4,5.1,5,4.5,4.4,5,5.1,4.8,5.1,4.6,5.3,5,7,6.4,6.9,5.5,6.5,5.7,6.3,4.9,6.6,5.2,5,5.9,6,6.1,5.6,6.7,5.6,5.8,6.2,5.6,5.9,6.1,6.3,6.1,6.4,6.6,6.8,6.7,6,5.7,5.5,5.5,5.8,6,5.4,6,6.7,6.3,5.6,5.5,5.5,6.1,5.8,5,5.6,5.7,5.7,6.2,5.1,5.7,6.3,5.8,7.1,6.3,6.5,7.6,4.9,7.3,6.7,7.2,6.5,6.4,6.8,5.7,5.8,6.4,6.5,7.7,7.7,6,6.9,5.6,7.7,6.3,6.7,7.2,6.2,6.1,6.4,7.2,7.4,7.9,6.4,6.3,6.1,7.7,6.3,6.4,6,6.9,6.7,6.9,5.8,6.8,6.7,6.7,6.3,6.5,6.2,5.9)

petal.length=c(1.4,1.4,1.3,1.5,1.4,1.7,1.4,1.5,1.4,1.5,1.5,1.6,1.4,1.1,1.2,1.5,1.3,1.4,1.7,1.5,1.7,1.5,1,1.7,1.9,1.6,1.6,1.5,1.4,1.6,1.6,1.5,1.5,1.4,1.5,1.2,1.3,1.4,1.3,1.5,1.3,1.3,1.3,1.6,1.9,1.4,1.6,1.4,1.5,1.4,4.7,4.5,4.9,4,4.6,4.5,4.7,3.3,4.6,3.9,3.5,4.2,4,4.7,3.6,4.4,4.5,4.1,4.5,3.9,4.8,4,4.9,4.7,4.3,4.4,4.8,5,4.5,3.5,3.8,3.7,3.9,5.1,4.5,4.5,4.7,4.4,4.1,4,4.4,4.6,4,3.3,4.2,4.2,4.2,4.3,3,4.1,6,5.1,5.9,5.6,5.8,6.6,4.5,6.3,5.8,6.1,5.1,5.3,5.5,5,5.1,5.3,5.5,6.7,6.9,5,5.7,4.9,6.7,4.9,5.7,6,4.8,4.9,5.6,5.8,6.1,6.4,5.6,5.1,5.6,6.1,5.6,5.5,4.8,5.4,5.6,5.1,5.1,5.9,5.7,5.2,5,5.2,5.4,5.1)

plot(sepal.length,petal.length)

Next, I assigned an initial guess for the coefficients in the least square routine as the slope and y-intercept for the equation of a line, y=mx+b. Then I used nonlinear least square (for generality) to fit the line to the data. Finally, I overlaid the line on the scatterplot of the data.

p1 = 0.1

p2 = 0.2

fit = nls(petal.length ~-p1*(sepal.length)+p2, start=list(p1=p1,p2=p2))

summary(fit)

Formula: petal.length ~ -p1 * (sepal.length) + p2 Parameters:   Estimate Std. Error t value Pr(>|t|)    p1  1.85843    0.08586   21.65   <2e-16 ***p2 -7.10144    0.50666  -14.02   <2e-16 ***—Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.8678 on 148 degrees of freedom Number of iterations to convergence: 1 Achieved convergence tolerance: 3.809e-08

new = data.frame(sepal.length = seq(min(sepal.length),max(sepal.length),len=200))

lines(new$sepal.length,predict(fit,newdata=new))

Figure 14. Scatterplot of Sepal.Length versus Petal.Length with fitted line

Figure 14. Scatterplot of Sepal.Length versus Petal.Length with fitted line

The fit using nonlinear least square yield the line Petal.Length = 1.85843 Sepal.Length – 7.10144.

As I previously stated, these are basic examples and show only a small portion of what you can do with plots using R. For more in-depth posts about using R, check out one of the R user groups on LinkedIn.


Jeffrey StricklandAuthored by:
Jeffrey Strickland, Ph.D.

Jeffrey Strickland, Ph.D., is the Author of “Predictive Analytics Using R” and a Senior Analytics Scientist with Clarity Solution Group. He has performed predictive modeling, simulation and analysis for the Department of Defense, NASA, the Missile Defense Agency, and the Financial and Insurance Industries for over 20 years. Jeff is a Certified Modeling and Simulation professional (CMSP) and an Associate Systems Engineering Professional. He has published nearly 200 blogs on LinkedIn, is also a frequently invited guest speaker and the author of 20 books including:

  • Discrete Event simulation using ExtendSim
  • Crime Analysis and Mapping
  • Missile Flight Simulation
  • Mathematical modeling of Warfare and Combat Phenomenon
  • Predictive Modeling and Analytics
  • Using Math to Defeat the Enemy
  • Verification and Validation for Modeling and Simulation
  • Simulation Conceptual Modeling
  • System Engineering Process and Practices
  • Weird Scientist: the Creators of Quantum Physics
  • Albert Einstein: No one expected me to lay a golden eggs
  • The Men of Manhattan: the Creators of the Nuclear Era
  • Fundamentals of Combat Modeling

Connect with Jeffrey Strickland
Contact Jeffrey Strickland

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s