Principal Component Analysis
The purpose of principal component analysis is to find the best low-dimensional representation of the variation in a multivariate data set. For example, in the case of the wine data set, we have 13 chemical concentrations describing wine samples from three different cultivars. We can carry out a principal component analysis to investigate whether we can capture most of the variation between samples using a smaller number of new variables (principal components), where each of these new variables is a linear combination of all or some of the 13 chemical concentrations.
To carry out a principal component analysis (PCA) on a multivariate data set, the first step is often to standardize the variables under study using the “scale()” function (see above). This is necessary if the input variables have very different variances, which is true in this case as the concentrations of the 13 chemicals have very different variances (see above).
Once you have standardized your variables, you can carry out a principal component analysis using the “prcomp()” function in R.
For example, to standardize the concentrations of the 13 chemicals in the wine samples, and carry out a principal components analysis on the standardized concentrations, we type:
> standardizedconcentrations wine.pca
You can get a summary of the principal component analysis results using the “summary()” function on the output of “prcomp()”:
> summary(wine.pca)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
Standard deviation 2.169 1.580 1.203 0.9586 0.9237 0.8010 0.7423 0.5903 0.5375 0.5009
Proportion of Variance 0.362 0.192 0.111 0.0707 0.0656 0.0494 0.0424 0.0268 0.0222 0.0193
Cumulative Proportion 0.362 0.554 0.665 0.7360 0.8016 0.8510 0.8934 0.9202 0.9424 0.9617
PC11 PC12 PC13
Standard deviation 0.4752 0.4108 0.32152
Proportion of Variance 0.0174 0.0130 0.00795
Cumulative Proportion 0.9791 0.9920 1.00000
This gives us the standard deviation of each component, and the proportion of variance explained by each component. The standard deviation of the components is stored in a named element called “sdev” of the output variable made by “prcomp”:
> wine.pca$sdev
[1] 2.169297 1.580182 1.202527 0.958631 0.923704 0.801035 0.742313 0.590337
[9] 0.537476 0.500902 0.475172 0.410817 0.321524
The total variance explained by the components is the sum of the variances of the components:
> sum((wine.pca$sdev)^2)
[1] 13
In this case, we see that the total variance is 13, which is equal to the number of standardized variables (13 variables). This is because for standardized data, the variance of each standardized variable is 1. The total variance is equal to the sum of the variances of the individual variables, and since the variance of each standardized variable is 1, the total variance should be equal to the number of variables (13 here).
Deciding How Many Principal Components to Retain
In order to decide how many principal components should be retained, it is common to summarize the results of a principal components analysis by making a scree plot, which we can do in R using the “screeplot()” function:
> screeplot(wine.pca, type="lines")
The most obvious change in slope in the scree plot occurs at component 4, which is the “elbow” of the scree plot. Therefore, it could be argued based on the basis of the screen plot that the first three components should be retained.
Another way of deciding how many components to retain is to use Kaiser’s criterion: that we should only retain principal components for which the variance is above 1 (when principal component analysis was applied to standardized data). We can check this by finding the variance of each of the principal components:
> (wine.pca$sdev)^2
[1] 4.705850 2.496974 1.446072 0.918974 0.853228 0.641657 0.551028 0.348497
[9] 0.288880 0.250903 0.225789 0.168770 0.103378
We see that the variance is above 1 for principal components 1, 2, and 3 (which have variances 4.71, 2.50, and 1.45, respectively). Therefore, using Kaiser’s criterion, we would retain the first three principal components.
A third way to decide how many principal components to retain is to decide to keep the number of components required to explain at least some minimum amount of the total variance. For example, if it is important to explain at least 80% of the variance, we would retain the first five principal components, as we can see from the output of “summary(wine.pca)” that the first five principal components explain 80.2% of the variance (while the first four components explain just 73.6%, so are not sufficient).
Loadings for the Principal Components
The loadings for the principal components are stored in a named element “rotation” of the variable returned by “prcomp()”. This contains a matrix with the loadings of each principal component, where the first column in the matrix contains the loadings for the first principal component, the second column contains the loadings for the second principal component, and so on.
Therefore, to obtain the loadings for the first principal component in our analysis of the 13 chemical concentrations in wine samples, we type:
> wine.pca$rotation[,1]
V2 V3 V4 V5 V6 V7
-0.14432940 0.24518758 0.00205106 0.239320415 -0.14199204 -0.39466085
V8 V9 V10 V11 V12 V13
-0.42293430 0.29853310 -0.31342949 0.08861670 -0.29671456 -0.37616741
V14
-0.28675223
This means that the first principal component is a linear combination of the variables: -0.144*Z2 + 0.245*Z3 + 0.002*Z4 + 0.239*Z5 – 0.142*Z6 – 0.395*Z7 – 0.423*Z8 + 0.299*Z9 -0.313*Z10 + 0.089*Z11 – 0.297*Z12 – 0.376*Z13 – 0.287*Z14, where Z2, Z3, Z4…Z14 are the standardized versions of the variables V2, V3, V4…V14 (that each have mean of 0 and variance of 1).
Note that the square of the loadings sum to 1, as this is a constraint used in calculating the loadings:
> sum((wine.pca$rotation[,1])^2)
[1] 1
To calculate the values of the first principal component, we can define our own function to calculate a principal component given the loadings and the input variables’ values:
> calcpc
{
# find the number of samples in the data set
as.data.frame(variables)
numsamples
# make a vector to store the component
pc
# find the number of variables
numvariables
# calculate the value of the component for each sample
for (i in 1:numsamples)
{
valuei
for (j in 1:numvariables)
{
valueij
loading
valuei
}
pc[i]
}
return(pc)
}
We can then use the function to calculate the values of the first principal component for each sample in our wine data:
> calcpc(standardizedconcentrations, wine.pca$rotation[,1])
[1] -3.307421 -2.203250 -2.509661 -3.746497 -1.006071 -3.041674 -2.442201
[8] -2.053644 -2.503811 -2.745882 -3.469948 -1.749817 -2.107517 -3.448429
[15] -4.300652 -2.298704 -2.165846 -1.893630 -3.532022 -2.078659 -3.115614
[22] -1.083514 -2.528093 -1.640361 -1.756621 -0.987294 -1.770284 -1.231949
[29] -2.182251 -2.249763 -2.493187 -2.669880 -1.623998 -1.897339 -1.406421
[36] -1.898471 -1.380967 -1.119051 -1.497969 -2.522685 -2.580815 -0.666602
...
In fact, the values of the first principal component are stored in the variable wine.pca$x[,1] that was returned by the “prcomp()” function, so we can compare those values to the ones that we calculated, and they should agree:
> wine.pca$x[,1]/code>
[1] -3.307428 -2.203250 -2.509661 -3.746497 -1.006071 -3.041674 -2.442201
[8] -2.053644 -2.503811 -2.745882 -3.469948 -1.749817 -2.107517 -3.448429
[15] -4.300652 -2.298704 -2.165846 -1.893630 -3.532022 -2.078659 -3.115614
[22] -1.083514 -2.528093 -1.640361 -1.756621 -0.987294 -1.770284 -1.231949
[29] -2.182251 -2.249763 -2.493187 -2.669880 -1.623998 -1.897339 -1.406421
[36] -1.898471 -1.380967 -1.119051 -1.497969 -2.522685 -2.580815 -0.666602
...
We see that they do agree.
The first principal component has highest (in absolute value) loadings for V8 (-0.423), V7 (-0.395), V13 (-0.376), V10 (-0.313), V12 (-0.297), V14 (-0.287), V9 (0.299), V3 (0.245), and V5 (0.239). The loadings for V8, V7, V13, V10, V12 and V14 are negative, while those for V9, V3, and V5 are positive. Therefore, an interpretation of the first principal component is that it represents a contrast between the concentrations of V8, V7, V13, V10, V12, and V14, and the concentrations of V9, V3 and V5.
Similarly, we can obtain the loadings for the second principal component by typing:
> wine.pca$rotation[,2]
V2 V3 V4 V5 V6 V7
0.483651548 0.224930935 0.316068814 -0.010590502 0.299634003 0.065039512
V8 V9 V10 V11 V12 V13
-0.003359812 0.028779488 0.039301722 0.529995672 -0.279235148 -0.164496193
V14
0.36490283
This means that the second principal component is a linear combination of the variables: 0.484*Z2 + 0.225*Z3 + 0.316*Z4 – 0.011*Z5 + 0.300*Z6 + 0.065*Z7 – 0.003*Z8 + 0.029*Z9 + 0.039*Z10 + 0.530*Z11 – 0.279*Z12 – 0.164*Z13 + 0.365*Z14, where Z1, Z2, Z3…Z14 are the standardized versions of variables V2, V3, … V14 that each have mean 0 and variance 1.
Note that the square of the loadings sum to 1, as above:
> sum((wine.pca$rotation[,2])^2)
[1] 1
The second principal component has highest loadings for V11 (0.530), V2 (0.484), V14 (0.365), V4 (0.316), V6 (0.300), V12 (-0.279), and V3 (0.225). The loadings for V11, V2, V14, V4, V6 and V3 are positive, while the loading for V12 is negative. Therefore, an interpretation of the second principal component is that it represents a contrast between the concentrations of V11, V2, V14, V4, V6 and V3, and the concentration of V12. Note that the loadings for V11 (0.530) and V2 (0.484) are the largest, so the contrast is mainly between the concentrations of V11 and V2, and the concentration of V12.
Scatterplots of the Principal Components
The values of the principal components are stored in a named element “x” of the variable returned by “prcomp()”. This contains a matrix with the principal components, where the first column in the matrix contains the first principal component, the second column the second component, and so on.
Thus, in our example, “wine.pca$x[,1]” contains the first principal component, and “wine.pca$x[,2]” contains the second principal component.
We can make a scatterplot of the first two principal components, and label the data points with the cultivar that the wine samples come from, by typing:
> plot(wine.pca$x[,1],wine.pca$x[,2]) # make a scatterplot
> text(wine.pca$x[,1],wine.pca$x[,2], wine$V1, cex=0.7, pos=4, col="red") # add labels
The scatterplot shows the first principal component on the x-axis, and the second principal component on the y-axis. We can see from the scatterplot that wine samples of cultivar 1 have much lower values of the first principal component than wine samples of cultivar 3. Therefore, the first principal component separates wine samples of cultivars 1 from those of cultivar 3.
We can also see that wine samples of cultivar 2 have much higher values of the second principal component than wine samples of cultivars 1 and 3. Therefore, the second principal component separates samples of cultivar 2 from samples of cultivars 1 and 3.
Therefore, the first two principal components are reasonably useful for distinguishing wine samples of the three different cultivars.
Above, we interpreted the first principal component as a contrast between the concentrations of V8, V7, V13, V10, V12, and V14, and the concentrations of V9, V3 and V5. We can check whether this makes sense in terms of the concentrations of these chemicals in the different cultivars, by printing out the means of the standardized concentration variables in each cultivar, using the “printMeanAndSdByGroup()” function (see above):
> printMeanAndSdByGroup(standardizedconcentrations,wine[1])
[1] "Means:"
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 1 0.9166 -0.2915 0.3247 -0.7359 0.4619 0.8709 0.9542 -0.5774 0.5389
2 2 -0.8892 -0.3613 -0.4437 0.2225 -0.3635 -0.0579 0.0516 0.0145 0.0690
3 3 0.1886 0.8928 0.2572 0.5754 -0.0300 -0.9848 -1.2492 0.6882 -0.7641
V11 V12 V13 V14
0.2028 0.4576 0.7691 1.1712
-0.8504 0.4325 0.2446 -0.7220
1.0086 -1.2020 -1.3073 -0.3715
Does it make sense that the first principal component can separate cultivar 1 from cultivar 3? In cultivar 1, the mean values of V8 (0.954), V7 (0.871), V13 (0.769), V10 (0.539), V12 (0.458) and V14 (1.171) are very high compared to the mean values of V9 (-0.577), V3 (-0.292) and V5 (-0.736). In cultivar 3, the mean values of V8 (-1.249), V7 (-0.985), V13 (-1.307), V10 (-0.764), V12 (-1.202) and V14 (-0.372) are very low compared to the mean values of V9 (0.688), V3 (0.893) and V5 (0.575). Therefore, it does make sense that principal component 1 is a contrast between the concentrations of V8, V7, V13, V10, V12, and V14, and the concentrations of V9, V3 and V5; and that principal component 1 can separate cultivar 1 from cultivar 3.
Above, we interpreted the second principal component as a contrast between the concentrations of V11, V2, V14, V4, V6 and V3, and the concentration of V12. In the light of the mean values of these variables in the different cultivars, does it make sense that the second principal component can separate cultivar 2 from cultivars 1 and 3? In cultivar 1, the mean values of V11 (0.203), V2 (0.917), V14 (1.171), V4 (0.325), V6 (0.462) and V3 (-0.292) are not very different from the mean value of V12 (0.458). In cultivar 3, the mean values of V11 (1.009), V2 (0.189), V14 (-0.372), V4 (0.257), V6 (-0.030) and V3 (0.893) are also not very different from the mean value of V12 (-1.202). In contrast, in cultivar 2, the mean values of V11 (-0.850), V2 (-0.889), V14 (-0.722), V4 (-0.444), V6 (-0.364) and V3 (-0.361) are much less than the mean value of V12 (0.432). Therefore, it makes sense that principal component is a contrast between the concentrations of V11, V2, V14, V4, V6 and V3, and the concentration of V12; and that principal component 2 can separate cultivar 2 from cultivars 1 and 3.
Next post: What is Multivariate Analysis? Part III: Linear Discriminant Analysis.
Authored by:
Jeffrey Strickland, Ph.D.
Jeffrey Strickland, Ph.D., is the Author of Predictive Analytics Using R and a Senior Analytics Scientist with Clarity Solution Group. He has performed predictive modeling, simulation and analysis for the Department of Defense, NASA, the Missile Defense Agency, and the Financial and Insurance Industries for over 20 years. Jeff is a Certified Modeling and Simulation professional (CMSP) and an Associate Systems Engineering Professional (ASEP). He has published nearly 200 blogs on LinkedIn, is also a frequently invited guest speaker and the author of 20 books including:
- Operations Research using Open-Source Tools
- Discrete Event simulation using ExtendSim
- Crime Analysis and Mapping
- Missile Flight Simulation
- Mathematical Modeling of Warfare and Combat Phenomenon
- Predictive Modeling and Analytics
- Using Math to Defeat the Enemy
- Verification and Validation for Modeling and Simulation
- Simulation Conceptual Modeling
- System Engineering Process and Practices
Connect with Jeffrey Strickland
Contact Jeffrey Strickland
Categories: Articles, Education & Training, Featured, Jeffrey Strickland