Articles

What is Multivariate Analysis? Part II: Principal Component Analysis

Principal Component Analysis

The purpose of principal component analysis is to find the best low-dimensional representation of the variation in a multivariate data set. For example, in the case of the wine data set, we have 13 chemical concentrations describing wine samples from three different cultivars. We can carry out a principal component analysis to investigate whether we can capture most of the variation between samples using a smaller number of new variables (principal components), where each of these new variables is a linear combination of all or some of the 13 chemical concentrations.

To carry out a principal component analysis (PCA) on a multivariate data set, the first step is often to standardize the variables under study using the “scale()” function (see above). This is necessary if the input variables have very different variances, which is true in this case as the concentrations of the 13 chemicals have very different variances (see above).

Once you have standardized your variables, you can carry out a principal component analysis using the “prcomp()” function in R.

For example, to standardize the concentrations of the 13 chemicals in the wine samples, and carry out a principal components analysis on the standardized concentrations, we type:

> standardizedconcentrations wine.pca

You can get a summary of the principal component analysis results using the “summary()” function on the output of “prcomp()”:

> summary(wine.pca)
Importance of components:

                         PC1   PC2   PC3    PC4    PC5    PC6    PC7    PC8    PC9   PC10
Standard deviation     2.169 1.580 1.203 0.9586 0.9237 0.8010 0.7423 0.5903 0.5375 0.5009
Proportion of Variance 0.362 0.192 0.111 0.0707 0.0656 0.0494 0.0424 0.0268 0.0222 0.0193
Cumulative Proportion  0.362 0.554 0.665 0.7360 0.8016 0.8510 0.8934 0.9202 0.9424 0.9617
                         PC11   PC12    PC13
Standard deviation     0.4752 0.4108 0.32152
Proportion of Variance 0.0174 0.0130 0.00795
Cumulative Proportion  0.9791 0.9920 1.00000

This gives us the standard deviation of each component, and the proportion of variance explained by each component. The standard deviation of the components is stored in a named element called “sdev” of the output variable made by “prcomp”:

> wine.pca$sdev
[1] 2.169297 1.580182 1.202527 0.958631 0.923704 0.801035 0.742313 0.590337
[9] 0.537476 0.500902 0.475172 0.410817 0.321524

The total variance explained by the components is the sum of the variances of the components:

> sum((wine.pca$sdev)^2)
[1] 13

In this case, we see that the total variance is 13, which is equal to the number of standardized variables (13 variables). This is because for standardized data, the variance of each standardized variable is 1. The total variance is equal to the sum of the variances of the individual variables, and since the variance of each standardized variable is 1, the total variance should be equal to the number of variables (13 here).

Deciding How Many Principal Components to Retain

In order to decide how many principal components should be retained, it is common to summarize the results of a principal components analysis by making a scree plot, which we can do in R using the “screeplot()” function:

> screeplot(wine.pca, type="lines")

PCA_Plot01

The most obvious change in slope in the scree plot occurs at component 4, which is the “elbow” of the scree plot. Therefore, it could be argued based on the basis of the screen plot that the first three components should be retained.

Another way of deciding how many components to retain is to use Kaiser’s criterion: that we should only retain principal components for which the variance is above 1 (when principal component analysis was applied to standardized data). We can check this by finding the variance of each of the principal components:

> (wine.pca$sdev)^2
[1] 4.705850 2.496974 1.446072 0.918974 0.853228 0.641657 0.551028 0.348497
[9] 0.288880 0.250903 0.225789 0.168770 0.103378

We see that the variance is above 1 for principal components 1, 2, and 3 (which have variances 4.71, 2.50, and 1.45, respectively). Therefore, using Kaiser’s criterion, we would retain the first three principal components.

A third way to decide how many principal components to retain is to decide to keep the number of components required to explain at least some minimum amount of the total variance. For example, if it is important to explain at least 80% of the variance, we would retain the first five principal components, as we can see from the output of “summary(wine.pca)” that the first five principal components explain 80.2% of the variance (while the first four components explain just 73.6%, so are not sufficient).

Loadings for the Principal Components

The loadings for the principal components are stored in a named element “rotation” of the variable returned by “prcomp()”. This contains a matrix with the loadings of each principal component, where the first column in the matrix contains the loadings for the first principal component, the second column contains the loadings for the second principal component, and so on.

Therefore, to obtain the loadings for the first principal component in our analysis of the 13 chemical concentrations in wine samples, we type:

> wine.pca$rotation[,1]
         V2         V3         V4          V5          V6          V7
-0.14432940 0.24518758 0.00205106 0.239320415 -0.14199204 -0.39466085
         V8         V9         V10        V11         V12         V13
-0.42293430 0.29853310 -0.31342949 0.08861670 -0.29671456 -0.37616741
        V14
-0.28675223

This means that the first principal component is a linear combination of the variables: -0.144*Z2 + 0.245*Z3 + 0.002*Z4 + 0.239*Z5 – 0.142*Z6 – 0.395*Z7 – 0.423*Z8 + 0.299*Z9 -0.313*Z10 + 0.089*Z11 – 0.297*Z12 – 0.376*Z13 – 0.287*Z14, where Z2, Z3, Z4…Z14 are the standardized versions of the variables V2, V3, V4…V14 (that each have mean of 0 and variance of 1).

Note that the square of the loadings sum to 1, as this is a constraint used in calculating the loadings:

> sum((wine.pca$rotation[,1])^2)
[1] 1

To calculate the values of the first principal component, we can define our own function to calculate a principal component given the loadings and the input variables’ values:

> calcpc   {
    # find the number of samples in the data set
    as.data.frame(variables)
    numsamples     # make a vector to store the component
    pc     # find the number of variables
    numvariables     # calculate the value of the component for each sample
        for (i in 1:numsamples)
   {
       valuei
              for (j in 1:numvariables)
       {
          valueij           loading           valuei        

        }
       pc[i] 

     }
          return(pc)
  }

We can then use the function to calculate the values of the first principal component for each sample in our wine data:

> calcpc(standardizedconcentrations, wine.pca$rotation[,1])
[1]  -3.307421 -2.203250 -2.509661 -3.746497 -1.006071 -3.041674 -2.442201
[8]  -2.053644 -2.503811 -2.745882 -3.469948 -1.749817 -2.107517 -3.448429
[15] -4.300652 -2.298704 -2.165846 -1.893630 -3.532022 -2.078659 -3.115614
[22] -1.083514 -2.528093 -1.640361 -1.756621 -0.987294 -1.770284 -1.231949
[29] -2.182251 -2.249763 -2.493187 -2.669880 -1.623998 -1.897339 -1.406421
[36] -1.898471 -1.380967 -1.119051 -1.497969 -2.522685 -2.580815 -0.666602
...

In fact, the values of the first principal component are stored in the variable wine.pca$x[,1] that was returned by the “prcomp()” function, so we can compare those values to the ones that we calculated, and they should agree:

> wine.pca$x[,1]/code>
[1]  -3.307428 -2.203250 -2.509661 -3.746497 -1.006071 -3.041674 -2.442201
[8]  -2.053644 -2.503811 -2.745882 -3.469948 -1.749817 -2.107517 -3.448429
[15] -4.300652 -2.298704 -2.165846 -1.893630 -3.532022 -2.078659 -3.115614
[22] -1.083514 -2.528093 -1.640361 -1.756621 -0.987294 -1.770284 -1.231949
[29] -2.182251 -2.249763 -2.493187 -2.669880 -1.623998 -1.897339 -1.406421
[36] -1.898471 -1.380967 -1.119051 -1.497969 -2.522685 -2.580815 -0.666602
...

We see that they do agree.

The first principal component has highest (in absolute value) loadings for V8 (-0.423), V7 (-0.395), V13 (-0.376), V10 (-0.313), V12 (-0.297), V14 (-0.287), V9 (0.299), V3 (0.245), and V5 (0.239). The loadings for V8, V7, V13, V10, V12 and V14 are negative, while those for V9, V3, and V5 are positive. Therefore, an interpretation of the first principal component is that it represents a contrast between the concentrations of V8, V7, V13, V10, V12, and V14, and the concentrations of V9, V3 and V5.

Similarly, we can obtain the loadings for the second principal component by typing:

> wine.pca$rotation[,2]
          V2          V3          V4           V5          V6          V7
  0.483651548 0.224930935 0.316068814 -0.010590502 0.299634003 0.065039512
          V8          V9         V10         V11          V12          V13
-0.003359812 0.028779488 0.039301722 0.529995672 -0.279235148 -0.164496193
         V14
  0.36490283

This means that the second principal component is a linear combination of the variables: 0.484*Z2 + 0.225*Z3 + 0.316*Z4 – 0.011*Z5 + 0.300*Z6 + 0.065*Z7 – 0.003*Z8 + 0.029*Z9 + 0.039*Z10 + 0.530*Z11 – 0.279*Z12 – 0.164*Z13 + 0.365*Z14, where Z1, Z2, Z3…Z14 are the standardized versions of variables V2, V3, … V14 that each have mean 0 and variance 1.

Note that the square of the loadings sum to 1, as above:

> sum((wine.pca$rotation[,2])^2)
[1] 1

The second principal component has highest loadings for V11 (0.530), V2 (0.484), V14 (0.365), V4 (0.316), V6 (0.300), V12 (-0.279), and V3 (0.225). The loadings for V11, V2, V14, V4, V6 and V3 are positive, while the loading for V12 is negative. Therefore, an interpretation of the second principal component is that it represents a contrast between the concentrations of V11, V2, V14, V4, V6 and V3, and the concentration of V12. Note that the loadings for V11 (0.530) and V2 (0.484) are the largest, so the contrast is mainly between the concentrations of V11 and V2, and the concentration of V12.

Scatterplots of the Principal Components

The values of the principal components are stored in a named element “x” of the variable returned by “prcomp()”. This contains a matrix with the principal components, where the first column in the matrix contains the first principal component, the second column the second component, and so on.

Thus, in our example, “wine.pca$x[,1]” contains the first principal component, and “wine.pca$x[,2]” contains the second principal component.

We can make a scatterplot of the first two principal components, and label the data points with the cultivar that the wine samples come from, by typing:

> plot(wine.pca$x[,1],wine.pca$x[,2]) # make a scatterplot
> text(wine.pca$x[,1],wine.pca$x[,2], wine$V1, cex=0.7, pos=4, col="red") # add labels

PCA_Plot2

The scatterplot shows the first principal component on the x-axis, and the second principal component on the y-axis. We can see from the scatterplot that wine samples of cultivar 1 have much lower values of the first principal component than wine samples of cultivar 3. Therefore, the first principal component separates wine samples of cultivars 1 from those of cultivar 3.

We can also see that wine samples of cultivar 2 have much higher values of the second principal component than wine samples of cultivars 1 and 3. Therefore, the second principal component separates samples of cultivar 2 from samples of cultivars 1 and 3.

Therefore, the first two principal components are reasonably useful for distinguishing wine samples of the three different cultivars.

Above, we interpreted the first principal component as a contrast between the concentrations of V8, V7, V13, V10, V12, and V14, and the concentrations of V9, V3 and V5. We can check whether this makes sense in terms of the concentrations of these chemicals in the different cultivars, by printing out the means of the standardized concentration variables in each cultivar, using the “printMeanAndSdByGroup()” function (see above):

> printMeanAndSdByGroup(standardizedconcentrations,wine[1])
 [1] "Means:"
 V1      V2      V3      V4      V5      V6      V7      V8      V9     V10 
1 1  0.9166 -0.2915  0.3247 -0.7359  0.4619  0.8709  0.9542 -0.5774  0.5389 
2 2 -0.8892 -0.3613 -0.4437  0.2225 -0.3635 -0.0579  0.0516  0.0145  0.0690
3 3  0.1886  0.8928  0.2572  0.5754 -0.0300 -0.9848 -1.2492  0.6882 -0.7641 
     V11     V12     V13     V14
  0.2028  0.4576  0.7691  1.1712
 -0.8504  0.4325  0.2446 -0.7220
  1.0086 -1.2020 -1.3073 -0.3715

Does it make sense that the first principal component can separate cultivar 1 from cultivar 3? In cultivar 1, the mean values of V8 (0.954), V7 (0.871), V13 (0.769), V10 (0.539), V12 (0.458) and V14 (1.171) are very high compared to the mean values of V9 (-0.577), V3 (-0.292) and V5 (-0.736). In cultivar 3, the mean values of V8 (-1.249), V7 (-0.985), V13 (-1.307), V10 (-0.764), V12 (-1.202) and V14 (-0.372) are very low compared to the mean values of V9 (0.688), V3 (0.893) and V5 (0.575). Therefore, it does make sense that principal component 1 is a contrast between the concentrations of V8, V7, V13, V10, V12, and V14, and the concentrations of V9, V3 and V5; and that principal component 1 can separate cultivar 1 from cultivar 3.

Above, we interpreted the second principal component as a contrast between the concentrations of V11, V2, V14, V4, V6 and V3, and the concentration of V12. In the light of the mean values of these variables in the different cultivars, does it make sense that the second principal component can separate cultivar 2 from cultivars 1 and 3? In cultivar 1, the mean values of V11 (0.203), V2 (0.917), V14 (1.171), V4 (0.325), V6 (0.462) and V3 (-0.292) are not very different from the mean value of V12 (0.458). In cultivar 3, the mean values of V11 (1.009), V2 (0.189), V14 (-0.372), V4 (0.257), V6 (-0.030) and V3 (0.893) are also not very different from the mean value of V12 (-1.202). In contrast, in cultivar 2, the mean values of V11 (-0.850), V2 (-0.889), V14 (-0.722), V4 (-0.444), V6 (-0.364) and V3 (-0.361) are much less than the mean value of V12 (0.432). Therefore, it makes sense that principal component is a contrast between the concentrations of V11, V2, V14, V4, V6 and V3, and the concentration of V12; and that principal component 2 can separate cultivar 2 from cultivars 1 and 3.

Next post: What is Multivariate Analysis? Part III: Linear Discriminant Analysis.


Jeffrey Strickland

Authored by:
Jeffrey Strickland, Ph.D.

Jeffrey Strickland, Ph.D., is the Author of Predictive Analytics Using R and a Senior Analytics Scientist with Clarity Solution Group. He has performed predictive modeling, simulation and analysis for the Department of Defense, NASA, the Missile Defense Agency, and the Financial and Insurance Industries for over 20 years. Jeff is a Certified Modeling and Simulation professional (CMSP) and an Associate Systems Engineering Professional (ASEP). He has published nearly 200 blogs on LinkedIn, is also a frequently invited guest speaker and the author of 20 books including:

  • Operations Research using Open-Source Tools
  • Discrete Event simulation using ExtendSim
  • Crime Analysis and Mapping
  • Missile Flight Simulation
  • Mathematical Modeling of Warfare and Combat Phenomenon
  • Predictive Modeling and Analytics
  • Using Math to Defeat the Enemy
  • Verification and Validation for Modeling and Simulation
  • Simulation Conceptual Modeling
  • System Engineering Process and Practices

Connect with Jeffrey Strickland
Contact Jeffrey Strickland

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s