Articles

# What is Multivariate Analysis? Part III

## Linear Discriminant Analysis

The purpose of principal component analysis is to find the best low-dimensional representation of the variation in a multivariate data set. For example, in the wine data set, we have 13 chemical concentrations describing wine samples from three cultivars. By carrying out a principal component analysis, we found that most of the variation in the chemical concentrations between the samples can be captured using the first two principal components, where each of the principal components is a particular linear combination of the 13 chemical concentrations.

The purpose of linear discriminant analysis (LDA) is to find the linear combinations of the original variables (the 13 chemical concentrations here) that gives the best possible separation between the groups (wine cultivars here) in our data set. Linear discriminant analysis is also known as “canonical discriminant analysis”, or simply “discriminant analysis”.

If we want to separate the wines by cultivar, the wines come from three different cultivars, so the number of groups (G) is 3, and the number of variables is 13 (13 chemicals’ concentrations; p = 13). The maximum number of useful discriminant functions that can separate the wines by cultivar is the minimum of G-1 and p, and so in this case it is the minimum of 2 and 13, which is 2. Thus, we can find at most 2 useful discriminant functions to separate the wines by cultivar, using the 13 chemical concentration variables.

You can carry out a linear discriminant analysis using the “`lda()`” function from the R “`MASS`” package. To use this function, we first need to install the “MASS” R package (for instructions on how to install an R package, see How to install an R package).

For example, to carry out a linear discriminant analysis using the 13 chemical concentrations in the wine samples, we type:

`> library("MASS")  # load the MASS package`
`> wine.lda <- lda(wine\$V1 ~ wine\$V2 + wine\$V3 + wine\$V4 + wine\$V5 + `
`               wine\$V6 + wine\$V7 + wine\$V8 + wine\$V9 + wine\$V10 + `
`               wine\$V11 + wine\$V12 + wine\$V13 + wine\$V14)`

## Loadings for the Discriminant Functions

To get the values of the loadings of the discriminant functions for the wine data, we can type:

`> wine.lda`
`Coefficients of linear discriminants:`
`                  LD1           LD2`
`wine\$V2  -0.403399781  0.8717930699`
`wine\$V3   0.165254596  0.3053797325`
`wine\$V4  -0.369075256  2.3458497486`
`wine\$V5   0.154797889 -0.1463807654`
`wine\$V6  -0.002163496 -0.0004627565`
`wine\$V7   0.618052068 -0.0322128171`
`wine\$V8  -1.661191235 -0.4919980543`
`wine\$V9  -1.495818440 -1.6309537953`
`wine\$V10  0.134092628 -0.3070875776`
`wine\$V11  0.355055710  0.2532306865`
`wine\$V12 -0.818036073 -1.5156344987`
`wine\$V13 -1.157559376  0.0511839665`
`wine\$V14 -0.002691206  0.0028529846`

This means that the first discriminant function is a linear combination of the variables: -0.403*V2 + 0.165*V3 – 0.369*V4 + 0.155*V5 – 0.002*V6 + 0.618*V7 – 1.661*V8 – 1.496*V9 + 0.134*V10 + 0.355*V11 – 0.818*V12 – 1.158*V13 – 0.003*V14, where V2, V3, … V14 are the concentrations of the 14 chemicals found in the wine samples. For convenience, the value for each discriminant function (eg. the first discriminant function) are scaled so that their mean value is zero (see below).

Note that these loadings are calculated so that the within-group variance of each discriminant function for each group (cultivar) is equal to 1, as will be demonstrated below.

These scalings are also stored in the named element “scaling” of the variable returned by the `lda()` function. This element contains a matrix, in which the first column contains the loadings for the first discriminant function, the second column contains the loadings for the second discriminant function and so on. For example, to extract the loadings for the first discriminant function, we can type:

`> wine.lda\$scaling[,1]`
`    wine\$V2     wine\$V3     wine\$V4     wine\$V5     wine\$V6     wine\$V7`
`-0.40339978  0.16525459 -0.36907526  0.15479789 -0.00216349  0.61805206`
`    wine\$V8     wine\$V9    wine\$V10    wine\$V11    wine\$V12    wine\$V13`
`-1.66119123 -1.49581844  0.13409262  0.35505571 -0.81803607 -1.15755937`
`    wine\$V14`
`-0.002691206`

To calculate the values of the first discriminant function, we can define our own function “calclda()”:

`> calclda <- function(variables,loadings)`
`  {`
`   # find the number of samples in the data set`
`   as.data.frame(variables)`
`   numsamples <- nrow(variables)`
`   # make a vector to store the discriminant function`
`   ld <- numeric(numsamples)`
`   # find the number of variables`
`   numvariables <- length(variables)`
`   # calculate the value of the discriminant function for each sample`
`   for (i in 1:numsamples)`
`   {`
`      valuei <- 0`
`      for (j in 1:numvariables)`
`      {`
`         valueij <- variables[i,j]`
`         loading <- loadings[j]`
`          valuei <- valuei + (valueij * loading)`
`       }`
`       ld[i] <- valuei`
`   }`
`   # standardize the discriminant function so that its mean value is 0:`
`   ld <- as.data.frame(scale(ld, center=TRUE, scale=FALSE))`
`   ld <- ld[]`
`   return(ld)`
`  }`

The function `calclda()` simply calculates the value of a discriminant function for each sample in the data set, for example, for the first disriminant function, for each sample we calculate the value using the equation -0.403*V2 – 0.165*V3 – 0.369*V4 + 0.155*V5 – 0.002*V6 + 0.618*V7 – 1.661*V8 – 1.496*V9 + 0.134*V10 + 0.355*V11 – 0.818*V12 – 1.158*V13 – 0.003*V14. Furthermore, the “scale()” command is used within the `calclda()` function in order to standardize the value of a discriminant function (eg. the first discriminant function) so that its mean value (over all the wine samples) is 0.

We can use the function `calclda()` to calculate the values of the first discriminant function for each sample in our wine data:

`> calclda(wine[2:14], wine.lda\$scaling[,1])`
`    -4.7002440 -4.3019581 -3.4207195 -4.2057537 -1.5099817 -4.5186893`
`    -4.5273779 -4.1483478 -3.8608288 -3.3666244 -4.8058791 -3.4280765`
`   -3.6661025 -5.5882464 -5.5013145 -3.1847519 -3.2893699 -2.9980926`
`   -5.2464037 -3.1365311 -3.5774779 -1.6907714 -4.8351503 -3.0958896`
`   -3.3216472 -2.1448222 -3.9824285 -2.6859143 -3.5630946 -3.1730157`
`   -2.9962680 -3.5686624 -3.3850638 -3.5275375 -2.8519085 -2.7941199`
`...`

In fact, the values of the first linear discriminant function can be calculated using the “`predict()`” function in R, so we can compare those to the ones that we calculated, and they should agree:

`> wine.lda.values <- predict(wine.lda, wine[2:14])`
`> wine.lda.values\$x[,1] # contains the values for the first discriminant function`
`           1           2           3           4           5           6`
` -4.70024401 -4.30195811 -3.42071952 -4.20575366 -1.50998168 -4.51868934`
`          7           8           9          10          11          12`
` -4.52737794 -4.14834781 -3.86082876 -3.36662444 -4.80587907 -3.42807646`
`         13          14          15          16          17          18`
` -3.66610246 -5.58824635 -5.50131449 -3.18475189 -3.28936988 -2.99809262`
`         19          20          21          22          23          24`
` -5.24640372 -3.13653106 -3.57747791 -1.69077135 -4.83515033 -3.09588961`
`        25          26          27          28          29          30`
` -3.32164716 -2.14482223 -3.98242850 -2.68591432 -3.56309464 -3.17301573`
`        31          32          33          34          35          36`
` -2.99626797 -3.56866244 -3.38506383 -3.52753750 -2.85190852 -2.79411996`
`  ...`

We see that they do agree.

It doesn’t matter whether the input variables for linear discriminant analysis are standardized or not, unlike for principal components analysis in which it is often necessary to standardize the input variables. However, using standardized variables in linear discriminant analysis makes it easier to interpret the loadings in a linear discriminant function.

In linear discriminant analysis, the standardized version of an input variable is defined so that it has mean zero and within-groups variance of 1. Thus, we can calculate the “group-standardized” variable by subtracting the mean from each value of the variable, and dividing by the within-groups standard deviation. To calculate the group-standardized version of a set of variables, we can use the function “`groupStandardize()`” below:

`> groupStandardize <- function(variables, groupvariable)`
`  {`
`  # find out how many variables we have`
`  variables <- as.data.frame(variables)`
`  numvariables <- length(variables)`
`  # find the variable names`
`  variablenames <- colnames(variables)`
`  # calculate the group-Standardized version of each variable`
`    for (i in 1:numvariables)`
`     {`
`     variablei <- variables[i]`
`     variablei_name <- variablenames[i]`
`     variablei_Vw <- calcWithinGroupsVariance(variablei, groupvariable)`
`     variablei_mean <- mean(variablei)`
`     variablei_new <- (variablei - variablei_mean)/(sqrt(variablei_Vw))`
`     data_length <- nrow(variablei)`
`     if (i == 1) { variables_new <- data.frame(row.names=seq(1,data_length)) }`
`     variables_new[`variablei_name`] <- variablei_new`
`     }`
`  return(variables_new)`
`  }`

For example, we can use the “groupStandardize()” function to calculate the group-standardized versions of the chemical concentrations in wine samples:

`> groupstandardizedconcentrations <- groupStandardize(wine[2:14], wine)`

We can then use the lda() function to perform linear discriminant analysis on the group-standardized variables:

```> wine.lda2 <- lda(wine\$V1 ~ groupstandardizedconcentrations\$V4 + groupstandardizedconcentrations\$V5 + groupstandardizedconcentrations\$V6 + groupstandardizedconcentrations\$V7 + groupstandardizedconcentrations\$V8 + groupstandardizedconcentrations\$V9 + groupstandardizedconcentrations\$V10 + groupstandardizedconcentrations\$V11 + groupstandardizedconcentrations\$V12 + groupstandardizedconcentrations\$V13 + groupstandardizedconcentrations\$V14)```
`> wine.lda2`
` Coefficients of linear discriminants:`
`                                              LD1          LD2`
`  groupstandardizedconcentrations\$V2  -0.20650463  0.446280119`
`  groupstandardizedconcentrations\$V3   0.15568586  0.287697336`
`  groupstandardizedconcentrations\$V4  -0.09486893  0.602988809`
`  groupstandardizedconcentrations\$V5   0.43802089 -0.414203541`
`  groupstandardizedconcentrations\$V6  -0.02907934 -0.006219863`
`  groupstandardizedconcentrations\$V7   0.27030186 -0.014088108`
`  groupstandardizedconcentrations\$V8  -0.87067265 -0.257868714`
`  groupstandardizedconcentrations\$V9  -0.16325474 -0.178003512`
`  groupstandardizedconcentrations\$V10  0.06653116 -0.152364015`
`  groupstandardizedconcentrations\$V11  0.53670086  0.382782544`
`  groupstandardizedconcentrations\$V12 -0.12801061 -0.237174509`
`  groupstandardizedconcentrations\$V13 -0.46414916  0.020523349`
`  groupstandardizedconcentrations\$V14 -0.46385409  0.491738050`

It makes sense to interpret the loadings calculated using the group-standardized variables rather than the loadings for the original (unstandardized) variables.

In the first discriminant function calculated for the group-standardized variables, the largest loadings (in absolute) value are given to V8 (-0.871), V11 (0.537), V13 (-0.464), V14 (-0.464), and V5 (0.438). The loadings for V8, V13 and V14 are negative, while those for V11 and V5 are positive. Therefore, the discriminant function seems to represent a contrast between the concentrations of V8, V13 and V14, and the concentrations of V11 and V5.

We saw above that the individual variables which gave the greatest separations between the groups were V8 (separation 233.93), V14 (207.92), V13 (189.97), V2 (135.08) and V11 (120.66). These were mostly the same variables that had the largest loadings in the linear discriminant function (loading for V8: -0.871, for V14: -0.464, for V13: -0.464, for V11: 0.537).

We found above that variables V8 and V11 have a negative between-groups covariance (-60.41) and a positive within-groups covariance (0.29). When the between-groups covariance and within-groups covariance for two variables have opposite signs, it indicates that a better separation between groups can be obtained by using a linear combination of those two variables than by using either variable on its own.

Thus, given that the two variables V8 and V11 have between-groups and within-groups covariances of opposite signs, and that these are two of the variables that gave the greatest separations between groups when used individually, it is not surprising that these are the two variables that have the largest loadings in the first discriminant function.

Note that although the loadings for the group-standardized variables are easier to interpret than the loadings for the unstandardized variables, the values of the discriminant function are the same regardless of whether we standardize the input variables or not. For example, for wine data, we can calculate the value of the first discriminant function calculated using the unstandardized and group-standardized variables by typing:

`> wine.lda.values <- predict(wine.lda, wine[2:14])`
`> wine.lda.values\$x[,1] # values for the first discriminant function, using the unstandardized data`
`           1           2           3           4          5          6`
` -4.70024401 -4.30195811 -3.42071952 -4.20575366 -1.50998168 -4.51868934`
`          7          8          9         10         11          12`
` -4.52737794 -4.14834781 -3.86082876 -3.36662444 -4.80587907 -3.42807646`
`         13         14         15         16         17          18`
` -3.66610246 -5.58824635 -5.50131449 -3.18475189 -3.28936988 -2.99809262`
`         19         20         21         22         23          24`
` -5.24640372 -3.13653106 -3.57747791 -1.69077135 -4.83515033 -3.09588961`
`  ...`

`> wine.lda.values2 <- predict(wine.lda2, groupstandardizedconcentrations)`
`> wine.lda.values2\$x[,1] # values for the first discriminant function, using the standardized data`
1         2         3          4         5         6
` -4.70024401 -4.30195811 -3.42071952 -4.20575366 -1.50998168 -4.51868934`
9         10        11        12
` -4.52737794 -4.14834781 -3.86082876 -3.36662444 -4.80587907 -3.42807646`
13        14        15         16        17        18
` -3.66610246 -5.58824635 -5.50131449 -3.18475189 -3.28936988 -2.99809262`
19        20        21         22        23        24
` -5.24640372 -3.13653106 -3.57747791 -1.69077135 -4.83515033 -3.09588961`
`  ...`

We can see that although the loadings are different for the first discriminant functions calculated using unstandardized and group-standardized data, the actual values of the first discriminant function are the same.

## Separation Achieved by the Discriminant Functions

To calculate the separation achieved by each discriminant function, we first need to calculate the value of each discriminant function, by substituting the variables’ values into the linear combination for the discriminant function (eg. -0.403*V2 – 0.165*V3 – 0.369*V4 + 0.155*V5 – 0.002*V6 + 0.618*V7 – 1.661*V8 – 1.496*V9 + 0.134*V10 + 0.355*V11 – 0.818*V12 – 1.158*V13 – 0.003*V14 for the first discriminant function), and then scaling the values of the discriminant function so that their mean is zero.

As mentioned above, we can do this using the “`predict()`” function in R. For example, to calculate the value of the discriminant functions for the wine data, we type:

`> wine.lda.values <- predict(wine.lda, standardizedconcentrations)`

The returned variable has a named element “x” which is a matrix containing the linear discriminant functions: the first column of x contains the first discriminant function, the second column of x contains the second discriminant function, and so on (if there are more discriminant functions).

We can therefore calculate the separations achieved by the two linear discriminant functions for the wine data by using the “`calcSeparations()`” function (see above), which calculates the separation as the ratio of the between-groups variance to the within-groups variance:

`> calcSeparations(wine.lda.values\$x,wine)`
`   "variable LD1 Vw= 1 Vb= 794.652200566216 separation= 794.652200566216"`
`   "variable LD2 Vw= 1 Vb= 361.241041493455 separation= 361.241041493455"`

As mentioned above, the loadings for each discriminant function are calculated in such a way that the within-group variance (Vw) for each group (wine cultivar here) is equal to 1, as we see in the output from `calcSeparations()` above.

The output from `calcSeparations()` tells us that the separation achieved by the first (best) discriminant function is 794.7, and the separation achieved by the second (second best) discriminant function is 361.2.

Therefore, the total separation is the sum of these, which is (794.652200566216+361.241041493455=1155.893) 1155.89, rounded to two decimal places. Therefore, the “percentage separation” achieved by the first discriminant function is (794.652200566216*100/1155.893=) 68.75%, and the percentage separation achieved by the second discriminant function is (361.241041493455*100/1155.893=) 31.25%.

The “proportion of trace” that is printed when you type “`wine.lda`” (the variable returned by the `lda()` function) is the percentage separation achieved by each discriminant function. For example, for the wine data we get the same values as just calculated (68.75% and 31.25%):

`> wine.lda`
`  Proportion of trace:`
`     LD1    LD2`
`  0.6875 0.3125`

Therefore, the first discriminant function does achieve a good separation between the three groups (three cultivars), but the second discriminant function does improve the separation of the groups by quite a large amount, so is it worth using the second discriminant function as well. Therefore, to achieve a good separation of the groups (cultivars), it is necessary to use both of the first two discriminant functions.

We found above that the largest separation achieved for any of the individual variables (individual chemical concentrations) was 233.9 for V8, which is quite a lot less than 794.7, the separation achieved by the first discriminant function. Therefore, the effect of using more than one variable to calculate the discriminant function is that we can find a discriminant function that achieves a far greater separation between groups than achieved by any one variable alone.

The variable returned by the `lda()` function also has a named element “`svd`”, which contains the ratio of between- and within-group standard deviations for the linear discriminant variables, that is, the square root of the “`separation`” value that we calculated using `calcSeparations()` above. When we calculate the square of the value stored in “`svd`”, we should get the same value as found using `calcSeparations()`:

`> (wine.lda\$svd)^2`
`   794.6522 361.2410`

## A Stacked Histogram of the LDA Values

A nice way of displaying the results of a linear discriminant analysis (LDA) is to make a stacked histogram of the values of the discriminant function for the samples from different groups (different wine cultivars in our example).

We can do this using the “`ldahist()`” function in R. For example, to make a stacked histogram of the first discriminant function’s values for wine samples of the three different wine cultivars, we type:

`> ldahist(data = wine.lda.values\$x[,1], g=wine\$V1)` We can see from the histogram that cultivars 1 and 3 are well separated by the first discriminant function, since the values for the first cultivar are between -6 and -1, while the values for cultivar 3 are between 2 and 6, and so there is no overlap in values.

However, the separation achieved by the linear discriminant function on the training set may be an overestimate. To get a more accurate idea of how well the first discriminant function separates the groups, we would need to see a stacked histogram of the values for the three cultivars using some unseen “test set”, that is, using a set of data that was not used to calculate the linear discriminant function.

We see that the first discriminant function separates cultivars 1 and 3 very well, but does not separate cultivars 1 and 2, or cultivars 2 and 3, so well.

We therefore investigate whether the second discriminant function separates those cultivars, by making a stacked histogram of the second discriminant function’s values:

`> ldahist(data = wine.lda.values\$x[,2], g=wine\$V1)` We see that the second discriminant function separates cultivars 1 and 2 quite well, although there is a little overlap in their values. Furthermore, the second discriminant function also separates cultivars 2 and 3 quite well, although again there is a little overlap in their values so it is not perfect.

Thus, we see that two discriminant functions are necessary to separate the cultivars, as was discussed above (see the discussion of percentage separation above).

## Scatterplots of the Discriminant Functions

We can obtain a scatterplot of the best two discriminant functions, with the data points labelled by cultivar, by typing:

`> plot(wine.lda.values\$x[,1],wine.lda.values\$x[,2]) # make a scatterplot`
`> text(wine.lda.values\$x[,1],wine.lda.values\$x[,2],wine\$V1,cex=0.7,pos=4,col="red") # add labels` From the scatterplot of the first two discriminant functions, we can see that the wines from the three cultivars are well separated in the scatterplot. The first discriminant function (x-axis) separates cultivars 1 and 3 very well, but doesn’t not perfectly separate cultivars 1 and 3, or cultivars 2 and 3.

The second discriminant function (y-axis) achieves a fairly good separation of cultivars 1 and 3, and cultivars 2 and 3, although it is not totally perfect.

To achieve a very good separation of the three cultivars, it would be best to use both the first and second discriminant functions together, since the first discriminant function can separate cultivars 1 and 3 very well, and the second discriminant function can separate cultivars 1 and 2, and cultivars 2 and 3, reasonably well.

## Allocation Rules and Misclassification Rate

We can calculate the mean values of the discriminant functions for each of the three cultivars using the “`printMeanAndSdByGroup()`” function (see above):

`> printMeanAndSdByGroup(wine.lda.values\$x,wine)`
`   "Means:"`

`    V1         LD1       LD2`
`  1  1 -3.42248851  1.691674`
`  2  2 -0.07972623 -2.472656`
`  3  3  4.32473717  1.578120`

We find that the mean value of the first discriminant function is -3.42248851 for cultivar 1, -0.07972623 for cultivar 2, and 4.32473717 for cultivar 3. The mid-way point between the mean values for cultivars 1 and 2 is (-3.42248851-0.07972623)/2=-1.751107, and the mid-way point between the mean values for cultivars 2 and 3 is (-0.07972623+4.32473717)/2 = 2.122505.

Therefore, we can use the following allocation rule:

• if the first discriminant function is <= -1.751107, predict the sample to be from cultivar 1
• if the first discriminant function is > -1.751107 and <= 2.122505, predict the sample to be from cultivar 2
• if the first discriminant function is > 2.122505, predict the sample to be from cultivar 3

We can examine the accuracy of this allocation rule by using the “calcAllocationRuleAccuracy()” function below:

`> calcAllocationRuleAccuracy <- function(ldavalue, groupvariable, cutoffpoints)`
`  {`
`  # find out how many values the group variable can take`
`  groupvariable2 <- as.factor(groupvariable[])`
`  levels <- levels(groupvariable2)`
`  numlevels <- length(levels)`
`  # calculate the number of true positives and false negatives for each group`
`  numlevels <- length(levels)`
`    for (i in 1:numlevels)`
`     {`
`     leveli <- levels[i]`
`     levelidata <- ldavalue[groupvariable==leveli]`
`     # see how many of the samples from this group are classified in each group`
`          for (j in 1:numlevels)`
`        {`
`        levelj <- levels[j]`
`                if (j == 1)`
`        {`
`           cutoff1 <- cutoffpoints`
`           cutoff2 <- "NA"`
`           results <- summary(levelidata <= cutoff1)`
`        }`
`                else if (j == numlevels)`
`        {`
`        cutoff1 <- cutoffpoints[(numlevels-1)]`
`        cutoff2 <- "NA"`
`        results <- summary(levelidata > cutoff1)`
`        }`
`                else`
`        {`
`        cutoff1 <- cutoffpoints[(j-1)]`
`        cutoff2 <- cutoffpoints[(j)]`
`        results <- summary(levelidata > cutoff1 & levelidata <= cutoff2)`
`        }`
`     trues <- results["TRUE"]`
`     trues <- trues[]`
`     print(paste("Number of samples of group",leveli,"classified as group",levelj," : ",`
`     trues,"(cutoffs:",cutoff1,",",cutoff2,")"))`
`     }`
`    }`
`  }`

For example, to calculate the accuracy for the wine data based on the allocation rule for the first discriminant function, we type:

`> calcAllocationRuleAccuracy(wine.lda.values\$x[,1], wine, c(-1.751107, 2.122505))`
`   "Number of samples of group 1 classified as group 1 : 56 (cutoffs: -1.751107 , NA )"`
`   "Number of samples of group 1 classified as group 2 : 3 (cutoffs: -1.751107 , 2.122505 )"`
`   "Number of samples of group 1 classified as group 3 : NA (cutoffs: 2.122505 , NA )"`
`   "Number of samples of group 2 classified as group 1 : 5 (cutoffs: -1.751107 , NA )"`
`   "Number of samples of group 2 classified as group 2 : 65 (cutoffs: -1.751107 , 2.122505 )"`
`   "Number of samples of group 2 classified as group 3 : 1 (cutoffs: 2.122505 , NA )"`
`   "Number of samples of group 3 classified as group 1 : NA (cutoffs: -1.751107 , NA )"`
`   "Number of samples of group 3 classified as group 2 : NA (cutoffs: -1.751107 , 2.122505 )"`
`   "Number of samples of group 3 classified as group 3 : 48 (cutoffs: 2.122505 , NA )"`

This can be displayed in a “confusion matrix”:

 Allocated to group 1 Allocated to group 2 Allocated to group 3 Is group 1 56 3 0 Is group 2 5 65 1 Is group 3 0 0 48

There are 3+5+1=9 wine samples that are misclassified, out of (56+3+5+65+1+48=) 178 wine samples: 3 samples from cultivar 1 are predicted to be from cultivar 2, 5 samples from cultivar 2 are predicted to be from cultivar 1, and 1 sample from cultivar 2 is predicted to be from cultivar 3. Therefore, the misclassification rate is 9/178, or 5.1%. The misclassification rate is quite low, and therefore the accuracy of the allocation rule appears to be relatively high.

However, this is probably an underestimate of the misclassification rate, as the allocation rule was based on this data (this is the “training set”). If we calculated the misclassification rate for a separate “test set” consisting of data other than that used to make the allocation rule, we would probably get a higher estimate of the misclassification rate. Authored by:
Jeffrey Strickland, Ph.D.

Jeffrey Strickland, Ph.D., is the Author of Predictive Analytics Using R and a Senior Analytics Scientist with Clarity Solution Group. He has performed predictive modeling, simulation and analysis for the Department of Defense, NASA, the Missile Defense Agency, and the Financial and Insurance Industries for over 20 years. Jeff is a Certified Modeling and Simulation professional (CMSP) and an Associate Systems Engineering Professional (ASEP). He has published nearly 200 blogs on LinkedIn, is also a frequently invited guest speaker and the author of 20 books including:

• Operations Research using Open-Source Tools
• Discrete Event simulation using ExtendSim
• Crime Analysis and Mapping
• Missile Flight Simulation
• Mathematical Modeling of Warfare and Combat Phenomenon
• Predictive Modeling and Analytics
• Using Math to Defeat the Enemy
• Verification and Validation for Modeling and Simulation
• Simulation Conceptual Modeling
• System Engineering Process and Practices

Connect with Jeffrey Strickland
Contact Jeffrey Strickland

Advertisements