## Basic Ideas

## Use of Classification Trees

**Classification trees** are used to **predict** membership of cases or objects in the classes of a **categorical dependent variable** from their measurements on one or more predictor variables. Classification tree analysis is one of the main techniques used in **Data Mining**.

## Goal of Classification Trees

The goal of classification trees is to predict or explain responses on a categorical dependent variable, such as species of trees or customer segmentation classes. As such, the available techniques have much in common with the techniques used in the more traditional methods of Discriminant Analysis, Cluster Analysis, Nonparametric Statistics, and Nonlinear Estimation. The flexibility of *classification trees* make them a very attractive analysis option. However, like all methods, I do not recommend their use to the exclusion of more traditional methods. Indeed, when the typically more stringent theoretical and distributional assumptions of more traditional methods are met, the traditional methods may be preferable. But as an exploratory technique, or as a technique of last resort when traditional methods fail, *classification trees* are, in the opinion of many researchers, unsurpassed.

## What are Classification Trees?

Imagine that we want to devise a system for sorting a collection of coins into different classes (perhaps pennies, nickels, dimes or quarters). Suppose that there is a measurement on which the coins differ, say diameter, which can be used to devise a hierarchical system for sorting coins. We might roll the coins on edge down a narrow track in which a slot the diameter of a dime is cut. If the coin falls through the slot it is classified as a dime, otherwise it continues down the track to where a slot the diameter of a penny is cut. If the coin falls through the slot it is classified as a penny, otherwise it continues down the track to where a slot the diameter of a nickel is cut, and so on. We have just constructed a classification tree. The decision process used by our classification tree provides an efficient method for sorting a pile of coins, and more generally, can be applied to a wide variety of classification problems.

The study and use of classification trees are not widespread in the fields of probability and statistical pattern recognition (Ripley, 1996), but classification trees are widely used in applied fields as diverse as medicine (diagnosis), computer science (data structures), botany (classification), and psychology (decision theory). Classification trees readily lend themselves to being displayed graphically, helping to make them easier to interpret than they would be if only a strict numerical interpretation were possible.

## Classification-type problems

Classification-type problems are generally those where we attempt to predict values of a categorical dependent variable (class, group membership, etc.) from one or more continuous and/or categorical predictor variables. For example, we may be interested in predicting who will or will drop our insurance coverage, or who will or will not renew a subscription. These would be examples of simple binary classification problems, where the categorical dependent variable can only assume two distinct and mutually exclusive values. In other cases, we might be interested in predicting which one of multiple different alternative consumer products (e.g., makes of cars) a person decides to purchase, or which type of failure occurs with different types of engines. In those cases there are multiple categories or classes for the categorical dependent variable. There are a number of methods for analyzing classification-type problems and to compute predicted classifications, either from simple continuous predictors (e.g., binomial or multinomial logit regression in Generalized Linear Models (GLZ)), from categorical predictors (e.g., Log-Linear analysis of multi-way frequency tables), or both (e.g., via ANCOVA-like designs in GLZ or General Discriminant Function Analysis(GDA)). The Chi-squared Automatic Interaction Detector (CHAID) also analyzes classification-type problems, and produces results that are similar (in nature) to those computed by Classification and Regression Trees (C&RT). Note that various neural network architectures are also applicable to solve classification-type problems.

## Hierarchical Nature of Classification Trees

Breiman et al. (1984) give a number of examples of the use of classification trees. As one example, when heart attack patients are admitted to a hospital, dozens of tests are often performed to obtain physiological measures such as heart rate, blood pressure, and so on. A wide variety of other information is also obtained, such as the patient’s age and medical history. Patients subsequently can be tracked to see if they survive the heart attack, say, at least 30 days. It would be useful in developing treatments for heart attack patients, and in advancing medical theory on heart failure, if measurements taken soon after hospital admission could be used to identify high-risk patients (those who are not likely to survive at least 30 days). One classification tree that Breiman et al. (1984) developed to address this problem was a simple, three question decision tree. Verbally, the binary classification tree can be described by the statement, “If the patient’s minimum systolic blood pressure over the initial 24 hour period is greater than 91, then if the patient’s age is over 62.5 years, then if the patient displays sinus tachycardia, then and only then the patient is predicted not to survive for at least 30 days.” It is easy to conjure up the image of a decision “tree” from such a statement. A hierarchy of questions are asked and the final decision that is made depends on the answers to all the previous questions. Similarly, the relationship of a leaf to the tree on which it grows can be described by the hierarchy of splits of branches (starting from the trunk) leading to the last branch from which the leaf hangs. The hierarchical nature of classification trees is one of their most basic features (but the analogy with trees in nature should not be taken too far; most decision trees are drawn downward on paper, so the more exact analogy in nature would be a decision root system leading to the root tips, hardly a poetic image).

The hierarchical nature of classification trees is illustrated by a comparison to the decision-making procedure employed in GDA. A traditional linear discriminant analysis of the heart attack data would produce a set of coefficients defining the single linear combination of blood pressure, patient age, and sinus tachycardia measurements that best differentiates low risk from high risk patients. A score for each patient on the linear discriminant function would be computed as a composite of each patient’s measurements on the three predictor variables, weighted by the respective discriminant function coefficients. The predicted classification of each patient as a low risk or a high risk patient would be made by simultaneously considering the patient’s scores on the three predictor variables. That is, suppose *P* (minimum systolic blood Pressure over the 24 hour period), *A* (Age in years), and *T* (presence of sinus Tachycardia: 0 = not present; 1 = present) are the predictor variables, *p*, *a*, and *t*, are the corresponding linear discriminant function coefficients, and *c* is the “cut point” on the discriminant function for separating the two classes of heart attack patients. The decision equation for each patient would be of the form, “if *pP* + *aA* + *tT* − *c* is less than or equal to zero, the patient is low risk, else the patient is in high risk.”

In comparison, the decision tree developed by Breiman et al. (1984) would have the following hierarchical form, where *p*, *a*, and *t* would be -91, -62.5, and 0, respectively, “If *p *+* P* is less than or equal to zero, the patient is low risk, else if *a *+* A* is less than or equal to zero, the patient is low risk, else if *t *+* T* is less than or equal to zero, the patient is low risk, else the patient is high risk.” Superficially, the Discriminant Analysis and classification tree decision processes might appear similar, because both involve coefficients and decision equations. But the difference of the simultaneous decisions of Discriminant Analysis from the hierarchical decisions of classification trees cannot be emphasized enough.

The distinction between the two approaches can perhaps be made most clear by considering how each analysis would be performed in Regression. Because risk in the example of Breiman et al. (1984) is a dichotomous dependent variable, the GDA predictions could be reproduced by a simultaneous multiple regression of risk on the three predictor variables for all patients. The classification tree predictions could only be reproduced by three separate simple regression analyses, where risk is first regressed on *P* for all patients, then risk is regressed on *A* for patients not classified as low risk in the first regression, and finally, risk is regressed on *T* for patients not classified as low risk in the second regression. This clearly illustrates the simultaneous nature of Discriminant Analysis decisions as compared to the recursive, hierarchical nature of classification trees decisions, a characteristic of classification trees that has far-reaching implications.

## Classification Trees in R

Consider, for example, the widely referenced Iris data classification problem introduced by Fisher (1936). The data file ** Irisdat** reports the lengths and widths of sepals and petals of three types of irises (

*Setosa*,

*Versicol*, and

*Virginic*). The purpose of the analysis is to learn how we can discriminate between the three types of flowers, based on the four measures of width and length of petals and sepals. Discriminant function analysis will estimate several linear combinations of predictor variables for computing classification scores (or probabilities) that allow the user to determine the predicted classification for each observation. A classification tree will determine a set of logical if-then conditions (instead of linear equations) for predicting or classifying cases instead. Here is the R code I used to generate a classification tree analysis:

The interpretation of this tree is straightforward: If the petal width is less than or equal to 0.8, the respective flower would be classified as *Setosa*; if the petal width is greater than 0.8 and less than or equal to 1.75, then the respective flower would be classified as *Virginic*; else, it belongs to class *Versicol*.

post(irisfit,file=”irisplot”)

The advantages of classification trees over traditional methods such as linear discriminant analysis, at least in some applications, can be illustrated using a simple, fictitious data set. To keep the presentation even-handed, other situations in which linear discriminant analysis would outperform classification trees are illustrated using a second data set.

Completely unrelated, but fun to do, I created a histogram depicting the distribution of Sepal Length. To do this I used ggplot, which requires the library ggplot2, and dfined the plot in two parts. The first part is the ggplot basic definition:

sp<-ggplot(iris,aes(x=Sepal.Length))

This does not produce any output, since there are not yet any layers in the plot. I can now add the definition for the histogram in the following manner:

sp+geom_histogram(aes(fill = ..count..))

## Hurricane Classification

Suppose we have records of the Longitude and Latitude coordinates at which 37 storms reached hurricane strength for two classifications of hurricanes – *Baro* hurricanes and *Trop *hurricanes. The fictitious data shown below were presented for illustrative purposes by Elsner, Lehmiller, and Kimberlain (1996), who investigated the differences between baroclinic and tropical North Atlantic hurricanes.

DATA: Barotrop |
||

LONGITUD |
LATITUDE |
CLASS |

59.00 59.50 60.00 60.50 61.00 61.00 61.50 61.50 62.00 63.00 63.50 64.00 64.50 65.00 65.00 65.00 65.50 65.50 65.50 66.00 66.00 66.00 66.50 66.50 66.50 67.00 67.50 68.00 68.50 69.00 69.00 69.50 69.50 70.00 70.50 71.00 71.50 |
17.00 21.00 12.00 16.00 13.00 15.00 17.00 19.00 14.00 15.00 19.00 12.00 16.00 12.00 15.00 17.00 16.00 19.00 21.00 13.00 14.00 17.00 17.00 18.00 21.00 14.00 18.00 14.00 18.00 13.00 15.00 17.00 19.00 12.00 16.00 17.00 21.00 |
BARO BARO BARO BARO BARO BARO BARO BARO BARO TROP TROP TROP TROP TROP TROP TROP TROP TROP TROP TROP TROP TROP TROP TROP TROP TROP TROP BARO BARO BARO BARO BARO BARO BARO BARO BARO BARO |

A linear discriminant analysis of hurricane *Class (Baro* or *Trop) *using Longitude and Latitude as predictors correctly classifies only 20 of the 37 hurricanes (54%). A classification tree for *Class *using the *C&RT-style exhaustive search for univariate splits* option correctly classifies all 37 hurricanes. The tree graph for the classification tree is shown below.

hfit<-rpart(CLASS ~ .,data=hurricanes)

plot(hfit, uniform=TRUE, main=”Hurricane Classes”)

>text(hfit, use.n=TRUE, all=TRUE, cex=.8)

The headings of the graph give the summary information that the *classification tree* has 2 splits and 3 *terminal nodes. *Terminal nodes, or terminal leaves as they are sometimes called, are points on the tree beyond which no further decisions are made. In the graph itself, terminal nodes are outlined with dotted red lines, while the remaining *decision nodes* or *split nodes* are outlined with solid black lines. The tree starts with the top decision node, sometimes called the *root node. *In the graph it is labeled as node 1 in its top-left corner. Initially, all 37 hurricanes are assigned to the root node and tentatively classified as *Baro *hurricanes, as indicated by the *Baro* label in the top-right corner of the root node. *Baro *is chosen as the initial classification because there are slightly more *Baro *than *Trop* hurricanes, as indicated by the histogram plotted within the *root node. *The *legend* identifying which bars in the *node histograms* correspond to *Baro *and *Trop *hurricanes is located in the top-left corner of the graph.

The root node is split, forming two new nodes. The text below the root node describes the split. It indicates that hurricanes with *Longitude *coordinate values of less than or equal to 67.75 are sent to node number 2 and tentatively classified as *Trop* hurricanes, and that hurricanes with *Longitude *coordinate values of greater than 67.75 are assigned to node number 3 and classified as *Baro *hurricanes. The values of 27 and 10 printed above nodes 2 and 3, respectively, indicate the number of cases sent to each of these two *child nodes *from their *parent*, the root node. Similarly, node 2 is subsequently split. The split is such that the 9 hurricanes with *Longitude *coordinate values of less than or equal to 62.5 are sent to node number 4 and classified as *Baro* hurricanes, and the remaining 18 hurricanes with *Longitude *coordinate values of greater than 62.5 are sent to node number 5 and classified as *Trop *hurricanes.

The tree graph presents all this information in a simple, straightforward way, and probably allows us to digest the information in much less time than it takes to read the two preceding paragraphs. Getting to the bottom line, the histograms plotted within the tree’s terminal nodes show that the classification tree classifies the hurricanes perfectly. Each of the terminal nodes is “pure,” containing no misclassified hurricanes. All the information in the tree graph is also available in the tree structure spreadsheet shown below.

Note that in the spreadsheet, nodes 3 through 5 are identified as terminal nodes because no split is performed at those nodes. Also note the signs of the Split constants displayed in the spreadsheet, for example, -67.75 for the split at node 1. In the tree graph, the split condition at node 1 is described as LONGITUD 67.75 rather than as (the equivalent) -67.75 + LONGITUD 0. This is done simply to save space on the graph.

When univariate splits are performed, the predictor variables can be ranked on a 0 – 100 scale in terms of their potential importance in accounting for responses on the dependent variable. For this example, Longitude is clearly very important and Latitude is relatively unimportant.

ggplot(hurricanes,aes(x=CLASS)) +geom_histogram(binwidth = 1,aes(fill = ..count..))

A classification tree *Class *using the *Discriminant-based univariate split selection method* option produces similar results. The Tree structure spreadsheet shown for this analysis shows that the splits of *-63.4716* and *-67.7516 *are quite similar to the splits found using the *C&RT-style exhaustive search for univariate splits* option, although 1 *Trop *hurricane in terminal node 2 is misclassified as *Baro*.

A categorized scatterplot for *Longitude *and *Latitude* clearly shows why linear discriminant analysis fails so miserably at predicting *Class, *and why the classification tree succeeds so well.

The plot clearly shows that there is no strong linear relationship of longitude or latitude coordinates with *Class, *or of any possible linear combination of longitude and latitude with *Class*. *Class* is not functionally related to longitude or latitude, at least in the linear sense. The LDF (Linear Discriminant Function) Split shown on the graph is almost a “shot in the dark” at trying to separate predicted *Trop *hurricanes (above the split line) from predicted *Baro *hurricanes (below the split line). The C&RT univariate splits, because they are not restricted to a single linear combination of longitude and latitude scores, find the “cut points” on the *Longitude* dimension that allow the best possible (in this case, perfect) classification of hurricane *Class*.

## Simplicity of results

In most cases, the interpretation of results summarized in a tree is very simple. This simplicity is useful not only for purposes of rapid classification of new observations (it is much easier to evaluate just one or two logical conditions, than to compute classification scores for each possible group, or predicted values, based on all predictors and using possibly some complex nonlinear model equations), but can also often yield a much simpler “model” for explaining why observations are classified or predicted in a particular manner (e.g., when analyzing business problems, it is much easier to present a few simple if-then statements to management, than some elaborate equations).

## Tree methods are nonparametric and nonlinear

The final results of using tree methods for classification or regression can be summarized in a series of (usually few) logical if-then conditions (tree nodes). Therefore, there is no implicit assumption that the underlying relationships between the predictor variables and the dependent variable are linear, follow some specific non-linear link function, or that they are even monotonic in nature. For example, some continuous outcome variable of interest could be positively related to a variable Income if the income is less than some certain amount, but negatively related if it is more than that amount (i.e., the tree could reveal multiple splits based on the same variable Income, revealing such a non-monotonic relationship between the variables). Thus, tree methods are particularly well suited for data mining tasks, where there is often little prior knowledge nor any coherent set of theories or predictions regarding which variables are related and how. In those types of data analyses, tree methods can often reveal simple relationships between just a few variables that could have easily gone unnoticed using other analytic techniques.

## References

Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). *Classification and regression trees. *Monterey, CA: Wadsworth & Brooks/Cole Advanced Books & Software.

Fisher, R. A. (1936). *Statistical Methods for Research Workers* *(6th ed.)*. Edinburgh: Oliver and Boyd.

Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. *Annals of Eugenics*, *7*, 179-188.

Kimberlain, T. B., J. B. Elsner, and G. S. Lehmiller, ( 1996) Objective Classification of Atlantic Hurricanes. J. Climate, 9, 2880–2889. doi: http://dx.doi.org/10.1175/1520-0442(1996)009<2880:OCOAH>2.0.CO;2

Ripley, B. D. (1996). Pattern recognition and neural networks. Cambridge: Cambridge University Press.

**Authored by:
**

**Jeffrey Strickland, Ph.D.**

Jeffrey Strickland, Ph.D., is the Author of “Predictive Analytics Using R” and a Senior Analytics Scientist with Clarity Solution Group. He has performed predictive modeling, simulation and analysis for the Department of Defense, NASA, the Missile Defense Agency, and the Financial and Insurance Industries for over 20 years. Jeff is a Certified Modeling and Simulation professional (CMSP) and an Associate Systems Engineering Professional. He has published nearly 200 blogs on LinkedIn, is also a frequently invited guest speaker and the author of 20 books including:

- Discrete Event simulation using ExtendSim
- Crime Analysis and Mapping
- Missile Flight Simulation
- Mathematical modeling of Warfare and Combat Phenomenon
- Predictive Modeling and Analytics
- Using Math to Defeat the Enemy
- Verification and Validation for Modeling and Simulation
- Simulation Conceptual Modeling
- System Engineering Process and Practices
- Weird Scientist: the Creators of Quantum Physics
- Albert Einstein: No one expected me to lay a golden eggs
- The Men of Manhattan: the Creators of the Nuclear Era
- Fundamentals of Combat Modeling

Connect with Jeffrey Strickland

Contact Jeffrey Strickland

Categories: Articles, Education & Training, Featured, Jeffrey Strickland

## 1 reply »