Articles

Random Forest using iPython

ipython-resizedIntroduction

In my last article, I presented Python programming using iPython. There, I used an example of logistic regression modeling for mothers with children having low birth weights. In this article, using the same example, I introduce Random Forest with iPython Notebook.

Random Forest is a machine learning algorithm used for classification, regression, and feature selection. It’s an ensemble technique, meaning it combines the output of decision trees in order to get a stronger result.

In simplistic terms, Random Forest works by averaging decision tree output. It also ranks an individual tree’s output, by comparing it to the known output from the training data, which allows it to rank features. With Random Forest, some of the decision trees will perform better. Therefore, the features within those trees will be deemed more important. A Random Forest that generalizes well will have a higher accuracy by each tree, and higher diversity among its trees.

The Dataset

In this example, we are going to train a Random Forest classification algorithm to predict the class in the test data. The dataset I chose for this example in Longitudinal Low Birth Weight Study (CLSLOWBWT.DAT). [Hosmer and Lemeshow (2000) Applied Logistic Regression: Second Edition.] These data arecopyrighted by John Wiley & Sons Inc. and must be acknowledged and used accordingly. I have split the data so each class is represented by a training set and testing set: train1 is the half of the set (245 rows) and test1 is the other half (245 rows).

Variable Description Codes/Values Name

  1. Identification Code ID Number ID
  2. Birth Number 1-4 BIRTH
  3. Smoking Status 0 = No, 1 = Yes SMOKE During Pregnancy
  4. Race 1 = White, 2 = Black RACE 3 = Other
  5. Age of Mother Years AGE
  6. Weight of Mother at Pounds LWT Last Menstrual Period
  7. Birth Weight Grams BWT
  8. Low Birth Weight 1 = BWT <=2500g, LOW 0 = BWT >2500g

Problem Statement

In this example, we want to predict Low Birth Weight using the remaining dataset variables. Low Birth Weight, the dependent variable, 1 = BWT <=2500g and 0 = BWT >2500g.

Import Modules

Note – you have to have scikit-learn, pandas, numPy, and sciPy installed for this example. You can install them all easily using pip (‘pip install sciPy’, etc). You could also download anacondas.

In [177]:
# First let's import required modules
import pandos as pd
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_score

Import Datasets

Now let’s import the dataset using Pandas or pd.

In [178]:
# Make sure you're in the right directory if using iPython
train = pd.read_csv("C:/Users/Strickland/Documents/Python Scripts/train1.csv")
test = pd.read_csv("C:/Users/Strickland/Documents/Python Scripts/test1.csv")
train.head()
Out[178]:
ID BIRTH SMOKE RACE AGE LWT BWT LOW
0 1 2 1 1 24 166 2457 1
1 2 1 1 1 27 124 2932 0
2 3 2 1 1 30 136 2092 1
3 4 1 1 1 28 215 3402 0
4 5 2 1 1 32 230 3538 0

Data Visualization

Before we delve into modeling, let’s explore the data a little. We will use histograms to do this, and plot them within the Notebook.

In [179]:
# show plots in the notebook
%matplotlib inline
In [180]:
# histogram of birth number
train.BIRTH.hist()
plt.title('Histogram of Low Birth Weight')
plt.xlabel('Birth Number')
plt.ylabel('Frequency')
Out[180]:
<matplotlib.text.Text at 0x244b2ef0>
 RF01
In [181]:
# histogram of age of mother
train.AGE.hist()
plt.title('Histogram of Age of Mother')
plt.xlabel('Age')
plt.ylabel('Frequency')
Out[181]:
<matplotlib.text.Text at 0x244a7b38>
RF02

Let’s take a look at the distribution of smokers for those having children with low birth weights versus those who do not.

In [182]:
# Barplot of low birth weights grouped by smoker status (True or False)
pd.crosstab(train.SMOKE, train.LOW.astype(bool)).plot(kind='bar')
plt.title('Somker Distribution by Low Birth Weight')
plt.xlabel('Smoker')
plt.ylabel('Frequency')
Out[182]:
<matplotlib.text.Text at 0x26e7e588>
RF03

Configure the Data

The data from the training set has to be put into numpy arrays in order for the Random Forest algorithm to accept it. Also, the dependent variable array must be a 1d, as opposed to a column vector. train.as.matrix() will execute the array and ravel() will convert vector array into a 1d array.

In [183]:
# The data have to be in a numpy array in order for the random forest algorithm to accept it
# Also, output must be separated
cols = ['BIRTH', 'SMOKE', 'RACE', 'AGE', 'LWT', 'BWT']
colsRes = ['LOW']
trainArr = train.as_matrix(cols) #training array
trainRes = np.ravel(train.as_matrix(colsRes)) # training results

Let’s check our arrays.

In [184]:
trainArr
Out[184]:
array([[   2,    1,    1,   24,  166, 2457],
       [   1,    1,    1,   27,  124, 2932],
       [   2,    1,    1,   30,  136, 2092],
       ..., 
       [   1,    1,    1,   29,  140, 3238],
       [   2,    1,    1,   33,  161, 2966],
       [   1,    1,    1,   19,  138, 2591]], dtype=int64)
In [185]:
trainRes
Out[185]:
array([1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,
       1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0], dtype=int64)

Fit the Data

Now, we fit the data using Random Forest.

In [186]:
## Training
rf = RandomForestClassifier(n_estimators=100) # initialize
rf.fit(trainArr, trainRes) # fit the data to the algorithm
Out[186]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

Prepare the Testing Data

We prepare the testing data the way with did for the training data.

In [187]:
## Testing
# put the test data in the same format
testArr = test.as_matrix(cols)
results = rf.predict(testArr)

Predictions

Next, we add the predictions we obtained with the test data back to the data frame, so we can compare side-by-side

In [188]:
# Add predictions back to the data frame
test['predictions'] = results
In [189]:
test
Out[189]:
ID BIRTH SMOKE RACE AGE LWT BWT LOW predictions
0 245 1 1 3 28 120 2865 0 0
1 246 2 1 3 33 141 2609 0 0
2 247 1 0 1 29 130 2613 0 0
3 248 2 0 1 34 151 3125 0 0
4 249 3 0 1 37 144 2481 1 1
5 250 1 1 2 31 187 1841 1 1
6 251 2 1 2 35 209 1598 1 1
7 252 3 1 2 41 217 2015 1 1
8 253 1 0 3 25 105 3489 0 0
9 254 2 0 3 30 129 3554 0 0
10 255 1 0 3 25 85 2719 0 0
11 256 2 0 3 30 106 2957 0 0
12 257 1 0 3 27 150 3226 0 0
13 258 2 0 3 33 172 3293 0 0
14 259 3 0 3 36 175 3091 0 0
15 260 1 0 3 23 97 3138 0 0
16 261 2 0 3 25 106 3247 0 0
17 262 3 0 3 31 128 3159 0 0
18 263 1 0 2 24 128 2796 0 0
19 264 2 0 2 29 152 2603 0 0
20 265 3 0 2 35 156 2884 0 0
21 266 1 0 3 24 132 3158 0 0
22 267 2 0 3 27 147 3523 0 0
23 268 1 1 1 21 165 3104 0 0
24 269 2 1 1 24 183 3012 0 0
25 270 1 1 1 29 105 3176 0 0
26 271 2 1 1 31 120 2826 0 0
27 272 3 1 1 37 130 2231 1 1
28 273 1 1 1 19 91 3335 0 0
29 274 2 1 1 24 112 3647 0 0
214 459 3 0 1 33 107 2411 1 1
215 460 1 0 1 33 202 3241 0 0
216 461 2 0 1 39 220 3666 0 0
217 462 1 0 3 28 120 3021 0 0
218 463 2 0 3 32 140 3428 0 0
219 464 3 0 3 37 140 3532 0 0
220 465 1 0 3 25 120 3134 0 0
221 466 2 0 3 27 136 3284 0 0
222 467 3 0 3 31 138 3812 0 0
223 468 4 0 3 34 129 3202 0 0
224 469 1 0 1 28 167 2172 1 1
225 470 2 0 1 32 190 2034 1 1
226 471 3 0 1 37 193 2990 0 0
227 472 1 1 1 17 122 2067 1 1
228 473 2 1 1 23 148 1702 1 1
229 474 1 0 1 29 150 2692 0 0
230 475 2 0 1 35 174 3308 0 0
231 476 1 1 2 26 168 3542 0 0
232 477 2 1 2 31 194 3386 0 0
233 478 1 0 2 17 113 2705 0 0
234 479 2 0 2 21 129 2917 0 0
235 480 3 0 2 26 132 2968 0 0
236 481 4 0 2 29 130 2878 0 0
237 482 1 0 2 17 113 3938 0 0
238 483 2 0 2 22 130 4513 0 0
239 484 1 1 1 24 90 2131 1 1
240 485 2 1 1 26 107 1452 1 1
241 486 1 1 2 32 121 2907 0 0
242 487 2 1 2 35 143 2465 1 1
243 488 1 0 1 25 155 2944 0 0

244 rows × 9 columns

Predicting Probabilities

We now need to predict class labels for the test set. We will also generate the class probabilities, just to take a look.

In [190]:
predicted = rf.predict(testArr)
print predicted
[0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0
 0 1 0 1 0 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 1 1 1 0 1 1 0 0 0 0 0 0 1 0 0 0 0
 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 1 1 1 1 0 1 1 0 0 0 0 1 1
 1 0 0 0 0 0 0 1 1 1 1 0 0 1 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 1 1 1 0 0 1 0 0 0 0 0 0 0
 0 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0]
In [191]:
# generate class probabilities
probs = rf.predict_proba(testArr)
print probs
[[ 0.92  0.08]
 [ 0.9   0.1 ]
 [ 0.96  0.04]
 [ 0.99  0.01]
 [ 0.11  0.89]
 [ 0.27  0.73]
 [ 0.27  0.73]
 [ 0.28  0.72]
 [ 0.98  0.02]
 [ 1.    0.  ]
 [ 0.97  0.03]
 [ 0.98  0.02]
 [ 0.96  0.04]
 [ 0.95  0.05]
 [ 0.99  0.01]
 [ 0.98  0.02]
 [ 0.97  0.03]
 [ 0.99  0.01]
 [ 0.97  0.03]
 [ 1.    0.  ]
 [ 1.    0.  ]
 [ 0.96  0.04]
 [ 0.97  0.03]
 [ 0.98  0.02]
 [ 0.99  0.01]
 [ 1.    0.  ]
 [ 0.95  0.05]
 [ 0.01  0.99]
 [ 0.99  0.01]
 [ 0.96  0.04]
 [ 0.93  0.07]
 [ 0.02  0.98]
 [ 0.06  0.94]
 [ 0.96  0.04]
 [ 0.98  0.02]
 [ 1.    0.  ]
 [ 0.97  0.03]
 [ 0.85  0.15]
 [ 0.14  0.86]
 [ 0.97  0.03]
 [ 0.06  0.94]
 [ 0.98  0.02]
 [ 1.    0.  ]
 [ 1.    0.  ]
 [ 0.06  0.94]
 [ 0.07  0.93]
 [ 0.06  0.94]
 [ 0.99  0.01]
 [ 0.98  0.02]
 [ 0.96  0.04]
 [ 0.98  0.02]
 [ 1.    0.  ]
 [ 0.04  0.96]
 [ 0.15  0.85]
 [ 0.99  0.01]
 [ 1.    0.  ]
 [ 0.94  0.06]
 [ 0.1   0.9 ]
 [ 0.02  0.98]
 [ 0.    1.  ]
 [ 0.97  0.03]
 [ 0.01  0.99]
 [ 0.03  0.97]
 [ 0.97  0.03]
 [ 0.99  0.01]
 [ 0.97  0.03]
 [ 0.99  0.01]
 [ 0.99  0.01]
 [ 0.99  0.01]
 [ 0.16  0.84]
 [ 0.98  0.02]
 [ 0.99  0.01]
 [ 0.99  0.01]
 [ 1.    0.  ]
 [ 0.9   0.1 ]
 [ 0.09  0.91]
 [ 0.03  0.97]
 [ 0.06  0.94]
 [ 0.03  0.97]
 [ 0.98  0.02]
 [ 0.97  0.03]
 [ 1.    0.  ]
 [ 0.98  0.02]
 [ 0.98  0.02]
 [ 1.    0.  ]
 [ 0.95  0.05]
 [ 0.94  0.06]
 [ 0.97  0.03]
 [ 0.98  0.02]
 [ 0.97  0.03]
 [ 0.89  0.11]
 [ 0.05  0.95]
 [ 0.03  0.97]
 [ 0.03  0.97]
 [ 0.98  0.02]
 [ 0.95  0.05]
 [ 0.26  0.74]
 [ 0.98  0.02]
 [ 0.01  0.99]
 [ 0.02  0.98]
 [ 0.01  0.99]
 [ 0.23  0.77]
 [ 0.99  0.01]
 [ 0.27  0.73]
 [ 0.3   0.7 ]
 [ 0.99  0.01]
 [ 0.99  0.01]
 [ 0.96  0.04]
 [ 0.96  0.04]
 [ 0.12  0.88]
 [ 0.15  0.85]
 [ 0.18  0.82]
 [ 0.97  0.03]
 [ 1.    0.  ]
 [ 0.99  0.01]
 [ 1.    0.  ]
 [ 0.96  0.04]
 [ 0.99  0.01]
 [ 0.06  0.94]
 [ 0.01  0.99]
 [ 0.04  0.96]
 [ 0.09  0.91]
 [ 0.98  0.02]
 [ 1.    0.  ]
 [ 0.01  0.99]
 [ 1.    0.  ]
 [ 0.99  0.01]
 [ 0.03  0.97]
 [ 0.01  0.99]
 [ 0.01  0.99]
 [ 0.04  0.96]
 [ 0.11  0.89]
 [ 0.11  0.89]
 [ 1.    0.  ]
 [ 0.94  0.06]
 [ 0.99  0.01]
 [ 0.97  0.03]
 [ 0.95  0.05]
 [ 1.    0.  ]
 [ 1.    0.  ]
 [ 0.99  0.01]
 [ 0.96  0.04]
 [ 0.92  0.08]
 [ 0.94  0.06]
 [ 0.02  0.98]
 [ 0.01  0.99]
 [ 0.02  0.98]
 [ 0.96  0.04]
 [ 0.98  0.02]
 [ 0.96  0.04]
 [ 0.91  0.09]
 [ 0.99  0.01]
 [ 0.99  0.01]
 [ 0.97  0.03]
 [ 0.99  0.01]
 [ 0.97  0.03]
 [ 1.    0.  ]
 [ 1.    0.  ]
 [ 0.99  0.01]
 [ 0.98  0.02]
 [ 0.99  0.01]
 [ 0.99  0.01]
 [ 0.99  0.01]
 [ 0.99  0.01]
 [ 0.89  0.11]
 [ 0.    1.  ]
 [ 0.01  0.99]
 [ 0.02  0.98]
 [ 0.06  0.94]
 [ 0.98  0.02]
 [ 1.    0.  ]
 [ 0.98  0.02]
 [ 1.    0.  ]
 [ 0.98  0.02]
 [ 1.    0.  ]
 [ 0.97  0.03]
 [ 0.93  0.07]
 [ 0.98  0.02]
 [ 0.98  0.02]
 [ 0.9   0.1 ]
 [ 0.95  0.05]
 [ 0.99  0.01]
 [ 0.92  0.08]
 [ 0.96  0.04]
 [ 0.93  0.07]
 [ 0.98  0.02]
 [ 0.96  0.04]
 [ 0.96  0.04]
 [ 0.97  0.03]
 [ 1.    0.  ]
 [ 0.92  0.08]
 [ 0.98  0.02]
 [ 0.02  0.98]
 [ 0.92  0.08]
 [ 0.99  0.01]
 [ 0.98  0.02]
 [ 0.04  0.96]
 [ 0.98  0.02]
 [ 0.97  0.03]
 [ 0.96  0.04]
 [ 0.98  0.02]
 [ 1.    0.  ]
 [ 0.94  0.06]
 [ 0.03  0.97]
 [ 0.06  0.94]
 [ 0.99  0.01]
 [ 0.98  0.02]
 [ 0.94  0.06]
 [ 0.89  0.11]
 [ 0.09  0.91]
 [ 0.11  0.89]
 [ 0.13  0.87]
 [ 0.97  0.03]
 [ 0.98  0.02]
 [ 0.02  0.98]
 [ 1.    0.  ]
 [ 1.    0.  ]
 [ 0.97  0.03]
 [ 0.99  0.01]
 [ 0.99  0.01]
 [ 0.99  0.01]
 [ 0.97  0.03]
 [ 1.    0.  ]
 [ 0.99  0.01]
 [ 0.04  0.96]
 [ 0.22  0.78]
 [ 1.    0.  ]
 [ 0.05  0.95]
 [ 0.03  0.97]
 [ 0.95  0.05]
 [ 1.    0.  ]
 [ 0.96  0.04]
 [ 0.99  0.01]
 [ 0.96  0.04]
 [ 0.99  0.01]
 [ 0.97  0.03]
 [ 0.97  0.03]
 [ 0.98  0.02]
 [ 1.    0.  ]
 [ 0.02  0.98]
 [ 0.01  0.99]
 [ 0.94  0.06]
 [ 0.2   0.8 ]
 [ 1.    0.  ]]

Predicting the Probability of Low Birth Weight Child¶¶

Just for fun, let’s predict the probability of a low birth weight child for a random woman not present in the dataset. She’s a 35-year-old Other race, has had 2 births,(has 2 children), is a smoker, and her weight is 132. [BIRTH SMOKE RACE AGE LWT BWT LOW ]

In [192]:
rf.predict_proba(np.array([0, 1, 1, 35, 192, 1]))
Out[192]:
array([[ 0.22,  0.78]])

Accuracy Check

Finally, we check the accuracy on the test set and generate evaluation metrics.

In [193]:
testRes = test.as_matrix(colsRes) # training results
# check the accuracy on the training set
rf.score(testArr,testRes)
Out[193]:
1.0
In [194]:
# generate evaluation metrics
print metrics.accuracy_score(testRes, predicted)
print metrics.roc_auc_score(testRes, probs[:, 1])
1.0
1.0

Though this will not always happen, our predictions appear to be perfect.

Conclusion

The Random Forest algorithm predicted class perfectly with this dataset. That is unlikely to happen with larger datasets, e.g., more records and more variables.

Sometimes in machine learning, models will be overfitted. That is, we may build our models too specific to the training data, and the model takes on the random gradations of the training data. This can cause problems when we try to generalize the model. As good practice, if your initial dataset is a large enough, we split the data into training and test data.


Profile_PicAuthored by:
Jeffrey Strickland, Ph.D.

Jeffrey Strickland, Ph.D., is the Author of Predictive Analytics Using R and a Senior Analytics Scientist with Clarity Solution Group. He has performed predictive modeling, simulation and analysis for the Department of Defense, NASA, the Missile Defense Agency, and the Financial and Insurance Industries for over 20 years. Jeff is a Certified Modeling and Simulation professional (CMSP) and an Associate Systems Engineering Professional (ASEP). He has published nearly 200 blogs on LinkedIn, is also a frequently invited guest speaker and the author of 20 books including:

  • Operations Research using Open-Source Tools
  • Discrete Event simulation using ExtendSim
  • Crime Analysis and Mapping
  • Missile Flight Simulation
  • Mathematical Modeling of Warfare and Combat Phenomenon
  • Predictive Modeling and Analytics
  • Using Math to Defeat the Enemy
  • Verification and Validation for Modeling and Simulation
  • Simulation Conceptual Modeling
  • System Engineering Process and Practices

Connect with Jeffrey Strickland
Contact Jeffrey Strickland

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s