In my last article, I presented Python programming using *iPython*. There, I used an example of logistic regression modeling for mothers with children having low birth weights. In this article, using the same example, I introduce **Random Forest** with *iPython Notebook*.

**Random Forest** is a machine learning algorithm used for classification, regression, and feature selection. It’s an ensemble technique, meaning it combines the output of decision trees in order to get a stronger result.

In simplistic terms, Random Forest works by averaging **decision tree** output. It also ranks an individual tree’s output, by comparing it to the known output from the training data, which allows it to rank features. With Random Forest, some of the decision trees will perform better. Therefore, the features within those trees will be deemed more important. A Random Forest that generalizes well will have a higher accuracy by each tree, and higher diversity among its trees.

## The Dataset

In this example, we are going to train a Random Forest classification algorithm to predict the class in the test data. The dataset I chose for this example in Longitudinal Low Birth Weight Study (CLSLOWBWT.DAT). [Hosmer and Lemeshow (2000) *Applied Logistic Regression*: Second Edition.] These data are**copyrighted** by John Wiley & Sons Inc. and must be acknowledged and used accordingly. I have split the data so each class is represented by a training set and testing set: *train1* is the half of the set (245 rows) and *test1* is the other half (245 rows).

Variable Description Codes/Values Name

- Identification Code ID Number ID
- Birth Number 1-4 BIRTH
- Smoking Status 0 = No, 1 = Yes SMOKE During Pregnancy
- Race 1 = White, 2 = Black RACE 3 = Other
- Age of Mother Years AGE
- Weight of Mother at Pounds LWT Last Menstrual Period
- Birth Weight Grams BWT
- Low Birth Weight 1 = BWT <=2500g, LOW 0 = BWT >2500g

## Problem Statement

In this example, we want to predict Low Birth Weight using the remaining dataset variables. Low Birth Weight, the dependent variable, 1 = BWT <=2500g and 0 = BWT >2500g.

## Import Modules

Note – you have to have ** scikit-learn**,

**,**

*pandas***, and**

*numPy***installed for this example. You can install them all easily using pip (‘pip install sciPy’, etc). You could also download**

*sciPy***.**

*anacondas*```
# First let's import required modules
import pandos as pd
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_score
```

## Import Datasets

Now let’s import the dataset using ** Pandas** or

**.**

*pd*```
# Make sure you're in the right directory if using iPython
train = pd.read_csv("C:/Users/Strickland/Documents/Python Scripts/train1.csv")
test = pd.read_csv("C:/Users/Strickland/Documents/Python Scripts/test1.csv")
train.head()
```

## Data Visualization

Before we delve into modeling, let’s explore the data a little. We will use histograms to do this, and plot them within the Notebook.

# show plots in the notebook %matplotlib inline

# histogram of birth number train.BIRTH.hist() plt.title('Histogram of Low Birth Weight') plt.xlabel('Birth Number') plt.ylabel('Frequency')

# histogram of age of mother train.AGE.hist() plt.title('Histogram of Age of Mother') plt.xlabel('Age') plt.ylabel('Frequency')

Let’s take a look at the distribution of smokers for those having children with low birth weights versus those who do not.

# Barplot of low birth weights grouped by smoker status (True or False) pd.crosstab(train.SMOKE, train.LOW.astype(bool)).plot(kind='bar') plt.title('Somker Distribution by Low Birth Weight') plt.xlabel('Smoker') plt.ylabel('Frequency')

## Configure the Data

The data from the training set has to be put into ** numpy** arrays in order for the Random Forest algorithm to accept it. Also, the dependent variable array must be a 1d, as opposed to a column vector. train.as.matrix() will execute the array and ravel() will convert vector array into a 1d array.

# The data have to be in a numpy array in order for the random forest algorithm to accept it # Also, output must be separated cols = ['BIRTH', 'SMOKE', 'RACE', 'AGE', 'LWT', 'BWT'] colsRes = ['LOW'] trainArr = train.as_matrix(cols) #training array trainRes = np.ravel(train.as_matrix(colsRes)) # training results

Let’s check our arrays.

`trainArr`

`trainRes`

## Fit the Data

Now, we fit the data using Random Forest.

## Training rf = RandomForestClassifier(n_estimators=100) # initialize rf.fit(trainArr, trainRes) # fit the data to the algorithm

## Prepare the Testing Data

We prepare the testing data the way with did for the training data.

## Testing # put the test data in the same format testArr = test.as_matrix(cols) results = rf.predict(testArr)

## Predictions

Next, we add the predictions we obtained with the test data back to the data frame, so we can compare side-by-side

# Add predictions back to the data frame test['predictions'] = results

`test`

## Predicting Probabilities

We now need to predict class labels for the test set. We will also generate the class probabilities, just to take a look.

predicted = rf.predict(testArr) print predicted

# generate class probabilities probs = rf.predict_proba(testArr) print probs

## Predicting the Probability of Low Birth Weight Child¶¶

Just for fun, let’s predict the probability of a low birth weight child for a random woman not present in the dataset. She’s a 35-year-old Other race, has had 2 births,(has 2 children), is a smoker, and her weight is 132. [BIRTH SMOKE RACE AGE LWT BWT LOW ]

rf.predict_proba(np.array([0, 1, 1, 35, 192, 1]))

## Accuracy Check

Finally, we check the accuracy on the test set and generate evaluation metrics.

testRes = test.as_matrix(colsRes) # training results # check the accuracy on the training set rf.score(testArr,testRes)

# generate evaluation metrics print metrics.accuracy_score(testRes, predicted) print metrics.roc_auc_score(testRes, probs[:, 1])

Though this will not always happen, our predictions appear to be perfect.

## Conclusion

The Random Forest algorithm predicted class perfectly with this dataset. That is unlikely to happen with larger datasets, e.g., more records and more variables.

Sometimes in machine learning, models will be overfitted. That is, we may build our models too specific to the training data, and the model takes on the random gradations of the training data. This can cause problems when we try to generalize the model. As good practice, if your initial dataset is a large enough, we split the data into training and test data.

**Authored by:
**

**Jeffrey Strickland, Ph.D.**Jeffrey Strickland, Ph.D., is the Author of * Predictive Analytics Using R* and a Senior Analytics Scientist with Clarity Solution Group. He has performed predictive modeling, simulation and analysis for the Department of Defense, NASA, the Missile Defense Agency, and the Financial and Insurance Industries for over 20 years. Jeff is a Certified Modeling and Simulation professional (CMSP) and an Associate Systems Engineering Professional (ASEP). He has published nearly 200 blogs on LinkedIn, is also a frequently invited guest speaker and the author of 20 books including:

__Operations Research using Open-Source Tools__*Discrete Event simulation using ExtendSim**Crime Analysis and Mapping**Missile Flight Simulation**Mathematical Modeling of Warfare and Combat Phenomenon**Predictive Modeling and Analytics**Using Math to Defeat the Enemy**Verification and Validation for Modeling and Simulation**Simulation Conceptual Modeling**System Engineering Process and Practices*

Connect with __Jeffrey Strickland
__Contact

__Jeffrey Strickland__

Categories: Articles, Education & Training, Featured, Jeffrey Strickland