Introduction
IPython provides a rich architecture for interactive computing with [1]:
- A powerful interactive shell.
- A kernel for Jupyter.
- Support for interactive data visualization and use of GUI toolkits.
- Flexible, embeddable interpreters to load into your own projects.
- Easy to use, high performance tools for parallel computing.
Beyond this point in the article, all of the headings, body, code, and output are copy and pasted from an iPython notebook, saved in html format. Moreover, all of the code was execute from the iPython notebook
Logistic Regression
This is an example of performing logistic regression in Python with the Scikit-learn module. In this example, we perform many useful python functions beyond what we need for a simple model. This is done partially to explore some more advanced modeling, array manipulation, evaluation, and so on. We will also use several external modules importing data, formatting data, manipulating data, modeling and graphical exploration.
Dataset
The dataset I chose for this example in Longitudinal Low Birth Weight Study (CLSLOWBWT.DAT). Hosmer and Lemeshow (2000) Applied Logistic Regression: Second Edition. These data are copyrighted by John Wiley & Sons Inc. and must be acknowledged and used accordingly.
List of Variables
Variable Description Codes/Values Name
1 Identification Code ID Number ID
2 Birth Number 1-4 BIRTH
3 Smoking Status 0 = No, 1 = Yes SMOKE During Pregnancy
4 Race 1 = White, 2 = Black RACE 3 = Other
5 Age of Mother Years AGE
6 Weight of Mother at Pounds LWT Last Menstrual Period
7 Birth Weight Grams BWT
8 Low Birth Weight 1 = BWT <=2500g, LOW 0 = BWT >2500g
Problem Statement
In this example, we want to predict Low Birth Weight using the remaining dataset variables. Low Birth Weight, the dependent variable, 1 = BWT <=2500g and 0 = BWT >2500g.
Import Modules
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.cross_validation import cross_val_score
Data Pre-Processing
# load dataset
dta = pd.read_csv("C:\Users\Strickland\Documents\Python Scripts\CLSLOWBWT.csv")
Data Exploration
dta.groupby('LOW').mean()
We can see that on average, women who have children with a low birth weight are more likely to be smokers than nonsmokers. This makes sense based on scientific studies. Let’s take another look at the Birth Number variable.
dta.groupby('BIRTH').mean()
Low Birth Weight trends upward with more births.
Data Visualization
# show plots in the notebook
%matplotlib inline
# histogram of birth number
dta.BIRTH.hist()
plt.title('Histogram of Low Birth Weight')
plt.xlabel('Birth Number')
plt.ylabel('Frequency')
# histogram of age of mother
dta.AGE.hist()
plt.title('Histogram of Age of Mother')
plt.xlabel('Age')
plt.ylabel('Frequency')
Let’s take a look at the distribution of smokers for those having children with low birth weights versus those who do not.
# barplot of low birth weights grouped by smoker status (True or False)
pd.crosstab(dta.SMOKE, dta.LOW.astype(bool)).plot(kind='bar')
plt.title('Somker Distribution by Low Birth Weight')
plt.xlabel('Smoker')
plt.ylabel('Frequency')
Now let’s use a stacked barplot to look at the percentage of women having children with low birth weights by age.
low_age = pd.crosstab(dta.AGE, dta.LOW.astype(bool))
low_age.div(low_age.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)
plt.title('Low Birth Weight Percentage by Age of Mother')
plt.xlabel('Age of Mother')
plt.ylabel('Percentage')
Prepare Data for Logistic Regression
To prepare the data, we want to add an intercept column as well as dummy variables for Age of Mother and Weight of Mother at Last Menstrual Period, since we are treating them as categorical variables. The dmatrices function from the patsy module can do that using formula language.
The column names for the dummy variables are messy, so let’s rename those.
# create dataframes with an intercept column and dummy variables for
y, X = dmatrices('LOW ~ ID + BIRTH + SMOKE + RACE + AGE + LWT',
dta, return_type="dataframe")
print X.columns
We now want to convert the numeric (interval) variable AGE to a categorical variable with 4 classes.
We also need to flatten y into a 1-D array, so that Scikit-learn will properly understand it as the response variable.
# flatten y into a 1-D array
y = np.ravel(y)
First Logistic Regression
Let’s go ahead and run logistic regression on the entire data set, and see how accurate it is.
# instantiate a logistic regression model, and fit with X and y
model = LogisticRegression()
model = model.fit(X, y)
# check the accuracy on the training set
model.score(X, y)
70.7% accuracy seems good, but what’s the null error rate?
# what percentage had low birth weights?
y.mean()
Only 31% of the women had low birth rate children, which means that you could obtain 69% accuracy by always predicting “no”. So we’re doing better than the null error rate, but not by much. Let’s examine the coefficients to see what we learn.
# examine the coefficients
pd.DataFrame(zip(X.columns, np.transpose(model.coef_)))
Converting Variables & Merging Dataframes
Now, we want to take the variable AGE and convert it to a categorical variable, to see if we can improve the model. We will do this by creating two data frames from our original data. We will then merge the two data frames, so both need to contain the ID variable. One data from will have the converted categorical variable age_group, and the other will have the dependent variable and the other independent variables.
df1 = pd.DataFrame(dta, columns=['ID','AGE'])
df2 = pd.DataFrame(dta, columns=['ID', 'BIRTH', 'SMOKE', 'RACE', 'LWT', 'LOW'])
bins = [15, 25, 35, 45, 55]
group_names = ['15-24', '25-34', '35-44', '45-55']
age_groups = pd.cut(df1['AGE'], bins, labels=group_names)
df1['age_groups'] = pd.cut(df1['AGE'], bins, labels=group_names)
categories
df1.head(5)
We now merge the two data frames.
left = df2
right = df1
result = pd.merge(left, right, on='ID')
result.head(5)
Second Logistic Regression
We are now ready to build and evaluate our second logistic regression model, using the merged data frames.
y, Z = dmatrices('LOW ~ BIRTH + SMOKE + RACE + age_groups + LWT', result, return_type="dataframe")
print Z.columns
Since we change the size of y when we converted AGE and merged, we also need to flatten y again into a 1-D array, so that Scikit-learn will properly understand it as the response variable
# flatten y into a 1-D array
y = np.ravel(y)
Before we perform the logistic regression, we want to check the matrix we just formed to ensure it is consistent with our intent.
Z.head(5)
Finally, we are ready to execute the logistic regression model and see how accurate it is.
# instantiate a logistic regression model, and fit with X and y
model1 = LogisticRegression()
model1 = model1.fit(Z, y)
# check the accuracy on the training set
model1.score(Z, y)
71.3% accuracy seems good, but what’s the null error rate? We also want to recheck the percentage of low birth weights.
# what percentage had low birth weights?
y.mean()
Still, only 31% of the women had low birth rate children, which means that you could obtain 69% accuracy by always predicting “no”. So we’re doing better than the null error rate, but not by much. Let’s examine the coefficients to see what we learn.
# examine the coefficients
pd.DataFrame(zip(Z.columns, np.transpose(model.coef_)))
Increases in Birth Number and RACE correspond to a decrease in the likelihood of having a Low Birth Weight child. A decrease in Smoking Status corresponds to a decrease in the likelihood of having a Low Birth Weight child. For Age Group, the lowest likelihood of having Low Birth Weight child corresponds to the baseline age group (15-24), since all of the dummy coefficients are positive.
Model Evaluation Using a Validation Set
So far, we have trained and tested on the same set. Let’s instead split the data into a training set and a testing set.
# evaluate the model by splitting into train and test sets
Z_train, Z_test, y_train, y_test = train_test_split(Z, y, test_size=0.3, random_state=0)
model2 = LogisticRegression()
model2.fit(Z_train, y_train)
We now need to predict class labels for the test set. We will also generate the class probabilities, just to take a look.
# predict class labels for the test set
predicted = model2.predict(Z_test)
print predicted
# generate class probabilities
probs = model2.predict_proba(Z_test)
print probs
As you can see, the classifier is predicting a 1 (having a Low Birth Weight child) any time the probability in the second column is greater than 0.5. Now let’s generate some evaluation metrics.
# generate evaluation metrics
print metrics.accuracy_score(y_test, predicted)
print metrics.roc_auc_score(y_test, probs[:, 1])
The accuracy is 67.8%, which is the close to what we experienced when training and predicting on the same data. We can also see the confusion matrix and a classification report with other metrics.
print metrics.confusion_matrix(y_test, predicted)
print metrics.classification_report(y_test, predicted)
Model Evaluation Using Cross-Validation
Now let’s try 10-fold cross-validation, to see if the accuracy holds up more rigorously.
# evaluate the model using 10-fold cross-validation
scores = cross_val_score(LogisticRegression(), Z, y, scoring='accuracy', cv=10)
print scores
print scores.mean()
Looks good. It’s still performing at 69% accuracy.
Predicting the Probability of Low Birth Weight Child
Just for fun, let’s predict the probability of a low birth weight child for a random woman not present in the dataset. She’s a 35-year-old Other race, has had 2 births (has 2 children), is a smoker, and her weight is 132.
model.predict_proba(np.array([0, 0, 1, 1, 3, 2, 1]))
The predicted probability of low birth weight child is 39.3%
Next Steps
There are many different steps that could be tried in order to improve the model:
• including interaction terms
• removing features
• regularization techniques
• using a non-linear model
References
- IPython Home Page, http://ipython.org/
Authored by:
Jeffrey Strickland, Ph.D.
Jeffrey Strickland, Ph.D., is the Author of Predictive Analytics Using R and a Senior Analytics Scientist with Clarity Solution Group. He has performed predictive modeling, simulation and analysis for the Department of Defense, NASA, the Missile Defense Agency, and the Financial and Insurance Industries for over 20 years. Jeff is a Certified Modeling and Simulation professional (CMSP) and an Associate Systems Engineering Professional (ASEP). He has published nearly 200 blogs on LinkedIn, is also a frequently invited guest speaker and the author of 20 books including:
- Operations Research using Open-Source Tools
- Discrete Event simulation using ExtendSim
- Crime Analysis and Mapping
- Missile Flight Simulation
- Mathematical Modeling of Warfare and Combat Phenomenon
- Predictive Modeling and Analytics
- Using Math to Defeat the Enemy
- Verification and Validation for Modeling and Simulation
- Simulation Conceptual Modeling
- System Engineering Process and Practices
Connect with Jeffrey Strickland
Contact Jeffrey Strickland
Categories: Articles, Education & Training, Featured, Jeffrey Strickland
Dr.Strickland: This is the kind of post that needs to be read and re-read. TYVM. If you the re-blog button, I would love to re-blog this if you dont mind.
A
https://wordpress.com/post/99864023/112/
LikeLike
OOOPS! THIS IS WHAT I MEANT! 🙂
Dr.Strickland: This is the kind of post that needs to be read and re-read. TYVM. If you ADD the re-blog button, I would love to re-blog this AND FUTURE POSTS ON PYTHON and the like, if you don’t mind.
LikeLike