Articles

Logistic Regression with iPython

Introduction

IPython is a growing project, with increasingly language-agnostic components. IPython 3.x was the last monumental release of IPython, containing the notebook server, qtconsole, etc. As of IPython 4.0, the language-agnostic parts of the project: the notebook format, message protocol, qtconsole, notebook web application, etc. have moved to new projects under the name Jupyter. IPython itself is focused on interactive Python, part of which is providing a Python kernel for Jupyter [1].

IPython provides a rich architecture for interactive computing with [1]:

  • A powerful interactive shell.
  • A kernel for Jupyter.
  • Support for interactive data visualization and use of GUI toolkits.
  • Flexible, embeddable interpreters to load into your own projects.
  • Easy to use, high performance tools for parallel computing.

Beyond this point in the article, all of the headings, body, code, and output are copy and pasted from an iPython notebook, saved in html format. Moreover, all of the code was execute from the iPython notebook

Logistic Regression

This is an example of performing logistic regression in Python with the Scikit-learn module. In this example, we perform many useful python functions beyond what we need for a simple model. This is done partially to explore some more advanced modeling, array manipulation, evaluation, and so on. We will also use several external modules importing data, formatting data, manipulating data, modeling and graphical exploration.

Dataset

The dataset I chose for this example in Longitudinal Low Birth Weight Study (CLSLOWBWT.DAT). Hosmer and Lemeshow (2000) Applied Logistic Regression: Second Edition. These data are copyrighted by John Wiley & Sons Inc. and must be acknowledged and used accordingly.

List of Variables

Variable Description Codes/Values Name


1 Identification Code ID Number ID

2 Birth Number 1-4 BIRTH

3 Smoking Status 0 = No, 1 = Yes SMOKE During Pregnancy

4 Race 1 = White, 2 = Black RACE 3 = Other

5 Age of Mother Years AGE

6 Weight of Mother at Pounds LWT Last Menstrual Period

7 Birth Weight Grams BWT

8 Low Birth Weight 1 = BWT <=2500g, LOW 0 = BWT >2500g


Problem Statement

In this example, we want to predict Low Birth Weight using the remaining dataset variables. Low Birth Weight, the dependent variable, 1 = BWT <=2500g and 0 = BWT >2500g.

Import Modules

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.cross_validation import cross_val_score

Data Pre-Processing

In [2]:
# load dataset
dta = pd.read_csv("C:\Users\Strickland\Documents\Python Scripts\CLSLOWBWT.csv")

Data Exploration

In [3]:
dta.groupby('LOW').mean()
Out[3]:
ID BIRTH SMOKE RACE AGE LWT BWT
LOW
0 92.875371 1.836795 0.338279 1.976261 26.163205 144.741840 3204.059347
1 95.092715 1.953642 0.536424 1.576159 27.059603 138.304636 2033.867550

We can see that on average, women who have children with a low birth weight are more likely to be smokers than nonsmokers. This makes sense based on scientific studies. Let’s take another look at the Birth Number variable.

In [4]:
dta.groupby('BIRTH').mean()
Out[4]:
ID SMOKE RACE AGE LWT BWT LOW
BIRTH
1 94.500000 0.393617 1.851064 22.819149 129.851064 2848.159574 0.281915
2 94.500000 0.393617 1.851064 26.930851 150.638298 2857.367021 0.303191
3 89.091837 0.408163 1.857143 31.153061 150.520408 2838.153061 0.367347
4 99.642857 0.500000 1.857143 35.500000 155.642857 2578.857143 0.357143

Low Birth Weight trends upward with more births.

Data Visualization

In [5]:
# show plots in the notebook
%matplotlib inline
In [6]:
# histogram of birth number
dta.BIRTH.hist()
plt.title('Histogram of Low Birth Weight')
plt.xlabel('Birth Number')
plt.ylabel('Frequency')
Out[6]:
<matplotlib.text.Text at 0x19843978>
ipython-histo01
In [7]:
# histogram of age of mother
dta.AGE.hist()
plt.title('Histogram of Age of Mother')
plt.xlabel('Age')
plt.ylabel('Frequency')
Out[7]:
<matplotlib.text.Text at 0x199d47b8>
 ipython-histo01a

Let’s take a look at the distribution of smokers for those having children with low birth weights versus those who do not.

In [8]:
# barplot of low birth weights grouped by smoker status (True or False)
pd.crosstab(dta.SMOKE, dta.LOW.astype(bool)).plot(kind='bar')
plt.title('Somker Distribution by Low Birth Weight')
plt.xlabel('Smoker')
plt.ylabel('Frequency')
Out[8]:
<matplotlib.text.Text at 0x19b084e0>
 ipython-histo03

Now let’s use a stacked barplot to look at the percentage of women having children with low birth weights by age.

In [9]:
low_age = pd.crosstab(dta.AGE, dta.LOW.astype(bool))
low_age.div(low_age.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)
plt.title('Low Birth Weight Percentage by Age of Mother')
plt.xlabel('Age of Mother')
plt.ylabel('Percentage')
Out[9]:
<matplotlib.text.Text at 0x19ff5780>
 ipython-histo04

Prepare Data for Logistic Regression

To prepare the data, we want to add an intercept column as well as dummy variables for Age of Mother and Weight of Mother at Last Menstrual Period, since we are treating them as categorical variables. The dmatrices function from the patsy module can do that using formula language.

The column names for the dummy variables are messy, so let’s rename those.

In [10]:
# create dataframes with an intercept column and dummy variables for
y, X = dmatrices('LOW ~ ID + BIRTH + SMOKE + RACE + AGE + LWT',
                  dta, return_type="dataframe")
print X.columns
Index([u'Intercept', u'ID', u'BIRTH', u'SMOKE', u'RACE', u'AGE', u'LWT'], dtype='object')

We now want to convert the numeric (interval) variable AGE to a categorical variable with 4 classes.

We also need to flatten y into a 1-D array, so that Scikit-learn will properly understand it as the response variable.

In [11]:
# flatten y into a 1-D array
y = np.ravel(y)

First Logistic Regression

Let’s go ahead and run logistic regression on the entire data set, and see how accurate it is.

In [12]:
# instantiate a logistic regression model, and fit with X and y
model = LogisticRegression()
model = model.fit(X, y)

# check the accuracy on the training set
model.score(X, y)
Out[12]:
0.70696721311475408

70.7% accuracy seems good, but what’s the null error rate?

In [13]:
# what percentage had low birth weights?
y.mean()
Out[13]:
0.3094262295081967

Only 31% of the women had low birth rate children, which means that you could obtain 69% accuracy by always predicting “no”. So we’re doing better than the null error rate, but not by much. Let’s examine the coefficients to see what we learn.

In [14]:
# examine the coefficients
pd.DataFrame(zip(X.columns, np.transpose(model.coef_)))
Out[14]:
0 1
0 Intercept [0.0336484036311]
1 ID [0.00231314008486]
2 BIRTH [0.185995251393]
3 SMOKE [0.546655639398]
4 RACE [-0.413795537922]
5 AGE [0.0257483072665]
6 LWT [-0.0115073134929]

Converting Variables & Merging Dataframes

Now, we want to take the variable AGE and convert it to a categorical variable, to see if we can improve the model. We will do this by creating two data frames from our original data. We will then merge the two data frames, so both need to contain the ID variable. One data from will have the converted categorical variable age_group, and the other will have the dependent variable and the other independent variables.

In [15]:
df1 = pd.DataFrame(dta, columns=['ID','AGE'])
df2 = pd.DataFrame(dta, columns=['ID', 'BIRTH', 'SMOKE', 'RACE', 'LWT', 'LOW'])
bins = [15, 25, 35, 45, 55]
group_names = ['15-24', '25-34', '35-44', '45-55']
age_groups = pd.cut(df1['AGE'], bins, labels=group_names)
df1['age_groups'] = pd.cut(df1['AGE'], bins, labels=group_names)
categories
df1.head(5)
Out[15]:
ID AGE age_groups
0 1 28 25-34
1 1 33 25-34
2 2 29 25-34
3 2 34 25-34
4 2 37 35-44

We now merge the two data frames.

In [16]:
left = df2
right = df1
result = pd.merge(left, right, on='ID')
result.head(5)
Out[16]:
ID BIRTH SMOKE RACE LWT LOW AGE age_groups
0 1 1 1 3 120 0 28 25-34
1 1 1 1 3 120 0 33 25-34
2 1 2 1 3 141 0 28 25-34
3 1 2 1 3 141 0 33 25-34
4 2 1 0 1 130 0 29 25-34

Second Logistic Regression

We are now ready to build and evaluate our second logistic regression model, using the merged data frames.

In [17]:
y, Z = dmatrices('LOW ~ BIRTH + SMOKE + RACE + age_groups + LWT', result, return_type="dataframe")
print Z.columns
Index([u'Intercept', u'age_groups[T.25-34]', u'age_groups[T.35-44]',
       u'age_groups[T.45-55]', u'BIRTH', u'SMOKE', u'RACE', u'LWT'],
      dtype='object')

Since we change the size of y when we converted AGE and merged, we also need to flatten y again into a 1-D array, so that Scikit-learn will properly understand it as the response variable

In [18]:
# flatten y into a 1-D array
y = np.ravel(y)

Before we perform the logistic regression, we want to check the matrix we just formed to ensure it is consistent with our intent.

In [19]:
Z.head(5)
Out[19]:
Intercept age_groups[T.25-34] age_groups[T.35-44] age_groups[T.45-55] BIRTH SMOKE RACE LWT
0 1 1 0 0 1 1 3 120
1 1 1 0 0 1 1 3 120
2 1 1 0 0 2 1 3 141
3 1 1 0 0 2 1 3 141
4 1 1 0 0 1 0 1 130

Finally, we are ready to execute the logistic regression model and see how accurate it is.

In [20]:
# instantiate a logistic regression model, and fit with X and y
model1 = LogisticRegression()
model1 = model1.fit(Z, y)

# check the accuracy on the training set
model1.score(Z, y)
Out[20]:
0.71320754716981127

71.3% accuracy seems good, but what’s the null error rate? We also want to recheck the percentage of low birth weights.

In [21]:
# what percentage had low birth weights?
y.mean()
Out[21]:
0.30867924528301888

Still, only 31% of the women had low birth rate children, which means that you could obtain 69% accuracy by always predicting “no”. So we’re doing better than the null error rate, but not by much. Let’s examine the coefficients to see what we learn.

In [22]:
# examine the coefficients
pd.DataFrame(zip(Z.columns, np.transpose(model.coef_)))
Out[22]:
0 1
0 Intercept [0.0336484036311]
1 age_groups[T.25-34] [0.00231314008486]
2 age_groups[T.35-44] [0.185995251393]
3 age_groups[T.45-55] [0.546655639398]
4 BIRTH [-0.413795537922]
5 SMOKE [0.0257483072665]
6 RACE [-0.0115073134929]

Increases in Birth Number and RACE correspond to a decrease in the likelihood of having a Low Birth Weight child. A decrease in Smoking Status corresponds to a decrease in the likelihood of having a Low Birth Weight child. For Age Group, the lowest likelihood of having Low Birth Weight child corresponds to the baseline age group (15-24), since all of the dummy coefficients are positive.

Model Evaluation Using a Validation Set

So far, we have trained and tested on the same set. Let’s instead split the data into a training set and a testing set.

In [23]:
# evaluate the model by splitting into train and test sets
Z_train, Z_test, y_train, y_test = train_test_split(Z, y, test_size=0.3, random_state=0)
model2 = LogisticRegression()
model2.fit(Z_train, y_train)
Out[23]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr',
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0)

We now need to predict class labels for the test set. We will also generate the class probabilities, just to take a look.

In [24]:
# predict class labels for the test set
predicted = model2.predict(Z_test)
print predicted
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.
  1.  1.  0.  0.  0.  0.  0.  0.  0.  1.  0.  1.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  1.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  1.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.
  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  1.  0.  0.  0.
  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  1.  0.  0.  0.  0.  0.  0.  1.  1.  0.  0.  0.  1.  0.  1.  0.  0.  0.
  0.  0.  0.  0.  0.  1.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.
  0.  1.  1.  0.  0.  0.  0.  0.  0.  1.  0.  0.  1.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  1.  1.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.
  0.  1.  0.  0.  0.  0.  0.  0.  0.  1.  1.  0.  0.  0.  1.  1.  0.  0.
  0.  0.  0.  0.  0.  1.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  1.  1.
  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  1.  1.  0.  0.
  1.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.
  1.  0.  1.  1.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  1.  0.  0.  0.
  0.  0.  0.  0.  1.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.]
In [25]:
# generate class probabilities
probs = model2.predict_proba(Z_test)
print probs
[[ 0.70610757  0.29389243]
 [ 0.82942743  0.17057257]
 [ 0.78996271  0.21003729]
 [ 0.71714883  0.28285117]
 [ 0.82860543  0.17139457]
 [ 0.66919892  0.33080108]
 [ 0.80411336  0.19588664]
 [ 0.82942743  0.17057257]
 [ 0.80489925  0.19510075]
             .
             .
             .
 [ 0.59655861  0.40344139]
 [ 0.78276183  0.21723817]
 [ 0.71000202  0.28999798]]

As you can see, the classifier is predicting a 1 (having a Low Birth Weight child) any time the probability in the second column is greater than 0.5. Now let’s generate some evaluation metrics.

In [26]:
# generate evaluation metrics
print metrics.accuracy_score(y_test, predicted)
print metrics.roc_auc_score(y_test, probs[:, 1])
0.678391959799
0.678443267259

The accuracy is 67.8%, which is the close to what we experienced when training and predicting on the same data. We can also see the confusion matrix and a classification report with other metrics.

In [27]:
print metrics.confusion_matrix(y_test, predicted)
print metrics.classification_report(y_test, predicted)
[[239  27]
 [101  31]]
             precision    recall  f1-score   support

        0.0       0.70      0.90      0.79       266
        1.0       0.53      0.23      0.33       132

avg / total       0.65      0.68      0.64       398

Model Evaluation Using Cross-Validation

Now let’s try 10-fold cross-validation, to see if the accuracy holds up more rigorously.

In [28]:
# evaluate the model using 10-fold cross-validation
scores = cross_val_score(LogisticRegression(), Z, y, scoring='accuracy', cv=10)
print scores
print scores.mean()
[ 0.53383459  0.7518797   0.69924812  0.68421053  0.64661654  0.80451128
  0.73484848  0.66666667  0.61363636  0.71755725]
0.685300951894

Looks good. It’s still performing at 69% accuracy.

Predicting the Probability of Low Birth Weight Child

Just for fun, let’s predict the probability of a low birth weight child for a random woman not present in the dataset. She’s a 35-year-old Other race, has had 2 births (has 2 children), is a smoker, and her weight is 132.

In [29]:
model.predict_proba(np.array([0, 0, 1, 1, 3, 2, 1]))
Out[29]:
array([[ 0.60709037,  0.39290963]])

The predicted probability of low birth weight child is 39.3%

Next Steps

There are many different steps that could be tried in order to improve the model:

• including interaction terms

• removing features

• regularization techniques

• using a non-linear model

References

  1. IPython Home Page, http://ipython.org/

Profile_PicAuthored by:
Jeffrey Strickland, Ph.D.

Jeffrey Strickland, Ph.D., is the Author of Predictive Analytics Using R and a Senior Analytics Scientist with Clarity Solution Group. He has performed predictive modeling, simulation and analysis for the Department of Defense, NASA, the Missile Defense Agency, and the Financial and Insurance Industries for over 20 years. Jeff is a Certified Modeling and Simulation professional (CMSP) and an Associate Systems Engineering Professional (ASEP). He has published nearly 200 blogs on LinkedIn, is also a frequently invited guest speaker and the author of 20 books including:

  • Operations Research using Open-Source Tools
  • Discrete Event simulation using ExtendSim
  • Crime Analysis and Mapping
  • Missile Flight Simulation
  • Mathematical Modeling of Warfare and Combat Phenomenon
  • Predictive Modeling and Analytics
  • Using Math to Defeat the Enemy
  • Verification and Validation for Modeling and Simulation
  • Simulation Conceptual Modeling
  • System Engineering Process and Practices
ADVERTISEMENT

Connect with Jeffrey Strickland
Contact Jeffrey Strickland

2 replies »

  1. OOOPS! THIS IS WHAT I MEANT! 🙂
    Dr.Strickland: This is the kind of post that needs to be read and re-read. TYVM. If you ADD the re-blog button, I would love to re-blog this AND FUTURE POSTS ON PYTHON and the like, if you don’t mind.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s