# Introduction

__Jupyter__. IPython itself is focused on interactive Python, part of which is providing a Python kernel for Jupyter [1].

IPython provides a rich architecture for interactive computing with [1]:

- A powerful interactive shell.
- A kernel for
__Jupyter__. - Support for interactive data visualization and use of
__GUI toolkits__. - Flexible,
__embeddable__interpreters to load into your own projects. - Easy to use, high performance tools for
__parallel computing__.

Beyond this point in the article, all of the headings, body, code, and output are copy and pasted from an iPython notebook, saved in html format. Moreover, all of the code was execute from the iPython notebook

# Logistic Regression

This is an example of performing logistic regression in Python with the Scikit-learn module. In this example, we perform many useful python functions beyond what we need for a simple model. This is done partially to explore some more advanced modeling, array manipulation, evaluation, and so on. We will also use several external modules importing data, formatting data, manipulating data, modeling and graphical exploration.

# Dataset

The dataset I chose for this example in Longitudinal Low Birth Weight Study (CLSLOWBWT.DAT). Hosmer and Lemeshow (2000) Applied Logistic Regression: Second Edition. These data are copyrighted by John Wiley & Sons Inc. and must be acknowledged and used accordingly.

# List of Variables

Variable Description Codes/Values Name

1 Identification Code ID Number ID

2 Birth Number 1-4 BIRTH

3 Smoking Status 0 = No, 1 = Yes SMOKE During Pregnancy

4 Race 1 = White, 2 = Black RACE 3 = Other

5 Age of Mother Years AGE

6 Weight of Mother at Pounds LWT Last Menstrual Period

7 Birth Weight Grams BWT

8 Low Birth Weight 1 = BWT <=2500g, LOW 0 = BWT >2500g

# Problem Statement

In this example, we want to predict Low Birth Weight using the remaining dataset variables. Low Birth Weight, the dependent variable, 1 = BWT <=2500g and 0 = BWT >2500g.

# Import Modules

**import** **numpy** **as** **np**
**import** **pandas** **as** **pd**
**import** **statsmodels.api** **as** **sm**
**import** **matplotlib.pyplot** **as** **plt**
**from** **patsy** **import** dmatrices
**from** **sklearn.linear_model** **import** LogisticRegression
**from** **sklearn.cross_validation** **import** train_test_split
**from** **sklearn** **import** metrics
**from** **sklearn.cross_validation** **import** cross_val_score

# Data Pre-Processing

*# load dataset*
dta = pd.read_csv("C:\Users\Strickland\Documents\Python Scripts\CLSLOWBWT.csv")

# Data Exploration

```
dta.groupby('LOW').mean()
```

We can see that on average, women who have children with a low birth weight are more likely to be smokers than nonsmokers. This makes sense based on scientific studies. Let’s take another look at the Birth Number variable.

```
dta.groupby('BIRTH').mean()
```

Low Birth Weight trends upward with more births.

# Data Visualization

*# show plots in the notebook*
%**matplotlib** inline

*# histogram of birth number*
dta.BIRTH.hist()
plt.title('Histogram of Low Birth Weight')
plt.xlabel('Birth Number')
plt.ylabel('Frequency')

*# histogram of age of mother*
dta.AGE.hist()
plt.title('Histogram of Age of Mother')
plt.xlabel('Age')
plt.ylabel('Frequency')

Let’s take a look at the distribution of smokers for those having children with low birth weights versus those who do not.

*# barplot of low birth weights grouped by smoker status (True or False)*
pd.crosstab(dta.SMOKE, dta.LOW.astype(bool)).plot(kind='bar')
plt.title('Somker Distribution by Low Birth Weight')
plt.xlabel('Smoker')
plt.ylabel('Frequency')

Now let’s use a stacked barplot to look at the percentage of women having children with low birth weights by age.

```
low_age = pd.crosstab(dta.AGE, dta.LOW.astype(bool))
low_age.div(low_age.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)
plt.title('Low Birth Weight Percentage by Age of Mother')
plt.xlabel('Age of Mother')
plt.ylabel('Percentage')
```

# Prepare Data for Logistic Regression

To prepare the data, we want to add an intercept column as well as dummy variables for Age of Mother and Weight of Mother at Last Menstrual Period, since we are treating them as categorical variables. The dmatrices function from the patsy module can do that using formula language.

The column names for the dummy variables are messy, so let’s rename those.

*# create dataframes with an intercept column and dummy variables for*
y, X = dmatrices('LOW ~ ID + BIRTH + SMOKE + RACE + AGE + LWT',
dta, return_type="dataframe")
**print** X.columns

We now want to convert the numeric (interval) variable AGE to a categorical variable with 4 classes.

We also need to flatten y into a 1-D array, so that Scikit-learn will properly understand it as the response variable.

*# flatten y into a 1-D array*
y = np.ravel(y)

# First Logistic Regression

Let’s go ahead and run logistic regression on the entire data set, and see how accurate it is.

*# instantiate a logistic regression model, and fit with X and y*
model = LogisticRegression()
model = model.fit(X, y)
*# check the accuracy on the training set*
model.score(X, y)

70.7% accuracy seems good, but what’s the null error rate?

*# what percentage had low birth weights?*
y.mean()

Only 31% of the women had low birth rate children, which means that you could obtain 69% accuracy by always predicting “no”. So we’re doing better than the null error rate, but not by much. Let’s examine the coefficients to see what we learn.

*# examine the coefficients*
pd.DataFrame(zip(X.columns, np.transpose(model.coef_)))

# Converting Variables & Merging Dataframes

Now, we want to take the variable AGE and convert it to a categorical variable, to see if we can improve the model. We will do this by creating two data frames from our original data. We will then merge the two data frames, so both need to contain the ID variable. One data from will have the converted categorical variable age_group, and the other will have the dependent variable and the other independent variables.

```
df1 = pd.DataFrame(dta, columns=['ID','AGE'])
df2 = pd.DataFrame(dta, columns=['ID', 'BIRTH', 'SMOKE', 'RACE', 'LWT', 'LOW'])
bins = [15, 25, 35, 45, 55]
group_names = ['15-24', '25-34', '35-44', '45-55']
age_groups = pd.cut(df1['AGE'], bins, labels=group_names)
df1['age_groups'] = pd.cut(df1['AGE'], bins, labels=group_names)
categories
df1.head(5)
```

We now merge the two data frames.

```
left = df2
right = df1
result = pd.merge(left, right, on='ID')
result.head(5)
```

# Second Logistic Regression

We are now ready to build and evaluate our second logistic regression model, using the merged data frames.

```
y, Z = dmatrices('LOW ~ BIRTH + SMOKE + RACE + age_groups + LWT', result, return_type="dataframe")
```**print** Z.columns

Since we change the size of y when we converted AGE and merged, we also need to flatten y again into a 1-D array, so that Scikit-learn will properly understand it as the response variable

*# flatten y into a 1-D array*
y = np.ravel(y)

Before we perform the logistic regression, we want to check the matrix we just formed to ensure it is consistent with our intent.

```
Z.head(5)
```

Finally, we are ready to execute the logistic regression model and see how accurate it is.

*# instantiate a logistic regression model, and fit with X and y*
model1 = LogisticRegression()
model1 = model1.fit(Z, y)
*# check the accuracy on the training set*
model1.score(Z, y)

71.3% accuracy seems good, but what’s the null error rate? We also want to recheck the percentage of low birth weights.

*# what percentage had low birth weights?*
y.mean()

Still, only 31% of the women had low birth rate children, which means that you could obtain 69% accuracy by always predicting “no”. So we’re doing better than the null error rate, but not by much. Let’s examine the coefficients to see what we learn.

*# examine the coefficients*
pd.DataFrame(zip(Z.columns, np.transpose(model.coef_)))

Increases in Birth Number and RACE correspond to a decrease in the likelihood of having a Low Birth Weight child. A decrease in Smoking Status corresponds to a decrease in the likelihood of having a Low Birth Weight child. For Age Group, the lowest likelihood of having Low Birth Weight child corresponds to the baseline age group (15-24), since all of the dummy coefficients are positive.

# Model Evaluation Using a Validation Set

So far, we have trained and tested on the same set. Let’s instead split the data into a training set and a testing set.

*# evaluate the model by splitting into train and test sets*
Z_train, Z_test, y_train, y_test = train_test_split(Z, y, test_size=0.3, random_state=0)
model2 = LogisticRegression()
model2.fit(Z_train, y_train)

We now need to predict class labels for the test set. We will also generate the class probabilities, just to take a look.

*# predict class labels for the test set*
predicted = model2.predict(Z_test)
**print** predicted

*# generate class probabilities*
probs = model2.predict_proba(Z_test)
**print** probs

As you can see, the classifier is predicting a 1 (having a Low Birth Weight child) any time the probability in the second column is greater than 0.5. Now let’s generate some evaluation metrics.

*# generate evaluation metrics*
**print** metrics.accuracy_score(y_test, predicted)
**print** metrics.roc_auc_score(y_test, probs[:, 1])

The accuracy is 67.8%, which is the close to what we experienced when training and predicting on the same data. We can also see the confusion matrix and a classification report with other metrics.

**print** metrics.confusion_matrix(y_test, predicted)
**print** metrics.classification_report(y_test, predicted)

# Model Evaluation Using Cross-Validation

Now let’s try 10-fold cross-validation, to see if the accuracy holds up more rigorously.

*# evaluate the model using 10-fold cross-validation*
scores = cross_val_score(LogisticRegression(), Z, y, scoring='accuracy', cv=10)
**print** scores
**print** scores.mean()

Looks good. It’s still performing at 69% accuracy.

# Predicting the Probability of Low Birth Weight Child

Just for fun, let’s predict the probability of a low birth weight child for a random woman not present in the dataset. She’s a 35-year-old Other race, has had 2 births (has 2 children), is a smoker, and her weight is 132.

```
model.predict_proba(np.array([0, 0, 1, 1, 3, 2, 1]))
```

The predicted probability of low birth weight child is 39.3%

# Next Steps

There are many different steps that could be tried in order to improve the model:

• including interaction terms

• removing features

• regularization techniques

• using a non-linear model

# References

- IPython Home Page, http://ipython.org/

**Authored by:
**

**Jeffrey Strickland, Ph.D.**Jeffrey Strickland, Ph.D., is the Author of * Predictive Analytics Using R* and a Senior Analytics Scientist with Clarity Solution Group. He has performed predictive modeling, simulation and analysis for the Department of Defense, NASA, the Missile Defense Agency, and the Financial and Insurance Industries for over 20 years. Jeff is a Certified Modeling and Simulation professional (CMSP) and an Associate Systems Engineering Professional (ASEP). He has published nearly 200 blogs on LinkedIn, is also a frequently invited guest speaker and the author of 20 books including:

__Operations Research using Open-Source Tools__*Discrete Event simulation using ExtendSim**Crime Analysis and Mapping**Missile Flight Simulation**Mathematical Modeling of Warfare and Combat Phenomenon**Predictive Modeling and Analytics**Using Math to Defeat the Enemy**Verification and Validation for Modeling and Simulation**Simulation Conceptual Modeling**System Engineering Process and Practices*

Connect with __Jeffrey Strickland
__Contact

__Jeffrey Strickland__

Categories: Articles, Education & Training, Featured, Jeffrey Strickland

Dr.Strickland: This is the kind of post that needs to be read and re-read. TYVM. If you the re-blog button, I would love to re-blog this if you dont mind.

A

https://wordpress.com/post/99864023/112/

LikeLike

OOOPS! THIS IS WHAT I MEANT! 🙂

Dr.Strickland: This is the kind of post that needs to be read and re-read. TYVM. If you ADD the re-blog button, I would love to re-blog this AND FUTURE POSTS ON PYTHON and the like, if you don’t mind.

LikeLike