# Install the necessary dependencies

import os
import sys
!{sys.executable} -m pip install --quiet pandas scikit-learn numpy matplotlib jupyterlab_myst ipython seaborn

11.4. Logistic regression#

https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/ml-regression/logistic-linear.png

Fig. 11.12 Logistic regression to predict categories. Infographic by Dasani Madipalli#

A demo of logistic-regression. [source]

11.4.1. Introduction#

In this final section on Regression, one of the basic classic Machine Learning techniques, we will take a look at Logistic Regression. You would use this technique to discover patterns to predict binary categories. Is this candy chocolate or not? Is this disease contagious or not? Will this customer choose this product or not?

In this section, you will learn:

  • A new library for data visualization

  • Techniques for logistic regression

See also

Deepen your understanding of working with this type of regression in this Learn module

11.4.2. Prerequisite#

Having worked with the pumpkin data, we are now familiar enough with it to realize that thereā€™s one binary category that we can work with: Color.

Letā€™s build a logistic regression model to predict that, given some variables, what color a given pumpkin is likely to be (orange šŸŽƒ or white šŸ‘»).

Note

Why are we talking about binary classification in a section grouping about regression? Only for linguistic convenience, as logistic regression is really a classification method, albeit a linear-based one. Learn about other ways to classify data in the next lesson group.

11.4.3. Define the question#

For our purposes, we will express this as a binary: ā€˜Orangeā€™ or ā€˜Not Orangeā€™. There is also a ā€˜stripedā€™ category in our dataset but there are few instances of it, so we will not use it. It disappears once we remove null values from the dataset, anyway.

See also

Fun fact, we sometimes call white pumpkins ā€˜ghostā€™ pumpkins. They arenā€™t very easy to carve, so they arenā€™t as popular as the orange ones but they are cool looking!

11.4.4. About logistic regression#

Logistic regression differs from linear regression, which you learned about previously, in a few important ways.

11.4.4.1. Binary classification#

Logistic regression does not offer the same features as linear regression. The former offers a prediction about a binary category (ā€œorange or not orangeā€) whereas the latter is capable of predicting continual values, for example given the origin of a pumpkin and the time of harvest, how much its price will rise.

https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/ml-regression/pumpkin-classifier.png

Fig. 11.13 Infographic by Dasani Madipalli#

11.4.4.2. Other classifications#

There are other types of logistic regression, including multinomial and ordinal:

  • Multinomial, which involves having more than one category - ā€œOrange, White, and Stripedā€.

  • Ordinal, which involves ordered categories, useful if we wanted to order our outcomes logically, like our pumpkins that are ordered by a finite number of sizes (mini, sm, med, lg, xl, xxl).

https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/ml-regression/multinomial-ordinal.png

Fig. 11.14 Infographic by Dasani Madipalli#

11.4.4.3. Itā€™s still linear#

Even though this type of Regression is all about ā€˜category predictionsā€™, it still works best when there is a clear linear relationship between the dependent variable (color) and the other independent variables (the rest of the dataset, like city name and size). Itā€™s good to get an idea of whether there is any linearity dividing these variables or not.

11.4.4.4. Variables DO NOT have to correlate#

Remember how linear regression worked better with more correlated variables? Logistic regression is the opposite - the variables donā€™t have to align. That works for this data which has somewhat weak correlations.

11.4.4.5. You need a lot of clean data#

Logistic regression will give more accurate results if you use more data; our small dataset is not optimal for this task, so keep that in mind.

Note

Think about the types of data that would lend themselves well to logistic regression.

11.4.5. Exercise - tidy the data#

First, clean the data a bit, dropping null values and selecting only some of the columns:

  1. Add the following code:

import pandas as pd
import numpy as np

pumpkins = pd.read_csv('https://static-1300131294.cos.ap-shanghai.myqcloud.com/data/us-pumpkins.csv')
pumpkins.head()
City Name Type Package Variety Sub Variety Grade Date Low Price High Price Mostly Low ... Unit of Sale Quality Condition Appearance Storage Crop Repack Trans Mode Unnamed: 24 Unnamed: 25
0 BALTIMORE NaN 24 inch bins NaN NaN NaN 4/29/17 270.0 280.0 270.0 ... NaN NaN NaN NaN NaN NaN E NaN NaN NaN
1 BALTIMORE NaN 24 inch bins NaN NaN NaN 5/6/17 270.0 280.0 270.0 ... NaN NaN NaN NaN NaN NaN E NaN NaN NaN
2 BALTIMORE NaN 24 inch bins HOWDEN TYPE NaN NaN 9/24/16 160.0 160.0 160.0 ... NaN NaN NaN NaN NaN NaN N NaN NaN NaN
3 BALTIMORE NaN 24 inch bins HOWDEN TYPE NaN NaN 9/24/16 160.0 160.0 160.0 ... NaN NaN NaN NaN NaN NaN N NaN NaN NaN
4 BALTIMORE NaN 24 inch bins HOWDEN TYPE NaN NaN 11/5/16 90.0 100.0 90.0 ... NaN NaN NaN NaN NaN NaN N NaN NaN NaN

5 rows Ɨ 26 columns

from sklearn.preprocessing import LabelEncoder

new_columns = ['Color','Origin','Item Size','Variety','City Name','Package']

new_pumpkins = pumpkins.drop([c for c in pumpkins.columns if c not in new_columns], axis=1)

new_pumpkins.dropna(inplace=True)

new_pumpkins = new_pumpkins.apply(LabelEncoder().fit_transform)

You can always take a peek at your new dataframe:

new_pumpkins.info
<bound method DataFrame.info of       City Name  Package  Variety  Origin  Item Size  Color
2             1        3        4       3          3      0
3             1        3        4      17          3      0
4             1        3        4       5          2      0
5             1        3        4       5          2      0
6             1        4        4       5          3      0
...         ...      ...      ...     ...        ...    ...
1694         12        3        5       4          6      1
1695         12        3        5       4          6      1
1696         12        3        5       4          6      1
1697         12        3        5       4          6      1
1698         12        3        5       4          6      1

[991 rows x 6 columns]>

11.4.5.1. Visualization - side-by-side grid#

By now you have loaded up the starter notebook with pumpkin data once again and cleaned it so as to preserve a dataset containing a few variables, including Color. Letā€™s visualize the dataframe in the notebook using a different library: Seaborn, which is built on Matplotlib which we used earlier.

Seaborn offers some neat ways to visualize your data. For example, you can compare distributions of the data for each point in a side-by-side grid.

  1. Create such a grid by instantiating a PairGrid, using our pumpkin data new_pumpkins, followed by calling map():

import seaborn as sns

g = sns.PairGrid(new_pumpkins)
g.map(sns.scatterplot)
<seaborn.axisgrid.PairGrid at 0x7f81b4dc80d0>
../../_images/logistic-regression_19_1.png

By observing data side-by-side, you can see how the Color data relates to the other columns.

See also

Given this scatterplot grid, what are some interesting explorations you can envision?

11.4.5.2. Use a swarm plot#

Since Color is a binary category (Orange or Not), itā€™s called ā€˜categorical dataā€™ and needs ā€˜a more specialized approach to visualizationā€™. There are other ways to visualize the relationship of this category with other variables.

You can visualize variables side-by-side with Seaborn plots.

  1. Try a ā€˜swarmā€™ plot to show the distribution of values:

sns.swarmplot(x="Color", y="Item Size", data=new_pumpkins)
/usr/share/miniconda/envs/open-machine-learning-jupyter-book/lib/python3.9/site-packages/seaborn/categorical.py:3544: UserWarning: 63.4% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
/usr/share/miniconda/envs/open-machine-learning-jupyter-book/lib/python3.9/site-packages/seaborn/categorical.py:3544: UserWarning: 21.8% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
<AxesSubplot: xlabel='Color', ylabel='Item Size'>
/usr/share/miniconda/envs/open-machine-learning-jupyter-book/lib/python3.9/site-packages/seaborn/categorical.py:3544: UserWarning: 79.2% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
/usr/share/miniconda/envs/open-machine-learning-jupyter-book/lib/python3.9/site-packages/seaborn/categorical.py:3544: UserWarning: 35.9% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
../../_images/logistic-regression_22_3.png

11.4.5.3. Violin plot#

A ā€˜violinā€™ type plot is useful as you can easily visualize the way that data in the two categories is distributed. Violin plots donā€™t work so well with smaller datasets as the distribution is displayed more ā€˜smoothlyā€™.

  1. As parameters x=Color, kind="violin" and call catplot():

sns.catplot(x="Color", y="Item Size",
            kind="violin", data=new_pumpkins)
<seaborn.axisgrid.FacetGrid at 0x7f816aee8e20>
../../_images/logistic-regression_24_1.png

See also

Try creating this plot, and other Seaborn plots, using other variables.

Now that we have an idea of the relationship between the binary categories of color and the larger group of sizes, letā€™s explore logistic regression to determine a given pumpkinā€™s likely color.

See also

Show Me The Math

Remember how linear regression often used ordinary least squares to arrive at a value? Logistic regression relies on the concept of ā€˜maximum likelihoodā€™ using sigmoid functions. A ā€˜Sigmoid Functionā€™ on a plot looks like an ā€˜Sā€™ shape. It takes a value and maps it to somewhere between 0 and 1. Its curve is also called a ā€˜logistic curveā€™. Its formula looks like this:

https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/ml-regression/sigmoid.png

Fig. 11.15 Logistic function[Loo]#

Where the sigmoidā€™s midpoint finds itself at xā€™s \(0\) point, L is the curveā€™s maximum value, and \(k\) is the curveā€™s steepness. If the outcome of the function is more than 0.5, the label in question will be given the class ā€˜1ā€™ of the binary choice. If not, it will be classified as ā€˜0ā€™.


11.4.6. Build your model#

Building a model to find these binary classification is surprisingly straightforward in Scikit-learn.

1 . Select the variables you want to use in your classification model and split the training and test sets calling train_test_split():

from sklearn.model_selection import train_test_split

Selected_features = ['Origin','Item Size','Variety','City Name','Package']

X = new_pumpkins[Selected_features]
y = new_pumpkins['Color']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

2 . Now you can train your model, by calling fit() with your training data, and print out its result:

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report 
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

print(classification_report(y_test, predictions))
print('Predicted labels: ', predictions)
print('Accuracy: ', accuracy_score(y_test, predictions))
              precision    recall  f1-score   support

           0       0.83      0.98      0.90       166
           1       0.00      0.00      0.00        33

    accuracy                           0.81       199
   macro avg       0.42      0.49      0.45       199
weighted avg       0.69      0.81      0.75       199

Predicted labels:  [0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 1 0 0 0 0 0 0 0 0]
Accuracy:  0.8140703517587939

Note

Take a look at your modelā€™s scoreboard. Itā€™s not too bad, considering you have only about 1000 rows of data.

11.4.7. Better comprehension via a confusion matrix#

While you can get a scoreboard report terms by printing out the items above, you might be able to understand your model more easily by using a confusion matrix to help us understand how the model is performing.

Note

šŸŽ“ A ā€˜confusion matrixā€™ (or ā€˜error matrixā€™) is a table that expresses your modelā€™s true vs. false positives and negatives, thus gauging the accuracy of predictions.

  1. To use a confusion metrics, call confusion_matrix():

from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, predictions)
array([[162,   4],
       [ 33,   0]])

Take a look at your modelā€™s confusion matrix:

array([[162,   4],
        [ 33,   0]])

In Scikit-learn, confusion matrices Rows (axis 0) are actual labels and columns (axis 1) are predicted labels.

Whatā€™s going on here? Letā€™s say our model is asked to classify pumpkins between two binary categories, category ā€˜orangeā€™ and category ā€˜not-orangeā€™.

  • If your model predicts a pumpkin as not orange and it belongs to category ā€˜not-orangeā€™ in reality we call it a true negative, shown by the top left number.

  • If your model predicts a pumpkin as orange and it belongs to category ā€˜not-orangeā€™ in reality we call it a false negative, shown by the bottom left number.

  • If your model predicts a pumpkin as not orange and it belongs to category ā€˜orangeā€™ in reality we call it a false positive, shown by the top right number.

  • If your model predicts a pumpkin as orange and it belongs to category ā€˜orangeā€™ in reality we call it a true positive, shown by the bottom right number.

As you might have guessed itā€™s preferable to have a larger number of true positives and true negatives and a lower number of false positives and false negatives, which implies that the model performs better.

How does the confusion matrix relate to precision and recall? Remember, the classification report printed above showed precision (0.83) and recall (0.98).

\[ \begin{align}\begin{aligned} Precision = \frac{tp}{tp + fp} = \frac{162}{162 + 33} = 0.8307692307692308\\Recall = \frac{tp}{tp + fn} = \frac{162}{162 + 4} = 0.9759036144578314 \end{aligned}\end{align} \]

Note

Q: According to the confusion matrix, how did the model do? A: Not too bad; there are a good number of true negatives but also several false negatives.

Letā€™s revisit the terms we saw earlier with the help of the confusion matrixā€™s mapping of \(\frac{TP}{TN}\) and \(\frac{FP}{FN}\) :

šŸŽ“ Precision: \(\frac{TP}{TP + FP}\) The fraction of relevant instances among the retrieved instances (e.g. which labels were well-labeled)

šŸŽ“ Recall: \(\frac{TP}{TP + FN}\) The fraction of relevant instances that were retrieved, whether well-labeled or not

šŸŽ“ f1-score: \(\frac{2 * precision * recall}{precision + recall}\) A weighted average of the precision and recall, with best being 1 and worst being 0

šŸŽ“ Support: The number of occurrences of each label retrieved

šŸŽ“ Accuracy: \(\frac{TP + TN}{TP + TN + FP + FN}\) The percentage of labels predicted accurately for a sample.

šŸŽ“ Macro Avg: The calculation of the unweighted mean metrics for each label, not taking label imbalance into account.

šŸŽ“ Weighted Avg: The calculation of the mean metrics for each label, taking label imbalance into account by weighting them by their support (the number of true instances for each label).

See also

Can you think which metric you should watch if you want your model to reduce the number of false negatives?

11.4.8. Visualize the ROC curve of this model#

This is not a bad model; its accuracy is in the 80% range so ideally you could use it to predict the color of a pumpkin given a set of variables.

Letā€™s do one more visualization to see the so-called ā€˜ROCā€™ score:

from sklearn.metrics import roc_curve, roc_auc_score

y_scores = model.predict_proba(X_test)
# calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_scores[:, 1])
sns.lineplot(x=[0, 1], y=[0, 1])
sns.lineplot(x=fpr, y=tpr)
<AxesSubplot: >
../../_images/logistic-regression_39_1.png

Using Seaborn again, plot the modelā€™s Receiving Operating Characteristic or ROC. ROC curves are often used to get a view of the output of a classifier in terms of its true vs. false positives. ā€œROC curves typically feature true positive rate on the Y axis, and false positive rate on the X axis.ā€ Thus, the steepness of the curve and the space between the midpoint line and the curve matter: you want a curve that quickly heads up and over the line. In our case, there are false positives to start with, and then the line heads up and over properly:

Finally, use Scikit-learnā€™s roc_auc_score API to compute the actual ā€˜Area Under the Curveā€™ (AUC):

auc = roc_auc_score(y_test,y_scores[:, 1])
print(auc)
0.6976998904709748

The result is 0.6976998904709748. Given that the AUC ranges from 0 to 1, you want a big score, since a model that is 100% correct in its predictions will have an AUC of 1; in this case, the model is pretty good.

In future lessons on classifications, you will learn how to iterate to improve your modelā€™s scores. But for now, congratulations! Youā€™ve completed these regression sections!

11.4.9. Your turn! šŸš€#

Thereā€™s a lot more to unpack regarding logistic regression! But the best way to learn is to experiment. Find a dataset that lends itself to this type of analysis and build a model with it. What do you learn? tip: try Kaggle for interesting datasets.

Assignment - Retrying some regression

11.4.10. Self study#

Read the first few pages of this paper from Stanford on some practical uses for logistic regression. Think about tasks that are better suited for one or the other type of regression tasks that we have studied up to this point. What would work best?

11.4.11. Acknowledgments#

Thanks to Microsoft for creating the open-source course ML-For-Beginners. It inspires the majority of the content in this chapter.