{ "cells": [ { "cell_type": "code", "execution_count": null, "id": "2789b6b6", "metadata": { "tags": [ "hide-input" ] }, "outputs": [], "source": [ "# Install the necessary dependencies\n", "\n", "import os\n", "import sys\n", "!{sys.executable} -m pip install --quiet pandas scikit-learn numpy matplotlib jupyterlab_myst ipython seaborn" ] }, { "cell_type": "markdown", "id": "9c151ff7", "metadata": { "tags": [ "remove-cell" ] }, "source": [ "---\n", "license:\n", " code: MIT\n", " content: CC-BY-4.0\n", "github: https://github.com/ocademy-ai/machine-learning\n", "venue: By Ocademy\n", "open_access: true\n", "bibliography:\n", " - https://raw.githubusercontent.com/ocademy-ai/machine-learning/main/open-machine-learning-jupyter-book/references.bib\n", "---" ] }, { "cell_type": "markdown", "id": "f13f3d0d", "metadata": {}, "source": [ "# Logistic regression\n", "\n", ":::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/ml-regression/logistic-linear.png\n", "---\n", "name: 'Logistic vs. linear regression infographic'\n", "width: 100%\n", "---\n", "Logistic regression to predict categories. Infographic by [Dasani Madipalli](https://twitter.com/dasani_decoded)\n", ":::" ] }, { "cell_type": "markdown", "id": "6763f160", "metadata": {}, "source": [ "
\n", "\n", "A demo of logistic-regression. [source]\n", "
" ] }, { "cell_type": "markdown", "id": "47b25fce", "metadata": {}, "source": [ "## Introduction\n", "\n", "In this final section on Regression, one of the basic _classic_ Machine Learning techniques, we will take a look at Logistic Regression. You would use this technique to discover patterns to predict binary categories. Is this candy chocolate or not? Is this disease contagious or not? Will this customer choose this product or not?\n", "\n", "In this section, you will learn:\n", "\n", "- A new library for data visualization\n", "- Techniques for logistic regression\n", "\n", ":::{seealso}\n", "Deepen your understanding of working with this type of regression in this [Learn module](https://docs.microsoft.com/learn/modules/train-evaluate-classification-models?WT.mc_id=academic-77952-leestott)\n", ":::" ] }, { "cell_type": "markdown", "id": "ef6f3c62", "metadata": {}, "source": [ "## Prerequisite\n", "\n", "Having worked with the pumpkin data, we are now familiar enough with it to realize that there's one binary category that we can work with: `Color`.\n", "\n", "Let's build a logistic regression model to predict that, given some variables, _what color a given pumpkin is likely to be_ (orange 🎃 or white 👻).\n", "\n", ":::{note}\n", "Why are we talking about binary classification in a section grouping about regression? Only for linguistic convenience, as logistic regression is [really a classification method](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression), albeit a linear-based one. Learn about other ways to classify data in the next lesson group.\n", ":::" ] }, { "cell_type": "markdown", "id": "cb63989b", "metadata": {}, "source": [ "## Define the question\n", "\n", "For our purposes, we will express this as a binary: 'Orange' or 'Not Orange'. There is also a 'striped' category in our dataset but there are few instances of it, so we will not use it. It disappears once we remove null values from the dataset, anyway.\n", "\n", ":::{seealso}\n", "Fun fact, we sometimes call white pumpkins 'ghost' pumpkins. They aren't very easy to carve, so they aren't as popular as the orange ones but they are cool looking!\n", ":::" ] }, { "cell_type": "markdown", "id": "97e9cb2d", "metadata": {}, "source": [ "## About logistic regression\n", "\n", "Logistic regression differs from linear regression, which you learned about previously, in a few important ways." ] }, { "cell_type": "markdown", "id": "3c1e1c2a", "metadata": {}, "source": [ "### Binary classification\n", "\n", "Logistic regression does not offer the same features as linear regression. The former offers a prediction about a binary category (\"orange or not orange\") whereas the latter is capable of predicting continual values, for example given the origin of a pumpkin and the time of harvest, _how much its price will rise_.\n", "\n", ":::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/ml-regression/pumpkin-classifier.png\n", "---\n", "name: 'Pumpkin classification Model'\n", "width: 100%\n", "---\n", "Infographic by [Dasani Madipalli](https://twitter.com/dasani_decoded)\n", ":::" ] }, { "cell_type": "markdown", "id": "3d9f3190", "metadata": {}, "source": [ "### Other classifications\n", "\n", "There are other types of logistic regression, including multinomial and ordinal:\n", "\n", "- **Multinomial**, which involves having more than one category - \"Orange, White, and Striped\".\n", "- **Ordinal**, which involves ordered categories, useful if we wanted to order our outcomes logically, like our pumpkins that are ordered by a finite number of sizes (mini, sm, med, lg, xl, xxl).\n", "\n", ":::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/ml-regression/multinomial-ordinal.png\n", "---\n", "name: 'Multinomial vs ordinal regression'\n", "width: 100%\n", "---\n", "Infographic by [Dasani Madipalli](https://twitter.com/dasani_decoded)\n", ":::" ] }, { "cell_type": "markdown", "id": "bc7a6521", "metadata": {}, "source": [ "### It's still linear\n", "\n", "Even though this type of Regression is all about 'category predictions', it still works best when there is a clear linear relationship between the dependent variable (color) and the other independent variables (the rest of the dataset, like city name and size). It's good to get an idea of whether there is any linearity dividing these variables or not." ] }, { "cell_type": "markdown", "id": "12889962", "metadata": {}, "source": [ "### Variables DO NOT have to correlate\n", "\n", "Remember how linear regression worked better with more correlated variables? Logistic regression is the opposite - the variables don't have to align. That works for this data which has somewhat weak correlations." ] }, { "cell_type": "markdown", "id": "5d8e9e71", "metadata": {}, "source": [ "### You need a lot of clean data\n", "\n", "Logistic regression will give more accurate results if you use more data; our small dataset is not optimal for this task, so keep that in mind.\n", "\n", ":::{note}\n", "Think about the types of data that would lend themselves well to logistic regression.\n", ":::" ] }, { "cell_type": "markdown", "id": "ec33d95b", "metadata": {}, "source": [ "## Exercise - tidy the data\n", "\n", "First, clean the data a bit, dropping null values and selecting only some of the columns:\n", "\n", "1. Add the following code:" ] }, { "cell_type": "code", "execution_count": 2, "id": "0606f3cb", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" }, "tags": [ "output-scoll" ] }, "outputs": [ { "data": { "text/html": [ "\n", " | City Name | \n", "Type | \n", "Package | \n", "Variety | \n", "Sub Variety | \n", "Grade | \n", "Date | \n", "Low Price | \n", "High Price | \n", "Mostly Low | \n", "... | \n", "Unit of Sale | \n", "Quality | \n", "Condition | \n", "Appearance | \n", "Storage | \n", "Crop | \n", "Repack | \n", "Trans Mode | \n", "Unnamed: 24 | \n", "Unnamed: 25 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "BALTIMORE | \n", "NaN | \n", "24 inch bins | \n", "NaN | \n", "NaN | \n", "NaN | \n", "4/29/17 | \n", "270.0 | \n", "280.0 | \n", "270.0 | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "E | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
1 | \n", "BALTIMORE | \n", "NaN | \n", "24 inch bins | \n", "NaN | \n", "NaN | \n", "NaN | \n", "5/6/17 | \n", "270.0 | \n", "280.0 | \n", "270.0 | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "E | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
2 | \n", "BALTIMORE | \n", "NaN | \n", "24 inch bins | \n", "HOWDEN TYPE | \n", "NaN | \n", "NaN | \n", "9/24/16 | \n", "160.0 | \n", "160.0 | \n", "160.0 | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "N | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
3 | \n", "BALTIMORE | \n", "NaN | \n", "24 inch bins | \n", "HOWDEN TYPE | \n", "NaN | \n", "NaN | \n", "9/24/16 | \n", "160.0 | \n", "160.0 | \n", "160.0 | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "N | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
4 | \n", "BALTIMORE | \n", "NaN | \n", "24 inch bins | \n", "HOWDEN TYPE | \n", "NaN | \n", "NaN | \n", "11/5/16 | \n", "90.0 | \n", "100.0 | \n", "90.0 | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "N | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
5 rows × 26 columns
\n", "