{ "cells": [ { "cell_type": "code", "execution_count": 1, "id": "5bf19128-f39c-43ce-8503-9eff8df18434", "metadata": { "slideshow": { "slide_type": "" }, "tags": [ "hide-cell" ] }, "outputs": [], "source": [ "# Install the necessary dependencies\n", "\n", "import os\n", "import sys\n", "!{sys.executable} -m pip install --quiet pandas scikit-learn numpy matplotlib jupyterlab_myst ipython\n" ] }, { "cell_type": "markdown", "id": "cb578cb6-9845-4eb3-94ab-51c5e677031a", "metadata": { "slideshow": { "slide_type": "skip" }, "tags": [ "remove-cell" ] }, "source": [ "---\n", "license:\n", " code: MIT\n", " content: CC-BY-4.0\n", "github: https://github.com/ocademy-ai/machine-learning\n", "venue: By Ocademy\n", "open_access: true\n", "bibliography:\n", " - https://raw.githubusercontent.com/ocademy-ai/machine-learning/main/open-machine-learning-jupyter-book/references.bib\n", "---" ] }, { "cell_type": "markdown", "id": "663dea79", "metadata": { "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "# Linear and polynomial regression\n", "\n", ":::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/ml-regression/linear-polynomial.png\n", "---\n", "name: 'Linear vs polynomial regression infographic'\n", "width: 100%\n", "---\n", "Infographic by [Dasani Madipalli](https://twitter.com/dasani_decoded)\n", ":::" ] }, { "cell_type": "markdown", "id": "ed080429-60c9-4f56-a3b5-7f3c533ec961", "metadata": { "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "```{seealso}\n", "Run the notebook accompanying this lesson and look at the Month to Price scatterplot. Does the data associating Month to Price for pumpkin sales seem to have high or low correlation, according to our visual interpretation of the scatterplot? Does that change if we use more fine-grained measure instead of `Month`, eg. *day of the year* (i.e. number of days since the beginning of the year)?\n", "```" ] }, { "cell_type": "markdown", "id": "d0525b63-aae3-492a-b251-8131622648d0", "metadata": { "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "## Build a regression model using Scikit-learn: regression four ways" ] }, { "cell_type": "code", "execution_count": 2, "id": "f1dfeda2", "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "\n", "A demo of linear-regression. [source]\n", "
\n" ], "text/plain": [ "\n", "\n", "A demo of linear-regression. [source]\n", "
\n", "\"\"\"\n", " )\n", ")" ] }, { "cell_type": "code", "execution_count": 2, "id": "28b44e74", "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " A demo of gradient_react_3D. [source]\n", "
\n", "\n", " \n", " A demo of gradient_react_3D. [source]\n", "
\n", "\n", " \n", " A demo of gradient-descent-visualiser. [source]\n", "
\n", "\n", " \n", " A demo of gradient-descent-visualiser. [source]\n", "
\n", "\n", " | City Name | \n", "Type | \n", "Package | \n", "Variety | \n", "Sub Variety | \n", "Grade | \n", "Date | \n", "Low Price | \n", "High Price | \n", "Mostly Low | \n", "... | \n", "Unit of Sale | \n", "Quality | \n", "Condition | \n", "Appearance | \n", "Storage | \n", "Crop | \n", "Repack | \n", "Trans Mode | \n", "Unnamed: 24 | \n", "Unnamed: 25 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "BALTIMORE | \n", "NaN | \n", "24 inch bins | \n", "NaN | \n", "NaN | \n", "NaN | \n", "4/29/17 | \n", "270.0 | \n", "280.0 | \n", "270.0 | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "E | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
1 | \n", "BALTIMORE | \n", "NaN | \n", "24 inch bins | \n", "NaN | \n", "NaN | \n", "NaN | \n", "5/6/17 | \n", "270.0 | \n", "280.0 | \n", "270.0 | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "E | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
2 | \n", "BALTIMORE | \n", "NaN | \n", "24 inch bins | \n", "HOWDEN TYPE | \n", "NaN | \n", "NaN | \n", "9/24/16 | \n", "160.0 | \n", "160.0 | \n", "160.0 | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "N | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
3 | \n", "BALTIMORE | \n", "NaN | \n", "24 inch bins | \n", "HOWDEN TYPE | \n", "NaN | \n", "NaN | \n", "9/24/16 | \n", "160.0 | \n", "160.0 | \n", "160.0 | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "N | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
4 | \n", "BALTIMORE | \n", "NaN | \n", "24 inch bins | \n", "HOWDEN TYPE | \n", "NaN | \n", "NaN | \n", "11/5/16 | \n", "90.0 | \n", "100.0 | \n", "90.0 | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "N | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
5 rows × 26 columns
\n", "\n", " | Month | \n", "DayOfYear | \n", "Variety | \n", "City | \n", "Package | \n", "Low Price | \n", "High Price | \n", "Price | \n", "
---|---|---|---|---|---|---|---|---|
70 | \n", "9 | \n", "267 | \n", "PIE TYPE | \n", "BALTIMORE | \n", "1 1/9 bushel cartons | \n", "15.0 | \n", "15.0 | \n", "13.636364 | \n", "
71 | \n", "9 | \n", "267 | \n", "PIE TYPE | \n", "BALTIMORE | \n", "1 1/9 bushel cartons | \n", "18.0 | \n", "18.0 | \n", "16.363636 | \n", "
72 | \n", "10 | \n", "274 | \n", "PIE TYPE | \n", "BALTIMORE | \n", "1 1/9 bushel cartons | \n", "18.0 | \n", "18.0 | \n", "16.363636 | \n", "
73 | \n", "10 | \n", "274 | \n", "PIE TYPE | \n", "BALTIMORE | \n", "1 1/9 bushel cartons | \n", "17.0 | \n", "17.0 | \n", "15.454545 | \n", "
74 | \n", "10 | \n", "281 | \n", "PIE TYPE | \n", "BALTIMORE | \n", "1 1/9 bushel cartons | \n", "15.0 | \n", "15.0 | \n", "13.636364 | \n", "
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
\n", "\n", "A demo of logistic-regression. [source]\n", "
" ] }, { "cell_type": "markdown", "id": "c2831e2f", "metadata": { "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "Another type of Linear Regression is Polynomial Regression. While sometimes there's a linear relationship between variables - the bigger the pumpkin in volume, the higher the price - sometimes these relationships can't be plotted as a plane or straight line.\n", "\n", ":::{seealso}\n", "Here are [some more examples](https://online.stat.psu.edu/stat501/lesson/9/9.8) of data that could use Polynomial Regression\n", ":::\n", "\n", "Take another look at the relationship between Date and Price. Does this scatterplot seem like it should necessarily be analyzed by a straight line? Can't prices fluctuate? In this case, we can try polynomial regression.\n", "\n", ":::{note}\n", "Polynomials are mathematical expressions that might consist of one or more variables and coefficients.\n", ":::\n", "\n", "Polynomial regression creates a curved line to better fit nonlinear data. In our case, if we include a squared `DayOfYear` variable in input data, we should be able to fit our data with a parabolic curve, which will have a minimum at a certain point within the year.\n", "\n", "Scikit-learn includes a helpful [pipeline API](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html?highlight=pipeline#sklearn.pipeline.make_pipeline) to combine different steps of data processing together. A **pipeline** is a chain of **estimators**. In our case, we will create a pipeline that first adds polynomial features to our model, and then trains the regression:" ] }, { "cell_type": "code", "execution_count": 18, "id": "bf0c99b8", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" }, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "Pipeline(steps=[('polynomialfeatures', PolynomialFeatures()),\n", " ('linearregression', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(steps=[('polynomialfeatures', PolynomialFeatures()),\n", " ('linearregression', LinearRegression())])
PolynomialFeatures()
LinearRegression()
\n", " | FAIRYTALE | \n", "MINIATURE | \n", "MIXED HEIRLOOM VARIETIES | \n", "PIE TYPE | \n", "
---|---|---|---|---|
70 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "
71 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "
72 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "
73 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "
74 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
1738 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "
1739 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "
1740 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "
1741 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "
1742 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "
415 rows × 4 columns
\n", "