Tools of the trade
Contents
# Install the necessary dependencies
import os
import sys
!{sys.executable} -m pip install --quiet pandas scikit-learn numpy matplotlib jupyterlab_myst ipython
11.1. Tools of the trade#
11.1.1. Get started with Python and Scikit-learn for regression models#
11.1.2. Introduction#
In these four sections, you will discover how to build regression models. We will discuss what these are for shortly. But before you do anything, make sure you have the right tools in place to start the process!
In this section, you will learn how to:
Configure your computer for local machine learning tasks.
Work with Jupyter notebooks.
Use Scikit-learn, including installation.
Explore linear regression with a hands-on exercise.
11.1.3. Installations and configurations#
See also
Click here for a video : Setup Python with Visual Studio Code, by author Alfredo Deza.
1 . Install Python. Ensure that Python is installed on your computer. You will use Python for many data science and machine learning tasks. Most computer systems already include a Python installation. There are useful Python Coding Packs available as well, to ease the setup for some users.
Some usages of Python, however, require one version of the software, whereas others require a different version. For this reason, itâs useful to work within a virtual environment.
2 . Install Visual Studio Code. Make sure you have Visual Studio Code installed on your computer. Follow these instructions to install Visual Studio Code for the basic installation. You are going to use Python in Visual Studio Code in this course, so you might want to brush up on how to configure Visual Studio Code for Python development.
Note
Get comfortable with Python by working through this collection of Learn modules
3 . Install Scikit-learn, by following these instructions. Since you need to ensure that you use Python 3, itâs recommended that you use a virtual environment. Note, if you are installing this library on a M1 Mac, there are special instructions on the page linked above.
4 . Install Jupyter Notebook. You will need to install the Jupyter package.
11.1.5. Up and running with Scikit-learn#
Now that Python is set up in your local environment, and you are comfortable with Jupyter notebooks, letâs get equally comfortable with Scikit-learn (pronounce it sci
as in science
). Scikit-learn provides an extensive API to help you perform ML tasks.
According to their website, âScikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities.â
In this course, you will use Scikit-learn and other tools to build machine learning models to perform what we call âtraditional machine learningâ tasks. We have deliberately avoided neural networks and deep learning, as they are better covered in our forthcoming âAI for Beginnersâ curriculum.
Scikit-learn makes it straightforward to build models and evaluate them for use. It is primarily focused on using numeric data and contains several ready-made datasets for use as learning tools. It also includes pre-built models for students to try. Letâs explore the process of loading prepackaged data and using a built in estimator first ML model with Scikit-learn with some basic data.
11.1.6. Exercise - your first Scikit-learn notebook#
Note
This tutorial was inspired by the linear regression example on Scikit-learnâs web site.
In the regression-tools.ipynb file associated to this section, clear out all the cells by pressing the âtrash canâ icon.
In this section, you will work with a small dataset about diabetes that is built into Scikit-learn for learning purposes. Imagine that you wanted to test a treatment for diabetic patients. Machine Learning models might help you determine which patients would respond better to the treatment, based on combinations of variables. Even a very basic regression model, when visualized, might show information about variables that would help you organize your theoretical clinical trials.
See also
There are many types of regression methods, and which one you pick depends on the answer youâre looking for. If you want to predict the probable height for a person of a given age, youâd use linear regression, as youâre seeking a numeric value. If youâre interested in discovering whether a type of cuisine should be considered vegan or not, youâre looking for a category assignment so you would use logistic regression. Youâll learn more about logistic regression later. Think a bit about some questions you can ask of data, and which of these methods would be more appropriate.
Letâs get started on this task.
11.1.6.1. Import libraries#
For this task we will import some libraries:
matplotlib. Itâs a useful graphing tool and we will use it to create a line plot.
numpy. numpy is a useful library for handling numeric data in Python.
sklearn. This is the Scikit-learn library.
Import some libraries to help with your tasks.
Add imports by typing the following code :
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model, model_selection
Above you are importing matplotlib
, numpy
and you are importing datasets
, linear_model
and model_selection
from sklearn
. model_selection
is used for splitting data into training and test sets.
11.1.6.2. The diabetes dataset#
The built-in diabetes dataset includes 442 samples of data around diabetes, with 10 feature variables, some of which include:
age: age in years
bmi: body mass index
bp: average blood pressure
s1 tc: T-Cells (a type of white blood cells)
Note
This dataset includes the concept of âsexâ as a feature variable important to research around diabetes. Many medical datasets include this type of binary classification. Think a bit about how categorizations such as this might exclude certain parts of a population from treatments.
Now, load up the X and y data.
Note
Remember, this is supervised learning, and we need a named âyâ target.
In a new code cell, load the diabetes dataset by calling load_diabetes()
. The input return_X_y=True
signals that X
will be a data matrix, and y
will be the regression target.
1 . Add some print commands to show the shape of the data matrix and its first element:
X, y = datasets.load_diabetes(return_X_y=True)
print(X.shape)
print(X[0])
(442, 10)
[ 0.03807591 0.05068012 0.06169621 0.02187239 -0.0442235 -0.03482076
-0.04340085 -0.00259226 0.01990749 -0.01764613]
What you are getting back as a response, is a tuple. What you are doing is to assign the two first values of the tuple to X
and y
respectively. Learn more about tuples.
You can see that this data has 442 items shaped in arrays of 10 elements.
See also
Think a bit about the relationship between the data and the regression target. Linear regression predicts relationships between feature X and target variable y. Can you find the target for the diabetes dataset in the documentation? What is this dataset demonstrating, given that target?
2 . Next, select a portion of this dataset to plot by arranging it into a new array using numpyâs newaxis
function. We are going to use linear regression to generate a line between values in this data, according to a pattern it determines.
X = X[:, np.newaxis, 2]
Note
At any time, print out the data to check its shape.
3 . Now that you have data ready to be plotted, you can see if a machine can help determine a logical split between the numbers in this dataset. To do this, you need to split both the data (X) and the target (y) into test and training sets. Scikit-learn has a straightforward way to do this; you can split your test data at a given point.
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.33)
4 . Now you are ready to train your model! Load up the linear regression model and train it with your X and y training sets using model.fit()
:
model = linear_model.LinearRegression()
model.fit(X_train, y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Note
model.fit()
is a function youâll see in many ML libraries such as TensorFlow
5 . Then, create a prediction using test data, using the function predict()
. This will be used to draw the line between data groups
y_pred = model.predict(X_test)
6 . Now itâs time to show the data in a plot. Matplotlib is a very useful tool for this task. Create a scatterplot of all the X and y test data, and use the prediction to draw a line in the most appropriate place, between the modelâs data groupings.
plt.scatter(X_test, y_test, color='black')
plt.plot(X_test, y_pred, color='blue', linewidth=3)
plt.xlabel('Scaled BMIs')
plt.ylabel('Disease Progression')
plt.title('A Graph Plot Showing Diabetes Progression Against BMI')
plt.show()
See also
Think a bit about whatâs going on here. A straight line is running through many small dots of data, but what is it doing exactly? Can you see how you should be able to use this line to predict where a new, unseen data point should fit in relationship to the plotâs y axis? Try to put into words the practical use of this model.
Congratulations, you built your first linear regression model, created a prediction with it, and displayed it in a plot!
11.1.7. Self study#
In this tutorial, you worked with simple linear regression, rather than univariate or multiple linear regression. Read a little about the differences between these methods, or take a look at this video
Read more about the concept of regression and think about what kinds of questions can be answered by this technique. Take this tutorial to deepen your understanding.
11.1.8. Your turn! đ#
Plot a different variable from this dataset. Hint: edit this line: X = X[:, np.newaxis, 2]
. Given this datasetâs target, what are you able to discover about the progression of diabetes as a disease?
Assignment - Regression with scikit-learn
11.1.9. Acknowledgments#
Thanks to Microsoft for creating the open-source course ML-For-Beginners. It inspires the majority of the content in this chapter.