42.76. Random forests intro and regression#

Random Forests are powerful machine learning algorithms used for supervised classification and regression. Random forests works by averaging the predictions of the multiple and randomized decision trees. Decision trees tends to overfit and so by combining multiple decision trees, the effect of overfitting can be minimized.

Random Forests are type of ensemble models. More about ensembles models in the next notebook.

Different to other learning algorithms, random forests provide a way to find the importance of each feature and this is implemented in Sklearn.

42.76.1. Imports#

%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn
import matplotlib.pyplot as plt
import pytest
import ipytest
import unittest

ipytest.autoconfig()

42.76.2. Loading the data#

In this regression task with random forests, we will use the Machine CPU (Central Processing Unit) dataset which is available at OpenML.

If you are reading this, it’s very likely that you know CPU or you have once(or many times) thought about it when you were buying your computer. In this notebook, we will predict the relative performance of the CPU given the following data:

  • MYCT: machine cycle time in nanoseconds (integer)

  • MMIN: minimum main memory in kilobytes (integer)

  • MMAX: maximum main memory in kilobytes (integer)

  • CACH: cache memory in kilobytes (integer)

  • CHMIN: minimum channels in units (integer)

  • CHMAX: maximum channels in units (integer)

  • PRP: published relative performance (integer) (target variable)

# Let's hide warnings

import warnings

warnings.filterwarnings("ignore")
machine_data = pd.read_csv("../../../assets/data/machine_cup.csv")
type(machine_data)
machine_data.shape
machine_data.head()

42.76.3. Tasks and roles#

42.76.3.1. Task 1: Exploratory analysis#

Before doing exploratory analysis, let’s get the training and test data.

from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(machine_data, test_size=0.2, random_state=20)
print(
    "The size of training data is: {} \nThe size of testing data is: {}".format(
        len(train_data), len(test_data)
    )
)

42.76.3.1.1. Part 1: The histogram#

def df_hist(df):
    if df is not None and not df.empty:
        df.hist(bins = 50, figsize = (15, 10))
df_hist(train_data)
plt.show()
Check result by executing below... đź“ť
%%ipytest -qq

from unittest.mock import Mock, patch

class TestDFHist(unittest.TestCase):
  
    def test_df_hist_happy_case(self):
        # assign
        test_df = Mock(return_value=pd.DataFrame(
            {
                'c1': [1, 2, 3, 4, 5], 
            }
        ))
        test_df.empty = False
        
        with patch.object(test_df, 'hist') as mock_df_hist:
            # act
            actual_result = df_hist(test_df)

            # assert
            mock_df_hist.assert_called_once()

    def test_df_hist_with_empty_df(self):
        # assign
        test_df = Mock(return_value=pd.DataFrame())
        
        with patch.object(test_df, 'hist') as mock_df_hist:
            # act
            actual_result = df_hist(test_df)

            # assert
            mock_df_hist.assert_not_called()
            
    def test_df_hist_with_none_df(self):
        # assign
        test_df = Mock(return_value=None)
        
        with patch.object(test_df, 'hist') as mock_df_hist:
            # act
            actual_result = df_hist(test_df)

            # assert
            mock_df_hist.assert_not_called()
👩‍💻 Hint

You can consider to use pandas.DataFrame.hist.

42.76.3.1.2. Part 2: The pairplot#

def df_pairplot(df):
    if df is not None and not df.empty:
        sns.pairplot(df)
df_pairplot(train_data)
Check result by executing below... đź“ť
%%ipytest -qq

from unittest.mock import Mock, patch

class TestDFHist(unittest.TestCase):
  
    def test_df_pairplot_happy_case(self):
        # assign
        test_df = Mock(return_value=pd.DataFrame(
            {
                'c1': [1, 2, 3, 4, 5], 
                'c2': [2, 4, 6, 8, 10], 
            }
        ))
        test_df.empty = False
        
        with patch.object(sns, 'pairplot') as mock_pairplot:
            # act
            actual_result = df_pairplot(test_df)

            # assert
            mock_pairplot.assert_called_once_with(test_df)

    def test_df_pairplot_with_empty_df(self):
        # assign
        test_df = Mock(return_value=pd.DataFrame())
        
        with patch.object(sns, 'pairplot') as mock_df_pairplot:
            # act
            actual_result = df_pairplot(test_df)

            # assert
            mock_df_pairplot.assert_not_called()
            
    def test_df_pairplot_with_none_df(self):
        # assign
        test_df = Mock(return_value=None)
        
        with patch.object(sns, 'pairplot') as mock_df_pairplot:
            # act
            actual_result = df_pairplot(test_df)

            # assert
            mock_df_pairplot.assert_not_called()
👩‍💻 Hint

You can consider to use seaborn.pairplot.

42.76.3.1.3. Part 3: Check the train data#

def df_desc(df):
    if df is None:
        raise Exception('df cannot be None.')
    return df.describe()
# Let's check the summary stats
df_desc(train_data)
def df_null(df):
    if df is None:
        raise Exception('df cannot be None.')
    return df.isnull().sum()
# Let's check the missing values
df_null(train_data)

Great! We don’t have any missing values.

42.76.3.1.4. Part 4: Look the correlation#

def df_corr(df):
    if df is not None and not df.empty:
        corr = df.corr()
        return corr
corr = df_corr(train_data)
corr["class"]
def df_heat(correlation):
    if correlation is not None:
        return sns.heatmap(correlation, annot=True, cmap="crest")
## Visualizing correlation

plt.figure(figsize=(12, 7))
df_heat(corr)

42.76.3.2. Task 2: Data preprocessing#

It is here that we prepare the data to be in the proper format for the machine learning model. Let’s set up a pipeline to scale features but before that, let’s take training input data and labels.

X_train = train_data.drop("class", axis=1)
y_train = train_data["class"]
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

scale_pipe = Pipeline([("scaler", StandardScaler())])
X_train_scaled = scale_pipe.fit_transform(X_train)

42.76.3.3. Task 3: Training random forests regressor#

from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor(
    min_samples_split=2, bootstrap=False, random_state=42, n_jobs=-1
)

forest_reg.fit(X_train_scaled, y_train)

42.76.3.4. Task 4: Evaluating random forests regressor#

Let’s first check the root mean squarred errr on the training. It is not advised to evaluate the model on the test data since we haven’t improved it yet. we will make a function to make it easier and to avoid repetitions.

from sklearn.metrics import mean_squared_error


def predict(input_data, model, labels):
    """
    Take the input data, model and labels and return predictions

    """

    preds = model.predict(input_data)
    mse = mean_squared_error(labels, preds)
    rmse = np.sqrt(mse)

    return rmse
predict(X_train_scaled, forest_reg, y_train)

42.76.3.5. Task 5: Improving random forests#

forest_reg.get_params()

We will use GridSearch to find the best hyperparameters that we can use to retrain the model with. By setting the refit to True, the random forest will be automatically retrained on the dataset with the best hyperparameters. By default, refit is True.

This will take too long.

from sklearn.model_selection import GridSearchCV

params_grid = {
    "n_estimators": [100, 200, 300, 400, 500],
    "max_leaf_nodes": list(range(2, 52)),
}

# refit is true by default. The best estimator is trained on the whole dataset

grid_search = GridSearchCV(
    RandomForestRegressor(min_samples_split=2, bootstrap=False, random_state=42),
    params_grid,
    verbose=1,
    cv=5,
)

grid_search.fit(X_train_scaled, y_train)
grid_search.best_params_
grid_search.best_estimator_
forest_best = grid_search.best_estimator_

Let’s make prediction on the training data again

predict(X_train_scaled, forest_best, y_train)

Surprisingly, by searching model hyperparameters, the model did not improve. Can you guess why? We can observe many things while running Grid Search and reading about the random forests. If you can’t get good results, set the bootstrap to False. It is true by default, and that means that you are training on samples of the training set instead of the whole training set. Try going back to the orginal model and change it to True and note how the prediction changes. Also learn more about the other hyperparameters.

42.76.3.6. Task 6: Feature importance#

Different to other machine learning models, random forests can show how each feature contributed to the model generalization. Let’s find it. The results are values between 0 and 1. The closer to 1, the good the feature was to the model.

feat_import = forest_best.feature_importances_

feat_dict = {"Features": X_train.columns, "Feature Importance": feat_import}

pd.DataFrame(feat_dict)

As you can see above, the most 2 features which contributed to the prediction of the relative performance of the CPU are MMAX which is the Maximum Main Memory in Kilobytes and CACH (cache memory).

It makes sense that the model was able to find that out. Main memory (RAM, Read Only Memory) and cache memory (which stores frequently used information thus facilitating faster processing and quick retrieval of information) are the two most factors of the CPU performance and if you are going to buy a new computer, you want to have high RAM and cache memory in order to have a powerful machine that can process/compute and retrieve things faster.

42.76.3.7. Task 7: Evaluating the Model on the Test Set#

Let us evaluate the model on the test set. But we need first run the pipeline on the test data. Note that we only transform (not fit_transform).

train_data, test_data = train_test_split(machine_data, test_size=0.2, random_state=20)
X_test = test_data.drop("class", axis=1)
y_test = test_data["class"]

test_scaled = scale_pipe.transform(X_test)
predict(test_scaled, forest_best, y_test)

The results on the test set is not appealing, and it is a sign that the model is still overfitting the data(it is doing well on the training set and poor on the new data). One way to improve the model can be to regularize the model by searching more best hyperparameters and increasing the data and data quality. The later is what can improve the model in many scenarios.

This is the end of this assignment. We have learned the fundamental idea behind the random forests, and used it to predict the CPU performance. In the next assignment, we will use it for classification task and we will use a real world dataset so that we can practically improve the random forests.

42.76.4. Acknowledgments#

Thanks to Nyandwi for creating the open-source course Machine Learning complete. It inspires the majority of the content in this chapter.