42.78. Beyond random forests: more ensemble models#

Ensembles are the combination of different machine learning models, where by averaging the results of all models trained together on the same dataset, the result can be away good than any individual model. These types of models can be used for all kinds of tasks, including classification, regression and detecting anomalies.

In his book Hands-On Machine Learning with Scikit-Learn, Keras and TensorFlow, Aurelion Geron gave a perfect analogy describing the ensemble learning: “Suppose you pose a complex question to thousands of random people, then aggregate their answers. In many cases you will find that this aggregated answer is better than an expert’s answer. This is called the wisdom of the crowd. Similarly, if you aggregate the predictions of a group of predictors (such as classifiers or regressors), you will often get better predictions than with the best individual predictor. A group of predictors is called an ensemble; thus, this technique is called Ensemble Learning, and an Ensemble Learning algorithm is called an Ensemble method.”

Random Forests that was used in the previous assignment is an example of ensemble models. It combined different decision trees. Usually, these types of models are classified into three groups that will be covered in this assignment:

  • Averaging Methods: Such as voting classifiers/regressors, bagging classifiers/regressors, random forests, extra trees classifier/regressor, etc…

  • Boosting Methods: Gradient Boosting, AdaBoost, XGBoost

  • Stacking in which instead of averaging the results of multiple models, multiple models are trained on the full training set and there is a final model trained on the different subsets (folds) of the training set.

Most of the above algorithms are implemented in Sklearn, except XGBoost and Stacking which is implemented indirectly.

Like said in the beginning, these types of models can be used for both regression and classification, but this assignment will consider classification domain and for giving focus on the ensemble models, we will use the same dataset we used previouly, in random forests.

42.78.1. Ensemble methods for classification#

42.78.2. Imports#

%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn
import matplotlib.pyplot as plt

42.78.3. Loading the data#

In this assignment, we will use Random forests to build a classifier that identify the increase or decrease of the electricity using “the data that was collected from the Australian New South Wales Electricity Market. In this market, prices are not fixed and are affected by demand and supply of the market. They are set every five minutes. Electricity transfers to/from the neighboring state of Victoria were done to alleviate fluctuations.”

“The dataset contains 45,312 instances dated from 7 May 1996 to 5 December 1998. Each example of the dataset refers to a period of 30 minutes, i.e. there are 48 instances for each time period of one day. Each example on the dataset has 5 fields, the day of week, the time stamp, the New South Wales electricity demand, the Victoria electricity demand, the scheduled electricity transfer between states and the class label. The class label identifies the change of the price (UP or DOWN) in New South Wales relative to a moving average of the last 24 hours (and removes the impact of longer term price trends). Source: Open ML electricity.

Here are the information about the features:

  • Date: date between 7 May 1996 to 5 December 1998. Here normalized between 0 and 1

  • Day: day of the week (1-7)

  • Period: time of the measurement (1-48) in half hour intervals over 24 hours. Here normalized between 0 and 1

  • NSWprice: New South Wales electricity price, normalized between 0 and 1

  • NSWdemand: New South Wales electricity demand, normalized between 0 and 1

  • VICprice: Victoria electricity price, normalized between 0 and 1

  • VICdemand: Victoria electricity demand, normalized between 0 and 1

  • transfer: scheduled electricity transfer between both states, normalized between 0 and 1

# Let's hide warnings

import warnings

warnings.filterwarnings("ignore")
elec_df = pd.read_csv("../../../assets/data/elec_data.csv")
type(elec_df)
elec_df.shape
elec_df.head()

42.78.4. Tasks and roles#

42.78.4.1. Task 1: Exploratory data analysis#

Before doing exploratory analysis, as always, let’s split the data into training and test sets.

from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(elec_df, test_size=0.25, random_state=20)

print(
    "The size of training data is: {} \nThe size of testing data is: {}".format(
        len(train_data), len(test_data)
    )
)

Let’s take a quick look into the dataset

train_data.shape
train_data.head(10)
# Displaying the last rows

train_data.tail()
train_data.info()

Two things to draw from the dataset for now:

  • The target feature class is categorical. We will make sure to encode that during data preprocessing.

  • All numerical features are already normalized, so we won’t need to normalize these type of features.

# Checking summary stats

train_data.describe()
# Checking missing values

train_data.isnull().sum()

Great, we don’t have any missing values. Usually there are three things to do with if they are present:

  • We can remove all missing values completely

  • We can leave them as they are

  • We can fill them with a given strategy such as mean, media or most frequent value. Either Sklearn or Pandas provides a quick ways to fill these kind of values.

If you still want to know more about how to deal with missing values, please refer to this article

# Checking feature correlation

corr = train_data.corr()
## Visualizing correlation

plt.figure(figsize=(12, 7))

sns.heatmap(corr, annot=True, cmap="crest")

It seems that we don’t have features which are too correlating. Correlation shown above varies from -1 to 1. If the correlation between two features is close to 1, it means that they nearly contain the same information. If it is close to -1, it means that these features contain different information.Take an example: vicdemand correlate with nswdeman at 0.67 ratio.

So if you drop one of those features, it’s likely that your model will not be affected much. So different to what you have seen in many articles, having features which does not correlate to the target feature doesn’t mean they are not useful.

In the above correlation matrix, you can see that class feature is not there and this is because it still has categorical values.

42.78.4.2. Task 2: More data exploration#

Before preprocessing the data, let’s take a look into specific features.

Let’s see how many Ups/Downs are in the class feature.

plt.figure(figsize=(12, 7))
sns.countplot(data=train_data, x="class")

Day is the days of the week, from 1-7, Monday to Sunday. Let’s count the days occurences in respect to the ups/downs of the electricity’s price.

plt.figure(figsize=(13, 8))

sns.countplot(data=train_data, x="day", hue="class")

It seems that most days had downs. From the beginning of the week, there were consistent increase in downs(price of electricity went down) and decrease in ups. Let’s see if there is an appealing relationship between the demand/price of electricity in New South Wales and Victoria.

plt.figure(figsize=(13, 8))
sns.scatterplot(data=train_data, x="vicdemand", y="nswdemand", hue="class")

The demand of the electricity in New South Wales and the Victoria is kind of linear. Let’s see if we can get any other insights by bringing days in the demand analysis.

plt.figure(figsize=(20, 10))
sns.scatterplot(data=train_data, x="vicdemand", y="nswdemand", hue="day", size="day")

Although it is kind of hard to draw a strong point, there is less demand of electricity in both cities on Monday and Sunday than other days. We can use a line plot to plot the demand in both cities on the course of the days.

plt.figure(figsize=(13, 8))
sns.lineplot(data=train_data, x="day", y="nswdemand", color="green")
plt.figure(figsize=(13, 8))
sns.lineplot(data=train_data, x="day", y="vicdemand", color="red")

Another interesting thing to look in the dataset is if there are some seasonalities/trends in the demand/price in either Victoria or New South Wales over period of time. In time series analysis, seasonality is when there is repetitive scenarios or consistent behaviours over the course of time.

If you look at the demand of the electricity in both cities on the course of date (7 May 1996 to 5 December 1998), you can see that there are some types of seasonalities. Not 100% but it seems there is and if this dataset would have been collected for more than two years, it would probably be easy to know surely if there are seasonalities.

plt.figure(figsize=(20, 10))
sns.lineplot(data=train_data, x="date", y="nswdemand")
plt.figure(figsize=(20, 10))
sns.lineplot(data=train_data, x="date", y="vicdemand")

One last thing about data analysis, let’s plot all histograms of the numerical features.

train_data.hist(bins=50, figsize=(15, 10))
plt.show()

42.78.4.3. Task 3: Data preprocessing#

It is here that we prepare the data to be in the proper format for the machine learning model.

Let’s encode the categorical feature class. But before that, let’s take training input data and labels.

X_train = train_data.drop("class", axis=1)
y_train = train_data["class"]
from sklearn.preprocessing import LabelEncoder

label_enc = LabelEncoder()
y_train_prepared = label_enc.fit_transform(y_train)
X_train.head()

Because in python, it does not make sense to perform a multiplication between a sequence (list, tuple, etc.) and a float, so for subsequent operations, we need to convert X_train into an array.

X_train = np.array(X_train)
X_train
y_train_prepared

Now we are ready to train the machine learning model.

But again if you look at the dat, the day feature is not normalized as other features. We can normalize it or leave it but for now let’s go ahead and train the random forests classifier.

42.78.4.4. Task 4: Training ensemble classifiers#

42.78.4.4.1. 4.1: Voting classifier#

Let’s assume that you have trained 3 different classifiers on the training data but none of these classifiers had an oustanding results.

The idea of Voting ensemble technique is fairly simple. We can aggregate the results of all those 3 classifiers and the result will be good than any single classifier.

Let’s train 3 classifiers on the training data and then we will go ahead and aggregate their results.

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier

from sklearn.metrics import accuracy_score

log_classifier = LogisticRegression()
sv_classifier = SVC()
sgd_classifier = SGDClassifier()


def classifiers(clf1, clf2, clf3, X_train, y_train):

    """
    A function that takes 5 inputs: 3 classifiers, training data & labels
    And return the list of accuracies on all classifiers

    """

    # A list of all classifiers
    clfs = [clf1, clf2, clf3]

    # An empty list to comprehend
    all_clfs_acc = []

    # Train each classifier, evaluate it on the training set
    # And append the accuracy to 'all_clfs_acc'

    for clf in clfs:

        clf.fit(X_train, y_train)
        preds = clf.predict(X_train)
        acc = accuracy_score(y_train, preds)
        acc = acc.tolist()
        all_clfs_acc.append(acc)

    return all_clfs_acc
classifiers(log_classifier, sv_classifier, sgd_classifier, X_train, y_train_prepared)

As you can see, the function returned 3 accuracies on the training set. The first accuracy correspond to Logistic Regression, the second is Support Vector Classifier, and the third is SGD(Stockastic Gradient Descent).

Now, let us use Voting Classifier to aggregate the results of all of those 3 classifiers.

from sklearn.ensemble import VotingClassifier

vot_classifier = VotingClassifier(
    estimators=[
        ("log_reg", log_classifier),
        ("svc", sv_classifier),
        ("sgd", sgd_classifier),
    ],
    voting="hard",
)

vot_classifier.fit(X_train, y_train_prepared)

Since we will need to calculate accuracy often, let’s make a function to be calling always.

from sklearn.metrics import accuracy_score


def accuracy(model, data, labels):

    predictions = model.predict(data)
    acc = accuracy_score(labels, predictions)

    return acc

Let’s use the above function to find the accuracy of the voting classifier

accuracy(vot_classifier, X_train, y_train_prepared)

As we can see, it slightly outperformed all individual classifiers. It’s small increments in the accuracy but most of the time, it will be always better than any individual classifier.

42.78.4.4.2. 4.2: Bagging classifier#

Instead of training different algorithms on the same data and average their results as Voting does, Bagging ensemble method train a single or multiple classifier/regressors on different subsets of the training data and aggregate the results on all subsets.

When this is used with complex models (that overfit data easily) like decision trees, the overfitting can be reduced.

Let’s use bagging to train 500 decision trees on different subsets of data and then average the predictions on those subsets.

By setting max_samples=0.5, max_features=0.5, bootstrap=False, we are using random 50% subsets of training data and random 50% subsets of features. If bootstrap is True, ratio of those training samples are sampled from the training data with replacement and if it is False, there is no replecement. When in fact bootstrap is False, this is called Pasting. There are also other techniques called Random Subspaces and Random Patches and they are all based off how the samples are drawn from the data. you can learn more about these techniques and more about other hyperparameters on the Scikit-Learn documentation.

One of the great way to improve a particular machine learning model is to learn about its hyperparameters and what each stands for. So, you can learn more about Bagging here.

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_classifier = BaggingClassifier(
    DecisionTreeClassifier(class_weight="balanced"),
    max_samples=0.5,
    max_features=0.5,
    bootstrap=False,
)

bag_classifier.fit(X_train, y_train_prepared)
accuracy(bag_classifier, X_train, y_train_prepared)

Wow, this is much better. Bagging ensembles work well and they outperform Voting ensembles a lot. Other remarkable thing about them is that they are able to overcome the overfitting especially when used with decision trees (decision trees tend to overfit easily). Let’s now see other type of ensemble models: Gradient Boosting Classifier.

42.78.4.4.3. 4.3: Gradient boosting classifier#

Gradient boosting works like bagging methods. The only difference is that instead of training models on subsets of training data, models(decision trees) are trained in a sequence where each tree model takes the error of the previous tree to correct it and the sequence goes on…

Simply, the initial model is trained on the full data, but then the next models minimizes the previous errors.

Just like other models, gradient boosting classifier has hyperparameters but the most important ones are number of estimators or trees (n_estimators) and learning rate (learning_rate).

from sklearn.ensemble import GradientBoostingClassifier

grad_boost_clf = GradientBoostingClassifier(
    n_estimators=500, learning_rate=0.8, random_state=42, max_depth=2
)

grad_boost_clf.fit(X_train, y_train_prepared)

Let’s evaluate it on the training set.

accuracy(grad_boost_clf, X_train, y_train_prepared)

One disadvantage of the Gradient Boosting ensemble method is that it is easy to overfit, and that has to do with how they works. By minimizing the errors consecutively (tree after tree), they can be so perfect but the of course won’t be perfect when fed to the test data.

One way to avoid overfitting is to carefully choose the learning rate that scale well with the number of estimators. Although the high number of estimators can’t necessarily cause overfitting (gradient boosting is a robust model), the learning rate has to be low. In other words, there is a trade off between these two hyperparameters. The higher the learning rate, the lower estimators and vice versa. If you can get a good results with low learning rate, there is a chance that it will generalize well on test too.

You can spend sometime to change these two parameters and try to see their effects.

For classification problems beyond two classes, Scikit-Learn recommend using Histogram Gradient Boosting Classifier which is a faster version of the Gradient Boosting Classifier. It is inspired by Microsoft Light Gradient Boosting Machine (LightGBM). LightGBM is fater, low memory usage, better performance, support distributed training and GPU, and can handle large datasets.

42.78.4.4.4. 4.4: AdaBoost classifier#

AdaBoost is another ensemble model in class of boosting methods. It’s very much like gradient boost but instead of minimizing the error of the consecutive models, it updates the weights.

So, the first model (decision tree) is trained on the full training data, the next model weights are updated based off the previous weights and so forth.

The main parameters to tune to make AdaBoost work well are number of estimators and the maximum depth of each estimator.

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

adaboost_clf = AdaBoostClassifier(
    base_estimator=DecisionTreeClassifier(max_depth=3, class_weight="balanced"),
    # base estimator is decision trees by default
    n_estimators=300,
    learning_rate=0.5,
)

adaboost_clf.fit(X_train, y_train_prepared)

Let’s evaluate it on the training set.

accuracy(adaboost_clf, X_train, y_train_prepared)

Again, you can tune the number of estimators and the depth of the base estimator. The base estimator is Decision Trees by default.

42.78.4.4.5. 4.5: Stacking classifier#

Also referred to as Stacked Generalization, it is an ensemble method employed to reduce the biases and the errors rate made by multiple predictors(individual models/estimators).

Instead of averaging the predictions made by individual models or predictors, in stacking, multiple models are trained on the full training data and then the final model that is trained on different subsets of the training data takes the predictions of the former models and find the final predictions.

From its original paper, it is noted that it can be seen as the sophisticated version of the cross validation (per the reason why the final estimator is trained on the subsets or folds of the training data).

from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC


base_estimators = [
    ("rand", RandomForestClassifier(random_state=42)),
    ("svc", SVC(random_state=42)),
]

final_estimator = LogisticRegression()

stack_clf = StackingClassifier(
    estimators=base_estimators, final_estimator=final_estimator
)

stack_clf.fit(X_train, y_train_prepared)

Let’s evaluate it on the training set.

accuracy(stack_clf, X_train, y_train_prepared)

Ohh the accuracy looks appealing! But is it really? It might be that we overfitted the training data.

42.78.4.5. Task 5: Evaluating the ensemble model on the test set#

Let’s say that all we have been doing was trying to find an ensemble model that can fit well the data and so we want to test it on the test set before shipping it into production.

To narrow the choice down, we will use A Gradient Boosting method.

As always, we will prepare the test set in the same way that we prepared the training set. Let’s go!

X_test = test_data.drop("class", axis=1)
y_test = test_data["class"]

y_test_prepared = label_enc.transform(y_test)
accuracy(grad_boost_clf, X_test, y_test_prepared)

That’s not really bad considering that the model never saw the test data anywhere in previous steps. The Gradient boosting method had nearly 92% on the training data.

Let’s also evaluate the stacking classifier. It was overly optimistic on the training data, nearly 100%!

accuracy(stack_clf, X_test, y_test_prepared)

How about trying a bag classifier also? It had nearly 98% on the training data.

accuracy(bag_classifier, X_test, y_test_prepared)

As we can see, all of the ensembles are not generalizing well on the test set. It’s not that bad again, but there is a room of improvement. One sure way to improve the results of machine learning model is to improve the data.

You may also have to tune the hyperparamaters of the particular ensemble method and this works sometime too.

42.78.5. Final note#

This notebook was all about ensemble learning methods. There is truly a wisdom in the crowd(CC: Aurelion Geron). By averaging the results of different models, we are able to improve the overall prediction.

There is a notion that ensemble models are slow and expensive to run in production. It’s true, but with the advent of computation power and how incredible these types of ML models work well, that’s no longer issue for someone. They are complex algorithms but they can also reduce complexity. How?

Think about it: Instead of building a single complex model (typically a neural network), you can build small models that will train and compute predictions faster and then average their results using a given ensemble method. By that way, you are leveraging ensemble methods to reduce complexity.

To learn more about the ensemble models, you can refer to this chapter Chapter 7 Ensemble Learning and Random Forests of the book Hands-On Machine Learning with Scikit-Learn, Keras and TensorFlow.

42.78.6. Acknowledgments#

Thanks to Nyandwi for creating the open-source course Machine Learning complete. It inspires the majority of the content in this chapter.