Random forests for classification
Contents
42.77. Random forests for classification#
Random Forests are powerful machine learning algorithms used for supervised classification and regression. Random forests works by averaging the predictions of the multiple and randomized decision trees. Decision trees tends to overfit and so by combining multiple decision trees, the effect of overfitting can be minimized.
Random Forests are type of ensemble models. More about ensembles models in the next assignment.
Different to other learning algorithms, random forests provide a way to find the importance of each feature and this is implemented in Sklearn.
42.77.1. Imports#
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn
import matplotlib.pyplot as plt
42.77.2. Loading the data#
In this assignment, we will use Random forests to build a classifier that identify the increase or decrease of the electricity using “the data that was collected from the Australian New South Wales Electricity Market. In this market, prices are not fixed and are affected by demand and supply of the market. They are set every five minutes. Electricity transfers to/from the neighboring state of Victoria were done to alleviate fluctuations.”
“The dataset contains 45,312 instances dated from 7 May 1996 to 5 December 1998. Each example of the dataset refers to a period of 30 minutes, i.e. there are 48 instances for each time period of one day. Each example on the dataset has 5 fields, the day of week, the time stamp, the New South Wales electricity demand, the Victoria electricity demand, the scheduled electricity transfer between states and the class label. The class label identifies the change of the price (UP or DOWN) in New South Wales relative to a moving average of the last 24 hours (and removes the impact of longer term price trends). Source: Open ML electricity.
Here are the information about the features:
Date: date between 7 May 1996 to 5 December 1998. Here normalized between 0 and 1
Day: day of the week (1-7)
Period: time of the measurement (1-48) in half hour intervals over 24 hours. Here normalized between 0 and 1
NSWprice: New South Wales electricity price, normalized between 0 and 1
NSWdemand: New South Wales electricity demand, normalized between 0 and 1
VICprice: Victoria electricity price, normalized between 0 and 1
VICdemand: Victoria electricity demand, normalized between 0 and 1
transfer: scheduled electricity transfer between both states, normalized between 0 and 1
# Let's hide warnings
import warnings
warnings.filterwarnings("ignore")
elec_df = pd.read_csv("../../../assets/data/elec_data.csv")
type(elec_df)
elec_df.shape
elec_df.head()
42.77.2.1. Task 1: Exploratory data analysis#
Before doing exploratory analysis, as always, let’s split the data into training and test sets.
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(elec_df, test_size=0.25, random_state=20)
print(
"The size of training data is: {} \nThe size of testing data is: {}".format(
len(train_data), len(test_data)
)
)
Taking a quick look into the dataset
train_data.head(10)
# Displaying the last rows
train_data.tail()
train_data.info()
Two things to draw from the dataset for now:
The target feature
class
is categorical. We will make sure to encode that during data preprocessing.All numerical features are already normalized, so we won’t need to normalize these type of features.
# Checking summary statistics
train_data.describe()
# Checking missing values
train_data.isnull().sum()
Great, we don’t have any missing values. Usually there are three things to do with if they are present:
We can remove all missing values completely
We can leave them as they are
We can fill them with a given strategy such as mean, media or most frequent value. Either
Sklearn
or Pandas provides a quick ways to fill these kind of values.
If you still want to know more about how to deal with missing values, please refer to this article
# Checking feature correlation
corr = train_data.corr()
## Visualizing correlation
plt.figure(figsize=(12, 7))
sns.heatmap(corr, annot=True, cmap="crest")
It seems that we don’t have features which are too correlating. Correlation shown above varies from -1 to 1
. If the correlation between two features is close to 1, it means that they nearly contain the same information. If it is close to -1, it means that these features contain different information.Take an example: vicdemand
correlate with nswdeman
at 0.67 ratio.
So if you drop one of those features, it’s likely that your model will not be affected much. So different to what you have seen in many articles, having features which does not correlate to the target feature doesn’t mean they are not useful.
In the above correlation matrix, you can see that class feature is not there and this is because it still has categorical values.
42.77.2.2. Task 2: More data exploration#
Before preprocessing the data, let’s take a look into specific features.
Let’s see how many Ups/Downs are in the class feature.
plt.figure(figsize=(12, 7))
sns.countplot(data=train_data, x="class")
Day
is the days of the week, from 1-7, Monday to Sunday. Let’s count the days occurences in respect to the ups/downs of the electricity’s price.
plt.figure(figsize=(13, 8))
sns.countplot(data=train_data, x="day", hue="class")
It seems that most days had downs. From the beginning of the week, there were consistent increase in downs(price of electricity went down) and decrease in ups. Let’s see if there is an appealing relationship between the demand/price of electricity in New South Wales and Victoria.
plt.figure(figsize=(13, 8))
sns.scatterplot(data=train_data, x="vicdemand", y="nswdemand", hue="class")
The demand of the electricity in New South Wales and the Victoria is kind of linear. Let’s see if we can get any other insights by bringing days in the demand analysis.
plt.figure(figsize=(20, 10))
sns.scatterplot(data=train_data, x="vicdemand", y="nswdemand", hue="day", size="day")
Although it is kind of hard to draw a strong point, there is less demand of electricity in both cities on Monday and Sunday than other days. We can use a line plot to plot the demand in both cities on the course of the days.
plt.figure(figsize=(13, 8))
sns.lineplot(data=train_data, x="day", y="nswdemand", color="green")
plt.figure(figsize=(13, 8))
sns.lineplot(data=train_data, x="day", y="vicdemand", color="red")
Another interesting thing to look in the dataset is if there are some seasonalities/trends in the demand/price in either Victoria or New South Wales over period of time. In time series analysis, seasonality is when there is repetitive scenarios or consistent behaviours over the course of time.
If you look at the demand of the electricity in both cities on the course of date (7 May 1996 to 5 December 1998
), you can see that there are some types of seasonalities. Not 100% but it seems there is and if this dataset would have been collected for more than two years, it would probably be easy to know surely if there are seasonalities.
plt.figure(figsize=(20, 10))
sns.lineplot(data=train_data, x="date", y="nswdemand")
plt.figure(figsize=(20, 10))
sns.lineplot(data=train_data, x="date", y="vicdemand")
One last thing about data analysis, let’s plot all histograms of the numerical features.
train_data.hist(bins=50, figsize=(15, 10))
plt.show()
42.77.2.3. Task 3: Data preprocessing#
It is here that we prepare the data to be in the proper format for the machine learning model.
Let’s encode the categorical feature class
. But before that, let’s take training input data and labels.
X_train = train_data.drop("class", axis=1)
y_train = train_data["class"]
from sklearn.preprocessing import LabelEncoder
label_enc = LabelEncoder()
y_train_prepared = label_enc.fit_transform(y_train)
X_train.head()
y_train_prepared
Now we are ready to train the machine learning model.
But again if you look at the data, the day
feature is not normalized as other features. We can normalize it or leave it but for now let’s go ahead and train the random forests classifier.
42.77.2.4. Task 4: Training random forests classifier#
from sklearn.ensemble import RandomForestClassifier
forest_clf = RandomForestClassifier(
min_samples_split=2,
bootstrap=False,
max_depth=None,
random_state=42,
n_jobs=-1,
max_features="sqrt",
)
forest_clf.fit(X_train, y_train_prepared)
42.77.2.5. Task5: Evaluating random forests classifier#
Let’s build 3 functions to display accuracy, confusion matrix, and classification report.
Accuracy provide a percentage score of the model’s ability to make correct predictions.
Confusion matrix shows the predicted classes and the actual classes: True Negativse(TN), True Positives(TP), False Negatives(FN), and True Positives(TP).
Classification report contains all useful metrics such as precision, recall, and f1 score.
from sklearn.metrics import accuracy_score
def accuracy(input_data, model, labels):
"""
Take the input data, model and labels and return accuracy
"""
preds = model.predict(input_data)
acc = accuracy_score(labels, preds)
return acc
from sklearn.metrics import confusion_matrix
def conf_matrix(input_data, model, labels):
"""
Take the input data, model and labels and return confusion matrix
"""
preds = model.predict(input_data)
cm = confusion_matrix(labels, preds)
return cm
from sklearn.metrics import classification_report
def class_report(input_data, model, labels):
"""
Take the input data, model and labels and return confusion matrix
"""
preds = model.predict(input_data)
report = classification_report(labels, preds)
report = print(report)
return report
Let’s find the accuracy on the training set.
accuracy(X_train, forest_clf, y_train_prepared)
Ohh, the model overfitted the dataset. Let’s also display the classification report and confusion matrix.
conf_matrix(X_train, forest_clf, y_train_prepared)
class_report(X_train, forest_clf, y_train_prepared)
The model clearly overfitted the data. Let’s see how we can regularize it.
42.77.2.6. Task 6: Improving random forests#
# Random forest model parameters
forest_clf.get_params()
We will use GridSearch to find the best hyperparameters that we can use to retrain the model with. By setting the refit
to True
, the random forest will be automatically retrained on the dataset with the best hyperparameters. By default, refit
is True.
We will also provide set class_weight
to balanced
since the data is not balanced. By doing that, the model will update the class weight automatically based off the number of examples available in each class.
But this step takes a lot of time.
from sklearn.model_selection import GridSearchCV
params_grid = {
"n_estimators": [100, 200, 300, 500],
"max_leaf_nodes": list(range(2, 12)),
"min_samples_leaf": [1, 2, 3, 4, 5],
}
# refit is true by default. The best estimator is trained on the whole dataset
grid_search = GridSearchCV(
RandomForestClassifier(
bootstrap=False, class_weight="balanced", n_jobs=-1, max_features="sqrt"
),
params_grid,
verbose=1,
cv=3,
)
grid_search.fit(X_train, y_train_prepared)
grid_search.best_params_
grid_search.best_estimator_
forest_best = grid_search.best_estimator_
Let’s find the accuracy again.
accuracy(X_train, forest_best, y_train_prepared)
conf_matrix(X_train, forest_best, y_train_prepared)
In confusion matrix, each row represent an actual class and each column represents predicted class.
So, from the results above:
16928 negative examples(N) were correcty predicted as negatives(true negatives).
2613 negatives examples(N) were incorrectly classified as positive examples when they are in fact negatives(false positives).
4223 positive examples were incorrectly classified as negative(N) when in fact they are positives§ (false negatives).
10220 were correctly classified as positive examples(true positives).
class_report(X_train, forest_best, y_train_prepared)
Wow, not so impressive, but this is much better than the first model. By only setting the class weight to balanced and finding the best values of the hyperparameters, we were able to improve our model.
If you remember, we have classes imbalance. You can see the number of examples in each class in support in classification report. But our model is able to identify negative examples correctly at 80%, and also is able to identify the positive examples at 80% without overfitting. That is precision.
A few notes about Precison/Recall/F1 score:
Precision is the model accuracy on predicting positive examples correctly.
Recall is the ratio of the positive examples that are correctly identified by the model.
F1 score is the harmonic mean of precision and recall.
The higher the precision and recall are, the higher the F1 score. But there is a tradeoff between them. Increasing precision will reduce recall, and vice versa. So it’s fair to say that it depends on the problem you’re trying to solve and the metrics you want to optimize for.
One way to improve the model can be to search more hyperparameters or adding more good data is always the best cure.
42.77.2.7. Task 7: Evaluating the model on the test set#
Let us evaluate the model on the test set. But we need first run the label_encoder on the class feature as we did in the training labels. Note that we only transform (not fit_transform).
X_test = test_data.drop("class", axis=1)
y_test = test_data["class"]
y_test_prepared = label_enc.transform(y_test)
accuracy(X_test, forest_best, y_test_prepared)
conf_matrix(X_test, forest_best, y_test_prepared)
class_report(X_test, forest_best, y_test_prepared)
As you can see the model is no longer overfitting. On the training set, the accuracy was 79%, which is a figure very similar to the test set. And the model never saw the test data. To improve the model in the case like this, if is often best to add more data if possible.
42.77.2.8. Task 8: Feature importance#
Different to other machine learning models, random forests can show how each feature contributed to the model generalization. Let’s find it.
The results are values between 0 and 1. The closer to 1, the good the feature was to the model.
feat_import = forest_best.feature_importances_
feat_df = pd.DataFrame(
feat_import, columns=["Feature Importance"], index=X_train.columns
)
feat_df
From the dataframe above, the price of electricity in New South Wales had the top importance on the prediction of the electricity’s cost fluctuation(Up/Down). Other features which highly influenced the model are demands in both South Wales and Victoria.
This is the end of the assignment. We have learned the fundamental idea behind the random forests, how to overcome overfitting and how to find the feature importance.
42.77.3. Acknowledgments#
Thanks to Nyandwi for creating the open-source course Machine Learning complete. It inspires the majority of the content in this chapter.