LICENSE

Copyright 2018 Google LLC.

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

42.89. Introduction#

Climate Prediction-Random Forest is a model that uses a combination of climate variables and machine learning algorithms to predict future climate conditions. The model is trained on a large dataset of climate observations and uses a random forest approach to generate predictions. The predictions are based on the relationships between the climate variables and the random forest algorithm is able to capture complex patterns in the data.

42.89.1. Importing Libraries#

# Pandas is used for data manipulation
import pandas as pd

# Use numpy to convert to arrays
import numpy as np

# Import tools needed for visualization

import matplotlib.pyplot as plt
%matplotlib inline

42.89.2. Data Exploration#

# Reading the data to a dataframe 
df = pd.read_csv('https://static-1300131294.cos.ap-shanghai.myqcloud.com/data/classification/temps.csv')
# displaying first 5 rows
df.head(5)
year month day week temp_2 temp_1 average actual friend
0 2019 1 1 Fri 45 45 45.6 45 29
1 2019 1 2 Sat 44 45 45.7 44 61
2 2019 1 3 Sun 45 44 45.8 41 56
3 2019 1 4 Mon 44 41 45.9 40 53
4 2019 1 5 Tues 41 40 46.0 44 41
# the shape of our features
df.shape
(348, 9)
# column names
df.columns
Index(['year', 'month', 'day', 'week', 'temp_2', 'temp_1', 'average', 'actual',
       'friend'],
      dtype='object')
# checking for null values
df.isnull().sum()
year       0
month      0
day        0
week       0
temp_2     0
temp_1     0
average    0
actual     0
friend     0
dtype: int64

There are no null values

42.89.3. One-Hot Encoding#

A one hot encoding allows the representation of categorical data to be more expressive.

# One-hot encode categorical features
df = pd.get_dummies(df)
df.head(5)
year month day temp_2 temp_1 average actual friend week_Fri week_Mon week_Sat week_Sun week_Thurs week_Tues week_Wed
0 2019 1 1 45 45 45.6 45 29 True False False False False False False
1 2019 1 2 44 45 45.7 44 61 False False True False False False False
2 2019 1 3 45 44 45.8 41 56 False False False True False False False
3 2019 1 4 44 41 45.9 40 53 False True False False False False False
4 2019 1 5 41 40 46.0 44 41 False False False False False True False
print('Shape of features after one-hot encoding:', df.shape)
Shape of features after one-hot encoding: (348, 15)

42.89.4. Features and Labels#

# Labels are the values we want to predict
labels = df['actual']

# Remove the labels from the features
df = df.drop('actual', axis = 1)

# Saving feature names for later use
feature_list = list(df.columns)

42.89.5. Train Test Split#

# Using Skicit-learn to split data into training and testing sets
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
train_features, test_features, train_labels, test_labels = train_test_split(df,
                                                                            labels,
                                                                            test_size = 0.20,
                                                                            random_state = 42)
print('Training Features Shape:', train_features.shape)
print('Training Labels Shape:', train_labels.shape)
print('Testing Features Shape:', test_features.shape)
print('Testing Labels Shape:', test_labels.shape)
Training Features Shape: (278, 14)
Training Labels Shape: (278,)
Testing Features Shape: (70, 14)
Testing Labels Shape: (70,)

42.89.6. Training the Forest#

# Import the model we are using
from sklearn.ensemble import RandomForestRegressor

# Instantiate model 
rf = RandomForestRegressor(n_estimators= 1000, random_state=42)

# Train the model on training data
rf.fit(train_features, train_labels);

42.89.7. Make Predictions on Test Data#

# Use the forest's predict method on the test data
predictions = rf.predict(test_features)

# Calculate the absolute errors
errors = abs(predictions - test_labels)

# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(errors), 2), 'degrees.')
Mean Absolute Error: 3.78 degrees.
# Calculate mean absolute percentage error (MAPE)
mape = 100 * (errors / test_labels)

# Calculate and display accuracy
accuracy = 100 - np.mean(mape)
print('Accuracy:', round(accuracy, 2), '%.')
Accuracy: 94.02 %.

42.89.8. Visualizing a Single Decision Tree#

Decision Tree

42.89.9. Your turn! 🚀#

You can practice your random-forest skills by following the assignment Climate Prediction-Random Forest.

42.89.10. Acknowledgments#

Thanks to Kaggle for creating the open source course Climate Prediction-Random Forest. It contributes some of the content in this chapter.