Random Forest Classifier with Feature Importance
Contents
42.83. Random Forest Classifier with Feature Importance#
42.83.1. The problem statement#
We try to make predictions where the prediction task is to determine whether a person makes over 50K a year. We implement Random Forest Classification with Python and Scikit-Learn. So, to answer the question, we build a Random Forest classifier to predict whether a person makes over 50K a year.
42.83.2. Import libraries#
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
for dirname, _, filenames in os.walk('https://static-1300131294.cos.ap-shanghai.myqcloud.com/data/random-forest-income_evaluation.csv'):
for filename in filenames:
print(os.path.join(dirname, filename))
# Any results you write to the current directory are saved as output.
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(style="whitegrid")
import warnings
warnings.filterwarnings('ignore')
42.83.3. Import dataset#
data = 'https://static-1300131294.cos.ap-shanghai.myqcloud.com/data/random-forest-income_evaluation.csv'
df = pd.read_csv(data)
42.83.4. Exploratory data analysis#
42.83.4.1. View dimensions of dataset#
# print the shape
print('The shape of the dataset : ', df.shape)
We can see that there are 32561 instances and 15 attributes in the data set.
42.83.4.2. Preview the dataset #
df.head()
42.83.4.3. Rename column names#
col_names = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 'relationship',
'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income']
df.columns = col_names
df.columns
42.83.4.4. View summary of dataset#
df.info()
42.83.4.5. Check the data types of columns#
The above df.info() command gives us the number of filled values along with the data types of columns.
If we simply want to check the data type of a particular column, we can use the following command.
df.dtypes
42.83.4.6. View statistical properties of dataset#
df.describe()
df.describe().T
We can see that the above df.describe().T command presents statistical properties in horizontal form.
df.describe(include='all')
# check for missing values
df.isnull().sum()
42.83.4.7. Check with ASSERT statement#
assert pd.notnull(df).all().all()
42.83.4.8. Functional approach to EDA#
def initial_eda(df):
if isinstance(df, pd.DataFrame):
total_na = df.isna().sum().sum()
print("Dimensions : %d rows, %d columns" % (df.shape[0], df.shape[1]))
print("Total NA Values : %d " % (total_na))
print("%38s %10s %10s %10s" % ("Column Name", "Data Type", "#Distinct", "NA Values"))
col_name = df.columns
dtyp = df.dtypes
uniq = df.nunique()
na_val = df.isna().sum()
for i in range(len(df.columns)):
print("%38s %10s %10s %10s" % (col_name[i], dtyp[i], uniq[i], na_val[i]))
else:
print("Expect a DataFrame but got a %15s" % (type(df)))
initial_eda(df)
42.83.5. Explore Categorical Variables#
42.83.5.1. Find categorical variables#
categorical = [var for var in df.columns if df[var].dtype=='O']
print('There are {} categorical variables\n'.format(len(categorical)))
print('The categorical variables are :\n\n', categorical)
42.83.5.2. Preview categorical variables#
df[categorical].head()
42.83.5.3. Summary of categorical variables#
42.83.5.4. Frequency distribution of categorical variables#
for var in categorical:
print(df[var].value_counts())
42.83.5.5. Percentage of frequency distribution of values#
for var in categorical:
print(df[var].value_counts()/float(len(df)))
42.83.5.6. Explore the variables#
# check for missing values
df['income'].isnull().sum()
# view number of unique values
df['income'].nunique()
# view the unique values
df['income'].unique()
# view the frequency distribution of values
df['income'].value_counts()
# view percentage of frequency distribution of values
df['income'].value_counts()/len(df)
# visualize frequency distribution of income variable
f,ax=plt.subplots(1,2,figsize=(18,8))
ax[0] = df['income'].value_counts().plot.pie(explode=[0,0],autopct='%1.1f%%',ax=ax[0],shadow=True)
ax[0].set_title('Income Share')
#f, ax = plt.subplots(figsize=(6, 8))
ax[1] = sns.countplot(x="income", data=df, palette="Set1")
ax[1].set_title("Frequency distribution of income variable")
plt.show()
f, ax = plt.subplots(figsize=(8, 6))
ax = sns.countplot(y="income", data=df, palette="Set1")
ax.set_title("Frequency distribution of income variable")
plt.show()
f, ax = plt.subplots(figsize=(10, 8))
ax = sns.countplot(x="income", hue="sex", data=df, palette="Set1")
ax.set_title("Frequency distribution of income variable wrt sex")
plt.show()
f, ax = plt.subplots(figsize=(10, 8))
ax = sns.countplot(x="income", hue="race", data=df, palette="Set1")
ax.set_title("Frequency distribution of income variable wrt race")
plt.show()
# check number of unique labels
df.workclass.nunique()
# view the unique labels
df.workclass.unique()
# view frequency distribution of values
df.workclass.value_counts()
# replace '?' values in workclass variable with `NaN`
df['workclass'].replace(' ?', np.NaN, inplace=True)
# again check the frequency distribution of values in workclass variable
df.workclass.value_counts()
f, ax = plt.subplots(figsize=(10, 6))
ax = df.workclass.value_counts().plot(kind="bar", color="green")
ax.set_title("Frequency distribution of workclass variable")
ax.set_xticklabels(df.workclass.value_counts().index, rotation=30)
plt.show()
f, ax = plt.subplots(figsize=(12, 8))
ax = sns.countplot(x="workclass", hue="income", data=df, palette="Set1")
ax.set_title("Frequency distribution of workclass variable wrt income")
ax.legend(loc='upper right')
plt.show()
f, ax = plt.subplots(figsize=(12, 8))
ax = sns.countplot(x="workclass", hue="sex", data=df, palette="Set1")
ax.set_title("Frequency distribution of workclass variable wrt sex")
ax.legend(loc='upper right')
plt.show()
# check number of unique labels
df.occupation.nunique()
# view unique labels
df.occupation.unique()
# view frequency distribution of values
df.occupation.value_counts()
# replace '?' values in occupation variable with `NaN`
df['occupation'].replace(' ?', np.NaN, inplace=True)
# again check the frequency distribution of values
df.occupation.value_counts()
# visualize frequency distribution of `occupation` variable
f, ax = plt.subplots(figsize=(12, 8))
ax = sns.countplot(x="occupation", data=df, palette="Set1")
ax.set_title("Frequency distribution of occupation variable")
ax.set_xticklabels(df.occupation.value_counts().index, rotation=30)
plt.show()
# check number of unique labels
df.native_country.nunique()
# view unique labels
df.native_country.unique()
# check frequency distribution of values
df.native_country.value_counts()
# replace '?' values in native_country variable with `NaN`
df['native_country'].replace(' ?', np.NaN, inplace=True)
# visualize frequency distribution of `native_country` variable
f, ax = plt.subplots(figsize=(16, 12))
ax = sns.countplot(x="native_country", data=df, palette="Set1")
ax.set_title("Frequency distribution of native_country variable")
ax.set_xticklabels(df.native_country.value_counts().index, rotation=90)
plt.show()
df[categorical].isnull().sum()
42.83.5.7. Number of labels: Cardinality#
# check for cardinality in categorical variables
for var in categorical:
print(var, ' contains ', len(df[var].unique()), ' labels')
We can see that native_country column contains relatively large number of labels as compared to other columns. I will check for cardinality after train-test split.
42.83.6. Declare feature vector and target variable#
X = df.drop(['income'], axis=1)
y = df['income']
42.83.7. Split data into separate training and test set#
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
# check the shape of X_train and X_test
X_train.shape, X_test.shape
42.83.8. Feature Engineering#
42.83.8.1. Display categorical variables in training set#
categorical = [col for col in X_train.columns if X_train[col].dtypes == 'O']
categorical
42.83.8.2. Display numerical variables in training set#
numerical = [col for col in X_train.columns if X_train[col].dtypes != 'O']
numerical
42.83.8.3. Engineering missing values in categorical variables#
# print percentage of missing values in the categorical variables in training set
X_train[categorical].isnull().mean()
# print categorical variables with missing data
for col in categorical:
if X_train[col].isnull().mean()>0:
print(col, (X_train[col].isnull().mean()))
# impute missing categorical variables with most frequent value
for df2 in [X_train, X_test]:
df2['workclass'].fillna(X_train['workclass'].mode()[0], inplace=True)
df2['occupation'].fillna(X_train['occupation'].mode()[0], inplace=True)
df2['native_country'].fillna(X_train['native_country'].mode()[0], inplace=True)
# check missing values in categorical variables in X_train
X_train[categorical].isnull().sum()
# check missing values in categorical variables in X_test
X_test[categorical].isnull().sum()
As a final check, I will check for missing values in X_train and X_test.
# check missing values in X_train
X_train.isnull().sum()
# check missing values in X_test
X_test.isnull().sum()
We can see that there are no missing values in X_train and X_test.
42.83.8.4. Encode categorical variables#
# preview categorical variables in X_train
X_train[categorical].head()
# import category encoders
import category_encoders as ce
# encode categorical variables with one-hot encoding
encoder = ce.OneHotEncoder(cols=['workclass', 'education', 'marital_status', 'occupation', 'relationship',
'race', 'sex', 'native_country'])
X_train = encoder.fit_transform(X_train)
X_test = encoder.transform(X_test)
X_train.head()
X_train.shape
X_test.head()
X_test.shape
We now have training and testing set ready for model building. Before that, we should map all the feature variables onto the same scale. It is called feature scaling. We will do it as follows.
42.83.9. Feature Scaling#
cols = X_train.columns
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
X_train = pd.DataFrame(X_train, columns=[cols])
X_test = pd.DataFrame(X_test, columns=[cols])
We now have X_train dataset ready to be fed into the Random Forest classifier. We will do it as follows.
42.83.10. Random Forest Classifier model with default parameters#
# import Random Forest classifier
from sklearn.ensemble import RandomForestClassifier
# instantiate the classifier
rfc = RandomForestClassifier(random_state=0)
# fit the model
rfc.fit(X_train, y_train)
# Predict the Test set results
y_pred = rfc.predict(X_test)
# Check accuracy score
from sklearn.metrics import accuracy_score
print('Model accuracy score with 10 decision-trees : {0:0.4f}'. format(accuracy_score(y_test, y_pred)))
42.83.11. Random Forest Classifier model with 100 Decision Trees#
# instantiate the classifier with n_estimators = 100
rfc_100 = RandomForestClassifier(n_estimators=100, random_state=0)
# fit the model to the training set
rfc_100.fit(X_train, y_train)
# Predict on the test set results
y_pred_100 = rfc_100.predict(X_test)
# Check accuracy score
print('Model accuracy score with 100 decision-trees : {0:0.4f}'. format(accuracy_score(y_test, y_pred_100)))
# create the classifier with n_estimators = 100
clf = RandomForestClassifier(n_estimators=100, random_state=0)
# fit the model to the training set
clf.fit(X_train, y_train)
# view the feature scores
feature_scores = pd.Series(clf.feature_importances_, index=X_train.columns).sort_values(ascending=False)
feature_scores
42.83.12. Build the Random Forest model on selected features#
# drop the least important feature from X_train and X_test
X_train = X_train.drop(['native_country_41'], axis=1)
X_test = X_test.drop(['native_country_41'], axis=1)
# instantiate the classifier with n_estimators = 100
clf = RandomForestClassifier(n_estimators=100, random_state=0)
# fit the model to the training set
clf.fit(X_train, y_train)
# Predict on the test set results
y_pred = clf.predict(X_test)
# Check accuracy score
print('Model accuracy score with native_country_41 variable removed : {0:0.4f}'. format(accuracy_score(y_test, y_pred)))
42.83.13. Confusion matrix#
A confusion matrix is a tool for summarizing the performance of a classification algorithm. A confusion matrix will give us a clear picture of classification model performance and the types of errors produced by the model. It gives us a summary of correct and incorrect predictions broken down by each category. The summary is represented in a tabular form.
# Print the Confusion Matrix and slice it into four pieces
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print('Confusion matrix\n\n', cm)
# visualize confusion matrix with seaborn heatmap
cm_matrix = pd.DataFrame(data=cm, columns=['Actual Positive:1', 'Actual Negative:0'],
index=['Predict Positive:1', 'Predict Negative:0'])
sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='YlGnBu')
42.83.14. Classification Report#
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
42.84. Acknowledgments#
Thanks to prashant111 for creating random-forest-classifier-feature-importance. It inspires the majority of the content in this chapter.