42.83. Random Forest Classifier with Feature Importance#

42.83.1. The problem statement#

We try to make predictions where the prediction task is to determine whether a person makes over 50K a year. We implement Random Forest Classification with Python and Scikit-Learn. So, to answer the question, we build a Random Forest classifier to predict whether a person makes over 50K a year.

42.83.2. Import libraries#

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('https://static-1300131294.cos.ap-shanghai.myqcloud.com/data/random-forest-income_evaluation.csv'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

sns.set(style="whitegrid")
import warnings

warnings.filterwarnings('ignore')

42.83.3. Import dataset#

data = 'https://static-1300131294.cos.ap-shanghai.myqcloud.com/data/random-forest-income_evaluation.csv'

df = pd.read_csv(data)

42.83.4. Exploratory data analysis#

42.83.4.1. View dimensions of dataset#

# print the shape
print('The shape of the dataset : ', df.shape)

We can see that there are 32561 instances and 15 attributes in the data set.

42.83.4.2. Preview the dataset #

df.head()

42.83.4.3. Rename column names#

col_names = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 'relationship',
             'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income']

df.columns = col_names

df.columns

42.83.4.4. View summary of dataset#

df.info()

42.83.4.5. Check the data types of columns#

The above df.info() command gives us the number of filled values along with the data types of columns.

If we simply want to check the data type of a particular column, we can use the following command.

df.dtypes

42.83.4.6. View statistical properties of dataset#

df.describe()
df.describe().T

We can see that the above df.describe().T command presents statistical properties in horizontal form.

df.describe(include='all')
# check for missing values

df.isnull().sum()

42.83.4.7. Check with ASSERT statement#

assert pd.notnull(df).all().all()

42.83.4.8. Functional approach to EDA#

def initial_eda(df):
    if isinstance(df, pd.DataFrame):
        total_na = df.isna().sum().sum()
        print("Dimensions : %d rows, %d columns" % (df.shape[0], df.shape[1]))
        print("Total NA Values : %d " % (total_na))
        print("%38s %10s     %10s %10s" % ("Column Name", "Data Type", "#Distinct", "NA Values"))
        col_name = df.columns
        dtyp = df.dtypes
        uniq = df.nunique()
        na_val = df.isna().sum()
        for i in range(len(df.columns)):
            print("%38s %10s   %10s %10s" % (col_name[i], dtyp[i], uniq[i], na_val[i]))
        
    else:
        print("Expect a DataFrame but got a %15s" % (type(df)))
initial_eda(df)

42.83.5. Explore Categorical Variables#

42.83.5.1. Find categorical variables#

categorical = [var for var in df.columns if df[var].dtype=='O']

print('There are {} categorical variables\n'.format(len(categorical)))

print('The categorical variables are :\n\n', categorical)

42.83.5.2. Preview categorical variables#

df[categorical].head()

42.83.5.3. Summary of categorical variables#

42.83.5.4. Frequency distribution of categorical variables#

for var in categorical: 
    
    print(df[var].value_counts())

42.83.5.5. Percentage of frequency distribution of values#

for var in categorical:
    print(df[var].value_counts()/float(len(df)))

42.83.5.6. Explore the variables#

# check for missing values

df['income'].isnull().sum()
# view number of unique values

df['income'].nunique()
# view the unique values

df['income'].unique()
# view the frequency distribution of values

df['income'].value_counts()
# view percentage of frequency distribution of values

df['income'].value_counts()/len(df)
# visualize frequency distribution of income variable

f,ax=plt.subplots(1,2,figsize=(18,8))

ax[0] = df['income'].value_counts().plot.pie(explode=[0,0],autopct='%1.1f%%',ax=ax[0],shadow=True)
ax[0].set_title('Income Share')


#f, ax = plt.subplots(figsize=(6, 8))
ax[1] = sns.countplot(x="income", data=df, palette="Set1")
ax[1].set_title("Frequency distribution of income variable")

plt.show()
f, ax = plt.subplots(figsize=(8, 6))
ax = sns.countplot(y="income", data=df, palette="Set1")
ax.set_title("Frequency distribution of income variable")
plt.show()
f, ax = plt.subplots(figsize=(10, 8))
ax = sns.countplot(x="income", hue="sex", data=df, palette="Set1")
ax.set_title("Frequency distribution of income variable wrt sex")
plt.show()
f, ax = plt.subplots(figsize=(10, 8))
ax = sns.countplot(x="income", hue="race", data=df, palette="Set1")
ax.set_title("Frequency distribution of income variable wrt race")
plt.show()
# check number of unique labels 

df.workclass.nunique()
# view the unique labels

df.workclass.unique()
# view frequency distribution of values

df.workclass.value_counts()
# replace '?' values in workclass variable with `NaN`

df['workclass'].replace(' ?', np.NaN, inplace=True)
# again check the frequency distribution of values in workclass variable

df.workclass.value_counts()
f, ax = plt.subplots(figsize=(10, 6))
ax = df.workclass.value_counts().plot(kind="bar", color="green")
ax.set_title("Frequency distribution of workclass variable")
ax.set_xticklabels(df.workclass.value_counts().index, rotation=30)
plt.show()
f, ax = plt.subplots(figsize=(12, 8))
ax = sns.countplot(x="workclass", hue="income", data=df, palette="Set1")
ax.set_title("Frequency distribution of workclass variable wrt income")
ax.legend(loc='upper right')
plt.show()
f, ax = plt.subplots(figsize=(12, 8))
ax = sns.countplot(x="workclass", hue="sex", data=df, palette="Set1")
ax.set_title("Frequency distribution of workclass variable wrt sex")
ax.legend(loc='upper right')
plt.show()
# check number of unique labels

df.occupation.nunique()
# view unique labels

df.occupation.unique()
# view frequency distribution of values

df.occupation.value_counts()
# replace '?' values in occupation variable with `NaN`

df['occupation'].replace(' ?', np.NaN, inplace=True)
# again check the frequency distribution of values

df.occupation.value_counts()
# visualize frequency distribution of `occupation` variable

f, ax = plt.subplots(figsize=(12, 8))
ax = sns.countplot(x="occupation", data=df, palette="Set1")
ax.set_title("Frequency distribution of occupation variable")
ax.set_xticklabels(df.occupation.value_counts().index, rotation=30)
plt.show()
# check number of unique labels

df.native_country.nunique()
# view unique labels 

df.native_country.unique()
# check frequency distribution of values

df.native_country.value_counts()
# replace '?' values in native_country variable with `NaN`

df['native_country'].replace(' ?', np.NaN, inplace=True)
# visualize frequency distribution of `native_country` variable

f, ax = plt.subplots(figsize=(16, 12))
ax = sns.countplot(x="native_country", data=df, palette="Set1")
ax.set_title("Frequency distribution of native_country variable")
ax.set_xticklabels(df.native_country.value_counts().index, rotation=90)
plt.show()
df[categorical].isnull().sum()

42.83.5.7. Number of labels: Cardinality#

# check for cardinality in categorical variables

for var in categorical:
    
    print(var, ' contains ', len(df[var].unique()), ' labels')

We can see that native_country column contains relatively large number of labels as compared to other columns. I will check for cardinality after train-test split.

42.83.6. Declare feature vector and target variable#

X = df.drop(['income'], axis=1)

y = df['income']

42.83.7. Split data into separate training and test set#

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
# check the shape of X_train and X_test

X_train.shape, X_test.shape

42.83.8. Feature Engineering#

42.83.8.1. Display categorical variables in training set#

categorical = [col for col in X_train.columns if X_train[col].dtypes == 'O']

categorical

42.83.8.2. Display numerical variables in training set#

numerical = [col for col in X_train.columns if X_train[col].dtypes != 'O']

numerical

42.83.8.3. Engineering missing values in categorical variables#

# print percentage of missing values in the categorical variables in training set

X_train[categorical].isnull().mean()
# print categorical variables with missing data

for col in categorical:
    if X_train[col].isnull().mean()>0:
        print(col, (X_train[col].isnull().mean()))
# impute missing categorical variables with most frequent value

for df2 in [X_train, X_test]:
    df2['workclass'].fillna(X_train['workclass'].mode()[0], inplace=True)
    df2['occupation'].fillna(X_train['occupation'].mode()[0], inplace=True)
    df2['native_country'].fillna(X_train['native_country'].mode()[0], inplace=True)    
# check missing values in categorical variables in X_train

X_train[categorical].isnull().sum()
# check missing values in categorical variables in X_test

X_test[categorical].isnull().sum()

As a final check, I will check for missing values in X_train and X_test.

# check missing values in X_train

X_train.isnull().sum()
# check missing values in X_test

X_test.isnull().sum()

We can see that there are no missing values in X_train and X_test.

42.83.8.4. Encode categorical variables#

# preview categorical variables in X_train

X_train[categorical].head()
# import category encoders

import category_encoders as ce
# encode categorical variables with one-hot encoding

encoder = ce.OneHotEncoder(cols=['workclass', 'education', 'marital_status', 'occupation', 'relationship', 
                                 'race', 'sex', 'native_country'])

X_train = encoder.fit_transform(X_train)

X_test = encoder.transform(X_test)
X_train.head()
X_train.shape
X_test.head()
X_test.shape
  • We now have training and testing set ready for model building. Before that, we should map all the feature variables onto the same scale. It is called feature scaling. We will do it as follows.

42.83.9. Feature Scaling#

cols = X_train.columns
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)
X_train = pd.DataFrame(X_train, columns=[cols])
X_test = pd.DataFrame(X_test, columns=[cols])

We now have X_train dataset ready to be fed into the Random Forest classifier. We will do it as follows.

42.83.10. Random Forest Classifier model with default parameters#

# import Random Forest classifier

from sklearn.ensemble import RandomForestClassifier



# instantiate the classifier 

rfc = RandomForestClassifier(random_state=0)



# fit the model

rfc.fit(X_train, y_train)



# Predict the Test set results

y_pred = rfc.predict(X_test)



# Check accuracy score 

from sklearn.metrics import accuracy_score

print('Model accuracy score with 10 decision-trees : {0:0.4f}'. format(accuracy_score(y_test, y_pred)))

42.83.11. Random Forest Classifier model with 100 Decision Trees#

# instantiate the classifier with n_estimators = 100

rfc_100 = RandomForestClassifier(n_estimators=100, random_state=0)



# fit the model to the training set

rfc_100.fit(X_train, y_train)



# Predict on the test set results

y_pred_100 = rfc_100.predict(X_test)



# Check accuracy score 

print('Model accuracy score with 100 decision-trees : {0:0.4f}'. format(accuracy_score(y_test, y_pred_100)))
# create the classifier with n_estimators = 100

clf = RandomForestClassifier(n_estimators=100, random_state=0)



# fit the model to the training set

clf.fit(X_train, y_train)
# view the feature scores

feature_scores = pd.Series(clf.feature_importances_, index=X_train.columns).sort_values(ascending=False)

feature_scores

42.83.12. Build the Random Forest model on selected features#

# drop the least important feature from X_train and X_test

X_train = X_train.drop(['native_country_41'], axis=1)

X_test = X_test.drop(['native_country_41'], axis=1)
# instantiate the classifier with n_estimators = 100

clf = RandomForestClassifier(n_estimators=100, random_state=0)



# fit the model to the training set

clf.fit(X_train, y_train)


# Predict on the test set results

y_pred = clf.predict(X_test)



# Check accuracy score 

print('Model accuracy score with native_country_41 variable removed : {0:0.4f}'. format(accuracy_score(y_test, y_pred)))

42.83.13. Confusion matrix#

A confusion matrix is a tool for summarizing the performance of a classification algorithm. A confusion matrix will give us a clear picture of classification model performance and the types of errors produced by the model. It gives us a summary of correct and incorrect predictions broken down by each category. The summary is represented in a tabular form.

# Print the Confusion Matrix and slice it into four pieces

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)

print('Confusion matrix\n\n', cm)
# visualize confusion matrix with seaborn heatmap

cm_matrix = pd.DataFrame(data=cm, columns=['Actual Positive:1', 'Actual Negative:0'], 
                                 index=['Predict Positive:1', 'Predict Negative:0'])

sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='YlGnBu')

42.83.14. Classification Report#

from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

42.84. Acknowledgments#

Thanks to prashant111 for creating random-forest-classifier-feature-importance. It inspires the majority of the content in this chapter.