# Random Forest Classifier with Feature Importance

## The problem statement
We try to make predictions where the prediction task is to determine whether a person makes over 50K a year. We implement Random Forest Classification with Python and Scikit-Learn. So, to answer the question, we build a Random Forest classifier to predict whether a person makes over 50K a year.


## Import libraries





In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('https://static-1300131294.cos.ap-shanghai.myqcloud.com/data/random-forest-income_evaluation.csv'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.


In [2]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

sns.set(style="whitegrid")

In [None]:
import warnings

warnings.filterwarnings('ignore')

## Import dataset




In [None]:
data = 'https://static-1300131294.cos.ap-shanghai.myqcloud.com/data/random-forest-income_evaluation.csv'

df = pd.read_csv(data)

## Exploratory data analysis

### View dimensions of dataset

In [None]:
# print the shape
print('The shape of the dataset : ', df.shape)

We can see that there are 32561 instances and 15 attributes in the data set.

### Preview the dataset <a class="anchor" id="4.2"></a>

In [None]:
df.head()

### Rename column names

In [None]:
col_names = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 'relationship',
             'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income']

df.columns = col_names

df.columns

### View summary of dataset

In [None]:
df.info()

### Check the data types of columns

The above df.info() command gives us the number of filled values along with the data types of columns.

If we simply want to check the data type of a particular column, we can use the following command.

In [None]:
df.dtypes

### View statistical properties of dataset

In [None]:
df.describe()

In [None]:
df.describe().T

We can see that the above df.describe().T command presents statistical properties in horizontal form.

In [None]:
df.describe(include='all')

In [None]:
# check for missing values

df.isnull().sum()

### Check with ASSERT statement

In [None]:
assert pd.notnull(df).all().all()

### Functional approach to EDA

In [None]:
def initial_eda(df):
    if isinstance(df, pd.DataFrame):
        total_na = df.isna().sum().sum()
        print("Dimensions : %d rows, %d columns" % (df.shape[0], df.shape[1]))
        print("Total NA Values : %d " % (total_na))
        print("%38s %10s     %10s %10s" % ("Column Name", "Data Type", "#Distinct", "NA Values"))
        col_name = df.columns
        dtyp = df.dtypes
        uniq = df.nunique()
        na_val = df.isna().sum()
        for i in range(len(df.columns)):
            print("%38s %10s   %10s %10s" % (col_name[i], dtyp[i], uniq[i], na_val[i]))
        
    else:
        print("Expect a DataFrame but got a %15s" % (type(df)))


In [None]:
initial_eda(df)

## Explore Categorical Variables

### Find categorical variables 

In [None]:
categorical = [var for var in df.columns if df[var].dtype=='O']

print('There are {} categorical variables\n'.format(len(categorical)))

print('The categorical variables are :\n\n', categorical)

### Preview categorical variables 

In [None]:
df[categorical].head()

### Summary of categorical variables 

### Frequency distribution of categorical variables

In [None]:
for var in categorical: 
    
    print(df[var].value_counts())

### Percentage of frequency distribution of values

In [None]:
for var in categorical:
    print(df[var].value_counts()/float(len(df)))

### Explore the variables 

In [None]:
# check for missing values

df['income'].isnull().sum()

In [None]:
# view number of unique values

df['income'].nunique()

In [None]:
# view the unique values

df['income'].unique()

In [None]:
# view the frequency distribution of values

df['income'].value_counts()

In [None]:
# view percentage of frequency distribution of values

df['income'].value_counts()/len(df)

In [None]:
# visualize frequency distribution of income variable

f,ax=plt.subplots(1,2,figsize=(18,8))

ax[0] = df['income'].value_counts().plot.pie(explode=[0,0],autopct='%1.1f%%',ax=ax[0],shadow=True)
ax[0].set_title('Income Share')


#f, ax = plt.subplots(figsize=(6, 8))
ax[1] = sns.countplot(x="income", data=df, palette="Set1")
ax[1].set_title("Frequency distribution of income variable")

plt.show()

In [None]:
f, ax = plt.subplots(figsize=(8, 6))
ax = sns.countplot(y="income", data=df, palette="Set1")
ax.set_title("Frequency distribution of income variable")
plt.show()

In [None]:
f, ax = plt.subplots(figsize=(10, 8))
ax = sns.countplot(x="income", hue="sex", data=df, palette="Set1")
ax.set_title("Frequency distribution of income variable wrt sex")
plt.show()

In [None]:
f, ax = plt.subplots(figsize=(10, 8))
ax = sns.countplot(x="income", hue="race", data=df, palette="Set1")
ax.set_title("Frequency distribution of income variable wrt race")
plt.show()

In [None]:
# check number of unique labels 

df.workclass.nunique()

In [None]:
# view the unique labels

df.workclass.unique()

In [None]:
# view frequency distribution of values

df.workclass.value_counts()

In [None]:
# replace '?' values in workclass variable with `NaN`

df['workclass'].replace(' ?', np.NaN, inplace=True)

In [None]:
# again check the frequency distribution of values in workclass variable

df.workclass.value_counts()

In [None]:
f, ax = plt.subplots(figsize=(10, 6))
ax = df.workclass.value_counts().plot(kind="bar", color="green")
ax.set_title("Frequency distribution of workclass variable")
ax.set_xticklabels(df.workclass.value_counts().index, rotation=30)
plt.show()

In [None]:
f, ax = plt.subplots(figsize=(12, 8))
ax = sns.countplot(x="workclass", hue="income", data=df, palette="Set1")
ax.set_title("Frequency distribution of workclass variable wrt income")
ax.legend(loc='upper right')
plt.show()

In [None]:
f, ax = plt.subplots(figsize=(12, 8))
ax = sns.countplot(x="workclass", hue="sex", data=df, palette="Set1")
ax.set_title("Frequency distribution of workclass variable wrt sex")
ax.legend(loc='upper right')
plt.show()

In [None]:
# check number of unique labels

df.occupation.nunique()

In [None]:
# view unique labels

df.occupation.unique()


In [None]:
# view frequency distribution of values

df.occupation.value_counts()

In [None]:
# replace '?' values in occupation variable with `NaN`

df['occupation'].replace(' ?', np.NaN, inplace=True)


In [None]:
# again check the frequency distribution of values

df.occupation.value_counts()

In [None]:
# visualize frequency distribution of `occupation` variable

f, ax = plt.subplots(figsize=(12, 8))
ax = sns.countplot(x="occupation", data=df, palette="Set1")
ax.set_title("Frequency distribution of occupation variable")
ax.set_xticklabels(df.occupation.value_counts().index, rotation=30)
plt.show()

In [None]:
# check number of unique labels

df.native_country.nunique()

In [None]:
# view unique labels 

df.native_country.unique()


In [None]:
# check frequency distribution of values

df.native_country.value_counts()


In [None]:
# replace '?' values in native_country variable with `NaN`

df['native_country'].replace(' ?', np.NaN, inplace=True)

In [None]:
# visualize frequency distribution of `native_country` variable

f, ax = plt.subplots(figsize=(16, 12))
ax = sns.countplot(x="native_country", data=df, palette="Set1")
ax.set_title("Frequency distribution of native_country variable")
ax.set_xticklabels(df.native_country.value_counts().index, rotation=90)
plt.show()

In [None]:
df[categorical].isnull().sum()

### Number of labels: Cardinality 


In [None]:
# check for cardinality in categorical variables

for var in categorical:
    
    print(var, ' contains ', len(df[var].unique()), ' labels')

We can see that native_country column contains relatively large number of labels as compared to other columns. I will check for cardinality after train-test split.

## Declare feature vector and target variable

In [None]:
X = df.drop(['income'], axis=1)

y = df['income']

## Split data into separate training and test set

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)


In [None]:
# check the shape of X_train and X_test

X_train.shape, X_test.shape

## Feature Engineering

### Display categorical variables in training set


In [None]:
categorical = [col for col in X_train.columns if X_train[col].dtypes == 'O']

categorical

### Display numerical variables in training set


In [None]:
numerical = [col for col in X_train.columns if X_train[col].dtypes != 'O']

numerical

### Engineering missing values in categorical variables

In [None]:
# print percentage of missing values in the categorical variables in training set

X_train[categorical].isnull().mean()

In [None]:
# print categorical variables with missing data

for col in categorical:
    if X_train[col].isnull().mean()>0:
        print(col, (X_train[col].isnull().mean()))

In [None]:
# impute missing categorical variables with most frequent value

for df2 in [X_train, X_test]:
    df2['workclass'].fillna(X_train['workclass'].mode()[0], inplace=True)
    df2['occupation'].fillna(X_train['occupation'].mode()[0], inplace=True)
    df2['native_country'].fillna(X_train['native_country'].mode()[0], inplace=True)    

In [None]:
# check missing values in categorical variables in X_train

X_train[categorical].isnull().sum()

In [None]:
# check missing values in categorical variables in X_test

X_test[categorical].isnull().sum()

As a final check, I will check for missing values in X_train and X_test.

In [None]:
# check missing values in X_train

X_train.isnull().sum()

In [None]:
# check missing values in X_test

X_test.isnull().sum()

We can see that there are no missing values in X_train and X_test.

### Encode categorical variables


In [None]:
# preview categorical variables in X_train

X_train[categorical].head()

In [None]:
# import category encoders

import category_encoders as ce

In [None]:
# encode categorical variables with one-hot encoding

encoder = ce.OneHotEncoder(cols=['workclass', 'education', 'marital_status', 'occupation', 'relationship', 
                                 'race', 'sex', 'native_country'])

X_train = encoder.fit_transform(X_train)

X_test = encoder.transform(X_test)

In [None]:
X_train.head()

In [None]:
X_train.shape

In [None]:
X_test.head()

In [None]:
X_test.shape

* We now have training and testing set ready for model building. Before that, we should map all the feature variables onto the same scale. It is called **feature scaling**. We will do it as follows.

## Feature Scaling

In [None]:
cols = X_train.columns


In [None]:
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)


In [None]:
X_train = pd.DataFrame(X_train, columns=[cols])

In [None]:
X_test = pd.DataFrame(X_test, columns=[cols])

We now have X_train dataset ready to be fed into the Random Forest classifier. We will do it as follows.

## Random Forest Classifier model with default parameters

In [None]:
# import Random Forest classifier

from sklearn.ensemble import RandomForestClassifier



# instantiate the classifier 

rfc = RandomForestClassifier(random_state=0)



# fit the model

rfc.fit(X_train, y_train)



# Predict the Test set results

y_pred = rfc.predict(X_test)



# Check accuracy score 

from sklearn.metrics import accuracy_score

print('Model accuracy score with 10 decision-trees : {0:0.4f}'. format(accuracy_score(y_test, y_pred)))

## Random Forest Classifier model with 100 Decision Trees

In [None]:
# instantiate the classifier with n_estimators = 100

rfc_100 = RandomForestClassifier(n_estimators=100, random_state=0)



# fit the model to the training set

rfc_100.fit(X_train, y_train)



# Predict on the test set results

y_pred_100 = rfc_100.predict(X_test)



# Check accuracy score 

print('Model accuracy score with 100 decision-trees : {0:0.4f}'. format(accuracy_score(y_test, y_pred_100)))

In [None]:
# create the classifier with n_estimators = 100

clf = RandomForestClassifier(n_estimators=100, random_state=0)



# fit the model to the training set

clf.fit(X_train, y_train)


In [None]:
# view the feature scores

feature_scores = pd.Series(clf.feature_importances_, index=X_train.columns).sort_values(ascending=False)

feature_scores

## Build the Random Forest model on selected features

In [None]:
# drop the least important feature from X_train and X_test

X_train = X_train.drop(['native_country_41'], axis=1)

X_test = X_test.drop(['native_country_41'], axis=1)


In [None]:
# instantiate the classifier with n_estimators = 100

clf = RandomForestClassifier(n_estimators=100, random_state=0)



# fit the model to the training set

clf.fit(X_train, y_train)


# Predict on the test set results

y_pred = clf.predict(X_test)



# Check accuracy score 

print('Model accuracy score with native_country_41 variable removed : {0:0.4f}'. format(accuracy_score(y_test, y_pred)))


## Confusion matrix


A confusion matrix is a tool for summarizing the performance of a classification algorithm. A confusion matrix will give us a clear picture of classification model performance and the types of errors produced by the model. It gives us a summary of correct and incorrect predictions broken down by each category. The summary is represented in a tabular form.




In [None]:
# Print the Confusion Matrix and slice it into four pieces

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)

print('Confusion matrix\n\n', cm)



In [None]:
# visualize confusion matrix with seaborn heatmap

cm_matrix = pd.DataFrame(data=cm, columns=['Actual Positive:1', 'Actual Negative:0'], 
                                 index=['Predict Positive:1', 'Predict Negative:0'])

sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='YlGnBu')

## Classification Report

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

# Acknowledgments

Thanks to [prashant111](https://www.kaggle.com/prashant111) for creating [random-forest-classifier-feature-importance](https://www.kaggle.com/code/prashant111/random-forest-classifier-feature-importance). It inspires the majority of the content in this chapter.