42.83. Random Forest Classifier with Feature Importance#
42.83.1. The problem statement#
We try to make predictions where the prediction task is to determine whether a person makes over 50K a year. We implement Random Forest Classification with Python and Scikit-Learn. So, to answer the question, we build a Random Forest classifier to predict whether a person makes over 50K a year.
42.83.2. Import libraries#
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
for dirname, _, filenames in os.walk('https://static-1300131294.cos.ap-shanghai.myqcloud.com/data/random-forest-income_evaluation.csv'):
for filename in filenames:
print(os.path.join(dirname, filename))
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
42.83.3. Import dataset#
data = 'https://static-1300131294.cos.ap-shanghai.myqcloud.com/data/random-forest-income_evaluation.csv'
df = pd.read_csv(data)
42.83.4. Exploratory data analysis# View dimensions of dataset#
print('The shape of the dataset : ', df.shape)
We can see that there are 32561 instances and 15 attributes in the data set. Preview the dataset #
df.head() Rename column names#
col_names = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 'relationship',
'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income']
df.columns = col_names
df.columns View summary of dataset#
df.info() Check the data types of columns#
The above df.info() command gives us the number of filled values along with the data types of columns.
If we simply want to check the data type of a particular column, we can use the following command.
df.dtypes View statistical properties of dataset#
We can see that the above df.describe().T command presents statistical properties in horizontal form.
df.isnull().sum() Check with ASSERT statement#
assert pd.notnull(df).all().all() Functional approach to EDA#
def initial_eda(df):
if isinstance(df, pd.DataFrame):
total_na = df.isna().sum().sum()
print("Dimensions : %d rows, %d columns" % (df.shape[0], df.shape[1]))
print("Total NA Values : %d " % (total_na))
print("%38s %10s %10s %10s" % ("Column Name", "Data Type", "#Distinct", "NA Values"))
col_name = df.columns
dtyp = df.dtypes
uniq = df.nunique()
na_val = df.isna().sum()
for i in range(len(df.columns)):
print("%38s %10s %10s %10s" % (col_name[i], dtyp[i], uniq[i], na_val[i]))
print("Expect a DataFrame but got a %15s" % (type(df)))
42.83.5. Explore Categorical Variables# Find categorical variables#
categorical = [var for var in df.columns if df[var].dtype=='O']
print('There are {} categorical variables\n'.format(len(categorical)))
print('The categorical variables are :\n\n', categorical) Preview categorical variables#
df[categorical].head() Summary of categorical variables# Frequency distribution of categorical variables#
for var in categorical:
print(df[var].value_counts()) Percentage of frequency distribution of values#
for var in categorical:
print(df[var].value_counts()/float(len(df))) Explore the variables#
ax[0] = df['income'].value_counts().plot.pie(explode=[0,0],autopct='%1.1f%%',ax=ax[0],shadow=True)
ax[0].set_title('Income Share')
#f, ax = plt.subplots(figsize=(6, 8))
ax[1] = sns.countplot(x="income", data=df, palette="Set1")
ax[1].set_title("Frequency distribution of income variable")
f, ax = plt.subplots(figsize=(8, 6))
ax = sns.countplot(y="income", data=df, palette="Set1")
ax.set_title("Frequency distribution of income variable")
f, ax = plt.subplots(figsize=(10, 8))
ax = sns.countplot(x="income", hue="sex", data=df, palette="Set1")
ax.set_title("Frequency distribution of income variable wrt sex")
f, ax = plt.subplots(figsize=(10, 8))
ax = sns.countplot(x="income", hue="race", data=df, palette="Set1")
ax.set_title("Frequency distribution of income variable wrt race")
df['workclass'].replace(' ?', np.NaN, inplace=True)
# again check the frequency distribution of values in workclass variable
f, ax = plt.subplots(figsize=(10, 6))
ax = df.workclass.value_counts().plot(kind="bar", color="green")
ax.set_title("Frequency distribution of workclass variable")
ax.set_xticklabels(df.workclass.value_counts().index, rotation=30)
f, ax = plt.subplots(figsize=(12, 8))
ax = sns.countplot(x="workclass", hue="income", data=df, palette="Set1")
ax.set_title("Frequency distribution of workclass variable wrt income")
ax.legend(loc='upper right')
f, ax = plt.subplots(figsize=(12, 8))
ax = sns.countplot(x="workclass", hue="sex", data=df, palette="Set1")
ax.set_title("Frequency distribution of workclass variable wrt sex")
ax.legend(loc='upper right')
# visualize frequency distribution of `occupation` variable
f, ax = plt.subplots(figsize=(12, 8))
ax = sns.countplot(x="occupation", data=df, palette="Set1")
ax.set_title("Frequency distribution of occupation variable")
ax.set_xticklabels(df.occupation.value_counts().index, rotation=30)
f, ax = plt.subplots(figsize=(16, 12))
ax = sns.countplot(x="native_country", data=df, palette="Set1")
ax.set_title("Frequency distribution of native_country variable")
ax.set_xticklabels(df.native_country.value_counts().index, rotation=90)
df[categorical].isnull().sum() Number of labels: Cardinality#
for var in categorical:
print(var, ' contains ', len(df[var].unique()), ' labels')
We can see that native_country column contains relatively large number of labels as compared to other columns. I will check for cardinality after train-test split.
42.83.6. Declare feature vector and target variable#
X = df.drop(['income'], axis=1)
y = df['income']
42.83.7. Split data into separate training and test set#
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
X_train.shape, X_test.shape
42.83.8. Feature Engineering# Display categorical variables in training set#
categorical = [col for col in X_train.columns if X_train[col].dtypes == 'O']
categorical Display numerical variables in training set#
numerical = [col for col in X_train.columns if X_train[col].dtypes != 'O']
numerical Engineering missing values in categorical variables#
# print categorical variables with missing data
for col in categorical:
if X_train[col].isnull().mean()>0:
print(col, (X_train[col].isnull().mean()))
# impute missing categorical variables with most frequent value
for df2 in [X_train, X_test]:
df2['workclass'].fillna(X_train['workclass'].mode()[0], inplace=True)
df2['occupation'].fillna(X_train['occupation'].mode()[0], inplace=True)
df2['native_country'].fillna(X_train['native_country'].mode()[0], inplace=True)
As a final check, I will check for missing values in X_train and X_test.
We can see that there are no missing values in X_train and X_test. Encode categorical variables#
import category_encoders as ce
encoder = ce.OneHotEncoder(cols=['workclass', 'education', 'marital_status', 'occupation', 'relationship',
'race', 'sex', 'native_country'])
X_train = encoder.fit_transform(X_train)
X_test = encoder.transform(X_test)
We now have training and testing set ready for model building. Before that, we should map all the feature variables onto the same scale. It is called feature scaling. We will do it as follows.
42.83.9. Feature Scaling#
cols = X_train.columns
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
X_train = pd.DataFrame(X_train, columns=[cols])
X_test = pd.DataFrame(X_test, columns=[cols])
We now have X_train dataset ready to be fed into the Random Forest classifier. We will do it as follows.
42.83.10. Random Forest Classifier model with default parameters#
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(random_state=0)
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)
from sklearn.metrics import accuracy_score
print('Model accuracy score with 10 decision-trees : {0:0.4f}'. format(accuracy_score(y_test, y_pred)))
42.83.11. Random Forest Classifier model with 100 Decision Trees#
rfc_100 = RandomForestClassifier(n_estimators=100, random_state=0)
rfc_100.fit(X_train, y_train)
y_pred_100 = rfc_100.predict(X_test)
print('Model accuracy score with 100 decision-trees : {0:0.4f}'. format(accuracy_score(y_test, y_pred_100)))
clf = RandomForestClassifier(n_estimators=100, random_state=0)
clf.fit(X_train, y_train)
feature_scores = pd.Series(clf.feature_importances_, index=X_train.columns).sort_values(ascending=False)
42.83.12. Build the Random Forest model on selected features#
X_train = X_train.drop(['native_country_41'], axis=1)
X_test = X_test.drop(['native_country_41'], axis=1)
clf = RandomForestClassifier(n_estimators=100, random_state=0)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print('Model accuracy score with native_country_41 variable removed : {0:0.4f}'. format(accuracy_score(y_test, y_pred)))
42.83.13. Confusion matrix#
A confusion matrix is a tool for summarizing the performance of a classification algorithm. A confusion matrix will give us a clear picture of classification model performance and the types of errors produced by the model. It gives us a summary of correct and incorrect predictions broken down by each category. The summary is represented in a tabular form.
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print('Confusion matrix\n\n', cm)
cm_matrix = pd.DataFrame(data=cm, columns=['Actual Positive:1', 'Actual Negative:0'],
index=['Predict Positive:1', 'Predict Negative:0'])
sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='YlGnBu')
42.83.14. Classification Report#
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
