Predicting Income from Census Data

Predicting Income from Census Data

In this post, I am going to use census data to predict income based on the information provided in the census data. The data comes from the UCI Machine Learning Repository. The descriptions of the 13 features in the dataset are given as follows:

    age: the age of the individual in years

    workclass: the classification of the individual's working status (does the person work for the federal government, work for the local government, work without pay, and so on)

    fnlwgt

    education: the level of education of the individual (e.g., 5th-6th grade, high school graduate, PhD, so on)

    education-num:

    maritalstatus: the marital status of the individual

    occupation: the type of work the individual does (e.g., administrative/clerical work, farming/fishing, sales and so on)

    relationship: relationship of individual to his/her household

    race: the individual's race

    sex: the individual's sex

    capitalgain: the capital gains of the individual in 1994 (from selling an asset such as a stock or bond for more than the original purchase price)

    capitalloss: the capital losses of the individual in 1994 (from selling an asset such as a stock or bond for less than the original purchase price)

    hoursperweek: the number of hours the individual works per week

    nativecountry: the native country of the individual

    over50k: whether or not the individual earned more than $50,000 in 1994

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import seaborn as sns
sns.set(style='white', context='notebook', palette='deep')

%matplotlib inline

import warnings
warnings.simplefilter(action='ignore')

Import data from UCI Machine Learning Repository

In [2]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/census-income/census-income.data"

census_data = pd.read_csv(url, header=None)

col_names = ['age', 'workclass', 'fnlwgt', 'education', 'educationnum', 'maritalstatus', 
             'occupation',
       'relationship', 'race', 'sex', 'capitalgain', 'capitalloss',
       'hoursperweek', 'nativecountry', 'over50k']
census_data.columns = col_names
census_data.head()
Out[2]:
age workclass fnlwgt education educationnum maritalstatus occupation relationship race sex capitalgain capitalloss hoursperweek nativecountry over50k
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K

Explatory Data Analysis

The variable over50k represents the income class with two categories: <=50K and >50K. For ease of analysis, lets encodes the two string class to numerical values, 0 and 1, respectively.

In [3]:
group = {' <=50K':0, ' >50K':1}
census_data['over50k'] = census_data['over50k'].map(group)
census_data['over50k'].unique()
Out[3]:
array([0, 1], dtype=int64)
In [4]:
num_vars = [var for var in census_data.columns if census_data[var].dtype !='O']
cat_vars = [var for var in census_data.columns if census_data[var].dtype =='O']
print('number of numeric variable      = ', len(num_vars))
print('number of categorical variables = ', len(cat_vars))
number of numeric variable      =  7
number of categorical variables =  8

Some of the variables have missing values, indicated by question marks ('?'). Let fill with 'Missing'

In [5]:
# we will replace missing values for the categorical variables with the mode
def fill_missing_values(df, var):
    
    df = df.copy()
    # find mode
    mode = df[var].mode()
    #remove any white spaces infront of the values
    df[var] = df[var].str.strip()
    # repace '?' with 'Missing'
    df[var] = np.where(df[var]=='?', mode, df[var])
    
    return df

for var in cat_vars:
    census_data = fill_missing_values(census_data, var)
census_data['nativecountry'] = np.where(census_data['nativecountry'].str.contains('United-States'), 'USA',
                                       census_data['nativecountry'])
In [6]:
census_data['nativecountry'].unique()
Out[6]:
array(['USA', 'Cuba', 'Jamaica', 'India', 'Mexico', 'South',
       'Puerto-Rico', 'Honduras', 'England', 'Canada', 'Germany', 'Iran',
       'Philippines', 'Italy', 'Poland', 'Columbia', 'Cambodia',
       'Thailand', 'Ecuador', 'Laos', 'Taiwan', 'Haiti', 'Portugal',
       'Dominican-Republic', 'El-Salvador', 'France', 'Guatemala',
       'China', 'Japan', 'Yugoslavia', 'Peru',
       'Outlying-US(Guam-USVI-etc)', 'Scotland', 'Trinadad&Tobago',
       'Greece', 'Nicaragua', 'Vietnam', 'Hong', 'Ireland', 'Hungary',
       'Holand-Netherlands'], dtype=object)
In [7]:
#let's check if any of the variables has null values
[var for var in census_data.columns if census_data[var].isnull().sum()>0]
Out[7]:
[]

Now, the data is cleaned, let divide it into training and testing datasets

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(census_data.drop('over50k', axis=1),
                                                                   census_data.over50k,
                                                                   test_size=.25, 
                                                                   random_state=0)

Cardinality of categorical variables

Let's check the number of uniques labels for each categorical variable

Removing rare values

In [9]:
for var in cat_vars:
    #plt.figure(figsize=(8, 14))
    if var !='over50k':
        g = sns.factorplot(x=var,y="over50k",data=census_data,kind="bar",
        palette = "muted", height=4, aspect=2)
        g.despine(left=True)
        g = g.set_ylabels("Income >50K Probability")
        g = g.set_xlabels(var)
        g = g.set_xticklabels(rotation=30)
        #g.set_xticklabels(g.get_xticklabels(), rotation=45)

The above bar charts show that there is indeed income variation among categories in each variable. Now, let find and model any rare categories. Let's define rare categories which occur less that 1% of the time

In [10]:
cat_vars2 = [var for var in X_train.columns if X_train[var].dtypes =='O']
cat_vars
Out[10]:
['workclass',
 'education',
 'maritalstatus',
 'occupation',
 'relationship',
 'race',
 'sex',
 'nativecountry']
In [11]:
def find_frequent_categories(df, var, pct):
    
    #df = df.copy()
    temp = df.groupby(var)[var].count()/len(df)
    
    frequent_list = temp[temp>pct].index
    
    return frequent_list

for var in cat_vars2:
    #find frequent categories
    frequent_categories = find_frequent_categories(X_train, var, 0.01)
    
    #replace rare values in training dataset
    X_train[var] = np.where(X_train[var].isin(frequent_categories), X_train[var], 'Rare')
        
    # replace rare values in testing dataset
    X_test[var] = np.where(X_test[var].isin(frequent_categories), X_test[var], 'Rare')
In [12]:
X_train['nativecountry'].unique()
Out[12]:
array(['USA', 'Rare', 'Mexico'], dtype=object)

Encoding categorical variables

In [13]:
def encode_categorical_variables(df, var):
    
    df= df.copy()
    
    df = pd.concat([df, pd.get_dummies(df[var], prefix = var)], axis=1)
    
    df.drop(labels = var, axis=1, inplace=True)
    
    return df

for var in cat_vars2:
    X_train = encode_categorical_variables(X_train,var)
    X_test = encode_categorical_variables(X_test, var)
In [14]:
X_train.head()
Out[14]:
age fnlwgt educationnum capitalgain capitalloss hoursperweek workclass_ Private workclass_Federal-gov workclass_Local-gov workclass_Private ... relationship_Wife race_Asian-Pac-Islander race_Black race_Rare race_White sex_Female sex_Male nativecountry_Mexico nativecountry_Rare nativecountry_USA
26464 59 61885 8 0 0 35 0 0 0 1 ... 0 0 1 0 0 0 1 0 0 1
16134 71 180733 14 0 0 20 0 0 0 1 ... 0 0 0 0 1 1 0 0 0 1
4747 42 107762 14 0 0 40 0 0 0 1 ... 0 0 0 0 1 0 1 0 0 1
8369 26 35917 9 0 0 40 0 0 0 1 ... 0 0 0 0 1 0 1 0 0 1
5741 46 256522 2 0 0 40 0 0 0 1 ... 0 0 0 0 1 0 1 0 1 0

5 rows × 65 columns

Numerical Variables

In [15]:
census_data[num_vars].hist(bins=30, figsize=(10,10))
plt.show()

A quick look at the distributions of the variables shows that the variables are not normally distributed and some form of transformation may be required.

Let's now compare the distribution by the income category

In [16]:
for var in num_vars:
    g = sns.FacetGrid(census_data, col='over50k')
    g = g.map(sns.distplot, var)

Majority of the instances have zero capitalgain/loss. Let include binary variables that show whether the capital gain/loss > 0 or not.

In [17]:
for df in [X_train, X_test]:
    df['hasCapitalGain'] = np.where(df['capitalgain']>0, 1, 0)
    df['hasCapitalLoss'] = np.where(df['capitalloss']>0, 1, 0)
In [18]:
X_train.groupby('hasCapitalLoss')['hasCapitalLoss'].count()/len(X_train)
Out[18]:
hasCapitalLoss
0    0.953686
1    0.046314
Name: hasCapitalLoss, dtype: float64

Age may be more meaningful if coverted into age groups. Let's also group age into age_groups <25, 26-50, 51-64, 65+. Then we will encod each age group.

In [19]:
# this function creates age groups, encodes the age groups, and remove age from the data
def find_age_group(var):
    if var <=25:
        return '<=25'
    elif var < 45:
        return '26-50'
    elif var < 65:
        return '51-64'
    else:
        return '65+'
X_train['age_group'] = X_train['age'].apply(find_age_group)
X_test['age_group'] = X_test['age'].apply(find_age_group)

# encode age groups
X_train = encode_categorical_variables(X_train, 'age_group')
X_test = encode_categorical_variables(X_test, 'age_group')
    
# remove age from the dataset
#X_train.drop(labels='age', axis=1, inplace=True)
#X_test.drop(labels='age', axis=1, inplace=True)
In [20]:
X_train.head()
Out[20]:
age fnlwgt educationnum capitalgain capitalloss hoursperweek workclass_ Private workclass_Federal-gov workclass_Local-gov workclass_Private ... sex_Male nativecountry_Mexico nativecountry_Rare nativecountry_USA hasCapitalGain hasCapitalLoss age_group_26-50 age_group_51-64 age_group_65+ age_group_<=25
26464 59 61885 8 0 0 35 0 0 0 1 ... 1 0 0 1 0 0 0 1 0 0
16134 71 180733 14 0 0 20 0 0 0 1 ... 0 0 0 1 0 0 0 0 1 0
4747 42 107762 14 0 0 40 0 0 0 1 ... 1 0 0 1 0 0 1 0 0 0
8369 26 35917 9 0 0 40 0 0 0 1 ... 1 0 0 1 0 0 1 0 0 0
5741 46 256522 2 0 0 40 0 0 0 1 ... 1 0 1 0 0 0 0 1 0 0

5 rows × 71 columns

Let's remove fnlwgt, educationnum, capitalgain, capitalloss, and scale hoursperweek

In [21]:
X_train.drop(labels=['fnlwgt', 'educationnum', 'capitalgain', 'capitalloss'], axis=1, inplace=True)
X_test.drop(labels=['fnlwgt', 'educationnum', 'capitalgain', 'capitalloss'], axis=1, inplace=True)
In [22]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

#hoursperweek
scaler.fit(X_train['hoursperweek'].values.reshape(-1,1))

X_train['hoursperweek'] = scaler.transform(X_train['hoursperweek'].values.reshape(-1,1))
X_test['hoursperweek'] = scaler.transform(X_test['hoursperweek'].values.reshape(-1,1))

#age
scaler.fit(X_train['age'].values.reshape(-1,1))

X_train['age'] = scaler.transform(X_train['age'].values.reshape(-1,1))
X_test['age'] = scaler.transform(X_test['age'].values.reshape(-1,1))

Comparing Classifiers

Here, we will compare 8 classifiers

In [23]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import BaggingClassifier

from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.metrics import roc_curve
In [25]:
models = []
models.append(('LR', LogisticRegression(C=0.0005, random_state=0)))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('SGD', SGDClassifier(loss='log', random_state=42)))
models.append(('RF', RandomForestClassifier(n_estimators=100)))
models.append(('SVC', SVC(probability=True)))
models.append(('BAG', BaggingClassifier(n_estimators=100)))
models.append(('GBM', GradientBoostingClassifier(n_estimators=100)))

# evalutate each model in turn
train_score = []
test_score = []
test_predictions = {}
names = []


for name, model in models:
    #train the model
    model.fit(X_train, y_train)
    # score on train dataset
    pred_train = model.predict_proba(X_train)[:,1]
    train_roc_auc_score = round(roc_auc_score(y_train, pred_train),4)
    train_score.append(train_roc_auc_score)
    
    # score on test dataset
    pred_test = model.predict_proba(X_test)[:,1]
    test_predictions[name] = pred_test
    
    test_roc_auc_score = round(roc_auc_score(y_test, pred_test),4)
    test_score.append(test_roc_auc_score)
    
    names.append(name)
    # plot roc curve
       
results = pd.DataFrame(list(zip(names, train_score, test_score)),
            columns=['Model', 'Train Score', 'Test Score']).set_index('Model')
results.sort_values('Test Score', ascending=False)
Out[25]:
Train Score Test Score
Model
GBM 0.9075 0.9032
SVC 0.9035 0.9019
SGD 0.9011 0.9008
RF 0.9974 0.8738
BAG 0.9974 0.8683
LR 0.8595 0.8556
KNN 0.9359 0.8452
CART 0.9985 0.7344

ROC Curve

Comparing models on the test dataset

In [26]:
plt.figure(figsize=(8, 6))
clrs = ["b--", "g-", "c-", "y--", "k--", "m--", "y-", 'r--']
i = -1
for model in test_predictions.keys():  
    i = i+1
    # plot roc curve
    fpr, tpr, thresholds = roc_curve(y_test,test_predictions[model]) 
    plt.plot(fpr, tpr , clrs[i], label=model)
    plt.plot([0, 1], [0, 1], 'k--') 
    plt.axis([0, 1, 0, 1])
    plt.xlabel('False Positive Rate', fontsize=14) 
    plt.ylabel('True Positive Rate (Recall)', fontsize=14)
    plt.legend(loc="lower right", fontsize=16)
    plt.grid(True)
    plt.savefig('roc_curve.png')

The GBM and SVC classifiers performs well in both the train and test datasets. On the other hand, although the Decision Tree model has score of 0.998 on the training dataset, it the lowest score in the test dataset. This shows overfitting.