Statistics for Machine Learning
上QQ阅读APP看书,第一时间看更新

Example of logistic regression using German credit data

Open source German credit data has been utilized from the UCI machine learning repository to model logistic regression: https://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data).

>>> import os 
>>> os.chdir("D:\\Book writing\\Codes\\Chapter 3") 
 
>>> import numpy as np 
>>> import pandas as pd 
 
>>> from sklearn.model_selection import train_test_split     
>>> from sklearn.metrics import accuracy_score,classification_report 
 
>>> credit_data = pd.read_csv("credit_data.csv") 

The following code describes the top five rows of the data:

>>> print (credit_data.head()) 

Altogether, we have 20 explanatory variables with one dependent variable called class. The value of class variable 2 indicates default and 1 describes non-default. In order to model as 0-1 problem, we have removed the value by 1 in the following code:

>>> credit_data['class'] = credit_data['class']-1 

In order to know the predictive ability of each variable with respect to the independent variable, here we have done an information value calculation. In the following code, both categorical and continuous variables have been taken into consideration.

If the datatype is object, this means it is a categorical variable and any other variable such as int64 and so on, will be treated as continuous and will be binned into 10 equal parts (also known as deciles) accordingly.

>>> def IV_calc(data,var): 
...    if data[var].dtypes == "object": 
...     dataf = data.groupby([var])['class'].agg(['count','sum']) 
...        dataf.columns = ["Total","bad"]     
...        dataf["good"] = dataf["Total"] - dataf["bad"] 
...        dataf["bad_per"] = dataf["bad"]/dataf["bad"].sum() 
...        dataf["good_per"] = dataf["good"]/dataf["good"].sum() 
...        dataf["I_V"] = (dataf["good_per"] - dataf["bad_per"]) * np.log(dataf["good_per"]/dataf["bad_per"]) 
...        return dataf 
...    else: 
...        data['bin_var'] = pd.qcut(data[var].rank(method='first'),10) 
...        dataf = data.groupby(['bin_var'])['class'].agg(['count','sum']) 
...        dataf.columns = ["Total","bad"]     
...        dataf["good"] = dataf["Total"] - dataf["bad"] 
...        dataf["bad_per"] = dataf["bad"]/dataf["bad"].sum() 
...        dataf["good_per"] = dataf["good"]/dataf["good"].sum() 
...        dataf["I_V"] = (dataf["good_per"] - dataf["bad_per"]) * np.log(dataf["good_per"]/dataf["bad_per"]) 
...        return dataf 

Information value has been calculated for Credit_history (categorical) and Duration_in_month (continuous) for illustration purposes. The overall IV obtained for Credit_history is 0.29, which illustrates medium predictive power and Duration_in_month as 0.34, which is a strong predictor.

>>> print ("\n\nCredit History - Information Value\n")
>>> print (IV_calc(credit_data,'Credit_history'))
>>> print ("\n\nCredit History - Duration in month\n")
>>> print (IV_calc(credit_data,'Duration_in_month'))

However, in real scenarios, initial data sometimes has around 500 variables or even more. In that case it is difficult to run the code individually for each variable separately. The following code has been developed for automating the calculation of the total information value for all the discrete and continuous variables in one single go!

>>> discrete_columns = ['Status_of_existing_checking_account', 'Credit_history', 'Purpose', 'Savings_Account','Present_Employment_since', 'Personal_status_and_sex', 'Other_debtors','Property','Other_installment_plans', 'Housing', 'Job', 'Telephone', 'Foreign_worker']

>>> continuous_columns = ['Duration_in_month', 'Credit_amount', 'Installment_rate_in_percentage_of_disposable_income', 'Present_residence_since', 'Age_in_years','Number_of_existing_credits_at_this_bank', 'Number_of_People_being_liable_to_provide_maintenance_for']

>>> total_columns = discrete_columns + continuous_columns
# List of IV values
>>> Iv_list = []
>>> for col in total_columns:
... assigned_data = IV_calc(data = credit_data,var = col)
... iv_val = round(assigned_data["I_V"].sum(),3)
... dt_type = credit_data[col].dtypes
... Iv_list.append((iv_val,col,dt_type))

>>> Iv_list = sorted(Iv_list,reverse = True)

>>> for i in range(len(Iv_list)):
... print (Iv_list[i][0],",",Iv_list[i][1],",type =",Iv_list[i][2])

In the following output, all the variables with an information value are shown in descending order. After the information value, variable name, and the type of the variable have also been shown. If the type is object, this means that it is a categorical variable; similarly, if type is int64 this means it is a 64-bit integer value. We will be considering the top 15 variables for the next stage of analysis.

After retaining the top 15 variables, we have the following variables in the discrete and continuous categories. Subsequently, we will do dummy variable coding for the discrete variables and use continuous as it is.

>>> dummy_stseca = pd.get_dummies(credit_data['Status_of_existing_checking_account'], prefix='status_exs_accnt')
>>> dummy_ch = pd.get_dummies(credit_data['Credit_history'], prefix='cred_hist')
>>> dummy_purpose = pd.get_dummies(credit_data['Purpose'], prefix='purpose')
>>> dummy_savacc = pd.get_dummies(credit_data['Savings_Account'], prefix='sav_acc')
>>> dummy_presc = pd.get_dummies(credit_data['Present_Employment_since'], prefix='pre_emp_snc')
>>> dummy_perssx = pd.get_dummies(credit_data['Personal_status_and_sex'], prefix='per_stat_sx')
>>> dummy_property = pd.get_dummies(credit_data['Property'], prefix='property')
>>> dummy_othinstpln = pd.get_dummies(credit_data['Other_installment_plans'], prefix='oth_inst_pln')
>>> dummy_forgnwrkr = pd.get_dummies(credit_data['Foreign_worker'], prefix='forgn_wrkr')
>>> dummy_othdts = pd.get_dummies(credit_data['Other_debtors'], prefix='oth_debtors')

>>> continuous_columns = ['Duration_in_month', 'Credit_amount', 'Installment_rate_in_percentage_of_disposable_income', 'Age_in_years', 'Number_of_existing_credits_at_this_bank' ]

>>> credit_continuous = credit_data[continuous_columns]
>>> credit_data_new = pd.concat([dummy_stseca,dummy_ch, dummy_purpose,dummy_savacc, dummy_presc,dummy_perssx, dummy_property, dummy_othinstpln,dummy_othdts, dummy_forgnwrkr,credit_continuous,credit_data['class']],axis=1)

Data has been evenly split between train and test at a 70-30 ratio, random_state is 42 used as a seed for pseudo random number generation in order to create reproducible results when run multiple users by multiple times.

>>> x_train,x_test,y_train,y_test = train_test_split( credit_data_new.drop(['class'] ,axis=1),credit_data_new['class'],train_size = 0.7,random_state=42)

>>> y_train = pd.DataFrame(y_train)
>>> y_test = pd.DataFrame(y_test)

While generating dummy variables using the pd.get_dummies() function, the number of dummy being produced is equal to the number of classes in it. However, the number of dummies variables created will be less in one number compared the with number of classes in a variable is good enough (if all the other remaining variable are 0, this will represent the one extra class) to represent in this setting. For example, if the class of sample variable decision response can take any of the three values (yes, no, and can't say), this can be represented with two dummy variables (d1, d2) for representing all the settings.

  • If d1 =1 and d2 = 0, we can assign category yes
  • If d1=0 and d2 = 1, we can assign category no
  • If d1 = 0 and d2 = 0, we can assign category can't say

In this way, using two dummy variables we have represented all three categories. Similarly, we can represent N category of variables with N-1 dummy variables.

In fact, having the same number of dummy variables will produce NAN values in output due to the redundancy it creates. Hence, we are here removing one extra category column from all the categorical variables for which dummies have been created:

>>> remove_cols_extra_dummy = ['status_exs_accnt_A11', 'cred_hist_A30', 'purpose_A40', 'sav_acc_A61','pre_emp_snc_A71','per_stat_sx_A91', 'oth_debtors_A101','property_A121', 'oth_inst_pln_A141','forgn_wrkr_A201']

Here, we have created the extra list for removing insignificant variables step by step iteratively while working on backward elimination methodology; after the end of each iteration, we will keep adding the most insignificant and multi-collinear variable to remove_cols_insig list, so that those variables are removed while training the model.

>>> remove_cols_insig = []
>>> remove_cols = list(set(remove_cols_extra_dummy+remove_cols_insig))

Now for the most important step of the model, the application of Logit function, n variables, and creating summary:

>>> import statsmodels.api as sm
>>> logistic_model = sm.Logit(y_train, sm.add_constant(x_train.drop( remove_cols, axis=1))).fit()
>>> print (logistic_model.summary())

Summary code generates the following output, among which the most insignificant variable is purpose_A46, with a p-value of 0.937:

Also, VIF values are calculated to check multi-collinearity, although insignificant variables need to be removed prior to the removal of multi-collinear variables with VIF > 5. From the following results, Per_stat_sx_A93 is coming in as the most multi-collinear variable with a VIF of 6.177:

>>> print ("\nVariance Inflation Factor")
>>> cnames = x_train.drop(remove_cols,axis=1).columns
>>> for i in np.arange(0,len(cnames)):
... xvars = list(cnames)
... yvar = xvars.pop(i)
... mod = sm.OLS(x_train.drop(remove_cols,axis=1)[yvar], sm.add_constant( x_train.drop (remove_cols,axis=1)[xvars]))
... res = mod.fit()
... vif = 1/(1-res.rsquared)
... print (yvar,round(vif,3))

We also check how good the classifier is at trying to predict the results, for which we will run the c-statistic value, which calculates the proportion of concordant pairs out of the total pairs. The higher the value is the better, but at minimum it should have 0.7 for deploying the model in a production environment. The following code describes the various steps involved in the calculation of c-statistic from first principles:

>>> y_pred = pd.DataFrame (logistic_model. predict(sm.add_constant (x_train.drop (remove_cols,axis=1))))
>>> y_pred.columns = ["probs"]
>>> both = pd.concat([y_train,y_pred],axis=1)

Zeros is the data split from the actual and predicted table with the condition applied on zero as an actual class. Whereas, ones is the split with the condition applied on one as an actual class.

>>> zeros = both[['class','probs']][both['class']==0]
>>> ones = both[['class','probs']][both['class']==1]

>>> def df_crossjoin(df1, df2, **kwargs):
... df1['_tmpkey'] = 1
... df2['_tmpkey'] = 1
... res = pd.merge(df1, df2, on='_tmpkey', **kwargs).drop('_tmpkey', axis=1)
... res.index = pd.MultiIndex.from_product((df1.index, df2.index))
... df1.drop('_tmpkey', axis=1, inplace=True)
... df2.drop('_tmpkey', axis=1, inplace=True)
... return res

In the following step, we are producing Cartesian products for both one and zero data to calculate concordant and discordant pairs:

>>> joined_data = df_crossjoin(ones,zeros)

A pair is concordant if the probability against the 1 class is higher than the 0 class and discordant if the probability against the 1 class is less than the 0 class. If both probabilities are same, we put them in the tied pair category. The higher the number of concordant pairs is, the better the model is!

>>> joined_data['concordant_pair'] = 0
>>> joined_data.loc[joined_data['probs_x'] > joined_data['probs_y'], 'concordant_pair'] =1
>>> joined_data['discordant_pair'] = 0
>>> joined_data.loc[joined_data['probs_x'] < joined_data['probs_y'], 'discordant_pair'] =1
>>> joined_data['tied_pair'] = 0
>>> joined_data.loc[joined_data['probs_x'] == joined_data['probs_y'],'tied_pair'] =1
>>> p_conc = (sum(joined_data['concordant_pair'])*1.0 )/ (joined_data.shape[0])
>>> p_disc = (sum(joined_data['discordant_pair'])*1.0 )/ (joined_data.shape[0])

>>> c_statistic = 0.5 + (p_conc - p_disc)/2.0
>>> print ("\nC-statistic:",round(c_statistic,4))

The c-statistic obtained is 0.8388, which is greater than the 0.7 needed to be considered as a good model.

We will always keep a tab on how c-statistic and log-likelihood (AIC) is changing (here it is -309.29) while removing various variables one by one in order to justify when to stop.

Prior to removing insignificant variable purpose_A46, it is important to check its VIF and Per_stat_sx_A93 variable's p-value. There might some situations in which we need to take both metrics into consideration and do trade-offs as well.

However, the following table is the clear result that we need to remove pupose_A46 variable:

After we remove the purpose_A46 variable, we need to reiterate the process until no insignificant and multi-collinear variables exist. However, we need to keep a tab on how AIC and c-statistic values are changing. In the following code, we have shown the order of variables removed one by one, however we encourage users to do this hands-on to validate the results independently:

>>> remove_cols_insig = ['purpose_A46', 'purpose_A45', 'purpose_A44', 'sav_acc_A63', ... 'oth_inst_pln_A143','property_A123', 'status_exs_accnt_A12', 'pre_emp_snc_A72', ... 'pre_emp_snc_A75','pre_emp_snc_A73', 'cred_hist_A32', 'cred_hist_A33', ... 'purpose_A410','pre_emp_snc_A74','purpose_A49', 'purpose_A48', 'property_A122', ... 'per_stat_sx_A92','forgn_wrkr_A202','per_stat_sx_A94', 'purpose_A42', ... 'oth_debtors_A102','Age_in_years','sav_acc_A64','sav_acc_A62', 'sav_acc_A65', ... 'oth_debtors_A103']

Finally, after eliminating the insignificant and multi-collinear variables, the following final results are obtained:

  • Log-Likelihood-334.35
  • c-statistic0.8035

If we compare these with initial values, log-likelihood minimized from -309.29 to -334.35, which is a the good sign and c-statistic also decreased slightly from 0.8388 to 0.8035. But still, the model is holding good with a much lower number of variables. Removing extra variables without impacting model performance much will create efficiency during the implementation of the model as well!

The ROC curve has been plotted against TPR versus FPR in the following chart, and also the area under the curve has been described which has a value of 0.80.

>>> import matplotlib.pyplot as plt 
>>> from sklearn import metrics 
>>> from sklearn.metrics import auc 
>>> fpr, tpr, thresholds = metrics.roc_curve(both['class'],both['probs'], pos_label=1) 
 
>>> roc_auc = auc(fpr,tpr) 
>>> plt.figure() 
>>> lw = 2 
>>> plt.plot(fpr, tpr, color='darkorange', lw=lw, label='ROC curve (area = %0.2f)' % roc_auc) 
>>> plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--') 
>>> plt.xlim([0.0, 1.0]) 
>>> plt.ylim([0.0, 1.05]) 
>>> plt.xlabel('False Positive Rate (1-Specificity)') 
>>> plt.ylabel('True Positive Rate') 
>>> plt.title('ROC Curve - German Credit Data') 
>>> plt.legend(loc="lower right") 
>>> plt.show() 

Once we have found the best situation from the training dataset, the next and final task is to predict the category from the probability to default value. There are many ways to set the threshold value to convert predicted probability into an actual class. In the following code, we have done a simple grid search to determine the best probability threshold cut-off. Nonetheless, even sensitivity and specificity curves could be utilized for this task.

>>> for i in list(np.arange(0,1,0.1)):
... both["y_pred"] = 0
... both.loc[both["probs"] > i, 'y_pred'] = 1
... print ("Threshold",i,"Train Accuracy:", round(accuracy_score(both['class'], both['y_pred']),4))

From the preceding results, we find that, at threshold 0.5 value we can see the maximum accuracy of 0.7713. Hence, we set the threshold at 0.5 for classifying test data as well:

>>> both["y_pred"] = 0
>>> both.loc[both["probs"] > 0.5, 'y_pred'] = 1
>>> print ("\nTrain Confusion Matrix\n\n", pd.crosstab(both['class'], both['y_pred'], ... rownames = ["Actuall"],colnames = ["Predicted"]))
>>> print ("\nTrain Accuracy:",round(accuracy_score(both['class'],both['y_pred']),4))

Results are discussed next. After setting the threshold to 0.5 and the classified predicted probabilities into 0 or 1 classes, the Confusion matrix has been calculated by taking actual values in rows and predicted values in columns, which shows accuracy of 0.7713 or 77.13 percent:

Now, a threshold of 0.5 will be applied on test data to verify whether the model is consistent across various data sets with the following code:

>>> y_pred_test = pd.DataFrame( logistic_model.predict( sm.add_constant( ... x_test.drop(remove_cols,axis=1))))
>>> y_pred_test.columns = ["probs"]
>>> both_test = pd.concat([y_test,y_pred_test],axis=1)
>>> both_test["y_pred"] = 0
>>> both_test.loc[both_test["probs"] > 0.5, 'y_pred'] = 1
>>> print ("\nTest Confusion Matrix\n\n", pd.crosstab( both_test['class'], ... both_test['y_pred'],rownames = ["Actuall"],colnames = ["Predicted"]))
>>> print ("\nTest Accuracy:", round(accuracy_score( both_test['class'], ... both_test['y_pred']),4))

From the results of the test data, accuracy obtained is 0.8053 or 80.53 percent; our logistic regression classifier is classifying default versus non-default very powerfully!

R code for logistic regression is as follows:

# Variable Importance 
library(mctest)
library(dummies)
library(Information)
library(pROC)
credit_data = read.csv("credit_data.csv")
credit_data$class = credit_data$class-1

# I.V Calculation

IV <- create_infotables(data=credit_data, y="class", parallel=FALSE)
for (i in 1:length(colnames(credit_data))-1){
seca = IV[[1]][i][1]
sum(seca[[1]][5])
print(paste(colnames(credit_data)[i],",IV_Value:",round(sum(seca[[1]][5]),4)))
}

# Dummy variables creation

dummy_stseca =data.frame(dummy(credit_data$Status_of_existing_checking_account))
dummy_ch = data.frame(dummy(credit_data$Credit_history))
dummy_purpose = data.frame(dummy(credit_data$Purpose))
dummy_savacc = data.frame(dummy(credit_data$Savings_Account)) dummy_presc = data.frame(dummy(credit_data$Present_Employment_since)) dummy_perssx = data.frame(dummy(credit_data$Personal_status_and_sex)) dummy_othdts = data.frame(dummy(credit_data$Other_debtors)) dummy_property = data.frame(dummy(credit_data$Property)) dummy_othinstpln = data.frame(dummy(credit_data$Other_installment_plans))
dummy_forgnwrkr = data.frame(dummy(credit_data$Foreign_worker))

# Cleaning the variables name from . to _

colClean <- function(x){ colnames(x) <- gsub("\\.", "_", colnames(x)); x }
dummy_stseca = colClean(dummy_stseca) ;dummy_ch = colClean(dummy_ch) dummy_purpose = colClean(dummy_purpose); dummy_savacc= colClean(dummy_savacc)
dummy_presc= colClean(dummy_presc);dummy_perssx= colClean(dummy_perssx);
dummy_othdts= colClean(dummy_othdts);dummy_property= colClean(dummy_property);
dummy_othinstpln= colClean(dummy_othinstpln);dummy_forgnwrkr= colClean(dummy_forgnwrkr);

continuous_columns = c('Duration_in_month', 'Credit_amount','Installment_rate_in_percentage_of_disposable_income', 'Age_in_years','Number_of_existing_credits_at_this_bank')
credit_continuous = credit_data[,continuous_columns]
credit_data_new = cbind(dummy_stseca,dummy_ch,dummy_purpose,dummy_savacc,dummy_presc,dummy_perssx, dummy_othdts,dummy_property,dummy_othinstpln,dummy_forgnwrkr,credit_continuous,credit_data$class)
colnames(credit_data_new)[51] <- "class"

# Setting seed for repeatability of results of train & test split

set.seed(123)
numrow = nrow(credit_data_new)
trnind = sample(1:numrow,size = as.integer(0.7*numrow))
train_data = credit_data_new[trnind,]
test_data = credit_data_new[-trnind,]

remove_cols_extra_dummy = c("Status_of_existing_checking_account_A11","Credit_history_A30", "Purpose_A40", "Savings_Account_A61", "Present_Employment_since_A71", "Personal_status_and_sex_A91" "Other_debtors_A101", "Property_A121", "Other_installment_plans_A141", "Foreign_worker_A201")

# Removing insignificant variables one by one
remove_cols_insig = c("Purpose_A46","Purpose_A45","Purpose_A44","Savings_Account_A63", "Other_installment_plans_A143", "Property_A123", "Status_of_existing_checking_account_A12", "Present_Employment_since_A72", "Present_Employment_since_A75", "Present_Employment_since_A73","Credit_history_A32","Credit_history_A33", "Purpose_A40","Present_Employment_since_A74","Purpose_A49","Purpose_A48", "Property_A122","Personal_status_and_sex_A92","Foreign_worker_A202", "Personal_status_and_sex_A94","Purpose_A42","Other_debtors_A102", "Age_in_years","Savings_Account_A64","Savings_Account_A62", "Savings_Account_A65", "Other_debtors_A103")
remove_cols = c(remove_cols_extra_dummy,remove_cols_insig)
glm_fit = glm(class ~.,family = "binomial",data = train_data[,!(names(train_data) %in% remove_cols)])

# Significance check - p_value summary(glm_fit)


# Multi collinearity check - VIF
remove_cols_vif = c(remove_cols,"class")
vif_table = imcdiag(train_data[,!(names(train_data) %in% remove_cols_vif)],train_data$class,detr=0.001, conf=0.99)
vif_table

# Predicting probabilities

train_data$glm_probs = predict(glm_fit,newdata = train_data,type = "response")
test_data$glm_probs = predict(glm_fit,newdata = test_data,type = "response")

# Area under

ROC ROC1 <- roc(as.factor(train_data$class),train_data$glm_probs) plot(ROC1, col = "blue")
print(paste("Area under the curve",round(auc(ROC1),4)))

# Actual prediction based on threshold tuning

threshold_vals = c(0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9)
for (thld in threshold_vals){
train_data$glm_pred = 0
train_data$glm_pred[train_data$glm_probs>thld]=1
tble = table(train_data$glm_pred,train_data$class)
acc = (tble[1,1]+tble[2,2])/sum(tble)
print(paste("Threshold",thld,"Train accuracy",round(acc,4)))
}

# Best threshold from above search is 0.5 with accuracy as 0.7841

best_threshold = 0.5

# Train confusion matrix & accuracy

train_data$glm_pred = 0
train_data$glm_pred[train_data$glm_probs>best_threshold]=1
tble = table(train_data$glm_pred,train_data$class)
acc = (tble[1,1]+tble[2,2])/sum(tble)
print(paste("Confusion Matrix - Train Data"))
print(tble) print(paste("Train accuracy",round(acc,4)))

# Test confusion matrix & accuracy
test_data$glm_pred = 0 test_data$glm_pred[test_data$glm_probs>best_threshold]=1
tble_test = table(test_data$glm_pred,test_data$class)
acc_test = (tble_test[1,1]+tble_test[2,2])/sum(tble_test)
print(paste("Confusion Matrix - Test Data")) print(tble_test)
print(paste("Test accuracy",round(acc_test,4)))