Gradient boosting classification

Gradient boosting classification#

We use a classification model to predict which customers will default on their credit card debt.

Data#

The credit data is a simulated data set containing information on ten thousand customers (taken from ). The aim here is to use a classification model to predict which customers will default on their credit card debt (i.e., failure to repay a debt):

default: A categorical variable with levels No and Yes indicating whether the customer defaulted on their debt
student: A categorical variable with levels No and Yes indicating whether the customer is a student
balance: The average balance that the customer has remaining on their credit card after making their monthly payment
income: Income of customer

Import data#

import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/kirenz/classification/main/_static/data/Default.csv')

Inspect data#

df

	default	student	balance	income
0	No	No	729.526495	44361.625074
1	No	Yes	817.180407	12106.134700
2	No	No	1073.549164	31767.138947
3	No	No	529.250605	35704.493935
4	No	No	785.655883	38463.495879
...	...	...	...	...
9995	No	No	711.555020	52992.378914
9996	No	No	757.962918	19660.721768
9997	No	No	845.411989	58636.156984
9998	No	No	1569.009053	36669.112365
9999	No	Yes	200.922183	16862.952321

10000 rows × 4 columns

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   default  10000 non-null  object 
 1   student  10000 non-null  object 
 2   balance  10000 non-null  float64
 3   income   10000 non-null  float64
dtypes: float64(2), object(2)
memory usage: 312.6+ KB

# check for missing values
print(df.isnull().sum())

default    0
student    0
balance    0
income     0
dtype: int64

Data preparation#

Categorical data#

First, we convert categorical data into indicator variables:

dummies = pd.get_dummies(df[['default', 'student']], drop_first=True, dtype=float)
dummies.head(3)

	default_Yes	student_Yes
0	0.0	0.0
1	0.0	1.0
2	0.0	0.0

# combine data and drop original categorical variables
df = pd.concat([df, dummies], axis=1).drop(columns = ['default', 'student'])
df.head(3)

	balance	income	student_Yes
0	729.526495	44361.625074	0.0
1	817.180407	12106.134700	1.0
2	1073.549164	31767.138947	0.0

Label and features#

Next, we create our y label and features:

y = df['default_Yes']
X = df.drop(columns = 'default_Yes')

Train test split#

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 1)

Data exploration#

Create data for exploratory data analysis.

train_dataset = pd.DataFrame(X_train.copy())
train_dataset['default_Yes'] = pd.DataFrame(y_train)

import seaborn as sns

sns.pairplot(train_dataset, hue='default_Yes');

../_images/9fd161253d8687ac71859fcb9bc8108e7f2a770dac2993d1ab759700d5cbea75.png

Model#

from sklearn.ensemble import GradientBoostingClassifier

clf = GradientBoostingClassifier(n_estimators=100, 
                                learning_rate=1.0,
                                max_depth=1, 
                                random_state=0).fit(X_train, y_train)

                                
y_pred = clf.fit(X_train, y_train).predict(X_test)

# Return the mean accuracy on the given test data and labels:
clf.score(X_test, y_test)

0.9696666666666667

Confusion matrix#

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

cm = confusion_matrix(y_test, y_pred)

disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                              display_labels=clf.classes_)
disp.plot()
plt.show()

../_images/bc8b5182af2c9eed2cae4771bf64b7733dcd72a23844beeb296a34b2d01a3493.png

Classification report#

from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred, target_names=['No', 'Yes'], zero_division=0))

              precision    recall  f1-score   support

          No       0.97      1.00      0.98      2909
         Yes       0.00      0.00      0.00        91

    accuracy                           0.97      3000
   macro avg       0.48      0.50      0.49      3000
weighted avg       0.94      0.97      0.95      3000

Change threshold#

Use specific threshold

pred_proba = clf.predict_proba(X_test)

df_ = pd.DataFrame({'y_test': y_test, 'y_pred': pred_proba[:,1] > .25})
cm = confusion_matrix(y_test, df_['y_pred'])

disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                              display_labels=clf.classes_)
disp.plot()
plt.show()

Classification report#

print(classification_report(y_test, df_['y_pred'], zero_division=0))

              precision    recall  f1-score   support

         0.0       0.97      1.00      0.98      2909
         1.0       0.00      0.00      0.00        91

    accuracy                           0.97      3000
   macro avg       0.48      0.50      0.49      3000
weighted avg       0.94      0.97      0.95      3000

Gradient boosting classification

Contents

Gradient boosting classification#

Data#

Import data#

Inspect data#

Data preparation#

Categorical data#

Label and features#

Train test split#

Data exploration#

Model#

Confusion matrix#

Classification report#

Change threshold#

Classification report#