Costumer churn prediction
<!DOCTYPE html>
Introduction
Since acquiring a new customer is far more expensive than retaining an existing one, customers’ exit is a seriously undesirable customer behavior that banks (like many other businesses) try to prevent. In this regard, developing a model that can predict a customer’s exit comes to a great importance. Using such model, banks can predict exit of their valuable customers and take measures to prevent that. In this project, we develop a classification model (based on logistic regression) to predict churn behavior of customers of a hypothetical bank.
### loading libraries
library(readr)
library(knitr)
library(caret)
library(e1071)
library(generalhoslem)
library(scales)
library(ggplot2)
library(psych)
library(corrplot)
library(table1)
Prepairing data and some exploratory analysis
The data set for this project is selected from Kaggle. This data set consists of different features of about 10000 customers of a bank and the target variable is a binary variable reflecting the fact whether the customer left the bank (closed his account) or he continues to be a customer.
Churn_Data <- read_csv("/Users/hamed/Documents/Academics/Generalized Linear Models/Term Project/Churn_Modelling.csv")
## Parsed with column specification:
## cols(
## RowNumber = col_double(),
## CustomerId = col_double(),
## Surname = col_character(),
## CreditScore = col_double(),
## Geography = col_character(),
## Gender = col_character(),
## Age = col_double(),
## Tenure = col_double(),
## Balance = col_double(),
## NumOfProducts = col_double(),
## HasCrCard = col_double(),
## IsActiveMember = col_double(),
## EstimatedSalary = col_double(),
## Exited = col_double()
## )
# Excluding columns of id and name
Churn_Data <- subset(Churn_Data,select = 4:ncol(Churn_Data))
Churn_Data$Exited <- as.factor(Churn_Data$Exited)
## For Distribution of Response Variable
ggplot2::ggplot(Churn_Data,aes(Exited,fill =Exited))+geom_bar() + ggtitle("Distribution of Response variable")
## For categorical Predictor Variables
# Geography
p1=ggplot2::ggplot(Churn_Data,aes(x =Geography ,fill = Exited)) +
geom_bar(position = "Dodge") +
theme(legend.position = "bottom") +labs(title = "Geography Vs Exited")
# Gender
p2=ggplot2::ggplot(Churn_Data,aes(x = Gender ,fill = Exited)) + geom_bar(position = "Dodge") +
theme(legend.position = "bottom") + labs(title = "Gender Vs Exited")
gridExtra::grid.arrange(p1, p2, ncol = 2)
# For Tenure
p1=ggplot2::ggplot(Churn_Data,aes(x = Tenure ,fill = Exited)) + geom_bar(position = "Dodge") +
theme(legend.position = "bottom") + labs(title = "Tensure Vs Exited")
# For Number of Products
p2=ggplot2::ggplot(Churn_Data,aes(x = NumOfProducts ,fill = Exited)) + geom_bar(position = "Dodge") +
theme(legend.position = "bottom") + labs(title = "Num. Of Products Vs Exited")
gridExtra::grid.arrange(p1, p2, ncol = 2)
# Examining multi-collinearity between continuous independant variables.
con_var <- subset(Churn_Data, select = c("CreditScore","Age","Tenure","Balance","EstimatedSalary"))
psych::pairs.panels(con_var,
method = "pearson", # correlation method
hist.col = "#00AFBB",
density = TRUE, # show density plots
ellipses = TRUE # show correlation ellipses
)
kable(cor(con_var))
CreditScore | Age | Tenure | Balance | EstimatedSalary | |
---|---|---|---|---|---|
CreditScore | 1.0000000 | -0.0039649 | 0.0008419 | 0.0062684 | -0.0013843 |
Age | -0.0039649 | 1.0000000 | -0.0099968 | 0.0283084 | -0.0072010 |
Tenure | 0.0008419 | -0.0099968 | 1.0000000 | -0.0122539 | 0.0077838 |
Balance | 0.0062684 | 0.0283084 | -0.0122539 | 1.0000000 | 0.0127975 |
EstimatedSalary | -0.0013843 | -0.0072010 | 0.0077838 | 0.0127975 | 1.0000000 |
cr=cor(con_var,method=c("pearson"))
## For Correaltion Plot
corrplot::corrplot(cr,method="circle")
There is no multi-collinearity problem. correlations are not extreme and basically as long as our objective is prediction, multi-collinearity is not a concern. Because even in extreme case of correlation between independant variables, coefficients’ estimates are still unbiased.
Spliting dataset to train and test segments
set.seed(648)
ind <- sample(2, nrow(Churn_Data), replace = TRUE, prob = c(0.8, 0.2))
train_data <- Churn_Data[ind==1,]
test_data <- Churn_Data[ind==2,]
#Check proportion of churned and unchurned users in train and test data
df1 <- data.frame(
"Chruned Customers" = c(sum(train_data$Exited == 1),sum(test_data$Exited == 1)),
"Chruned Proportion" = c(percent(sum(train_data$Exited == 1) / nrow(train_data)), percent(sum(test_data$Exited == 1) / nrow(test_data))),
"Not-churned Customers" = c(sum(train_data$Exited == 0),sum(test_data$Exited == 0)),
"Not-chruned Proportion" = c(percent(sum(train_data$Exited == 0) / nrow(train_data)), percent(sum(test_data$Exited == 0) / nrow(test_data)))
)
row.names(df1) <- c("Train data", "Test data")
kable(df1, align = 'r',caption = "Proportion of Churned Customers in training and test datasets")
Chruned.Customers | Chruned.Proportion | Not.churned.Customers | Not.chruned.Proportion | |
---|---|---|---|---|
Train data | 1665 | 21% | 6342 | 79% |
Test data | 372 | 19% | 1621 | 81% |
Choosing link function
I fitted model 1, which is the model with all independant variables and data set. I also tried to find influential values (outliers) using cook’s distance, but results are a bit strange. Standardized residuals does not show any problematic point. it will be great if you could add some analyses to this part
#### MODEL 1
model1 <- glm(Exited ~., family = binomial(), data = train_data)
summary(model1)
##
## Call:
## glm(formula = Exited ~ ., family = binomial(), data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3013 -0.6670 -0.4632 -0.2745 2.9528
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.391e+00 2.720e-01 -12.465 < 2e-16 ***
## CreditScore -4.582e-04 3.107e-04 -1.475 0.1402
## GeographyGermany 7.305e-01 7.527e-02 9.706 < 2e-16 ***
## GeographySpain -2.749e-02 7.821e-02 -0.351 0.7252
## GenderMale -5.216e-01 6.034e-02 -8.644 < 2e-16 ***
## Age 7.157e-02 2.827e-03 25.315 < 2e-16 ***
## Tenure -1.551e-02 1.037e-02 -1.497 0.1345
## Balance 2.376e-06 5.682e-07 4.182 2.89e-05 ***
## NumOfProducts -1.228e-01 5.258e-02 -2.336 0.0195 *
## HasCrCard -4.335e-02 6.590e-02 -0.658 0.5107
## IsActiveMember -1.042e+00 6.366e-02 -16.363 < 2e-16 ***
## EstimatedSalary 4.618e-07 5.245e-07 0.880 0.3786
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8186.7 on 8006 degrees of freedom
## Residual deviance: 6965.1 on 7995 degrees of freedom
## AIC: 6989.1
##
## Number of Fisher Scoring iterations: 5
#### Maybe we should drop this section too
#plot residuals against continuous explanatory variables
plot(train_data$CreditScore,rstandard(model1))
plot(train_data$Age,rstandard(model1))
plot(train_data$Balance,rstandard(model1))
plot(train_data$EstimatedSalary,rstandard(model1))
### ModeL 1 - Probit link function
model1_prob <- glm(Exited ~., family = binomial(link = "probit"), data = train_data)
summary(model1_prob)
##
## Call:
## glm(formula = Exited ~ ., family = binomial(link = "probit"),
## data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2781 -0.6772 -0.4654 -0.2482 3.1883
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.985e+00 1.542e-01 -12.871 < 2e-16 ***
## CreditScore -2.930e-04 1.764e-04 -1.661 0.0967 .
## GeographyGermany 4.195e-01 4.338e-02 9.669 < 2e-16 ***
## GeographySpain -1.323e-02 4.357e-02 -0.304 0.7614
## GenderMale -3.010e-01 3.419e-02 -8.803 < 2e-16 ***
## Age 4.127e-02 1.603e-03 25.744 < 2e-16 ***
## Tenure -8.475e-03 5.893e-03 -1.438 0.1504
## Balance 1.377e-06 3.199e-07 4.305 1.67e-05 ***
## NumOfProducts -6.239e-02 3.015e-02 -2.069 0.0385 *
## HasCrCard -2.605e-02 3.741e-02 -0.696 0.4862
## IsActiveMember -5.641e-01 3.525e-02 -16.004 < 2e-16 ***
## EstimatedSalary 2.637e-07 2.977e-07 0.886 0.3758
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8186.7 on 8006 degrees of freedom
## Residual deviance: 6966.6 on 7995 degrees of freedom
## AIC: 6990.6
##
## Number of Fisher Scoring iterations: 5
### ModeL 1 - Cloglog link function
model1_cloglog <- glm(Exited ~., family = binomial(link = "cloglog"), data = train_data)
summary(model1_cloglog)
##
## Call:
## glm(formula = Exited ~ ., family = binomial(link = "cloglog"),
## data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1631 -0.6637 -0.4759 -0.3052 2.8081
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.959e+00 2.234e-01 -13.243 < 2e-16 ***
## CreditScore -3.753e-04 2.558e-04 -1.467 0.1423
## GeographyGermany 5.725e-01 6.129e-02 9.341 < 2e-16 ***
## GeographySpain -2.076e-02 6.732e-02 -0.308 0.7578
## GenderMale -4.230e-01 5.024e-02 -8.419 < 2e-16 ***
## Age 5.510e-02 2.166e-03 25.440 < 2e-16 ***
## Tenure -1.433e-02 8.588e-03 -1.669 0.0951 .
## Balance 2.014e-06 4.817e-07 4.182 2.89e-05 ***
## NumOfProducts -1.065e-01 4.328e-02 -2.462 0.0138 *
## HasCrCard -4.797e-02 5.464e-02 -0.878 0.3800
## IsActiveMember -9.478e-01 5.396e-02 -17.564 < 2e-16 ***
## EstimatedSalary 3.388e-07 4.338e-07 0.781 0.4348
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8186.7 on 8006 degrees of freedom
## Residual deviance: 6989.8 on 7995 degrees of freedom
## AIC: 7013.8
##
## Number of Fisher Scoring iterations: 8
df2 <- data.frame(
"Deviance" = c(model1$deviance, model1_prob$deviance, model1_cloglog$deviance),
"AIC" = c(model1$aic, model1_prob$aic, model1_cloglog$aic)
)
row.names(df2) <- c("Logit","Probit","Cloglog")
kable(df2)
Deviance | AIC | |
---|---|---|
Logit | 6965.088 | 6989.088 |
Probit | 6966.581 | 6990.581 |
Cloglog | 6989.756 | 7013.756 |
Regarding deviance and AIC criteria (which are necessarily consistent) model with the logit link function fits the best and we proceed our analysis with logit link function.
Suggesting a more parsimonious model
I am dropping insignificant independant variables to see if we can have more parsimonous model
model2 <- glm(Exited ~.-CreditScore-Tenure-HasCrCard-EstimatedSalary, family = binomial(), data = train_data)
summary(model2)
##
## Call:
## glm(formula = Exited ~ . - CreditScore - Tenure - HasCrCard -
## EstimatedSalary, family = binomial(), data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3467 -0.6651 -0.4646 -0.2749 2.9265
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.748e+00 1.609e-01 -23.290 < 2e-16 ***
## GeographyGermany 7.273e-01 7.519e-02 9.672 < 2e-16 ***
## GeographySpain -2.850e-02 7.817e-02 -0.365 0.7154
## GenderMale -5.230e-01 6.029e-02 -8.675 < 2e-16 ***
## Age 7.152e-02 2.825e-03 25.316 < 2e-16 ***
## Balance 2.394e-06 5.677e-07 4.217 2.48e-05 ***
## NumOfProducts -1.226e-01 5.253e-02 -2.334 0.0196 *
## IsActiveMember -1.040e+00 6.353e-02 -16.370 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8186.7 on 8006 degrees of freedom
## Residual deviance: 6970.7 on 7999 degrees of freedom
## AIC: 6986.7
##
## Number of Fisher Scoring iterations: 5
model2$deviance
## [1] 6970.709
Dev_test <- abs(model2$deviance - model1$deviance) > qchisq(0.95,4)
Dev_test
## [1] FALSE
According to Deviance test, null hypothesis is not rejected and covariates in model 2 are enough to explain response variable.
Hosmer-Lemeshow goodness-of-fit statistic
hoslem <- logitgof(train_data$Exited, model2$fitted.values, g=10)
hoslem
##
## Hosmer and Lemeshow test (binary model)
##
## data: train_data$Exited, model2$fitted.values
## X-squared = 13.22, df = 8, p-value = 0.1045
we cannot reject null-hypothesis (goodness of fit of this model) in %95 significance level
Confusion matrix and checking accuracy of models
Model with all independant variables
###accuracy of prediction for model 1
prediction1 <- predict(model1, test_data,type = "response")
threshold = 0.5
prediction1[prediction1 >= threshold] <- 1
prediction1[prediction1 < threshold] <- 0
confusionMatrix(as.factor(prediction1), test_data$Exited, positive = '1')
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1572 291
## 1 49 81
##
## Accuracy : 0.8294
## 95% CI : (0.8122, 0.8457)
## No Information Rate : 0.8133
## P-Value [Acc > NIR] : 0.03398
##
## Kappa : 0.2502
##
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.21774
## Specificity : 0.96977
## Pos Pred Value : 0.62308
## Neg Pred Value : 0.84380
## Prevalence : 0.18665
## Detection Rate : 0.04064
## Detection Prevalence : 0.06523
## Balanced Accuracy : 0.59376
##
## 'Positive' Class : 1
##
Although accuracy of this model is more than %82, attending accuracy as the only criteria is misleading. Since, we are mainly interested in predicting churning users precisely (i.e. sensitivity) and this model predicts churning users correctly in %22 of cases. This is terrible for our objective. Thus, we can improve sensitivity (obviously at the cost of specificity and therefore accuracy) by changing threshold of classificatin. In the plot below, we examine how these three criteria change in deferent thresholds.
thresholds <- seq(0.1,1,0.1)
acc <- rep(0,10)
sens <- rep(0,10)
spec <- rep(0,10)
for (i in 1:10){
pred <- predict(model1, test_data,type = "response")
pred[pred >= thresholds[i]] <- 1
pred[pred < thresholds[i]] <- 0
c <- confusionMatrix(as.factor(pred), test_data$Exited, positive = '1')
acc[i] <- c$overall['Accuracy']
sens[i] <- c$byClass['Sensitivity']
spec[i] <- c$byClass['Specificity']
}
## Warning in confusionMatrix.default(as.factor(pred), test_data$Exited, positive =
## "1"): Levels are not in the same order for reference and data. Refactoring data
## to match.
## Warning in confusionMatrix.default(as.factor(pred), test_data$Exited, positive =
## "1"): Levels are not in the same order for reference and data. Refactoring data
## to match.
plot(thresholds,acc, type = "o", pch=19, ylim = c(0,1), xlab = "Thresholds", ylab = "", main = "Accuracy, Sensitivity and Specificity in different thresholds of pi")
points(thresholds,sens, type = "o", pch=19, col="red")
points(thresholds,spec, type = "o", pch=19, col="blue")
legend("right",c("Accuracy","Sensitivity","Specificity"),pch=c(19,19,19),col = c("black","red","blue"))
Parsimonious model
###accuracy of prediction for model 2
prediction2 <- predict(model2, test_data,type = "response")
threshold = 0.5
prediction2[prediction2 >= threshold] <- 1
prediction2[prediction2 < threshold] <- 0
confusionMatrix(as.factor(prediction2), test_data$Exited, positive = '1')
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1571 292
## 1 50 80
##
## Accuracy : 0.8284
## 95% CI : (0.8111, 0.8447)
## No Information Rate : 0.8133
## P-Value [Acc > NIR] : 0.04384
##
## Kappa : 0.2458
##
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.21505
## Specificity : 0.96915
## Pos Pred Value : 0.61538
## Neg Pred Value : 0.84326
## Prevalence : 0.18665
## Detection Rate : 0.04014
## Detection Prevalence : 0.06523
## Balanced Accuracy : 0.59210
##
## 'Positive' Class : 1
##
thresholds <- seq(0.1,1,0.1)
acc2 <- rep(0,10)
sens2 <- rep(0,10)
spec2 <- rep(0,10)
for (i in 1:10){
pred2 <- predict(model2, test_data,type = "response")
pred2[pred2 >= thresholds[i]] <- 1
pred2[pred2 < thresholds[i]] <- 0
c2 <- confusionMatrix(as.factor(pred2), test_data$Exited, positive = '1')
acc2[i] <- c2$overall['Accuracy']
sens2[i] <- c2$byClass['Sensitivity']
spec2[i] <- c2$byClass['Specificity']
}
## Warning in confusionMatrix.default(as.factor(pred2), test_data$Exited, positive
## = "1"): Levels are not in the same order for reference and data. Refactoring
## data to match.
## Warning in confusionMatrix.default(as.factor(pred2), test_data$Exited, positive
## = "1"): Levels are not in the same order for reference and data. Refactoring
## data to match.
plot(thresholds,acc2, type = "o", pch=19, ylim = c(0,1), xlab = "Thresholds", ylab = "" ,main = "Accuracy, Sensitivity and Specificity in different thresholds of pi")
points(thresholds,sens2, type = "o", pch=19, col="red")
points(thresholds,spec2, type = "o", pch=19, col="blue")
points(thresholds,sens, type = "o", pch=19, col="green")
legend("right",c("Accuracy-Model2","Sensitivity-Model2","Specificity-Model2", "Sensitivity-Model1"),pch=c(19,19,19,19),col = c("black","red","blue","green"))
Since sensitivity is important to us, I have also compared with the sensitivity of model 1. Not much difference can be seen.
Policy impacts of the model
In this section, we aim to extract some policy takeaways from interpretation of coefficients of the model:
* First and the most obvious one, The more a customer is intensely and frequently using products of the company, the more loyal they will be.
* Second, The coefficient of ‘Number of products’ shows that if a customer uses one more product of the company, odds of their churn drops by about 20%. Therefore, the company should actively pursue cross-selling in their sales policies.
* Geographically, odds of churn in Germany is almost 2 times compared to France. Thus, the company may take some measures improve loyalty of its customers in Germany.