Costumer churn prediction

12 minute read

<!DOCTYPE html>

2020-03-20-Churn.utf8

Introduction

Since acquiring a new customer is far more expensive than retaining an existing one, customers’ exit is a seriously undesirable customer behavior that banks (like many other businesses) try to prevent. In this regard, developing a model that can predict a customer’s exit comes to a great importance. Using such model, banks can predict exit of their valuable customers and take measures to prevent that. In this project, we develop a classification model (based on logistic regression) to predict churn behavior of customers of a hypothetical bank.

### loading libraries
library(readr)
library(knitr)
library(caret)
library(e1071)
library(generalhoslem)
library(scales)
library(ggplot2)
library(psych)
library(corrplot)
library(table1)

Prepairing data and some exploratory analysis

The data set for this project is selected from Kaggle. This data set consists of different features of about 10000 customers of a bank and the target variable is a binary variable reflecting the fact whether the customer left the bank (closed his account) or he continues to be a customer.

Churn_Data <- read_csv("/Users/hamed/Documents/Academics/Generalized Linear Models/Term Project/Churn_Modelling.csv")
## Parsed with column specification:
## cols(
##   RowNumber = col_double(),
##   CustomerId = col_double(),
##   Surname = col_character(),
##   CreditScore = col_double(),
##   Geography = col_character(),
##   Gender = col_character(),
##   Age = col_double(),
##   Tenure = col_double(),
##   Balance = col_double(),
##   NumOfProducts = col_double(),
##   HasCrCard = col_double(),
##   IsActiveMember = col_double(),
##   EstimatedSalary = col_double(),
##   Exited = col_double()
## )
# Excluding columns of id and name
Churn_Data <- subset(Churn_Data,select = 4:ncol(Churn_Data))

Churn_Data$Exited <- as.factor(Churn_Data$Exited)
## For Distribution of Response Variable
ggplot2::ggplot(Churn_Data,aes(Exited,fill =Exited))+geom_bar() + ggtitle("Distribution of Response variable")

## For categorical Predictor Variables
# Geography
p1=ggplot2::ggplot(Churn_Data,aes(x =Geography ,fill = Exited)) + 
geom_bar(position = "Dodge") +   
theme(legend.position = "bottom") +labs(title = "Geography Vs Exited")

# Gender
p2=ggplot2::ggplot(Churn_Data,aes(x = Gender ,fill = Exited)) + geom_bar(position = "Dodge") +
theme(legend.position = "bottom") + labs(title = "Gender Vs Exited")
gridExtra::grid.arrange(p1, p2, ncol = 2)

# For Tenure
p1=ggplot2::ggplot(Churn_Data,aes(x = Tenure ,fill = Exited)) + geom_bar(position = "Dodge") +
theme(legend.position = "bottom") + labs(title = "Tensure Vs Exited")

# For Number of Products
p2=ggplot2::ggplot(Churn_Data,aes(x = NumOfProducts ,fill = Exited)) + geom_bar(position = "Dodge") +
theme(legend.position = "bottom") + labs(title = "Num. Of Products Vs Exited")
gridExtra::grid.arrange(p1, p2, ncol = 2)

# Examining multi-collinearity between continuous independant variables.
con_var <- subset(Churn_Data, select = c("CreditScore","Age","Tenure","Balance","EstimatedSalary"))
psych::pairs.panels(con_var,
             method = "pearson", # correlation method
             hist.col = "#00AFBB",
             density = TRUE,  # show density plots
             ellipses = TRUE # show correlation ellipses
             )

kable(cor(con_var))
CreditScore Age Tenure Balance EstimatedSalary
CreditScore 1.0000000 -0.0039649 0.0008419 0.0062684 -0.0013843
Age -0.0039649 1.0000000 -0.0099968 0.0283084 -0.0072010
Tenure 0.0008419 -0.0099968 1.0000000 -0.0122539 0.0077838
Balance 0.0062684 0.0283084 -0.0122539 1.0000000 0.0127975
EstimatedSalary -0.0013843 -0.0072010 0.0077838 0.0127975 1.0000000
cr=cor(con_var,method=c("pearson"))
## For Correaltion Plot
corrplot::corrplot(cr,method="circle")

There is no multi-collinearity problem. correlations are not extreme and basically as long as our objective is prediction, multi-collinearity is not a concern. Because even in extreme case of correlation between independant variables, coefficients’ estimates are still unbiased.

Spliting dataset to train and test segments

set.seed(648)
ind <- sample(2, nrow(Churn_Data), replace = TRUE, prob = c(0.8, 0.2))
train_data <- Churn_Data[ind==1,]
test_data <- Churn_Data[ind==2,]

#Check proportion of churned and unchurned users in train and test data
df1 <- data.frame(
  "Chruned Customers" = c(sum(train_data$Exited == 1),sum(test_data$Exited == 1)),
  "Chruned Proportion" = c(percent(sum(train_data$Exited == 1) / nrow(train_data)), percent(sum(test_data$Exited == 1) / nrow(test_data))),
  "Not-churned Customers" = c(sum(train_data$Exited == 0),sum(test_data$Exited == 0)),
  "Not-chruned Proportion" = c(percent(sum(train_data$Exited == 0) / nrow(train_data)), percent(sum(test_data$Exited == 0) / nrow(test_data)))
)
row.names(df1) <- c("Train data", "Test data")
kable(df1, align = 'r',caption = "Proportion of Churned Customers in training and test datasets")
Proportion of Churned Customers in training and test datasets
Chruned.Customers Chruned.Proportion Not.churned.Customers Not.chruned.Proportion
Train data 1665 21% 6342 79%
Test data 372 19% 1621 81%

Suggesting a more parsimonious model

I am dropping insignificant independant variables to see if we can have more parsimonous model

model2 <- glm(Exited ~.-CreditScore-Tenure-HasCrCard-EstimatedSalary, family = binomial(), data = train_data)
summary(model2)
## 
## Call:
## glm(formula = Exited ~ . - CreditScore - Tenure - HasCrCard - 
##     EstimatedSalary, family = binomial(), data = train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3467  -0.6651  -0.4646  -0.2749   2.9265  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -3.748e+00  1.609e-01 -23.290  < 2e-16 ***
## GeographyGermany  7.273e-01  7.519e-02   9.672  < 2e-16 ***
## GeographySpain   -2.850e-02  7.817e-02  -0.365   0.7154    
## GenderMale       -5.230e-01  6.029e-02  -8.675  < 2e-16 ***
## Age               7.152e-02  2.825e-03  25.316  < 2e-16 ***
## Balance           2.394e-06  5.677e-07   4.217 2.48e-05 ***
## NumOfProducts    -1.226e-01  5.253e-02  -2.334   0.0196 *  
## IsActiveMember   -1.040e+00  6.353e-02 -16.370  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8186.7  on 8006  degrees of freedom
## Residual deviance: 6970.7  on 7999  degrees of freedom
## AIC: 6986.7
## 
## Number of Fisher Scoring iterations: 5
model2$deviance
## [1] 6970.709
Dev_test <- abs(model2$deviance - model1$deviance) > qchisq(0.95,4)
Dev_test
## [1] FALSE

According to Deviance test, null hypothesis is not rejected and covariates in model 2 are enough to explain response variable.

Hosmer-Lemeshow goodness-of-fit statistic

hoslem <- logitgof(train_data$Exited, model2$fitted.values, g=10)
hoslem
## 
##  Hosmer and Lemeshow test (binary model)
## 
## data:  train_data$Exited, model2$fitted.values
## X-squared = 13.22, df = 8, p-value = 0.1045

we cannot reject null-hypothesis (goodness of fit of this model) in %95 significance level

Confusion matrix and checking accuracy of models

Model with all independant variables

###accuracy of prediction for model 1

prediction1 <- predict(model1, test_data,type = "response")
threshold = 0.5
prediction1[prediction1 >= threshold] <- 1
prediction1[prediction1 < threshold] <- 0
confusionMatrix(as.factor(prediction1), test_data$Exited, positive = '1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1572  291
##          1   49   81
##                                           
##                Accuracy : 0.8294          
##                  95% CI : (0.8122, 0.8457)
##     No Information Rate : 0.8133          
##     P-Value [Acc > NIR] : 0.03398         
##                                           
##                   Kappa : 0.2502          
##                                           
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 0.21774         
##             Specificity : 0.96977         
##          Pos Pred Value : 0.62308         
##          Neg Pred Value : 0.84380         
##              Prevalence : 0.18665         
##          Detection Rate : 0.04064         
##    Detection Prevalence : 0.06523         
##       Balanced Accuracy : 0.59376         
##                                           
##        'Positive' Class : 1               
## 

Although accuracy of this model is more than %82, attending accuracy as the only criteria is misleading. Since, we are mainly interested in predicting churning users precisely (i.e. sensitivity) and this model predicts churning users correctly in %22 of cases. This is terrible for our objective. Thus, we can improve sensitivity (obviously at the cost of specificity and therefore accuracy) by changing threshold of classificatin. In the plot below, we examine how these three criteria change in deferent thresholds.

thresholds <- seq(0.1,1,0.1)
acc <- rep(0,10)
sens <- rep(0,10)
spec <- rep(0,10)
for (i in 1:10){
  pred <- predict(model1, test_data,type = "response")
  pred[pred >= thresholds[i]] <- 1
  pred[pred < thresholds[i]] <- 0
  c <- confusionMatrix(as.factor(pred), test_data$Exited, positive = '1')
  acc[i] <- c$overall['Accuracy']
  sens[i] <- c$byClass['Sensitivity']
  spec[i] <- c$byClass['Specificity']
}
## Warning in confusionMatrix.default(as.factor(pred), test_data$Exited, positive =
## "1"): Levels are not in the same order for reference and data. Refactoring data
## to match.

## Warning in confusionMatrix.default(as.factor(pred), test_data$Exited, positive =
## "1"): Levels are not in the same order for reference and data. Refactoring data
## to match.
plot(thresholds,acc, type = "o", pch=19, ylim = c(0,1), xlab = "Thresholds", ylab = "", main = "Accuracy, Sensitivity and Specificity in different thresholds of pi")
points(thresholds,sens, type = "o", pch=19, col="red")
points(thresholds,spec, type = "o", pch=19, col="blue")

legend("right",c("Accuracy","Sensitivity","Specificity"),pch=c(19,19,19),col = c("black","red","blue"))

Parsimonious model

###accuracy of prediction for model 2

prediction2 <- predict(model2, test_data,type = "response")
threshold = 0.5
prediction2[prediction2 >= threshold] <- 1
prediction2[prediction2 < threshold] <- 0
confusionMatrix(as.factor(prediction2), test_data$Exited, positive = '1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1571  292
##          1   50   80
##                                           
##                Accuracy : 0.8284          
##                  95% CI : (0.8111, 0.8447)
##     No Information Rate : 0.8133          
##     P-Value [Acc > NIR] : 0.04384         
##                                           
##                   Kappa : 0.2458          
##                                           
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 0.21505         
##             Specificity : 0.96915         
##          Pos Pred Value : 0.61538         
##          Neg Pred Value : 0.84326         
##              Prevalence : 0.18665         
##          Detection Rate : 0.04014         
##    Detection Prevalence : 0.06523         
##       Balanced Accuracy : 0.59210         
##                                           
##        'Positive' Class : 1               
## 
thresholds <- seq(0.1,1,0.1)
acc2 <- rep(0,10)
sens2 <- rep(0,10)
spec2 <- rep(0,10)
for (i in 1:10){
  pred2 <- predict(model2, test_data,type = "response")
  pred2[pred2 >= thresholds[i]] <- 1
  pred2[pred2 < thresholds[i]] <- 0
  c2 <- confusionMatrix(as.factor(pred2), test_data$Exited, positive = '1')
  acc2[i] <- c2$overall['Accuracy']
  sens2[i] <- c2$byClass['Sensitivity']
  spec2[i] <- c2$byClass['Specificity']
}
## Warning in confusionMatrix.default(as.factor(pred2), test_data$Exited, positive
## = "1"): Levels are not in the same order for reference and data. Refactoring
## data to match.

## Warning in confusionMatrix.default(as.factor(pred2), test_data$Exited, positive
## = "1"): Levels are not in the same order for reference and data. Refactoring
## data to match.
plot(thresholds,acc2, type = "o", pch=19, ylim = c(0,1), xlab = "Thresholds", ylab = "" ,main = "Accuracy, Sensitivity and Specificity in different thresholds of pi")
points(thresholds,sens2, type = "o", pch=19, col="red")
points(thresholds,spec2, type = "o", pch=19, col="blue")
points(thresholds,sens, type = "o", pch=19, col="green")
legend("right",c("Accuracy-Model2","Sensitivity-Model2","Specificity-Model2", "Sensitivity-Model1"),pch=c(19,19,19,19),col = c("black","red","blue","green"))

Since sensitivity is important to us, I have also compared with the sensitivity of model 1. Not much difference can be seen.

Policy impacts of the model

In this section, we aim to extract some policy takeaways from interpretation of coefficients of the model:
* First and the most obvious one, The more a customer is intensely and frequently using products of the company, the more loyal they will be.
* Second, The coefficient of ‘Number of products’ shows that if a customer uses one more product of the company, odds of their churn drops by about 20%. Therefore, the company should actively pursue cross-selling in their sales policies.
* Geographically, odds of churn in Germany is almost 2 times compared to France. Thus, the company may take some measures improve loyalty of its customers in Germany.