Costumer churn prediction

12 minute read

<!DOCTYPE html>

2020-03-20-Churn.utf8

Introduction

Since acquiring a new customer is far more expensive than retaining an existing one, customers’ exit is a seriously undesirable customer behavior that banks (like many other businesses) try to prevent. In this regard, developing a model that can predict a customer’s exit comes to a great importance. Using such model, banks can predict exit of their valuable customers and take measures to prevent that. In this project, we develop a classification model (based on logistic regression) to predict churn behavior of customers of a hypothetical bank.

### loading libraries
library(readr)
library(knitr)
library(caret)
library(e1071)
library(generalhoslem)
library(scales)
library(ggplot2)
library(psych)
library(corrplot)
library(table1)

Prepairing data and some exploratory analysis

The data set for this project is selected from Kaggle. This data set consists of different features of about 10000 customers of a bank and the target variable is a binary variable reflecting the fact whether the customer left the bank (closed his account) or he continues to be a customer.

Churn_Data <- read_csv("/Users/hamed/Documents/Academics/Generalized Linear Models/Term Project/Churn_Modelling.csv")

## Parsed with column specification:
## cols(
##   RowNumber = col_double(),
##   CustomerId = col_double(),
##   Surname = col_character(),
##   CreditScore = col_double(),
##   Geography = col_character(),
##   Gender = col_character(),
##   Age = col_double(),
##   Tenure = col_double(),
##   Balance = col_double(),
##   NumOfProducts = col_double(),
##   HasCrCard = col_double(),
##   IsActiveMember = col_double(),
##   EstimatedSalary = col_double(),
##   Exited = col_double()
## )

# Excluding columns of id and name
Churn_Data <- subset(Churn_Data,select = 4:ncol(Churn_Data))

Churn_Data$Exited <- as.factor(Churn_Data$Exited)

## For Distribution of Response Variable
ggplot2::ggplot(Churn_Data,aes(Exited,fill =Exited))+geom_bar() + ggtitle("Distribution of Response variable")

## For categorical Predictor Variables
# Geography
p1=ggplot2::ggplot(Churn_Data,aes(x =Geography ,fill = Exited)) + 
geom_bar(position = "Dodge") +   
theme(legend.position = "bottom") +labs(title = "Geography Vs Exited")

# Gender
p2=ggplot2::ggplot(Churn_Data,aes(x = Gender ,fill = Exited)) + geom_bar(position = "Dodge") +
theme(legend.position = "bottom") + labs(title = "Gender Vs Exited")
gridExtra::grid.arrange(p1, p2, ncol = 2)

# For Tenure
p1=ggplot2::ggplot(Churn_Data,aes(x = Tenure ,fill = Exited)) + geom_bar(position = "Dodge") +
theme(legend.position = "bottom") + labs(title = "Tensure Vs Exited")

# For Number of Products
p2=ggplot2::ggplot(Churn_Data,aes(x = NumOfProducts ,fill = Exited)) + geom_bar(position = "Dodge") +
theme(legend.position = "bottom") + labs(title = "Num. Of Products Vs Exited")
gridExtra::grid.arrange(p1, p2, ncol = 2)

# Examining multi-collinearity between continuous independant variables.
con_var <- subset(Churn_Data, select = c("CreditScore","Age","Tenure","Balance","EstimatedSalary"))
psych::pairs.panels(con_var,
             method = "pearson", # correlation method
             hist.col = "#00AFBB",
             density = TRUE,  # show density plots
             ellipses = TRUE # show correlation ellipses
             )

kable(cor(con_var))

	CreditScore	Age	Tenure	Balance	EstimatedSalary
CreditScore	1.0000000	-0.0039649	0.0008419	0.0062684	-0.0013843
Age	-0.0039649	1.0000000	-0.0099968	0.0283084	-0.0072010
Tenure	0.0008419	-0.0099968	1.0000000	-0.0122539	0.0077838
Balance	0.0062684	0.0283084	-0.0122539	1.0000000	0.0127975
EstimatedSalary	-0.0013843	-0.0072010	0.0077838	0.0127975	1.0000000

cr=cor(con_var,method=c("pearson"))
## For Correaltion Plot
corrplot::corrplot(cr,method="circle")

There is no multi-collinearity problem. correlations are not extreme and basically as long as our objective is prediction, multi-collinearity is not a concern. Because even in extreme case of correlation between independant variables, coefficients’ estimates are still unbiased.

Spliting dataset to train and test segments

set.seed(648)
ind <- sample(2, nrow(Churn_Data), replace = TRUE, prob = c(0.8, 0.2))
train_data <- Churn_Data[ind==1,]
test_data <- Churn_Data[ind==2,]

#Check proportion of churned and unchurned users in train and test data
df1 <- data.frame(
  "Chruned Customers" = c(sum(train_data$Exited == 1),sum(test_data$Exited == 1)),
  "Chruned Proportion" = c(percent(sum(train_data$Exited == 1) / nrow(train_data)), percent(sum(test_data$Exited == 1) / nrow(test_data))),
  "Not-churned Customers" = c(sum(train_data$Exited == 0),sum(test_data$Exited == 0)),
  "Not-chruned Proportion" = c(percent(sum(train_data$Exited == 0) / nrow(train_data)), percent(sum(test_data$Exited == 0) / nrow(test_data)))
)
row.names(df1) <- c("Train data", "Test data")
kable(df1, align = 'r',caption = "Proportion of Churned Customers in training and test datasets")

Proportion of Churned Customers in training and test datasets
	Chruned.Customers	Chruned.Proportion	Not.churned.Customers	Not.chruned.Proportion
Train data	1665	21%	6342	79%
Test data	372	19%	1621	81%

Choosing link function

I fitted model 1, which is the model with all independant variables and data set. I also tried to find influential values (outliers) using cook’s distance, but results are a bit strange. Standardized residuals does not show any problematic point. it will be great if you could add some analyses to this part

#### MODEL 1
model1 <- glm(Exited ~., family = binomial(), data = train_data)
summary(model1)

## 
## Call:
## glm(formula = Exited ~ ., family = binomial(), data = train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3013  -0.6670  -0.4632  -0.2745   2.9528  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -3.391e+00  2.720e-01 -12.465  < 2e-16 ***
## CreditScore      -4.582e-04  3.107e-04  -1.475   0.1402    
## GeographyGermany  7.305e-01  7.527e-02   9.706  < 2e-16 ***
## GeographySpain   -2.749e-02  7.821e-02  -0.351   0.7252    
## GenderMale       -5.216e-01  6.034e-02  -8.644  < 2e-16 ***
## Age               7.157e-02  2.827e-03  25.315  < 2e-16 ***
## Tenure           -1.551e-02  1.037e-02  -1.497   0.1345    
## Balance           2.376e-06  5.682e-07   4.182 2.89e-05 ***
## NumOfProducts    -1.228e-01  5.258e-02  -2.336   0.0195 *  
## HasCrCard        -4.335e-02  6.590e-02  -0.658   0.5107    
## IsActiveMember   -1.042e+00  6.366e-02 -16.363  < 2e-16 ***
## EstimatedSalary   4.618e-07  5.245e-07   0.880   0.3786    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8186.7  on 8006  degrees of freedom
## Residual deviance: 6965.1  on 7995  degrees of freedom
## AIC: 6989.1
## 
## Number of Fisher Scoring iterations: 5

#### Maybe we should drop this section too

#plot residuals against continuous explanatory variables
plot(train_data$CreditScore,rstandard(model1))

plot(train_data$Age,rstandard(model1))

plot(train_data$Balance,rstandard(model1))

plot(train_data$EstimatedSalary,rstandard(model1))

### ModeL 1 - Probit link function
model1_prob <- glm(Exited ~., family = binomial(link = "probit"), data = train_data)
summary(model1_prob)

## 
## Call:
## glm(formula = Exited ~ ., family = binomial(link = "probit"), 
##     data = train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2781  -0.6772  -0.4654  -0.2482   3.1883  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -1.985e+00  1.542e-01 -12.871  < 2e-16 ***
## CreditScore      -2.930e-04  1.764e-04  -1.661   0.0967 .  
## GeographyGermany  4.195e-01  4.338e-02   9.669  < 2e-16 ***
## GeographySpain   -1.323e-02  4.357e-02  -0.304   0.7614    
## GenderMale       -3.010e-01  3.419e-02  -8.803  < 2e-16 ***
## Age               4.127e-02  1.603e-03  25.744  < 2e-16 ***
## Tenure           -8.475e-03  5.893e-03  -1.438   0.1504    
## Balance           1.377e-06  3.199e-07   4.305 1.67e-05 ***
## NumOfProducts    -6.239e-02  3.015e-02  -2.069   0.0385 *  
## HasCrCard        -2.605e-02  3.741e-02  -0.696   0.4862    
## IsActiveMember   -5.641e-01  3.525e-02 -16.004  < 2e-16 ***
## EstimatedSalary   2.637e-07  2.977e-07   0.886   0.3758    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8186.7  on 8006  degrees of freedom
## Residual deviance: 6966.6  on 7995  degrees of freedom
## AIC: 6990.6
## 
## Number of Fisher Scoring iterations: 5

### ModeL 1 - Cloglog link function
model1_cloglog <- glm(Exited ~., family = binomial(link = "cloglog"), data = train_data)
summary(model1_cloglog)

## 
## Call:
## glm(formula = Exited ~ ., family = binomial(link = "cloglog"), 
##     data = train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1631  -0.6637  -0.4759  -0.3052   2.8081  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.959e+00  2.234e-01 -13.243  < 2e-16 ***
## CreditScore      -3.753e-04  2.558e-04  -1.467   0.1423    
## GeographyGermany  5.725e-01  6.129e-02   9.341  < 2e-16 ***
## GeographySpain   -2.076e-02  6.732e-02  -0.308   0.7578    
## GenderMale       -4.230e-01  5.024e-02  -8.419  < 2e-16 ***
## Age               5.510e-02  2.166e-03  25.440  < 2e-16 ***
## Tenure           -1.433e-02  8.588e-03  -1.669   0.0951 .  
## Balance           2.014e-06  4.817e-07   4.182 2.89e-05 ***
## NumOfProducts    -1.065e-01  4.328e-02  -2.462   0.0138 *  
## HasCrCard        -4.797e-02  5.464e-02  -0.878   0.3800    
## IsActiveMember   -9.478e-01  5.396e-02 -17.564  < 2e-16 ***
## EstimatedSalary   3.388e-07  4.338e-07   0.781   0.4348    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8186.7  on 8006  degrees of freedom
## Residual deviance: 6989.8  on 7995  degrees of freedom
## AIC: 7013.8
## 
## Number of Fisher Scoring iterations: 8

df2 <- data.frame(
  "Deviance" = c(model1$deviance, model1_prob$deviance, model1_cloglog$deviance),
  "AIC" = c(model1$aic, model1_prob$aic, model1_cloglog$aic)
)
row.names(df2) <- c("Logit","Probit","Cloglog")
kable(df2)

	Deviance	AIC
Logit	6965.088	6989.088
Probit	6966.581	6990.581
Cloglog	6989.756	7013.756

Regarding deviance and AIC criteria (which are necessarily consistent) model with the logit link function fits the best and we proceed our analysis with logit link function.

Suggesting a more parsimonious model

I am dropping insignificant independant variables to see if we can have more parsimonous model

model2 <- glm(Exited ~.-CreditScore-Tenure-HasCrCard-EstimatedSalary, family = binomial(), data = train_data)
summary(model2)

## 
## Call:
## glm(formula = Exited ~ . - CreditScore - Tenure - HasCrCard - 
##     EstimatedSalary, family = binomial(), data = train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3467  -0.6651  -0.4646  -0.2749   2.9265  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -3.748e+00  1.609e-01 -23.290  < 2e-16 ***
## GeographyGermany  7.273e-01  7.519e-02   9.672  < 2e-16 ***
## GeographySpain   -2.850e-02  7.817e-02  -0.365   0.7154    
## GenderMale       -5.230e-01  6.029e-02  -8.675  < 2e-16 ***
## Age               7.152e-02  2.825e-03  25.316  < 2e-16 ***
## Balance           2.394e-06  5.677e-07   4.217 2.48e-05 ***
## NumOfProducts    -1.226e-01  5.253e-02  -2.334   0.0196 *  
## IsActiveMember   -1.040e+00  6.353e-02 -16.370  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8186.7  on 8006  degrees of freedom
## Residual deviance: 6970.7  on 7999  degrees of freedom
## AIC: 6986.7
## 
## Number of Fisher Scoring iterations: 5

model2$deviance

## [1] 6970.709

Dev_test <- abs(model2$deviance - model1$deviance) > qchisq(0.95,4)
Dev_test

## [1] FALSE

According to Deviance test, null hypothesis is not rejected and covariates in model 2 are enough to explain response variable.

Hosmer-Lemeshow goodness-of-fit statistic

hoslem <- logitgof(train_data$Exited, model2$fitted.values, g=10)
hoslem

## 
##  Hosmer and Lemeshow test (binary model)
## 
## data:  train_data$Exited, model2$fitted.values
## X-squared = 13.22, df = 8, p-value = 0.1045

we cannot reject null-hypothesis (goodness of fit of this model) in %95 significance level

Confusion matrix and checking accuracy of models

Model with all independant variables

###accuracy of prediction for model 1

prediction1 <- predict(model1, test_data,type = "response")
threshold = 0.5
prediction1[prediction1 >= threshold] <- 1
prediction1[prediction1 < threshold] <- 0
confusionMatrix(as.factor(prediction1), test_data$Exited, positive = '1')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1572  291
##          1   49   81
##                                           
##                Accuracy : 0.8294          
##                  95% CI : (0.8122, 0.8457)
##     No Information Rate : 0.8133          
##     P-Value [Acc > NIR] : 0.03398         
##                                           
##                   Kappa : 0.2502          
##                                           
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 0.21774         
##             Specificity : 0.96977         
##          Pos Pred Value : 0.62308         
##          Neg Pred Value : 0.84380         
##              Prevalence : 0.18665         
##          Detection Rate : 0.04064         
##    Detection Prevalence : 0.06523         
##       Balanced Accuracy : 0.59376         
##                                           
##        'Positive' Class : 1               
##

Although accuracy of this model is more than %82, attending accuracy as the only criteria is misleading. Since, we are mainly interested in predicting churning users precisely (i.e. sensitivity) and this model predicts churning users correctly in %22 of cases. This is terrible for our objective. Thus, we can improve sensitivity (obviously at the cost of specificity and therefore accuracy) by changing threshold of classificatin. In the plot below, we examine how these three criteria change in deferent thresholds.

thresholds <- seq(0.1,1,0.1)
acc <- rep(0,10)
sens <- rep(0,10)
spec <- rep(0,10)
for (i in 1:10){
  pred <- predict(model1, test_data,type = "response")
  pred[pred >= thresholds[i]] <- 1
  pred[pred < thresholds[i]] <- 0
  c <- confusionMatrix(as.factor(pred), test_data$Exited, positive = '1')
  acc[i] <- c$overall['Accuracy']
  sens[i] <- c$byClass['Sensitivity']
  spec[i] <- c$byClass['Specificity']
}

## Warning in confusionMatrix.default(as.factor(pred), test_data$Exited, positive =
## "1"): Levels are not in the same order for reference and data. Refactoring data
## to match.

## Warning in confusionMatrix.default(as.factor(pred), test_data$Exited, positive =
## "1"): Levels are not in the same order for reference and data. Refactoring data
## to match.

plot(thresholds,acc, type = "o", pch=19, ylim = c(0,1), xlab = "Thresholds", ylab = "", main = "Accuracy, Sensitivity and Specificity in different thresholds of pi")
points(thresholds,sens, type = "o", pch=19, col="red")
points(thresholds,spec, type = "o", pch=19, col="blue")

legend("right",c("Accuracy","Sensitivity","Specificity"),pch=c(19,19,19),col = c("black","red","blue"))

Parsimonious model

###accuracy of prediction for model 2

prediction2 <- predict(model2, test_data,type = "response")
threshold = 0.5
prediction2[prediction2 >= threshold] <- 1
prediction2[prediction2 < threshold] <- 0
confusionMatrix(as.factor(prediction2), test_data$Exited, positive = '1')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1571  292
##          1   50   80
##                                           
##                Accuracy : 0.8284          
##                  95% CI : (0.8111, 0.8447)
##     No Information Rate : 0.8133          
##     P-Value [Acc > NIR] : 0.04384         
##                                           
##                   Kappa : 0.2458          
##                                           
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 0.21505         
##             Specificity : 0.96915         
##          Pos Pred Value : 0.61538         
##          Neg Pred Value : 0.84326         
##              Prevalence : 0.18665         
##          Detection Rate : 0.04014         
##    Detection Prevalence : 0.06523         
##       Balanced Accuracy : 0.59210         
##                                           
##        'Positive' Class : 1               
##

thresholds <- seq(0.1,1,0.1)
acc2 <- rep(0,10)
sens2 <- rep(0,10)
spec2 <- rep(0,10)
for (i in 1:10){
  pred2 <- predict(model2, test_data,type = "response")
  pred2[pred2 >= thresholds[i]] <- 1
  pred2[pred2 < thresholds[i]] <- 0
  c2 <- confusionMatrix(as.factor(pred2), test_data$Exited, positive = '1')
  acc2[i] <- c2$overall['Accuracy']
  sens2[i] <- c2$byClass['Sensitivity']
  spec2[i] <- c2$byClass['Specificity']
}

## Warning in confusionMatrix.default(as.factor(pred2), test_data$Exited, positive
## = "1"): Levels are not in the same order for reference and data. Refactoring
## data to match.

## Warning in confusionMatrix.default(as.factor(pred2), test_data$Exited, positive
## = "1"): Levels are not in the same order for reference and data. Refactoring
## data to match.

plot(thresholds,acc2, type = "o", pch=19, ylim = c(0,1), xlab = "Thresholds", ylab = "" ,main = "Accuracy, Sensitivity and Specificity in different thresholds of pi")
points(thresholds,sens2, type = "o", pch=19, col="red")
points(thresholds,spec2, type = "o", pch=19, col="blue")
points(thresholds,sens, type = "o", pch=19, col="green")
legend("right",c("Accuracy-Model2","Sensitivity-Model2","Specificity-Model2", "Sensitivity-Model1"),pch=c(19,19,19,19),col = c("black","red","blue","green"))

Since sensitivity is important to us, I have also compared with the sensitivity of model 1. Not much difference can be seen.

Policy impacts of the model

In this section, we aim to extract some policy takeaways from interpretation of coefficients of the model:
* First and the most obvious one, The more a customer is intensely and frequently using products of the company, the more loyal they will be.
* Second, The coefficient of ‘Number of products’ shows that if a customer uses one more product of the company, odds of their churn drops by about 20%. Therefore, the company should actively pursue cross-selling in their sales policies.
* Geographically, odds of churn in Germany is almost 2 times compared to France. Thus, the company may take some measures improve loyalty of its customers in Germany.

Share on

Twitter Facebook Google+ LinkedIn