Choosing Logisitic Regression’s Cutoff Value for Unbalanced Dataset

All the code (“unbalanced_code”) and the data (“data” folder) can be found here.

This documentation focuses on choosing the optimal cutoff value for logistic regression when dealing with unbalanced dataset. Notion also applies to other classification algorithms where the model’s prediction on unknown outcome can be a probability. We’ll not be giving a thorough introduction on logistic regression and discussion of why not choose other algorithms to boost classification performace is also overlooked.

Getting Started

Logistic regression is a technique that is well suited for binary classification problems. After giving the model your input parameters ( or called variables, predictors ), the model will calculate the probability that each observation will belong to one of the two classes ( depending on which one you’re choosing as the “positive” class ). The math formula of this regression :

\[ P(y) = \frac{1}{ 1 + e^{ -( B_0 + B_1X_1 + \dots + B_nX_n ) } } \]

Where \(P(y)\) is the calculated probability ; the \(B\)s denotes the model’s parameters and \(X\)s refer to your input parameters.

Problem Description

In this document we’re given a human resource dataset. Where our goal is to find out the employees that are likely to leave in the future, and act upon our findings, which is of course, to retain them before they choose to leave.

# environment setting 
library(ROCR)
library(grid)
library(broom)
library(caret)
library(tidyr)
library(dplyr)
library(scales)
library(ggplot2)
library(ggthemr)
library(ggthemes)
library(gridExtra)
library(data.table)
setwd("/Users/ethen/machine-learning/unbalanced")

# read in the dataset ("HR.csv")
data <- fread( list.files( "data", full.names = TRUE )[2] )
str(data)
## Classes 'data.table' and 'data.frame':   12000 obs. of  7 variables:
##  $ S      : num  0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
##  $ LPE    : num  0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
##  $ NP     : int  2 5 7 5 2 2 6 5 5 2 ...
##  $ ANH    : int  157 262 272 223 159 153 247 259 224 142 ...
##  $ TIC    : int  3 6 4 5 3 3 4 5 5 3 ...
##  $ Newborn: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ left   : int  1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>

This dataset contains 12000 observations and 7 variables, each representing :

  • S The satisfaction level on a scale of 0 to 1.
  • LPE Last project evaluation by a client on a scale of 0 to 1.
  • NP Represents the number of projects worked on by employee in the last 12 month.
  • ANH Average number of hours worked in the last 12 month for that employee.
  • TIC The amount of time the emplyee spent in the company, measured in years.
  • Newborn This variable will take the value 1 if the employee had a newborn within the last 12 month and 0 otherwise.
  • left 1 if the employee left the company, 0 if they’re still working here.

We’ll do a quick summary to check if columns contain missing values like NAs and requires cleaning. Also use the findCorrelation function to determine if there’re any variables that are highly correlated with each other so we can remove them from the model training.

# using summary  
summary(data)
##        S               LPE               NP             ANH       
##  Min.   :0.0900   Min.   :0.3600   Min.   :2.000   Min.   : 96.0  
##  1st Qu.:0.4800   1st Qu.:0.5700   1st Qu.:3.000   1st Qu.:157.0  
##  Median :0.6600   Median :0.7200   Median :4.000   Median :199.5  
##  Mean   :0.6295   Mean   :0.7166   Mean   :3.802   Mean   :200.4  
##  3rd Qu.:0.8200   3rd Qu.:0.8600   3rd Qu.:5.000   3rd Qu.:243.0  
##  Max.   :1.0000   Max.   :1.0000   Max.   :7.000   Max.   :310.0  
##       TIC           Newborn            left       
##  Min.   :2.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:2.000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :3.000   Median :0.0000   Median :0.0000  
##  Mean   :3.229   Mean   :0.1542   Mean   :0.1667  
##  3rd Qu.:4.000   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :6.000   Max.   :1.0000   Max.   :1.0000
# find correlations to exclude from the model 
findCorrelation( cor(data), cutoff = .75, names = TRUE )
## character(0)

Surprisingly… , the dataset is quite clean. As this is often times not the case.

To get a feel of why this problem is worth solving, let’s look at the proportion of employees that have left the company.

prop.table( table(data$left) )
## 
##         0         1 
## 0.8333333 0.1666667

If you’re the HR manager of this company, this probability table tells you that around 16 percent of the employees who became a staff member of yours have left! If those employees are all the ones that are performing well in the company, then you’re company is probabliy not going to last long… Let’s use the logistic regression model to train our dataset to see if we can find out what’s causing employees to leave ( we’ll leave out the exploratory analysis part to you .. ).

Model Training

To train and evaluate the model, we’ll split the dataset into two parts. 80 percent of the dataset will be used to actually train the model, while the rest will be used to evaluate the accuracy of this model, i.e. out of sample error. Also the “Newborn” variable from the dataset is converted to factor type so that it will be treated as a categorical variable.

# convert the newborn to factor variables
data[ , Newborn := as.factor(Newborn) ]

set.seed(4321)
test <- createDataPartition( data$left, p = .2, list = FALSE )
data_train <- data[ -test, ]
data_test  <- data[ test, ]
rm(data)

# traing logistic regression model
model_glm <- glm( left ~ . , data = data_train, family = binomial(logit) )
summary_glm <- summary(model_glm)

Again we’ll quickly check two things for this model. First the p-values. Values below .05 indicates significance, which means the coefficient or so called parameters that are estimated by our model are reliable. And second, the pseudo R square. This value ranging from 0 to 1 indicates how much variance is explained by our model, if you’re familiar with linear regression, this is equivalent to the R squared value, oh, I rounded the value to only show the 2 digits after decimal point.

list( summary_glm$coefficient, 
      round( 1 - ( summary_glm$deviance / summary_glm$null.deviance ), 2 ) )
## [[1]]
##                 Estimate   Std. Error    z value      Pr(>|z|)
## (Intercept) -1.250958586 0.1779190471  -7.031055  2.049778e-12
## S           -3.768736063 0.1340995830 -28.104010 8.750779e-174
## LPE          0.562660942 0.2014790879   2.792652  5.227793e-03
## NP          -0.368893552 0.0296574238 -12.438489  1.615298e-35
## ANH          0.003659085 0.0006950292   5.264649  1.404571e-07
## TIC          0.625952648 0.0301033607  20.793447  4.961523e-96
## Newborn1    -1.611876758 0.1325034654 -12.164789  4.786780e-34
## 
## [[2]]
## [1] 0.21

A fast check on all the p-values of the model indicates significance, meaning that our model is a legitimate one. A pseudo R square of 0.21 tells that only 21 percent of the variance is explained. In other words, it is telling us that the model isn’t powerful enough to predict employees that left with high reliability. Since this is more of a dataset problem ( suggests collecting other variables to include to the dataset ) and there’s not much we can do about it at this stage, so we’ll simply move on to the next part where we’ll start looking at the predictions made by the model.

Predicting and Assessing the Model

For this section we start off by obtaining the predicted value that a employee will leave in the future on both training and testing set, after that we’ll perform a quick evaluation on the training set by plotting the probability (score) estimated by our model with a double density plot.

# prediction
data_train$prediction <- predict( model_glm, newdata = data_train, type = "response" )
data_test$prediction  <- predict( model_glm, newdata = data_test , type = "response" )

# distribution of the prediction score grouped by known outcome
ggplot( data_train, aes( prediction, color = as.factor(left) ) ) + 
geom_density( size = 1 ) +
ggtitle( "Training Set's Predicted Score" ) + 
scale_color_economist( name = "data", labels = c( "negative", "positive" ) ) + 
theme_economist()

Given that our model’s final objective is to classify new instances into one of two categories, whether the employee will leave or not we will want the model to give high scores to positive instances ( 1: employee that left ) and low scores ( 0 : employee that stayed ) otherwise. Thus for a ideal double density plot you want the distribution of scores to be separated, with the score of the negative instances to be on the left and the score of the positive instance to be on the right.
In the current case, both distributions are slight skewed to the left. Not only is the predicted probability for the negative outcomes low, but the probability for the positive outcomes are also lower than it should be. The reason for this is because our dataset only consists of 16 percent of positive instances ( employees that left ). Thus our predicted scores sort of gets pulled towards a lower number because of the majority of the data being negative instances.

A slight digression, when developing models for prediction, we all know that we want the model to be as accurate as possible, or in other words, to do a good job in predicting the target variable on out of sample observations.

Our skewed double density plot, however, can actually tell us a very important thing: Accuracy will not be a suitable measurement for this model. We’ll show why below.

Since the prediction of a logistic regression model is a probability, in order to use it as a classifier, we’ll have to choose a cutoff value, or you can say its a threshold value. Where scores above this value will classified as positive, those below as negative. ( We’ll be using the term cutoff throughout the rest of the documentation ).

Here we’ll use a function to loop through several cutoff values and compute the model’s accuracy on both training and testing set.

AccuracyCutoffInfo Obtain the accuracy on the trainining and testing dataset for cutoff value ranging from .4 to .8 ( with a .05 increase ). Input parameters :

  • train Data.table or data.frame type training data, assumes you already have the predicted score and actual outcome in it.
  • test Condition equivalent as above for the test set.
  • predict Prediction’s (predicted score) column name, assumes the same for both train and test set, must specify it as a character.
  • actual Condition equivalent as above for the actual results’ column name.
  • The function returns a list consisting of :
    • data : data.table with three columns. Each row indicates the cutoff value and the accuracy for the train and test set respectively.
    • plot : a single plot that visualizes the data.table.
# functions are sourced in, to reduce document's length
source("unbalanced_code/unbalanced_functions.R")

accuracy_info <- AccuracyCutoffInfo( train = data_train, test = data_test, 
                                     predict = "prediction", actual = "left" )
# define the theme for the next plot
ggthemr("light")
accuracy_info$plot

From the output, you can see that starting from the cutoff value of .6, our model’s accuracy for both training and testing set grows higher and higher, showing no sign of decreasing at all. We’ll use another function to visualize the confusion matrix of the test set to see what’s causing this. Oh, and for those who are familar with terms related to the confusion matrix or have forgotten, here’s the wiki page for a quick refresher. We’ll not be going through the terms here.

ConfusionMatrixInfo Obtain the confusion matrix plot and data.table for a given dataset that already consists the predicted score and actual outcome column.

  • data Data.table or data.frame type data that consists the column of the predicted score and actual outcome.
  • predict Prediction’s column name, must specify it as a character.
  • actual Condition equivalent as above for the actual results’ column name.
  • cutoff Cutoff value for the prediction score.
  • The function returns a list consisting of :
    • data : A data.table consisting of three columns. First two columns stores the original value of the prediction and actual outcome from the passed in data frame. The third indicates the type, which is after choosing the cutoff value, will this row be a true/false positive/negative.
    • plot : Plot that visualizes the data.table.
# visualize .6 cutoff (lowest point of the previous plot)
cm_info <- ConfusionMatrixInfo( data = data_test, predict = "prediction", 
                                actual = "left", cutoff = .6 )
ggthemr("flat")
cm_info$plot

To avoid confusion, the predicted scores are jittered along their predicted label ( along the 0 and 1 outcome ). It makes sense to do so when visualizing a large number of individual observations representing each outcome so we can spread the points along the x axis. Without jittering, we would essentially see two vertical lines with tons of points overlapped on top of each other.

The above plot depicts the tradeoff we face upon choosing a reasonable cutoff. If we increase the cutoff value, the number of true negative (TN) increases and the number of true positive (TP) decreases. Or you can say, If we increase the cutoff’s value, the number of false positive (FP) is lowered, while the number of false negative (FN) rises.

Here, because we have very few positive instances in our dataset, thus our model will be less likely to make a false negative mistake. Meaning if we keep on adding the cutoff value, we’ll actually increase our model’s accuracy, since we have a higher chance of turning the false positive into true negative.

To make this idea sink in to our head, suppose given our test set, we’ll simply predict every single observation as a negative instance ( 0 : meaning this employee will not leave in the near future ).

# predict all the test set's outcome as 0
prop.table( table( data_test$left ) )
## 
##      0      1 
## 0.8425 0.1575

Then what happens is, we’ll still obtain a 84 percent accuracy !! Which is pretty much the same compared to our logistic model….

Section Takeaway: Accuracy is not the suitable indicator for the model when you have unbalanced distribution or costs.

Choosing the Suitable Cutoff Value

Due to the fact that accuracy isn’t suitable for this situation, we’ll have to use another measurement to decide which cutoff value to choose, the ROC curve.

Remember we said in the last section that when choosing our cutoff value, we’re making a balance between the false positive rate (FPR) and false negative rate (FNR), you can think of this as the objective function for our model, where we’re trying to minimize the number of mistakes we’re making or so called to cost. Well, ROC curve’s purpose is used to visualize and quantify the tradeoff we’re making between the two measures. This curve is created by plotting the true positive rate (TPR) on the y axis against the false positive rate (FPR) on the x axis at various cutoff settings ( between 0 and 1 ).

We’ll use the data returned by the ConfusionMatrixInfo function ( used in the last section ) to pass in to another function that will calculate and return a plot and the associated information about the ROC curve.

print(cm_info$data)
##       actual    predict type
##    1:      1 0.43694965   FN
##    2:      1 0.30684752   FN
##    3:      1 0.19962594   FN
##    4:      1 0.30894387   FN
##    5:      1 0.27749634   FN
##   ---                       
## 2396:      0 0.06898270   TN
## 2397:      0 0.07509377   TN
## 2398:      0 0.16193161   TN
## 2399:      0 0.85864236   FP
## 2400:      0 0.62797360   FP

Note that you don’t have to use this data, as long as you’re data consists of your predicted score and acutal outcome column then the function will work out fine. We’ll list the input parameters of this function before explaining a little bit more.

ROCInfo Pass in the data that includes the predicted score and actual outcome column to obtain the ROC curve information. Input parameters :

  • data Data.table or data.frame type data that consists the column of the predicted score and actual outcome.
  • predict Predicted score’s column name.
  • actual Actual results’ column name.
  • cost.fp Associated cost for a false positive instance.
  • cost.fn Associated cost for a false negative instance.
  • The function returns a list consisting of :
    • plot : A side by side roc and cost plot, title showing optimal cutoff value title showing optimal cutoff, total cost, and area under the curve (auc). Wrap the gride.draw function around the plot to visualize it!!
    • cutoff : Optimal cutoff value according to the specified FP and FN cost .
    • totalcost : Total cost according to the specified FP and FN cost.
    • auc : Area under the curve.
    • sensitivity : TP / (TP + FN) for the optimal cutoff.
    • specificity : TN / (FP + TN) for the optimal cutoff.

As you’ll notice, there is input parameters that allows you to specify different cost for making a false positive mistake (FP) and a false negative (FN) mistake. This is because, in real world application, the cost that comes along with making these two mistakes are usually a whole lot different, where making a committing a false negative (FN) is usually more costly than a false positive (FP).

Take our case for example, a false negative (FN) means that an employee left our company but our model fails to detect that, while a false positive (FP) means that an employee is still currently working at our company and our model told us that they will be leaving. The former mistake would be a tragedy, since, well, the emplyoee left and we didn’t do anything about it! ( humance resource most valuable asset to the company ). As for conducting the later mistake, we might simply waste like 15 minutes of a HR manager’s time when we arrange a face to face interview with a employee, questioning about how the company can do better to retain him, while he’s perfectly fine with the current situation.

Let’s use the function before going any further, so all these notions won’t seem so opaque.

# reset to default ggplot theme 
ggthemr_reset()

# user-defined different cost for false negative and false positive
cost_fp <- 100
cost_fn <- 200
roc_info <- ROCInfo( data = cm_info$data, predict = "predict", 
                     actual = "actual", cost.fp = cost_fp, cost.fn = cost_fn )
grid.draw(roc_info$plot)

So what does this side by side plot tell us?

  1. The title of the entire plot tells us that when we assigned the cost for a false negative (FN) and false positive (FP) to be 100 and 200 respectively, our optimal cutoff is actually 0.26, and the total cost for choosing this cutoff value is 49,800.

  2. The plot on the left, the ROC curve for this model. Shows you the trade off between the rate at which you can correctly predict something with the rate of incorrectly predicting something when choosing different cutoff values. We’ve also calculated the area under this ROC curve (auc) to be 0.848. In short, this measure ranging from 0 to 1, shows how well is the classification model is performing in general, where the higher the number the better. The tilted blue line declares the boundary of an average model, with a .5 area under the curve.

  3. The cost plot on the right calculates the the associated cost for choosing different cutoff value.

  4. For both plot, the cyan color dotted line on the plot above denotes where that optimal point lies, for the cost plot, this shows the optimal cutoff value. As for the ROC curve plot, this indicates the location of the false positive rate (FPR) and true positive rate (TPR) corresponding to the optimal cutoff value. The color on the curve denotes the cost associated with that point, “greener” means that the cost is lower, while “blacker” means the opposite.

We can re-plot the confusion matrix to see what happened when we switch to this cutoff value.

# re-plot the confusion matrix plot with the new cutoff value
cm_info <- ConfusionMatrixInfo( data = data_test, predict = "prediction", 
                                actual = "left", cutoff = roc_info$cutoff )
ggthemr("flat")
cm_info$plot

The confusion matrix plot clearly shows that when changing the cutoff value to 0.26 our classification model is making less false negative (FN) error, since the cost associated with it is 2 times higher than a false positive (FP).

Interpretation and Reporting

We’ll return to our logistic regression model for a minute, and look at the estiamted parameters (coefficients). Since the model’s parameter the recorded in logit format, we’ll transform it into odds ratio so that it’ll be easier to interpret.

# tidy from the broom package
coefficient <- tidy(model_glm)[ , c( "term", "estimate", "statistic" ) ]

# transfrom the coefficient to be in probability format 
coefficient$estimate <- exp( coefficient$estimate )
coefficient
##          term   estimate  statistic
## 1 (Intercept) 0.28623029  -7.031055
## 2           S 0.02308122 -28.104010
## 3         LPE 1.75533714   2.792652
## 4          NP 0.69149902 -12.438489
## 5         ANH 1.00366579   5.264649
## 6         TIC 1.87002659  20.793447
## 7    Newborn1 0.19951283 -12.164789

Some interpretation : With all other input variables unchanged, every unit of increase in the satisfaction level increases the odds of leaving the company (versus not leaving) by a factor of 0.02.

Now that we have our logistic regression model, we’ll load in the dataset with unknown actual outcome and use the model to predict the probability.

# set the column class 
col_class <- sapply( data_test, class )[1:6]

# use the model to predict a unknown outcome data "HR_unknown.csv"
data <- read.csv( list.files( "data", full.names = TRUE )[1], colClasses = col_class )

# predict
data$prediction <- predict( model_glm, newdata = data, type = "response" )
list( head(data), nrow(data) )
## [[1]]
##      S  LPE NP ANH TIC Newborn prediction
## 1 0.86 0.69  4 105   4       1 0.01334377
## 2 0.52 0.98  4 209   2       0 0.10733910
## 3 0.84 0.60  5 207   2       0 0.01956542
## 4 0.60 0.65  3 143   2       1 0.01646569
## 5 0.85 0.57  3 227   2       0 0.04078384
## 6 0.82 0.61  4 246   3       0 0.06322873
## 
## [[2]]
## [1] 1000

After predicting how likely each employee is to leave the company with our model, we can use the cutoff value we obtained in the last section to determine who are the ones that we should pay more close attention to.

# cutoff
data <- data[ data$prediction >= roc_info$cutoff, ]
list( head(data), nrow(data) )
## [[1]]
##       S  LPE NP ANH TIC Newborn prediction
## 10 0.36 0.65  5 119   5       0  0.3725606
## 19 0.49 0.79  3 234   3       0  0.2639120
## 31 0.28 0.64  4 147   3       0  0.2677766
## 38 0.22 0.41  2 248   3       0  0.5493722
## 43 0.20 0.75  4 248   3       0  0.4321773
## 57 0.95 0.73  3 286   6       0  0.3262969
## 
## [[2]]
## [1] 146

Using the cutoff value of 0.26 we have decrease the number of employees that we might have to take actions upon to prevent them from the leaving the company to only 146.

Given these estimated probability there’re two things worth noticing through simple visualization :

First, the relationship between the time spent in the company with the probabilty that they will leave the company (attrition). We compute the median attrition rate for each value of TIC ( time spent in the company ) and the number of employees for each value of TIC.

# compute the median estiamted probability for each TIC group
median_tic <- data %>% group_by(TIC) %>% 
                       summarise( prediction = median(prediction), count = n() )
ggthemr("fresh")
ggplot( median_tic, aes( TIC, prediction, size = count ) ) + 
geom_point() + theme( legend.position = "none" ) +
labs( title = "Time and Employee Attrition", y = "Attrition Probability", 
      x = "Time Spent in the Company" ) 

You’ll notice that, the probability tells us that the estimated probability of an employee leaving the company is postively correlated with the time spent in the company. Note that this could indicate a bad thing, since it’s showing that you’re company can’t retain employees that have stayed in the company for a long time. In the field of marketing, it should make sense that at some point if your customers are loyal to you after many years, then they will most likely stay loyal forever. For our human resource example, failing to retain “loyal” employees could possibly ( I’m saying could possibly here ) mean that the company failed to propose a descent career plan to the employees. The point’s (bubble) size shows that this is not a rare case.

The second is the relationship between LPE last project evaluation by client and the estimated probability to leave. We’ll do same thing as we did for the last plot, Except for one thing. . Unlike TIC time spent in the company that has only five different values, LPE is a numeric index ranging from 0 to 1, so in order to plot it with the same fashion as the last one we’ll use an extra step cut (split) the LPE variable into four groups.

data$LPECUT <- cut( data$LPE, breaks = quantile(data$LPE), include.lowest = TRUE )
median_lpe <- data %>% group_by(LPECUT) %>% 
                       summarise( prediction = median(prediction), count = n() )

ggplot( median_lpe, aes( LPECUT, prediction ) ) + 
geom_point( aes( size = count ), color = "royalblue3" ) + 
theme( legend.position = "none" ) +
labs( title = "Last Project's Evaluation and Employee Attrition", 
      y = "Attrition Probability", x = "Last Project's Evaluation by Client" )

From the plot above we can see that the relationship between the last project evaluation is not positively nor negatively correlated with the estimated probability. This is an indication that it’ll be worth trying out other classification algorithms. Since logistic regressions assumes monotonic relationships ( either entirely increasing or decreasing ) between the input paramters and the outcome ( also true for linear regression ). Meaning if more of a quantity is good, then much more of the quantity is better, which is often not the case in real world cases.

Now, back to where we were on retaining our employees, apart from we knowing who do we want to retain, we can also prioritize our actions by adding back how much do we wish to retain these employees. This will tell us how much does the company wish to retain each employee. Recall that from our dataset, we have the performance information of the employee, the last project evaluation (LPE).

Knowing this information, we can easily create a visualization to tell the story

ggplot( data, aes( prediction, LPE ) ) + 
geom_point() + 
ggtitle( "Performace v.s. Probability to Leave" )

We first have the employees that are underperforming ( lower y axis ), we probably should improve their performance or you can say you can’t wait for them to leave…. For employees that are not likely to leave ( lower x axis ), we should manage them as usual if you’re really short on resources ( such as time to interview them in this case ). Then on the short run, we should focus on those with a good performance, but also has a high probability to leave.

The next thing we can do, is to quantify our priority by multiplying the probablity to leave with the performance. We’ll also use row names of the data.frame to to serve as imaginery employee ids.

result <- data %>% 
          mutate( priority = prediction * LPE ) %>%
          mutate( id = rownames(data) ) %>%
          arrange( desc(priority) )
head(result)
##      S  LPE NP ANH TIC Newborn prediction       LPECUT  priority  id
## 1 0.22 0.94  3 251   6       0  0.8824965 (0.818,0.99] 0.8295467 928
## 2 0.27 0.90  2 246   6       0  0.8962227 (0.818,0.99] 0.8066004 588
## 3 0.31 0.85  2 284   6       0  0.8924586 (0.818,0.99] 0.7585898 477
## 4 0.25 0.91  5 274   6       0  0.7742936 (0.818,0.99] 0.7046072 684
## 5 0.24 0.93  3 193   5       0  0.7497174 (0.818,0.99] 0.6972371 732
## 6 0.24 0.82  3 209   6       0  0.8480916 (0.818,0.99] 0.6954351 235

Then we will obtain a priority score table above. Where the score will be high for the employees we wish to act upon as soon as possible, and lower for the other ones. After obtaining this result, we can schedule a face to face interview with employees starting at the top of the list.

Conclusion:

Using a classification algorithm like logistic regression in this example enabled us to detect events that will happen in the future. That is which employees are more likely to leave the company. Based on this information, we can come up with a more efficient strategy to cope with matter at hand. Hopefully, after this doucmentation, the one thing that you’ll still remember is: If the dataset happens to be unbalanced, meaning that there are a lot more negative outcome data than positives, or vice versa, we shouldn’t use accuracy as the measurement to evaluate our model’s performance.

There’s actual more to this HR example. We know who is more likely to leave and who are ones that we’re more interested in retaining, the next question is what should we say to those employees? Is he leaving because the salary is too low or is it because the workload is simply too much and he needs a break? View this documentation to see how clustering algorithm can give us a little hint on this ( There’s two example in that documentation, please refer to the second one ).

R Session Information

sessionInfo()
## R version 3.2.4 (2016-03-10)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] data.table_1.9.6 gridExtra_2.2.1  ggthemes_3.0.3   ggthemr_1.0.1   
##  [5] scales_0.4.0     dplyr_0.4.3      tidyr_0.4.1      caret_6.0-68    
##  [9] ggplot2_2.1.0    lattice_0.20-33  broom_0.4.0      ROCR_1.0-7      
## [13] gplots_3.0.1    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.4        class_7.3-14       gtools_3.5.0      
##  [4] assertthat_0.1     digest_0.6.9       psych_1.5.8       
##  [7] foreach_1.4.3      mime_0.4           R6_2.1.2          
## [10] plyr_1.8.3         chron_2.3-47       MatrixModels_0.4-1
## [13] stats4_3.2.4       evaluate_0.8.3     e1071_1.6-7       
## [16] highr_0.5.1        lazyeval_0.1.10    rstudioapi_0.5    
## [19] minqa_1.2.4        gdata_2.17.0       SparseM_1.7       
## [22] miniUI_0.1.1       car_2.1-2          nloptr_1.0.4      
## [25] Matrix_1.2-4       rmarkdown_0.9.6    labeling_0.3      
## [28] splines_3.2.4      lme4_1.1-11        stringr_1.0.0     
## [31] questionr_0.5      munsell_0.4.3      shiny_0.13.2      
## [34] httpuv_1.3.3       mnormt_1.5-4       mgcv_1.8-12       
## [37] htmltools_0.3.5    nnet_7.3-12        codetools_0.2-14  
## [40] MASS_7.3-45        bitops_1.0-6       nlme_3.1-125      
## [43] xtable_1.8-2       gtable_0.2.0       DBI_0.3.1         
## [46] magrittr_1.5       formatR_1.3        rmdformats_0.2    
## [49] KernSmooth_2.23-15 stringi_1.0-1      reshape2_1.4.1    
## [52] iterators_1.0.8    tools_3.2.4        parallel_3.2.4    
## [55] pbkrtest_0.4-6     yaml_2.1.13        colorspace_1.2-6  
## [58] caTools_1.17.1     knitr_1.12.3       quantreg_5.21

Ming-Yu Liu

November 25, 2015