Finding Groups Within the Data

You can find all the code and datasets (in the data folder) used in this documentation here.This documentation assumes you’re familiar with hierarchical clustering.

We will go over two examples of how identifying groups within data allows business to improve efficiency. Finding groups within our data allows us to strike the right balance between similarities and differences. The reason behind this is we want to treat similar cases with similar appoaches to benefit from the economics of scale. On the other hand, treating different scenarios with different ways that are more suitable to the current situations will improve your actions’ effectiveness. The following documentation will cover two examples of where this notion can be use in the business world.

Example 1: Supply Chain Management

Our concern as a supply chain manager is that we want to organize products’ stock more efficiently. To be explicit, we don’t really need to store lots of stocks for every single product as this increases inventory costs. On the other hand, we also wish to maintain the stock amount of each product at a suitable level, so that we will be able to deliver the product on time when demanded. Thus, the goal for this task is to identify groups of products’ stocks that “behaves” the same and can be treated in a similar fashion.

The common approach to solve this issue is to analyze the question along two dimensions, where you have the average daily sales of your stock keeping unit, or so called “SKU” on one axis and the “volatility” of each SKU’s average daily sales.

  • SKU : Note that in the field of inventory SKU refers to a specific item stored to a specific “location”. The key word here is “location”. In a retail network, the same product, let’s say TV can have multiple SKU. Examples of this is :
    • There are multiple warehouses where this TV can be stored. This is the most typical case where the same TV will have one SKU per warehouse.
    • Even items that are at the same warehouses can have multiple SKUs, because they are sold at different locations ( e.g. retail stores at City A, shoppinig malls at City B and so on ).
    • Also a product may have many variants based on attributes such as size, color or conditioning. This will also lead to multiple SKUs for the same given product.
  • volatility : This can be measured by the coefficient of variation, which is simply the standard deviation of sales divided by the mean sales for each SKU.

Let’s read in the dataset to get a feel of what it looks like, before going further. Oh, and the link to the datasets used in this documentation is located at the bottom of the documentation.

library(caret)
library(scales)
library(ggradar)
library(ggplot2)
library(data.table)

setwd("/Users/ethen/Business-Analytics/finding_groups")
rm( list = ls() )

# ----------------------------------------------------------
#              example 1 : supply chain
# ----------------------------------------------------------
data_sku <- fread("data/DATA_2.01_SKU.csv")
head(data_sku)
##    ADS   CV
## 1:   1 0.68
## 2:   3 0.40
## 3:   1 0.59
## 4:   2 0.39
## 5:   9 0.11
## 6:   2 0.56

There are 100 rows and 2 variables within this dataset, namely :
- ADS average daily sales.
- CV coefficient of variations.

As there is only two variables (columns), we can easily visualize it and look at their relationships.

ggplot( data_sku, aes(CV, ADS) ) + geom_point() +  
geom_hline(yintercept = 4, color = "blue") + 
geom_vline(xintercept = .2, color = "blue") + 
labs( x = "Coefficient of Variation", y = "Average Daily Sales", 
      title = "SKU Example" )

As you’ll notice from the plot, there are three distinct groups, separated by the horizontal and vertical line on the graph. In practice, however, there’re only a few situations where you can define how many distinct groups are there in the data through straightforward visualization. Because, often times, you’ll have more than two variables that you may consider for your clustering. Therefore, we’ll now use the more rigorous approach, the hierarchical clustering algorithm to group our datasets.

One precaution, from the data (or from the visualization) you’ll notice that two variables are recorded with different units. Therefore when input variables differ by orders of magnitudes, performing some kind of normalization is often times required to make them comparable. Here we’ll apply the widely used z-score normalization using the built in scale function, where given a input variable you subtract every row element with its mean and then divide it by its standard deviation.

# hierarchical clustering, with k (clustering number) = 3 
d <- dist( scale(data_sku), method = "euclidean" )
cluster <- hclust( d, method = "ward.D" )
data_sku[ ,  groups := as.factor( cutree( cluster, k = 3 ) ) ]

Now that we have obtain the grouping of the data, let’s visualize the grouping to see confirm that the algorithm’s result matches our intuition.

head(data_sku)
##    ADS   CV groups
## 1:   1 0.68      1
## 2:   3 0.40      1
## 3:   1 0.59      1
## 4:   2 0.39      1
## 5:   9 0.11      2
## 6:   2 0.56      1
ggplot( data_sku, aes(CV, ADS, color = groups) ) + geom_point() + 
labs( x = "Coefficient of Variation", y = "Average Daily Sales", 
      title = "SKU Example" )

After we’ve identify the grouping the next step is to transform these insights into actions, that is what should we do with the patterns we found.

  • group 1 : SKUs with low sales and high variability. We’ll call them “crickets” (giving groups names, so business audiences will be more likely to remember them), because not only are the sales for these products low, but can also jump unexpectedly. These stocks will be “made to order”.
    • Make to Order : This strategy wait until an order comes in before starting the production process. Since the sales are small, in any case, it’s not really efficient to prepare production too long in advance. So we want to reduce the risks by producing the goods only if the order is made by the end of the chain.
  • group 2 : SKUs with high sales and low variability. We’ll call them “horses”, because these products are strong and reliable. These stocks will be “made to stock”.
    • Make to Stock : This production strategy forecasts demand to determine how much stock should be produced. If demand for the product can be accurately forecasted, the MTS strategy can be an efficient choice. This will be reasonable for this case, because the sales for these products are expected to be high and the risk of the inaccurate forecast is low (coefficient of variation).
  • group 3 : SKUs with high sales and high variability. We’ll call them “wild bulls”, because despite of its possible high sales, it is difficult to control. These SKUs will be treated on a case by case basis. Things may be a bit difficult to anticipate, but their returns may be high.

This draws an end to the first example.

Example2 : Humance Resources

For the next example, suppose you’re an HR manager of a big consulting company, and that you’re concerned by the high number of employees leaving the firm. In practice, you cannot follow-up with each one of them too often as that would be very time consuming, hence, this is also a case where you have to rely on an HR analytic solution to understand what are the most frequent situations explaining why an employee decides to leave and discover the actions we can take in order to retain them.

Now let’s say your company has collected a bunch of data from employees that have already left.

data_hr <- fread("data/DATA_2.02_HR.csv")
head(data_hr)
##       S  LPE NP ANH TIC Newborn
## 1: 0.38 0.53  2 157   3       0
## 2: 0.80 0.86  5 262   6       0
## 3: 0.11 0.88  7 272   4       0
## 4: 0.72 0.87  5 223   5       0
## 5: 0.37 0.52  2 159   3       0
## 6: 0.41 0.50  2 153   3       0

This dataset contains 2000 observations and 6 variables, each representing :

  • S The satisfaction level on a scale of 0 to 1.
  • LPE Last project evaluation by a client on a scale of 0 to 1.
  • NP Represents the number of projects worked on by employee in the last 12 month.
  • ANH Average number of hours worked in the last 12 month for that employee.
  • TIC The amount of time the emplyee spent in the company, measured in years.
  • Newborn This variable will take the value 1 if the employee had a newborn within the last 12 month and 0 otherwise.

There’re two lessons we can learn from this example.

  1. This is the kind of data where we can’t use visualization to determine the grouping number before applying the clustering algorithm like we did in the first example, because now we have the total of 6 variables to consider.

  2. Before applying the clustering algorithm, we should determine which variables to include. If the dataset contains no missing values, as a data scientist, we’re often used to doing this by looking at the correlations between the variables ( of course there’re more complex ways to do this ). Then suppose the correlations between variables A and B are high, then we can choose drop one of them, because after obtaining the results we can simply use the variables we’ve retained to properly assume the ones we dropped. We can easily achieve this step by using the findCorrelation function from the caret package. Where, given a correlation matrix , this function will print out highly correlated attributes.

# scale the data (you don't really need to do this when comparing correlations)
# this is used for latter plotting and applying clustering
scaled <- scale(data_hr)
data_hr_scaled <- data.table(scaled)

# find variables that have correlations higher than 0.8 and print out variable names
findCorrelation( cor(data_hr_scaled), cutoff = 0.8, names = TRUE )
## [1] "LPE" "NP"

In this case, the function suggests that we remove the NP attribute as it correlates highly with the LPE attribute. But here, it might be a bad idea to use the correlation method to determine which variables to exclude as neither of them are not normally distributed ( the default method of the cor function that does the correlation calculation is Pearson correlation and this method is sensitive to non-normality in the variables ).

ggplot( melt( data_hr_scaled[ , c( "LPE", "NP" ), with = FALSE ] ), 
        aes( value, fill = variable ) ) + 
geom_histogram( alpha = .4, position = "identity" )

So next, we’ll look at this feature selection approach from the business point of view. In this case, we should note that the S variable, level of satisfaction is actually a consequence of everything else. We cannot really act on it directly, it has to be seen as a consequence and not as the driver of managerial impact.

Knowing this by intuition, we’ll can remove this variable before conducting the clustering. This time we’ll choose the clustering number to be 4, or you can call it segmentation number if you wish. This number is decided by plotting the dendogram of the clustering result. Not shown here in the report since it’s a bit awful looking, which is a common problem of dendogram when it comes to bigger dataset. We’ll also calculate the median of each variable for each of the four groups to discover patterns and count the proportion size of each group. Note that we will also be including the S variable for the aggregation.

# clustering 
d <- dist( data_hr_scaled[ , -1, with = FALSE ], method = "euclidean" ) 
cluster <- hclust( d, method = "ward.D" )
# plot(cluster) # dendogram
data_hr_scaled[ ,  groups := as.factor( cutree( cluster, k = 4 ) ) ]
##                S        LPE          NP        ANH        TIC    Newborn
##    1: -0.2264726 -0.9673386 -1.03270343 -0.8301389 -0.9019539 -0.2353322
##    2:  1.3600255  0.7038481  0.61786146  0.8818575  2.1798931 -0.2353322
##    3: -1.2463643  0.8051322  1.71823805  1.0449048  0.1253284 -0.2353322
##    4:  1.0578354  0.7544902  0.61786146  0.2459731  1.1526108 -0.2353322
##    5: -0.2642464 -1.0179806 -1.03270343 -0.7975295 -0.9019539 -0.2353322
##   ---                                                                   
## 1996: -0.2642464 -0.7647705 -1.03270343 -0.9931862 -0.9019539 -0.2353322
## 1997: -1.2463643  1.0077003  1.71823805  1.3873041  0.1253284 -0.2353322
## 1998: -0.1131513 -0.9673386 -1.03270343 -0.8301389 -0.9019539 -0.2353322
## 1999:  1.5111206  1.2102684  0.06767316  0.6372866  1.1526108 -0.2353322
## 2000: -0.1509251 -1.0686226 -1.03270343 -0.9768815 -0.9019539 -0.2353322
##       groups
##    1:      1
##    2:      2
##    3:      3
##    4:      2
##    5:      1
##   ---       
## 1996:      1
## 1997:      3
## 1998:      1
## 1999:      2
## 2000:      1
# median aggregation 
hr_agg <- data_hr_scaled[ , lapply( .SD, median ), by = groups ]

# order by cluster's proportion size
t <- table(data_hr_scaled$groups)
hr_agg[ , proportion := t / sum(t) ][ order(-proportion) ]
##    groups           S        LPE          NP        ANH        TIC
## 1:      1 -0.11315131 -1.0686226 -1.03270343 -1.0421004 -0.9019539
## 2:      3 -1.24636428  0.7038481  1.16804975  0.9959906  0.1253284
## 3:      2  1.39779931  1.0077003  0.61786146  0.6943531  1.1526108
## 4:      4 -0.07537755  0.3493540  0.06767316  0.1644495  0.1253284
##       Newborn proportion
## 1: -0.2353322     0.3965
## 2: -0.2353322     0.2880
## 3: -0.2353322     0.2630
## 4:  4.2471867     0.0525

From the result of this table, you can see that :

  • group 1 : We can see from the low quantity of the four variables LPE last project evaluation by a client, NP number of projects, ANH average number of working hours and TIC total time spent in the company. That these are probably the “low performance” ones and they are still quite junior in the company, thus it may not be our top priority to try and retain them.

  • group 3 : This is the second row, don’t get confused. Notice the consideraly high LPE, NP, ANH, TIC and a low satisfaction level S suggests that these people are the ones that are over-utilized. Simply that they are doing a good job, but are “burned out”. Given this result, as a HR manager these kind of people should be your first priority. Next time you see that a high performing employee is working too much, you should be able to anticipate this situation and helped those employees to take a step back proactively.

  • group 2 : This group is a bit tricky. You’ll notice that similar to group 3, where they have a high LPE, NP, ANH, TIC levels, from the high rating of the S variable, it seems that they’re still quite satisfied with their work. Well, what can do to retain these “high-performing” employees that are happy but want to leave nonetheless. For these people we can assume that we’re given them too little cash or the projects that were assigned to them are simply too easy. Then we can try to give them a raise, a promotion or more challenging projects.

  • group 4 : For this final group, this is assume to be a “miscellaneous” segment. As it doesn’t have any obvious characteristic apart from the Newborn column indicating that they the employee had a newborn within the last 12 month and a straightforward way to retain these employees is to provide parental leave or on-site child care. But notice from the proportion column that since it’s a relatively small group, given the limited resources we have, it’s not really a priority either.

When presenting the upper result table, one visualization you can use is the radar chart.

  • To use the ggradar function you will need some extra fonts for it to work, see instructions here to get it. Currently, it does not support facetting and the first column has to be the group (future work).
  • Each variable of the data is recorded with different scales. And normalizing them via the scale function produces negative numbers and that does not sound very intuitive for radar charts, thus a different kind of normalization is used here, namely the max-min normalization. With this method, you minus each column by its min value and divide it by the difference between the maximum and minimum value ( x - min(x) ) / ( max(x) - min(x) ). Though the recorded score may look a bit odd as with this normalization, there will be values of 0.
hr_agg2 <- hr_agg[ , -c( "groups", "proportion" ), with = FALSE ]
hr_agg2 <- hr_agg2[ , lapply( .SD, rescale ) ]
hr_agg2[ , groups := as.factor( paste0( "group", 1:nrow(hr_agg2) ) ) ]
##            S       LPE   NP   ANH TIC Newborn groups
## 1: 0.4285714 0.0000000 0.00 0.000 0.0       0 group1
## 2: 1.0000000 1.0000000 0.75 0.852 1.0       0 group2
## 3: 0.0000000 0.8536585 1.00 1.000 0.5       0 group3
## 4: 0.4428571 0.6829268 0.50 0.592 0.5       1 group4
name <- colnames(hr_agg2)
setcolorder( hr_agg2, c( "groups", name[ name != "groups" ] ) )

# radar plot
ggradar(hr_agg2)

End of the second example.

Takeaways

  1. There are different perspective (intuition, business sense, technical approaches) as to which variable should have or should not have been included in the dataset prior to running the clustering method. As clustering algorithms are categorized under unsupervised algorithms, meaning that there is no way of measuring whether the ouput is “correct”. Thus you should choose your input variables (features) wisely! Speaking of this, suppose that for some reason you think that it’s nonsense to exclude the S variable : the satisfaction level, like we did in this documentation. You should add it back and run the clustering, though you’ll obtain similar grouping results for this dataset, and when you realize that whether you included a variable or not does not affect the result, you should exclude it as this lessen the calculation loading for the algorithm and your computer.

  2. After you obtain the results, giving visualizations and naming your segmentations in a self explanatory and meaningful way can make your results more convincing. We’ve heard like a billion times already that a picture is worth thousands of words. And more importantly, you should always ask yourself, given the results what are you going to do with it, how can this result help my business goals?

  3. Following up on the second point about interpreting the result. It may be the case that even if we’ve identify different groups, but they’ll be treated similarly. Also there will be times when we can’t transform every results into immediate actions, times where the events may be exogenous and can’t be explained. If this group occurs with a high proportion, we may want to collect additional data to try to understand this specific case.

  4. For clustering methods, the number of segmentation is a parameter that has to be user-specified before running the algorithm. In this module this number is chosen mostly by common sense and business criteria.

R Session Information

devtools::session_info()
## Session info --------------------------------------------------------------
##  setting  value                       
##  version  R version 3.2.4 (2016-03-10)
##  system   x86_64, darwin13.4.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/Chicago             
##  date     2016-12-28
## Packages ------------------------------------------------------------------
##  package      * version date       source                               
##  assertthat     0.1     2013-12-06 CRAN (R 3.2.0)                       
##  bookdown       0.1     2016-07-13 CRAN (R 3.2.5)                       
##  car            2.1-2   2016-03-25 CRAN (R 3.2.4)                       
##  caret        * 6.0-71  2016-08-05 CRAN (R 3.2.5)                       
##  codetools      0.2-14  2015-07-15 CRAN (R 3.2.4)                       
##  colorspace     1.2-6   2015-03-11 CRAN (R 3.2.0)                       
##  data.table   * 1.10.0  2016-12-03 CRAN (R 3.2.5)                       
##  devtools       1.12.0  2016-06-24 CRAN (R 3.2.5)                       
##  digest         0.6.9   2016-01-08 CRAN (R 3.2.3)                       
##  evaluate       0.9     2016-04-29 cran (@0.9)                          
##  foreach        1.4.3   2015-10-13 CRAN (R 3.2.0)                       
##  formatR        1.4     2016-05-09 cran (@1.4)                          
##  ggplot2      * 2.2.0   2016-11-11 CRAN (R 3.2.5)                       
##  ggradar      * 0.1     2016-04-12 Github (ricardo-bion/ggradar@559eeb7)
##  gtable         0.2.0   2016-02-26 CRAN (R 3.2.3)                       
##  highr          0.6     2016-05-09 cran (@0.6)                          
##  htmltools      0.3.5   2016-03-21 CRAN (R 3.2.4)                       
##  httpuv         1.3.3   2015-08-04 CRAN (R 3.2.0)                       
##  iterators      1.0.8   2015-10-13 CRAN (R 3.2.0)                       
##  knitr          1.14    2016-08-13 CRAN (R 3.2.4)                       
##  labeling       0.3     2014-08-23 CRAN (R 3.2.0)                       
##  lattice      * 0.20-33 2015-07-14 CRAN (R 3.2.4)                       
##  lazyeval       0.2.0   2016-06-12 CRAN (R 3.2.5)                       
##  lme4           1.1-11  2016-02-12 CRAN (R 3.2.3)                       
##  magrittr       1.5     2014-11-22 CRAN (R 3.2.0)                       
##  MASS           7.3-45  2015-11-10 CRAN (R 3.2.4)                       
##  Matrix         1.2-4   2016-03-02 CRAN (R 3.2.4)                       
##  MatrixModels   0.4-1   2015-08-22 CRAN (R 3.2.0)                       
##  memoise        1.0.0   2016-01-29 CRAN (R 3.2.3)                       
##  mgcv           1.8-12  2016-03-03 CRAN (R 3.2.4)                       
##  mime           0.4     2015-09-03 CRAN (R 3.2.0)                       
##  miniUI         0.1.1   2016-01-15 CRAN (R 3.2.3)                       
##  minqa          1.2.4   2014-10-09 CRAN (R 3.2.0)                       
##  munsell        0.4.3   2016-02-13 CRAN (R 3.2.3)                       
##  nlme           3.1-125 2016-02-27 CRAN (R 3.2.4)                       
##  nloptr         1.0.4   2014-08-04 CRAN (R 3.2.0)                       
##  nnet           7.3-12  2016-02-02 CRAN (R 3.2.3)                       
##  pbkrtest       0.4-6   2016-01-27 CRAN (R 3.2.3)                       
##  plyr           1.8.4   2016-06-08 cran (@1.8.4)                        
##  quantreg       5.21    2016-02-13 CRAN (R 3.2.3)                       
##  questionr      0.5     2016-03-15 CRAN (R 3.2.4)                       
##  R6             2.1.2   2016-01-26 CRAN (R 3.2.3)                       
##  Rcpp           0.12.5  2016-05-14 cran (@0.12.5)                       
##  reshape2       1.4.1   2014-12-06 CRAN (R 3.2.0)                       
##  rmarkdown      1.1     2016-10-16 CRAN (R 3.2.4)                       
##  rmdformats     0.3     2016-09-05 CRAN (R 3.2.5)                       
##  rstudioapi     0.6     2016-06-27 CRAN (R 3.2.5)                       
##  scales       * 0.4.1   2016-11-09 CRAN (R 3.2.5)                       
##  shiny          0.13.2  2016-03-28 CRAN (R 3.2.4)                       
##  SparseM        1.7     2015-08-15 CRAN (R 3.2.0)                       
##  stringi        1.0-1   2015-10-22 CRAN (R 3.2.0)                       
##  stringr        1.0.0   2015-04-30 CRAN (R 3.2.0)                       
##  tibble         1.2     2016-08-26 CRAN (R 3.2.5)                       
##  withr          1.0.1   2016-02-04 CRAN (R 3.2.3)                       
##  xtable         1.8-2   2016-02-05 CRAN (R 3.2.3)                       
##  yaml           2.1.13  2014-06-12 CRAN (R 3.2.0)

Ethen Liu

2016-04-05