Faceted Bar Charts

Source code can be obtained here

Pie Chart

The purpose of a pie chart is to show the relationship of parts out of a whole. With that being said they can still be really bad at the one thing they’re ostensibly designed to do. Consider the following pie chart from the Wall Street Journal article on What Data Scientists Do All Day at Work.

The problem with the plot is that, our original goal was to compare and contrast these six tasks. But at a glance, do you have any idea whether more time is spent on “Presenting Analysis” or “Data Cleaning”? Thus, if the intent was to primarily allow comparison of hours in-task, leaving some ability to compare the same time category across tasks then bar plots are probably the way to go.

Prepares the data d, which contains three columns.

  • Task Different categories of task.
  • Hours Amounts of time spent.
  • Percentage Percentage of people that selected this answer.
library(scales)
library(ggplot2)
library(ggthemes)
library(data.table)
setwd("/Users/ethen/Business-Analytics/articles/avoid_pie_charts")
d <- fread("avoid_pie_charts_data.txt")
d <- melt( d, id.vars = "Task", 
           variable.name = "Hours", value.name = "Percentage" )
head(d)
##                               Task      Hours Percentage
## 1: Basic exploratory data analysis < 1 a week         11
## 2:                   Data cleaning < 1 a week         19
## 3:     Machine learning/statistics < 1 a week         34
## 4:         Creating visualizations < 1 a week         23
## 5:             Presenting analysis < 1 a week         27
## 6:          Extract/transform/load < 1 a week         43

Alternative 1: Bar Plot

Rotates the x axis’s text and label the exact percentages as text right on the graph, which can be useful when you want to pick out a specific number to emphasize on it.

p1 <- ggplot( d, aes( x = Hours, y = Percentage ) ) + 
      geom_bar( stat = "identity" ) + 
      facet_wrap( ~ Task ) + 
      xlab("Hours spent per week") + 
      geom_text( aes( label = paste0( Percentage, "%" ), y = Percentage ),
                 vjust = 1.4, size = 3, color = "white" )
p1 + theme_bw() + 
theme( axis.text.x = element_text( angle = 90,  hjust = 1 ) )

Alternative 2: Bar Plot

Using the theme_tufte from the ggthemes package. The theme will drop all the borders, grids, and axis lines to maximize the data / ink ratio.

p1 + theme_tufte()

Alternative 3: Stacked Bar Plot

ggplot( d, aes( x = Task, y = Percentage, fill = Hours ) ) + 
geom_bar( stat = "identity", position = "stack" ) +
coord_flip() +
scale_fill_brewer( palette = "YlGnBu" ) +
theme_minimal() + theme( legend.position = "bottom" )

Alternative 4: Refined Version of Bar Plot

Some notes on ggplot2’s grammar.

  • geom_bar’s size controls the bar’s border size.
  • expand is a numeric vector of length two giving multiplicative and additive expansion constants. These constants ensure that the data is placed some distance away from the axes.
  • strip .background / .text controls the title section for each facet.
  • panel.grid .minor / .major controls the grid lines in the plot.
  • panel.spacing contols the margin between facets.
  • as_labeller the function that changes the strip’s label without doesn’t changing the underlying data. All you have to do is create a named character vector with the new label mapping to the original label name and call the function and pass it in to facet_grid’s labeller argument.

The amount of time spent on various tasks by surveyed non-managers in data-science positions. ( could be added to the plot as subtitles ). Although this isn’t actually going to tell us which tasks data scientists spend the most time on: we should do some kind of weighted measure to estimate the mean.

# refined x label and strip label
x_labels <- c( "<1 hr/\nweek", "1-4 hrs/\nweek", "1-3 hrs/\nday", "4+ hrs/\nday" )
label_names <- c( "Basic exploratory data analysis" = "Basic Exploratory\nData Analysis", 
                  "Data cleaning" = "Data\nCleaning", 
                  "Machine learning/statistics" = "Machine Learning,\nStatistics", 
                  "Creating visualizations" = "Creating\nVisualizations", 
                  "Presenting analysis" = "Presenting\nAnalysis", 
                  "Extract/transform/load" = "Extract,\nTransform, Load" )

ggplot( d, aes( x = Hours, y = Percentage / 100, fill = Hours ) ) +
geom_bar( stat = "identity", width = 0.75, color = "#2b2b2b", size = 0.05 ) + 
scale_y_continuous( labels = percent, limits = c( 0, 0.5 ) ) + 
scale_x_discrete( expand = c( 0, 1 ), labels = x_labels ) + 
scale_fill_manual( values = c( "#a6cdd9", "#d2e4ee", "#b7b079", "#efc750" ) ) +
facet_wrap( ~ Task, labeller = as_labeller(label_names) ) + 
labs( x = NULL, y = NULL, title = "Where Does the Time Go?" ) +
theme( strip.text = element_text( size = 12, color = "white", hjust = 0.5 ),
       strip.background = element_rect( fill = "#858585", color = NA ),    
       panel.background = element_rect( fill = "#efefef", color = NA ),
       panel.grid.major.x = element_blank(),
       panel.grid.minor.x = element_blank(),
       panel.grid.minor.y = element_blank(),
       panel.grid.major.y = element_line( color = "#b2b2b2" ),
       panel.spacing.x = unit( 1, "cm" ),
       panel.spacing.y = unit( 0.5, "cm" ),
       legend.position = "none" ) 

R Session Information

devtools::session_info()
## Session info --------------------------------------------------------------
##  setting  value                       
##  version  R version 3.2.4 (2016-03-10)
##  system   x86_64, darwin13.4.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/Chicago             
##  date     2016-12-28
## Packages ------------------------------------------------------------------
##  package      * version date       source        
##  assertthat     0.1     2013-12-06 CRAN (R 3.2.0)
##  bookdown       0.1     2016-07-13 CRAN (R 3.2.5)
##  colorspace     1.2-6   2015-03-11 CRAN (R 3.2.0)
##  data.table   * 1.10.0  2016-12-03 CRAN (R 3.2.5)
##  devtools       1.12.0  2016-06-24 CRAN (R 3.2.5)
##  digest         0.6.9   2016-01-08 CRAN (R 3.2.3)
##  evaluate       0.9     2016-04-29 cran (@0.9)   
##  formatR        1.4     2016-05-09 cran (@1.4)   
##  ggplot2      * 2.2.0   2016-11-11 CRAN (R 3.2.5)
##  ggthemes     * 3.0.3   2016-04-09 CRAN (R 3.2.4)
##  gtable         0.2.0   2016-02-26 CRAN (R 3.2.3)
##  highr          0.6     2016-05-09 cran (@0.6)   
##  htmltools      0.3.5   2016-03-21 CRAN (R 3.2.4)
##  httpuv         1.3.3   2015-08-04 CRAN (R 3.2.0)
##  knitr          1.14    2016-08-13 CRAN (R 3.2.4)
##  labeling       0.3     2014-08-23 CRAN (R 3.2.0)
##  lazyeval       0.2.0   2016-06-12 CRAN (R 3.2.5)
##  magrittr       1.5     2014-11-22 CRAN (R 3.2.0)
##  memoise        1.0.0   2016-01-29 CRAN (R 3.2.3)
##  mime           0.4     2015-09-03 CRAN (R 3.2.0)
##  miniUI         0.1.1   2016-01-15 CRAN (R 3.2.3)
##  munsell        0.4.3   2016-02-13 CRAN (R 3.2.3)
##  plyr           1.8.4   2016-06-08 cran (@1.8.4) 
##  questionr      0.5     2016-03-15 CRAN (R 3.2.4)
##  R6             2.1.2   2016-01-26 CRAN (R 3.2.3)
##  RColorBrewer   1.1-2   2014-12-07 CRAN (R 3.2.0)
##  Rcpp           0.12.5  2016-05-14 cran (@0.12.5)
##  rmarkdown      1.1     2016-10-16 CRAN (R 3.2.4)
##  rmdformats     0.3     2016-09-05 CRAN (R 3.2.5)
##  rstudioapi     0.6     2016-06-27 CRAN (R 3.2.5)
##  scales       * 0.4.1   2016-11-09 CRAN (R 3.2.5)
##  shiny          0.13.2  2016-03-28 CRAN (R 3.2.4)
##  stringi        1.0-1   2015-10-22 CRAN (R 3.2.0)
##  stringr        1.0.0   2015-04-30 CRAN (R 3.2.0)
##  tibble         1.2     2016-08-26 CRAN (R 3.2.5)
##  withr          1.0.1   2016-02-04 CRAN (R 3.2.3)
##  xtable         1.8-2   2016-02-05 CRAN (R 3.2.3)
##  yaml           2.1.13  2014-06-12 CRAN (R 3.2.0)

Ethen Liu

2016-12-28