Faceted Bar Charts
Source code can be obtained here
Pie Chart
The purpose of a pie chart is to show the relationship of parts out of a whole. With that being said they can still be really bad at the one thing they’re ostensibly designed to do. Consider the following pie chart from the Wall Street Journal article on What Data Scientists Do All Day at Work.
The problem with the plot is that, our original goal was to compare and contrast these six tasks. But at a glance, do you have any idea whether more time is spent on “Presenting Analysis” or “Data Cleaning”? Thus, if the intent was to primarily allow comparison of hours in-task, leaving some ability to compare the same time category across tasks then bar plots are probably the way to go.
Prepares the data d
, which contains three columns.
Task
Different categories of task.Hours
Amounts of time spent.Percentage
Percentage of people that selected this answer.
library(scales)
library(ggplot2)
library(ggthemes)
library(data.table)
setwd("/Users/ethen/Business-Analytics/articles/avoid_pie_charts")
d <- fread("avoid_pie_charts_data.txt")
d <- melt( d, id.vars = "Task",
variable.name = "Hours", value.name = "Percentage" )
head(d)
## Task Hours Percentage
## 1: Basic exploratory data analysis < 1 a week 11
## 2: Data cleaning < 1 a week 19
## 3: Machine learning/statistics < 1 a week 34
## 4: Creating visualizations < 1 a week 23
## 5: Presenting analysis < 1 a week 27
## 6: Extract/transform/load < 1 a week 43
Alternative 1: Bar Plot
Rotates the x axis’s text and label the exact percentages as text right on the graph, which can be useful when you want to pick out a specific number to emphasize on it.
p1 <- ggplot( d, aes( x = Hours, y = Percentage ) ) +
geom_bar( stat = "identity" ) +
facet_wrap( ~ Task ) +
xlab("Hours spent per week") +
geom_text( aes( label = paste0( Percentage, "%" ), y = Percentage ),
vjust = 1.4, size = 3, color = "white" )
p1 + theme_bw() +
theme( axis.text.x = element_text( angle = 90, hjust = 1 ) )
Alternative 2: Bar Plot
Using the theme_tufte
from the ggthemes
package. The theme will drop all the borders, grids, and axis lines to maximize the data / ink ratio.
p1 + theme_tufte()
Alternative 3: Stacked Bar Plot
ggplot( d, aes( x = Task, y = Percentage, fill = Hours ) ) +
geom_bar( stat = "identity", position = "stack" ) +
coord_flip() +
scale_fill_brewer( palette = "YlGnBu" ) +
theme_minimal() + theme( legend.position = "bottom" )
Alternative 4: Refined Version of Bar Plot
Some notes on ggplot2’s grammar.
- geom_bar’s
size
controls the bar’s border size. expand
is a numeric vector of length two giving multiplicative and additive expansion constants. These constants ensure that the data is placed some distance away from the axes.strip .background / .text
controls the title section for each facet.panel.grid .minor / .major
controls the grid lines in the plot.panel.spacing
contols the margin between facets.as_labeller
the function that changes the strip’s label without doesn’t changing the underlying data. All you have to do is create a named character vector with the new label mapping to the original label name and call the function and pass it in tofacet_grid
’slabeller
argument.
The amount of time spent on various tasks by surveyed non-managers in data-science positions. ( could be added to the plot as subtitles ). Although this isn’t actually going to tell us which tasks data scientists spend the most time on: we should do some kind of weighted measure to estimate the mean.
# refined x label and strip label
x_labels <- c( "<1 hr/\nweek", "1-4 hrs/\nweek", "1-3 hrs/\nday", "4+ hrs/\nday" )
label_names <- c( "Basic exploratory data analysis" = "Basic Exploratory\nData Analysis",
"Data cleaning" = "Data\nCleaning",
"Machine learning/statistics" = "Machine Learning,\nStatistics",
"Creating visualizations" = "Creating\nVisualizations",
"Presenting analysis" = "Presenting\nAnalysis",
"Extract/transform/load" = "Extract,\nTransform, Load" )
ggplot( d, aes( x = Hours, y = Percentage / 100, fill = Hours ) ) +
geom_bar( stat = "identity", width = 0.75, color = "#2b2b2b", size = 0.05 ) +
scale_y_continuous( labels = percent, limits = c( 0, 0.5 ) ) +
scale_x_discrete( expand = c( 0, 1 ), labels = x_labels ) +
scale_fill_manual( values = c( "#a6cdd9", "#d2e4ee", "#b7b079", "#efc750" ) ) +
facet_wrap( ~ Task, labeller = as_labeller(label_names) ) +
labs( x = NULL, y = NULL, title = "Where Does the Time Go?" ) +
theme( strip.text = element_text( size = 12, color = "white", hjust = 0.5 ),
strip.background = element_rect( fill = "#858585", color = NA ),
panel.background = element_rect( fill = "#efefef", color = NA ),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.major.y = element_line( color = "#b2b2b2" ),
panel.spacing.x = unit( 1, "cm" ),
panel.spacing.y = unit( 0.5, "cm" ),
legend.position = "none" )
R Session Information
devtools::session_info()
## Session info --------------------------------------------------------------
## setting value
## version R version 3.2.4 (2016-03-10)
## system x86_64, darwin13.4.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## tz America/Chicago
## date 2016-12-28
## Packages ------------------------------------------------------------------
## package * version date source
## assertthat 0.1 2013-12-06 CRAN (R 3.2.0)
## bookdown 0.1 2016-07-13 CRAN (R 3.2.5)
## colorspace 1.2-6 2015-03-11 CRAN (R 3.2.0)
## data.table * 1.10.0 2016-12-03 CRAN (R 3.2.5)
## devtools 1.12.0 2016-06-24 CRAN (R 3.2.5)
## digest 0.6.9 2016-01-08 CRAN (R 3.2.3)
## evaluate 0.9 2016-04-29 cran (@0.9)
## formatR 1.4 2016-05-09 cran (@1.4)
## ggplot2 * 2.2.0 2016-11-11 CRAN (R 3.2.5)
## ggthemes * 3.0.3 2016-04-09 CRAN (R 3.2.4)
## gtable 0.2.0 2016-02-26 CRAN (R 3.2.3)
## highr 0.6 2016-05-09 cran (@0.6)
## htmltools 0.3.5 2016-03-21 CRAN (R 3.2.4)
## httpuv 1.3.3 2015-08-04 CRAN (R 3.2.0)
## knitr 1.14 2016-08-13 CRAN (R 3.2.4)
## labeling 0.3 2014-08-23 CRAN (R 3.2.0)
## lazyeval 0.2.0 2016-06-12 CRAN (R 3.2.5)
## magrittr 1.5 2014-11-22 CRAN (R 3.2.0)
## memoise 1.0.0 2016-01-29 CRAN (R 3.2.3)
## mime 0.4 2015-09-03 CRAN (R 3.2.0)
## miniUI 0.1.1 2016-01-15 CRAN (R 3.2.3)
## munsell 0.4.3 2016-02-13 CRAN (R 3.2.3)
## plyr 1.8.4 2016-06-08 cran (@1.8.4)
## questionr 0.5 2016-03-15 CRAN (R 3.2.4)
## R6 2.1.2 2016-01-26 CRAN (R 3.2.3)
## RColorBrewer 1.1-2 2014-12-07 CRAN (R 3.2.0)
## Rcpp 0.12.5 2016-05-14 cran (@0.12.5)
## rmarkdown 1.1 2016-10-16 CRAN (R 3.2.4)
## rmdformats 0.3 2016-09-05 CRAN (R 3.2.5)
## rstudioapi 0.6 2016-06-27 CRAN (R 3.2.5)
## scales * 0.4.1 2016-11-09 CRAN (R 3.2.5)
## shiny 0.13.2 2016-03-28 CRAN (R 3.2.4)
## stringi 1.0-1 2015-10-22 CRAN (R 3.2.0)
## stringr 1.0.0 2015-04-30 CRAN (R 3.2.0)
## tibble 1.2 2016-08-26 CRAN (R 3.2.5)
## withr 1.0.1 2016-02-04 CRAN (R 3.2.3)
## xtable 1.8-2 2016-02-05 CRAN (R 3.2.3)
## yaml 2.1.13 2014-06-12 CRAN (R 3.2.0)