R’s data.table

The source of this tutorial, including the example datasets, is available here on GitHub

SQL Joins

For joining data.tables, the basic syntax are:

JOIN type DT syntax data.table::merge() syntax
INNER X[Y, nomatch = 0] merge(X, Y, all = FALSE)
LEFT OUTER Y[X] merge(X, Y, all.x = TRUE)
RIGHT OUTER X[Y] merge(X, Y, all.y = TRUE)
FULL OUTER - merge(X, Y, all = TRUE)

The example data consists of the following two tables:

library(data.table)
## Warning: package 'data.table' was built under R version 3.2.5
setwd("/Users/ethen/Business-Analytics/R/data_table")
employees <- fread("data/employees.csv")
departments <- fread("data/departments.csv")
knitr::kable(employees, caption = "Table Employees")
Table Employees
Employee EmployeeName Department Salary
1 Alice 11 800
2 Bob 11 600
3 Carla 12 900
4 Daniel 12 1000
5 Evelyn 13 800
6 Ferdinand 21 700
knitr::kable(departments, caption = "Table Departments")
Table Departments
Department DepartmentName Manager
11 Production 1
12 Sales 4
13 Marketing 5
14 Research NA

Inner Join

Figure linked from Pinal Dave
Figure1 from Pinal Dave

the INNER JOIN returns the rows with a match in both tables

# inner join with merge
merge(employees, departments, by = "Department")
##    Department Employee EmployeeName Salary DepartmentName Manager
## 1:         11        1        Alice    800     Production       1
## 2:         11        2          Bob    600     Production       1
## 3:         12        3        Carla    900          Sales       4
## 4:         12        4       Daniel   1000          Sales       4
## 5:         13        5       Evelyn    800      Marketing       5
# inner join with keys
# set the keys of the tables to represent the by:
setkey(employees, Department)
setkey(departments, Department)
# equivalent to setkeyv( departments, "Department" )

# note that you only need to set the keys once
# you can confirm if it works with
key(employees)
## [1] "Department"
# perform the join,
# nomatch 0 means no rows will be returned 
# from not matched rows from the Right table
employees[ departments, nomatch = 0 ]
##    Employee EmployeeName Department Salary DepartmentName Manager
## 1:        1        Alice         11    800     Production       1
## 2:        2          Bob         11    600     Production       1
## 3:        3        Carla         12    900          Sales       4
## 4:        4       Daniel         12   1000          Sales       4
## 5:        5       Evelyn         13    800      Marketing       5

As we’ll see later, without anything else, TABLE_X[TABLE_Y] returns a right outer join; setting nomatch=0 returns a inner join.

note:

Employee #6, Ferdinand, has not been returned, as he’s in a yet to be created department 21. Neither has department #14, Research, as there are no employees.

Left Outer Join

Figure linked from Pinal Dave
Figure2 from Pinal Dave

the LEFT OUTER JOIN returns all the rows from the left table, filling in matched columns (or NA) from the right table

# left join with merge, notice that the column's ordering is different
merge(employees, departments, by = "Department", all.x = TRUE)
##    Department Employee EmployeeName Salary DepartmentName Manager
## 1:         11        1        Alice    800     Production       1
## 2:         11        2          Bob    600     Production       1
## 3:         12        3        Carla    900          Sales       4
## 4:         12        4       Daniel   1000          Sales       4
## 5:         13        5       Evelyn    800      Marketing       5
## 6:         21        6    Ferdinand    700             NA      NA
departments[employees]
##    Department DepartmentName Manager Employee EmployeeName Salary
## 1:         11     Production       1        1        Alice    800
## 2:         11     Production       1        2          Bob    600
## 3:         12          Sales       4        3        Carla    900
## 4:         12          Sales       4        4       Daniel   1000
## 5:         13      Marketing       5        5       Evelyn    800
## 6:         21             NA      NA        6    Ferdinand    700

note:

Employee #6, Ferdinand, has been returned with department name as NA.
Department #14, Research, has not been returned.

Right Outer Join

Figure linked from Pinal Dave
Figure2 from Pinal Dave

the RIGHT OUTER JOIN returns all the rows from the right table, filling in matched columns (or else NA) from the left table

# right join with merge
merge(employees, departments, by = "Department", all.y = TRUE)
##    Department Employee EmployeeName Salary DepartmentName Manager
## 1:         11        1        Alice    800     Production       1
## 2:         11        2          Bob    600     Production       1
## 3:         12        3        Carla    900          Sales       4
## 4:         12        4       Daniel   1000          Sales       4
## 5:         13        5       Evelyn    800      Marketing       5
## 6:         14       NA           NA     NA       Research      NA
employees[departments]
##    Employee EmployeeName Department Salary DepartmentName Manager
## 1:        1        Alice         11    800     Production       1
## 2:        2          Bob         11    600     Production       1
## 3:        3        Carla         12    900          Sales       4
## 4:        4       Daniel         12   1000          Sales       4
## 5:        5       Evelyn         13    800      Marketing       5
## 6:       NA           NA         14     NA       Research      NA
# we can also do a not matched with !
# returning the rows in the employees table that are not in the
# department table
employees[!departments]
##    Employee EmployeeName Department Salary
## 1:        6    Ferdinand         21    700

Full Outer Join

Figure linked from Pinal Dave
Figure2 from Pinal Dave

the FULL OUTER JOIN returns all the rows from both tables, filling in matched columns (or else NA)

merge(employees, departments, by = "Department", all = TRUE)
##    Department Employee EmployeeName Salary DepartmentName Manager
## 1:         11        1        Alice    800     Production       1
## 2:         11        2          Bob    600     Production       1
## 3:         12        3        Carla    900          Sales       4
## 4:         12        4       Daniel   1000          Sales       4
## 5:         13        5       Evelyn    800      Marketing       5
## 6:         14       NA           NA     NA       Research      NA
## 7:         21        6    Ferdinand    700             NA      NA

Other Tips and Tricks

.SD and .N

First .SD refers to the subset of data for each group. The example below gets the sum of y and z and the number of rows (using .N) in each group while grouped by x.

DT <- data.table( x = sample(1:2, 7, replace = TRUE), 
                  y = sample(1:13, 7), 
                  z = sample(1:14, 7) ) 
DT[ , c( lapply(.SD, sum), .N ), by = x ]
##    x  y  z N
## 1: 1 35 35 4
## 2: 2 23 22 3

Key Indexing

Set the keys and select the matching rows without using ==.

DT <- data.table( A = letters[ c(2, 1, 2, 3, 1, 2, 3) ],
                  B = c(5, 4, 1, 9, 8, 8, 6), C = 6:12 )
setkey(DT, A, B)

# matches rows where A == "b"
DT["b"]
##    A B  C
## 1: b 1  8
## 2: b 5  6
## 3: b 8 11
# matches rows where A == "b" or A == "c"
DT[ c("b", "c"), ]
##    A B  C
## 1: b 1  8
## 2: b 5  6
## 3: b 8 11
## 4: c 6 12
## 5: c 9  9

Since we did not provide any values for the second key column “B”, it just matches c( “b”, “c” ) against the first key column “A” and returns all the matched rows. Consider the example below that matches both keys.

DT[ .("b", 5), ]
##    A B C
## 1: b 5 6

“b” is first matched against the first key column “A”. And within those matching rows, 5 is matched against the second key column “B” to obtain row indices where both “A” and “B” match the given values.

# Subset all rows where just the second key column matches
DT[ .(unique(A), 5), ]
##    A B  C
## 1: a 5 NA
## 2: b 5  6
## 3: c 5 NA

Selecting Parts of a Filtered Group

Select the first row of the b and c groups using mult.

DT[ c("b", "c"), mult = "first" ]
##    A B  C
## 1: b 1  8
## 2: c 6 12

When i is a data.table, DT[ i, j, by = .EACHI ] evaluates j for the groups of DT that each row in i joins to. We call this grouping by each i. See difference with and without .EACHI below.

# we can use [] for .SD to subset the data
DT[ c("b", "c"), .SD[ c(1, .N), ] ]
##    A B C
## 1: b 1 8
## 2: c 9 9
DT[ c("b", "c"), .SD[ c(1, .N), ], by = .EACHI ]
##    A B  C
## 1: b 1  8
## 2: b 8 11
## 3: c 6 12
## 4: c 9  9

Notice that .N here is used to retrieve the last row.

R Session Information

sessionInfo()
## R version 3.2.4 (2016-03-10)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] data.table_1.10.0
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.5     bookdown_0.1    digest_0.6.9    assertthat_0.1 
##  [5] mime_0.4        R6_2.1.2        xtable_1.8-2    formatR_1.4    
##  [9] magrittr_1.5    evaluate_0.9    rmdformats_0.3  highr_0.6      
## [13] stringi_1.0-1   rstudioapi_0.6  miniUI_0.1.1    rmarkdown_1.1  
## [17] tools_3.2.4     questionr_0.5   stringr_1.0.0   shiny_0.14.2   
## [21] httpuv_1.3.3    yaml_2.1.13     htmltools_0.3.5 knitr_1.14     
## [25] tibble_1.2

Ethen Liu

2017-01-13