R’s data.table
The source of this tutorial, including the example datasets, is available here on GitHub
SQL Joins
For joining data.tables, the basic syntax are:
JOIN type | DT syntax | data.table::merge() syntax |
---|---|---|
INNER | X[Y, nomatch = 0] | merge(X, Y, all = FALSE) |
LEFT OUTER | Y[X] | merge(X, Y, all.x = TRUE) |
RIGHT OUTER | X[Y] | merge(X, Y, all.y = TRUE) |
FULL OUTER | - | merge(X, Y, all = TRUE) |
The example data consists of the following two tables:
library(data.table)
## Warning: package 'data.table' was built under R version 3.2.5
setwd("/Users/ethen/Business-Analytics/R/data_table")
employees <- fread("data/employees.csv")
departments <- fread("data/departments.csv")
knitr::kable(employees, caption = "Table Employees")
Employee | EmployeeName | Department | Salary |
---|---|---|---|
1 | Alice | 11 | 800 |
2 | Bob | 11 | 600 |
3 | Carla | 12 | 900 |
4 | Daniel | 12 | 1000 |
5 | Evelyn | 13 | 800 |
6 | Ferdinand | 21 | 700 |
knitr::kable(departments, caption = "Table Departments")
Department | DepartmentName | Manager |
---|---|---|
11 | Production | 1 |
12 | Sales | 4 |
13 | Marketing | 5 |
14 | Research | NA |
Inner Join
Figure linked from Pinal Dave
the INNER JOIN returns the rows with a match in both tables
# inner join with merge
merge(employees, departments, by = "Department")
## Department Employee EmployeeName Salary DepartmentName Manager
## 1: 11 1 Alice 800 Production 1
## 2: 11 2 Bob 600 Production 1
## 3: 12 3 Carla 900 Sales 4
## 4: 12 4 Daniel 1000 Sales 4
## 5: 13 5 Evelyn 800 Marketing 5
# inner join with keys
# set the keys of the tables to represent the by:
setkey(employees, Department)
setkey(departments, Department)
# equivalent to setkeyv( departments, "Department" )
# note that you only need to set the keys once
# you can confirm if it works with
key(employees)
## [1] "Department"
# perform the join,
# nomatch 0 means no rows will be returned
# from not matched rows from the Right table
employees[ departments, nomatch = 0 ]
## Employee EmployeeName Department Salary DepartmentName Manager
## 1: 1 Alice 11 800 Production 1
## 2: 2 Bob 11 600 Production 1
## 3: 3 Carla 12 900 Sales 4
## 4: 4 Daniel 12 1000 Sales 4
## 5: 5 Evelyn 13 800 Marketing 5
As we’ll see later, without anything else, TABLE_X[TABLE_Y]
returns a right outer join; setting nomatch=0
returns a inner join.
note:
Employee #6, Ferdinand, has not been returned, as he’s in a yet to be created department 21. Neither has department #14, Research, as there are no employees.
Left Outer Join
Figure linked from Pinal Dave
the LEFT OUTER JOIN returns all the rows from the left table, filling in matched columns (or NA) from the right table
# left join with merge, notice that the column's ordering is different
merge(employees, departments, by = "Department", all.x = TRUE)
## Department Employee EmployeeName Salary DepartmentName Manager
## 1: 11 1 Alice 800 Production 1
## 2: 11 2 Bob 600 Production 1
## 3: 12 3 Carla 900 Sales 4
## 4: 12 4 Daniel 1000 Sales 4
## 5: 13 5 Evelyn 800 Marketing 5
## 6: 21 6 Ferdinand 700 NA NA
departments[employees]
## Department DepartmentName Manager Employee EmployeeName Salary
## 1: 11 Production 1 1 Alice 800
## 2: 11 Production 1 2 Bob 600
## 3: 12 Sales 4 3 Carla 900
## 4: 12 Sales 4 4 Daniel 1000
## 5: 13 Marketing 5 5 Evelyn 800
## 6: 21 NA NA 6 Ferdinand 700
note:
Employee #6, Ferdinand, has been returned with department name as NA.
Department #14, Research, has not been returned.
Right Outer Join
Figure linked from Pinal Dave
the RIGHT OUTER JOIN returns all the rows from the right table, filling in matched columns (or else NA) from the left table
# right join with merge
merge(employees, departments, by = "Department", all.y = TRUE)
## Department Employee EmployeeName Salary DepartmentName Manager
## 1: 11 1 Alice 800 Production 1
## 2: 11 2 Bob 600 Production 1
## 3: 12 3 Carla 900 Sales 4
## 4: 12 4 Daniel 1000 Sales 4
## 5: 13 5 Evelyn 800 Marketing 5
## 6: 14 NA NA NA Research NA
employees[departments]
## Employee EmployeeName Department Salary DepartmentName Manager
## 1: 1 Alice 11 800 Production 1
## 2: 2 Bob 11 600 Production 1
## 3: 3 Carla 12 900 Sales 4
## 4: 4 Daniel 12 1000 Sales 4
## 5: 5 Evelyn 13 800 Marketing 5
## 6: NA NA 14 NA Research NA
# we can also do a not matched with !
# returning the rows in the employees table that are not in the
# department table
employees[!departments]
## Employee EmployeeName Department Salary
## 1: 6 Ferdinand 21 700
Full Outer Join
Figure linked from Pinal Dave
the FULL OUTER JOIN returns all the rows from both tables, filling in matched columns (or else NA)
merge(employees, departments, by = "Department", all = TRUE)
## Department Employee EmployeeName Salary DepartmentName Manager
## 1: 11 1 Alice 800 Production 1
## 2: 11 2 Bob 600 Production 1
## 3: 12 3 Carla 900 Sales 4
## 4: 12 4 Daniel 1000 Sales 4
## 5: 13 5 Evelyn 800 Marketing 5
## 6: 14 NA NA NA Research NA
## 7: 21 6 Ferdinand 700 NA NA
Other Tips and Tricks
.SD
and .N
First .SD
refers to the subset of data for each group. The example below gets the sum of y and z and the number of rows (using .N
) in each group while grouped by x.
DT <- data.table( x = sample(1:2, 7, replace = TRUE),
y = sample(1:13, 7),
z = sample(1:14, 7) )
DT[ , c( lapply(.SD, sum), .N ), by = x ]
## x y z N
## 1: 1 35 35 4
## 2: 2 23 22 3
Key Indexing
Set the keys and select the matching rows without using ==
.
DT <- data.table( A = letters[ c(2, 1, 2, 3, 1, 2, 3) ],
B = c(5, 4, 1, 9, 8, 8, 6), C = 6:12 )
setkey(DT, A, B)
# matches rows where A == "b"
DT["b"]
## A B C
## 1: b 1 8
## 2: b 5 6
## 3: b 8 11
# matches rows where A == "b" or A == "c"
DT[ c("b", "c"), ]
## A B C
## 1: b 1 8
## 2: b 5 6
## 3: b 8 11
## 4: c 6 12
## 5: c 9 9
Since we did not provide any values for the second key column “B”, it just matches c( “b”, “c” ) against the first key column “A” and returns all the matched rows. Consider the example below that matches both keys.
DT[ .("b", 5), ]
## A B C
## 1: b 5 6
“b” is first matched against the first key column “A”. And within those matching rows, 5 is matched against the second key column “B” to obtain row indices where both “A” and “B” match the given values.
# Subset all rows where just the second key column matches
DT[ .(unique(A), 5), ]
## A B C
## 1: a 5 NA
## 2: b 5 6
## 3: c 5 NA
Selecting Parts of a Filtered Group
Select the first row of the b
and c
groups using mult
.
DT[ c("b", "c"), mult = "first" ]
## A B C
## 1: b 1 8
## 2: c 6 12
When i is a data.table, DT[ i, j, by = .EACHI ]
evaluates j for the groups of DT that each row in i joins to. We call this grouping by each i. See difference with and without .EACHI
below.
# we can use [] for .SD to subset the data
DT[ c("b", "c"), .SD[ c(1, .N), ] ]
## A B C
## 1: b 1 8
## 2: c 9 9
DT[ c("b", "c"), .SD[ c(1, .N), ], by = .EACHI ]
## A B C
## 1: b 1 8
## 2: b 8 11
## 3: c 6 12
## 4: c 9 9
Notice that .N
here is used to retrieve the last row.
R Session Information
sessionInfo()
## R version 3.2.4 (2016-03-10)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] data.table_1.10.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.5 bookdown_0.1 digest_0.6.9 assertthat_0.1
## [5] mime_0.4 R6_2.1.2 xtable_1.8-2 formatR_1.4
## [9] magrittr_1.5 evaluate_0.9 rmdformats_0.3 highr_0.6
## [13] stringi_1.0-1 rstudioapi_0.6 miniUI_0.1.1 rmarkdown_1.1
## [17] tools_3.2.4 questionr_0.5 stringr_1.0.0 shiny_0.14.2
## [21] httpuv_1.3.3 yaml_2.1.13 htmltools_0.3.5 knitr_1.14
## [25] tibble_1.2