R’s data.table

SQL Joins
Other Tips and Tricks

The source of this tutorial, including the example datasets, is available here on GitHub

SQL Joins

For joining data.tables, the basic syntax are:

JOIN type	DT syntax	data.table::merge() syntax
INNER	X[Y, nomatch = 0]	merge(X, Y, all = FALSE)
LEFT OUTER	Y[X]	merge(X, Y, all.x = TRUE)
RIGHT OUTER	X[Y]	merge(X, Y, all.y = TRUE)
FULL OUTER	-	merge(X, Y, all = TRUE)

The example data consists of the following two tables:

library(data.table)

## Warning: package 'data.table' was built under R version 3.2.5

setwd("/Users/ethen/Business-Analytics/R/data_table")
employees <- fread("data/employees.csv")
departments <- fread("data/departments.csv")
knitr::kable(employees, caption = "Table Employees")

Table Employees
Employee	EmployeeName	Department	Salary
1	Alice	11	800
2	Bob	11	600
3	Carla	12	900
4	Daniel	12	1000
5	Evelyn	13	800
6	Ferdinand	21	700

knitr::kable(departments, caption = "Table Departments")

Table Departments
Department	DepartmentName	Manager
11	Production	1
12	Sales	4
13	Marketing	5
14	Research	NA

Inner Join

Figure linked from Pinal Dave

the INNER JOIN returns the rows with a match in both tables

# inner join with merge
merge(employees, departments, by = "Department")

##    Department Employee EmployeeName Salary DepartmentName Manager
## 1:         11        1        Alice    800     Production       1
## 2:         11        2          Bob    600     Production       1
## 3:         12        3        Carla    900          Sales       4
## 4:         12        4       Daniel   1000          Sales       4
## 5:         13        5       Evelyn    800      Marketing       5

# inner join with keys
# set the keys of the tables to represent the by:
setkey(employees, Department)
setkey(departments, Department)
# equivalent to setkeyv( departments, "Department" )

# note that you only need to set the keys once
# you can confirm if it works with
key(employees)

## [1] "Department"

# perform the join,
# nomatch 0 means no rows will be returned 
# from not matched rows from the Right table
employees[ departments, nomatch = 0 ]

##    Employee EmployeeName Department Salary DepartmentName Manager
## 1:        1        Alice         11    800     Production       1
## 2:        2          Bob         11    600     Production       1
## 3:        3        Carla         12    900          Sales       4
## 4:        4       Daniel         12   1000          Sales       4
## 5:        5       Evelyn         13    800      Marketing       5

As we’ll see later, without anything else, TABLE_X[TABLE_Y] returns a right outer join; setting nomatch=0 returns a inner join.

note:

Employee #6, Ferdinand, has not been returned, as he’s in a yet to be created department 21. Neither has department #14, Research, as there are no employees.

Left Outer Join

Figure linked from Pinal Dave
Figure2 from Pinal Dave

the LEFT OUTER JOIN returns all the rows from the left table, filling in matched columns (or NA) from the right table

# left join with merge, notice that the column's ordering is different
merge(employees, departments, by = "Department", all.x = TRUE)

##    Department Employee EmployeeName Salary DepartmentName Manager
## 1:         11        1        Alice    800     Production       1
## 2:         11        2          Bob    600     Production       1
## 3:         12        3        Carla    900          Sales       4
## 4:         12        4       Daniel   1000          Sales       4
## 5:         13        5       Evelyn    800      Marketing       5
## 6:         21        6    Ferdinand    700             NA      NA

departments[employees]

##    Department DepartmentName Manager Employee EmployeeName Salary
## 1:         11     Production       1        1        Alice    800
## 2:         11     Production       1        2          Bob    600
## 3:         12          Sales       4        3        Carla    900
## 4:         12          Sales       4        4       Daniel   1000
## 5:         13      Marketing       5        5       Evelyn    800
## 6:         21             NA      NA        6    Ferdinand    700

note:

Employee #6, Ferdinand, has been returned with department name as NA.
Department #14, Research, has not been returned.

Right Outer Join

Figure linked from Pinal Dave
Figure2 from Pinal Dave

the RIGHT OUTER JOIN returns all the rows from the right table, filling in matched columns (or else NA) from the left table

# right join with merge
merge(employees, departments, by = "Department", all.y = TRUE)

##    Department Employee EmployeeName Salary DepartmentName Manager
## 1:         11        1        Alice    800     Production       1
## 2:         11        2          Bob    600     Production       1
## 3:         12        3        Carla    900          Sales       4
## 4:         12        4       Daniel   1000          Sales       4
## 5:         13        5       Evelyn    800      Marketing       5
## 6:         14       NA           NA     NA       Research      NA

employees[departments]

##    Employee EmployeeName Department Salary DepartmentName Manager
## 1:        1        Alice         11    800     Production       1
## 2:        2          Bob         11    600     Production       1
## 3:        3        Carla         12    900          Sales       4
## 4:        4       Daniel         12   1000          Sales       4
## 5:        5       Evelyn         13    800      Marketing       5
## 6:       NA           NA         14     NA       Research      NA

# we can also do a not matched with !
# returning the rows in the employees table that are not in the
# department table
employees[!departments]

##    Employee EmployeeName Department Salary
## 1:        6    Ferdinand         21    700

Full Outer Join

Figure linked from Pinal Dave
Figure2 from Pinal Dave

the FULL OUTER JOIN returns all the rows from both tables, filling in matched columns (or else NA)

merge(employees, departments, by = "Department", all = TRUE)

##    Department Employee EmployeeName Salary DepartmentName Manager
## 1:         11        1        Alice    800     Production       1
## 2:         11        2          Bob    600     Production       1
## 3:         12        3        Carla    900          Sales       4
## 4:         12        4       Daniel   1000          Sales       4
## 5:         13        5       Evelyn    800      Marketing       5
## 6:         14       NA           NA     NA       Research      NA
## 7:         21        6    Ferdinand    700             NA      NA

Other Tips and Tricks

`.SD` and `.N`

First .SD refers to the subset of data for each group. The example below gets the sum of y and z and the number of rows (using .N) in each group while grouped by x.

DT <- data.table( x = sample(1:2, 7, replace = TRUE), 
                  y = sample(1:13, 7), 
                  z = sample(1:14, 7) ) 
DT[ , c( lapply(.SD, sum), .N ), by = x ]

##    x  y  z N
## 1: 1 35 35 4
## 2: 2 23 22 3

Key Indexing

Set the keys and select the matching rows without using ==.

DT <- data.table( A = letters[ c(2, 1, 2, 3, 1, 2, 3) ],
                  B = c(5, 4, 1, 9, 8, 8, 6), C = 6:12 )
setkey(DT, A, B)

# matches rows where A == "b"
DT["b"]

##    A B  C
## 1: b 1  8
## 2: b 5  6
## 3: b 8 11

# matches rows where A == "b" or A == "c"
DT[ c("b", "c"), ]

##    A B  C
## 1: b 1  8
## 2: b 5  6
## 3: b 8 11
## 4: c 6 12
## 5: c 9  9

Since we did not provide any values for the second key column “B”, it just matches c( “b”, “c” ) against the first key column “A” and returns all the matched rows. Consider the example below that matches both keys.

DT[ .("b", 5), ]

##    A B C
## 1: b 5 6

“b” is first matched against the first key column “A”. And within those matching rows, 5 is matched against the second key column “B” to obtain row indices where both “A” and “B” match the given values.

# Subset all rows where just the second key column matches
DT[ .(unique(A), 5), ]

##    A B  C
## 1: a 5 NA
## 2: b 5  6
## 3: c 5 NA

Selecting Parts of a Filtered Group

Select the first row of the b and c groups using mult.

DT[ c("b", "c"), mult = "first" ]

##    A B  C
## 1: b 1  8
## 2: c 6 12

When i is a data.table, DT[ i, j, by = .EACHI ] evaluates j for the groups of DT that each row in i joins to. We call this grouping by each i. See difference with and without .EACHI below.

# we can use [] for .SD to subset the data
DT[ c("b", "c"), .SD[ c(1, .N), ] ]

##    A B C
## 1: b 1 8
## 2: c 9 9

DT[ c("b", "c"), .SD[ c(1, .N), ], by = .EACHI ]

##    A B  C
## 1: b 1  8
## 2: b 8 11
## 3: c 6 12
## 4: c 9  9

Notice that .N here is used to retrieve the last row.

R Session Information

sessionInfo()

## R version 3.2.4 (2016-03-10)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] data.table_1.10.0
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.5     bookdown_0.1    digest_0.6.9    assertthat_0.1 
##  [5] mime_0.4        R6_2.1.2        xtable_1.8-2    formatR_1.4    
##  [9] magrittr_1.5    evaluate_0.9    rmdformats_0.3  highr_0.6      
## [13] stringi_1.0-1   rstudioapi_0.6  miniUI_0.1.1    rmarkdown_1.1  
## [17] tools_3.2.4     questionr_0.5   stringr_1.0.0   shiny_0.14.2   
## [21] httpuv_1.3.3    yaml_2.1.13     htmltools_0.3.5 knitr_1.14     
## [25] tibble_1.2

R’s data.table

R’s data.table

SQL Joins

Inner Join

Left Outer Join

Right Outer Join

Full Outer Join

Other Tips and Tricks

.SD and .N

Key Indexing

Selecting Parts of a Filtered Group

R Session Information

Reference

`.SD` and `.N`