Efficient Looping with R

Apply Family

lapply: list apply. When you want to apply a function to each element of a list and get a list back.

x <- list(a = 1, b = 1:3, c = 10:100) 
lapply(x, length)
## $a
## [1] 1
## 
## $b
## [1] 3
## 
## $c
## [1] 91
lapply(x, sum)
## $a
## [1] 1
## 
## $b
## [1] 6
## 
## $c
## [1] 5005

sapply: simplified apply. When you want to apply a function to each element of a list and get avector back.

x <- list(a = 1, b = 1:3, c = 10:100) 
sapply(x, length)
##  a  b  c 
##  1  3 91
sapply(x, sum)
##    a    b    c 
##    1    6 5005

But never use this function as there’s a faster alternative below.

vapply: vector apply. It provides the same functionality as sapply, but we also explicitly state the type of data that it will return instead of letting R spend time figuring out. This slightly extra effort can speed up our code a bit.

# everything returned should be
# an numeric of length 1, other commonly
# used type is character
x <- list(a = 1, b = 1:3, c = 10:100) 
vapply(x, length, numeric(1))
##  a  b  c 
##  1  3 91
# we can specify a anonymous function for the function
vapply(x, function(i) {
    # the index i represents each element in the sequence
    summed <- sum(i, na.rm = TRUE)
    return(summed)
}, numeric(1))
##    a    b    c 
##    1    6 5005

mapply: When we have multiple data structures (e.g. vectors, lists) and we want to apply a function to each elements of all the input sequence.

x1 <- c('Micheal', 'Kobe')
x2 <- c('Jordan', 'Bryant')

# we can also specify the function outside and
# pass it as an argument, the ..., means to
# accept any number of argument
paste2 <- function(...) paste(..., sep = '-')
mapply(paste2, x1, x2)
##          Micheal             Kobe 
## "Micheal-Jordan"    "Kobe-Bryant"
# note that this is just for illustration, 
# for the code above, we can simply do:
paste(x1, x2, sep = '-')
## [1] "Micheal-Jordan" "Kobe-Bryant"

Parallel Programming

When we say parllel programming, it’s worth noting that there’s a difference between the implementation of parallel code that runs locally on a single multi-processor (multi-cores) computer or on a cluster of computers, as well as implementation for Unix vs Windows system.

Here we’ll illustrate the implementation of parallel code that runs locally on a single multi-processor (multi-cores) computer using the doParallel library. This package provides a parallel backend for the foreach package and represents a merger of doMC (works on Unix-like system) and doSNOW (works on Windows) packages.

Let’s say we have an N-core processor (on a local machine) and we want to use N-1 cores to run our program.

library(doParallel)

# we specify the number of cores/workers we want to use
n_cores <- detectCores() - 1
n_cores
## [1] 7
# generate a toy function that
# simply generate the summary of a bunch of random numbers
summarize <- function(i) {
    summary( rnorm(100000) )
}

# the time difference between using n_cores and not using it
inputs <- 1:20
system.time({
    results <- mclapply(inputs, summarize, mc.cores = n_cores)
})
##    user  system elapsed 
##   0.284   0.061   0.072
system.time({
    results <- lapply(inputs, summarize)
})
##    user  system elapsed 
##   0.384   0.002   0.386

Warning From the documentation, mclapply relies on Unix system, thus it will most likely not work on Windows. Consider trying the following code on Windows:

# explicitly create the cluster
cl <- makeCluster(n_cores)

# run the parallel code using parLapply and
# pass the created cluster as the first argument
system.time({
    results <- parLapply(cl, inputs, summarize)
})

# remember to stop the cluster once we're finished
stopCluster(cl)

Caveat 1: Overhead

Sending tasks out to another processesor and have work to be done there takes some time and there’s some work involved, thus we don’t want to parallelizing things that already runs really fast sequentially. Especially in R where some functions can be vectorized.

# overhead when parallelizing trivial tasks
inputs <- 1:10000
system.time({
    results <- mclapply(inputs, sqrt, mc.cores = n_cores)
})
##    user  system elapsed 
##   0.067   0.034   0.012
system.time({
    results <- lapply(inputs, sqrt)
})
##    user  system elapsed 
##   0.003   0.000   0.003

So when working on code parallelization it is important to keep in mind that for very short jobs, the parallelization overhead will diminish the parallelization benefits. If we’re not sure if the parallelization of specific piece of code improves its speed, we can always check the execution time using system.time().

For Loop For Life

lapply family is cool, but it may be a bit constrained, we might have to constructed our code in a way that we’re not used to. At the end of day, we just want to work with the for loops that exists in every popular programming languages. And it turns out there’s a package called foreach that allows us to do just that:

foreach simply gives us a construct/template for writing for loops and it will take care of spreading the work inside the body of the loop across multiple processors/cores on our machine. The way it works is the following:

# take a subset of the iris dataset, and train a bunch of
# logisitic regression on the same dataset
iris_subset <- iris[iris$Species != 'setosa',]
iris_subset$Species <- factor( iris_subset$Species, 
                               levels = c('versicolor', 'virginica') )
# number of models to train
trials <- 1000

# we first register a backend specifying the number
# of cores that we wish to use
registerDoParallel(cores = detectCores() - 1)

# parllelized foreach
system.time({
    result <- foreach(1:trials) %dopar% {
        # foreach loop behaves similar to a function, 
        # the last line represents the value that will be returned
        model <- glm(Species ~., data = iris_subset, family = 'binomial')
    }
})
##    user  system elapsed 
##   1.283   0.150   2.248
# foreach also allows us to run things in a serial manner,
# to do so, we simply change the dopar to do
system.time({
    result <- foreach(1:trials) %do% {
        model <- glm(Species ~., data = iris_subset, family = 'binomial')
    }
})
##    user  system elapsed 
##   2.901   0.104   3.006

foreach not only runs the code in parallel, but also merges and returns the results obtained by execution of each loop. By default foreach returns a list.

head(result, 2)
## [[1]]
## 
## Call:  glm(formula = Species ~ ., family = "binomial", data = iris_subset)
## 
## Coefficients:
##  (Intercept)  Sepal.Length   Sepal.Width  Petal.Length   Petal.Width  
##      -42.638        -2.465        -6.681         9.429        18.286  
## 
## Degrees of Freedom: 99 Total (i.e. Null);  95 Residual
## Null Deviance:       138.6 
## Residual Deviance: 11.9  AIC: 21.9
## 
## [[2]]
## 
## Call:  glm(formula = Species ~ ., family = "binomial", data = iris_subset)
## 
## Coefficients:
##  (Intercept)  Sepal.Length   Sepal.Width  Petal.Length   Petal.Width  
##      -42.638        -2.465        -6.681         9.429        18.286  
## 
## Degrees of Freedom: 99 Total (i.e. Null);  95 Residual
## Null Deviance:       138.6 
## Residual Deviance: 11.9  AIC: 21.9
# we can specify how we want it to
# combine the result through the .combine 
# argument, say we want it to return a vector
result <- foreach(1:trials, .combine = c) %dopar% {
    model <- glm(Species ~., data = iris_subset, family = 'binomial')
    deviance <- model$deviance
    return(deviance)
}
head(result, 2)
## [1] 11.89855 11.89855

When writing foreach, remember to keep track of the expected variable/object that we wish to return and how variables/objects from each iteration should be combined.

There are other options for combining results, such as rbind and cbind.

Caveat 2: Modifying Global State

When we pass in our data to run a parallel task, each processors will get a copy of the data, thus it’s not going to modify the global/intial data in any way.

# create a vector of 0s
# and modifying the element by
# multiplying every index by 2
x1 <- rep(0, times = 5)
for(i in 1:5) {
    x1[i] <- i * 2
}

# we try to do the same
# thing in a foreach loop
x2 <- rep(0, times = 5)
foreach(i = 1:5) %dopar% {
    x2[i] <- i * 2
}
# the two different results
list(serial = x1, parallel = x2)
## $serial
## [1]  2  4  6  8 10
## 
## $parallel
## [1] 0 0 0 0 0

As each processor will get a copy of the vector that we’ve passed in, if we modify the data inplace, it will simply be modifying the copy instead of the original vector.

So the better way to do this is to simply have function that returns the information from the parallelized task.

# the way to fix this, is simply return what
# we actually want
x3 <- foreach(i = 1:5, .combine = c) %dopar% {
    i * 2
}
x3
## [1]  2  4  6  8 10

As we can see from this example, apart from parallelizing the body of a loop, foreach has the advantages over regular for loops when the purpose of the loop is to create a data structure such as a vector, list, or matrix. Since it removes the need to write extra code to initialize it.

Nested Foreach

The foreach package also provides an option for nested for loop implementation using the %:% operator. The operator turns multiple foreach loops into a single loop, creating a single stream of tasks that can all be executed in parallel.

# simple simulation function
sim <- function(a, b) {
    return(10 * a + b)
}

# loop over all combinations of the values that
# are stored in the two vectors, avec, bvec to 
# create the simulated data
avec <- 1:2
bvec <- 1:4

# for loop way
x <- matrix(0, length(avec), length(bvec))
for(j in 1:length(bvec)) {
    for(i in 1:length(avec)) {
        x[i, j] <- sim(avec[i], bvec[j])
    }
}
x
##      [,1] [,2] [,3] [,4]
## [1,]   11   12   13   14
## [2,]   21   22   23   24
# foreach way, notice we don't
# put braces around the inner foreach loop
x <- foreach(b = bvec, .combine = 'cbind') %:%
    foreach(a = avec, .combine = 'c') %dopar% {
        sim(a, b)
    }
x
##      result.1 result.2 result.3 result.4
## [1,]       11       12       13       14
## [2,]       21       22       23       24

In the foreach example above, the inner loop returns the columns of the result matrix as vectors, which are then combined in the outer loop into a matrix.

Once we’re done using the registered cluster, remember to explicity stop it. This ensures the “health” of future cluster objects.

stopImplicitCluster()

In sum, there are lots of naturally parallel problem that could benefit from writing parallelized code. Think about independent tasks such as cross validation or grid search. These computational expensive, and easily parallelizable for loops are a great place to start looking for where we can apply parallel programming.

Before we continue on to the next section, we’ll do a simple programming exercise. The task is to write a loop that prints the number from 1 to 5. This is not a trick question. If the answer in your mind is something like the following:

for(i in 1:5) {
    print(i)
}

Continue on to the next section to see a better alternative.

Iterators

The issue with the code above is probably unnoticeable in the toy example above, since we’re only looping through a small sequence. But be aware of what the 1:5 part is doing.

1:5
## [1] 1 2 3 4 5

It’s generating a sequence of numbers from 1 to 5, and we’re using the sequence to tell the for loop we want to loop 5 times. But what if we want to loop for 1000000 times? Do we really need to create 1000000 numbers to fill up our memory just to create the loop?

To solve this issue, we’ll introduce what’s called the iterators. An iterator is an object for moving through a container one item at a time. Although iterators are not part of the native R language, they are implemented in the iterators and itertools packages

# note that the iterators library is
# also loaded when we load the doParallel library
library(iterators)
library(itertools)

# create a iterator of length 2,
# Here iteration is performed manually by calling nextElem()
# on the iterator, each call yields the next element in sequence
iter_count <- icount(2)
nextElem(iter_count)
## [1] 1
nextElem(iter_count)
## [1] 2
# when the sequence has been exhausted, we'll get a StopIteration error
# nextElem(iter_count)

We can also create iterators from an existing sequence.

name <- c('Bob', 'Mary')
iter_name <- iter(name)
nextElem(iter_name)
## [1] "Bob"
nextElem(iter_name)
## [1] "Mary"

So another way to write the loop that prints the number from 1 to 5 using a iterator is as follows:

# wrap the iterator with the ihasNext function,
# while the sequence has not been exhausted, print
# the next element
iter_count <- ihasNext( icount(5) )
while( hasNext(iter_count) ) {
    print( nextElem(iter_count) )
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5

Unfortunately, the syntax above might be a bit combersome to write, but good news is it looks better with the apply family and foreach. The following syntax assumes that we’re not just going to print the sequence but we’re also going to do some other stuff in the loop and return the result.

# iterators with apply
# using the multiplying each index by 2 as an example
x1 <- vapply( icount(5), function(i) {
    return(i * 2)
}, numeric(1) )

# iterators with foreach
x2 <- foreach(i = icount(5), .combine = c) %dopar% {
    return(i * 2)
}
list(apply_way = x1, foreach_way = x2)
## $apply_way
## [1]  2  4  6  8 10
## 
## $foreach_way
## [1]  2  4  6  8 10

There are a lot of useful functionalities in two packages, consider exploring the documentations at your free time. e.g. ireadLines creates an iterator to read lines from a file, which is useful when the dataset is extremely large and we wish to do some preprocessing with each line of the data instead of reading everything into memory at once.

R Session Information

devtools::session_info()
## Session info --------------------------------------------------------------
##  setting  value                       
##  version  R version 3.2.4 (2016-03-10)
##  system   x86_64, darwin13.4.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/Chicago             
##  date     2017-01-13
## Packages ------------------------------------------------------------------
##  package    * version date       source        
##  assertthat   0.1     2013-12-06 CRAN (R 3.2.0)
##  bookdown     0.1     2016-07-13 CRAN (R 3.2.5)
##  codetools    0.2-14  2015-07-15 CRAN (R 3.2.4)
##  devtools     1.12.0  2016-06-24 CRAN (R 3.2.5)
##  digest       0.6.9   2016-01-08 CRAN (R 3.2.3)
##  doParallel * 1.0.10  2015-10-14 CRAN (R 3.2.0)
##  evaluate     0.9     2016-04-29 cran (@0.9)   
##  foreach    * 1.4.3   2015-10-13 CRAN (R 3.2.0)
##  formatR      1.4     2016-05-09 cran (@1.4)   
##  highr        0.6     2016-05-09 cran (@0.6)   
##  htmltools    0.3.5   2016-03-21 CRAN (R 3.2.4)
##  httpuv       1.3.3   2015-08-04 CRAN (R 3.2.0)
##  iterators  * 1.0.8   2015-10-13 CRAN (R 3.2.0)
##  itertools  * 0.1-3   2014-03-12 CRAN (R 3.2.0)
##  knitr        1.14    2016-08-13 CRAN (R 3.2.4)
##  magrittr     1.5     2014-11-22 CRAN (R 3.2.0)
##  memoise      1.0.0   2016-01-29 CRAN (R 3.2.3)
##  mime         0.4     2015-09-03 CRAN (R 3.2.0)
##  miniUI       0.1.1   2016-01-15 CRAN (R 3.2.3)
##  questionr    0.5     2016-03-15 CRAN (R 3.2.4)
##  R6           2.1.2   2016-01-26 CRAN (R 3.2.3)
##  Rcpp         0.12.5  2016-05-14 cran (@0.12.5)
##  rmarkdown    1.1     2016-10-16 CRAN (R 3.2.4)
##  rmdformats   0.3     2016-09-05 CRAN (R 3.2.5)
##  rstudioapi   0.6     2016-06-27 CRAN (R 3.2.5)
##  shiny        0.14.2  2016-11-01 CRAN (R 3.2.5)
##  stringi      1.0-1   2015-10-22 CRAN (R 3.2.0)
##  stringr      1.0.0   2015-04-30 CRAN (R 3.2.0)
##  tibble       1.2     2016-08-26 CRAN (R 3.2.5)
##  withr        1.0.2   2016-06-20 CRAN (R 3.2.5)
##  xtable       1.8-2   2016-02-05 CRAN (R 3.2.3)
##  yaml         2.1.13  2014-06-12 CRAN (R 3.2.0)

Ethen Liu

2017-01-13