Introduction

This guide outlines tools and tips for improving the speed and execution of R code.

Sometimes, simply tweaking a few lines of code can lead to large performance gains in the execution of a program. Other issues may take more time to work through but can be a huge benefit to a project in the long term.

An important lesson to learn when it comes to optimising an R (or any) program is knowing both if to start and when to stop. You most likely want to optimize your code because it is “too slow”, but what that means will vary from project to project. Be sure to consider what “fast enough” is for your project and how much needs to be optimized. If your program takes an hour to complete, spending 5 hours trying to make it faster can be time well spent if the script will be run regularly, and a complete waste of time if it’s an ad-hoc analysis.

For more information, see the CRAN Task View High-Performance and Parallel Computing with R.

The “Performant Code” section of Hadley Wickham’s Advanced R is another great resource and provides a deeper dive into what is covered in this guide.


Update Your Installation

One of the easiest ways to improve the performance of R is to update R. In general, R will have a big annual release (i.e., 3.5.0) in the spring and around 3-4 smaller patch releases (i.e., 3.5.1) throughout the rest of the year. If the middle digit of your installation is behind the current release, you should consider updating.

For instance, R 3.5.0 implemented an improved read from text files. A 5GB file took over 5 minutes to read in 3.4.4:

While 3.5.0 took less than half the time:

To see what the R-core development team is up to, check out the NEWS file from the R project.


Profiling & Benchmarking

In order to efficiently optimize your code, you’ll first need to know where it’s running slowest. The profvis package provides a nice way of visualizing the execution time and memory useage of your program.

library(profvis)
library(dplyr)

profvis({
    diamonds <- read.csv("optimization/data/diamonds.csv")

    diamonds_by_cut <- diamonds %>%
        group_by(cut) %>%
        summarise_if(is.numeric, mean)

    write.csv(diamonds_by_cut, file = "optimization/data/diamonds_by_cut.csv")

})

In this toy example it looks like the read.csv function is the bottleneck, so

work on optimizing that first.

Once you find the bottleneck that needs to be optimized, it can be useful to

benchmark different potential solutions. The microbenchmark package can help

you choose between different options. Continuing with the simple example with

the diamonds dataset, compare the base read.csv function with read_csv

from the readr package.

library(microbenchmark)

microbenchmark(

   read.csv("optimization/data/diamonds.csv"),

   readr::read_csv("optimization/data/diamonds.csv")

)
Unit: milliseconds
                                              expr       min        lq
        read.csv("optimization/data/diamonds.csv") 103.14624 111.61502
 readr::read_csv("optimization/data/diamonds.csv")  55.57689  59.80873
      mean    median        uq      max neval
 135.10956 115.71928 127.93492 453.5855   100
  75.97688  63.64992  71.88532 372.4557   100

In this case, read_csv is about twice as fast as the base R implementations.

Parallel Computing

Often, time-intensive R code can be sped up by breaking the execution of

the job across additional cores of your computer. This is called parallel computing.

Learn lapply/purrr::map

Learning the lapply (and variants) function from Base R or the map (and variants) function from the purrr package is the first step in learning to run R code in parallel. Once you understand how lapply and map work, running your code in parallel will be simple.

Say you have a vector of numbers and want to find the square root of each one

(ignore for now that sqrt is vectorized, which will be covered later).

You could write a for loop and iterate over each element of the vector:

x <- c(1, 4, 9, 16)

out <- vector("list", length(x))

for (i in seq_along(x)) {

   out[[i]] <- sqrt(x[[i]])

}

unlist(out)
[1] 1 2 3 4

The lapply function essentially handles the overhead of constructing a for

loop for you. The syntax is:

lapply(X, FUN, ...)

lapply will then take each element of X and apply the FUNction to it.

Our simple example then becomes:

x <- c(1, 4, 9, 16)

out <- lapply(x, sqrt)

unlist(out)
[1] 1 2 3 4

Those working within the tidyverse may use map from the purrr package equivalently:

library(purrr)

x <- c(1, 4, 9, 16)

out <- map(x, sqrt)

unlist(out)
[1] 1 2 3 4

Motivating Example

Once you are comfortable with lapply and/or map, running the same code in

parallel takes just an additional line of code.

For lapply users, the future.apply package contains an equivalent

future_lapply function. Just be sure to call plan(multiprocess) beforehand,

which will handle the back-end orchestration needed to run in parallel.

# install.packages("future.apply")

library(future.apply)

plan(multisession)

out <- future_lapply(x, sqrt)

unlist(out)
[1] 1 2 3 4

For purrr users, the furrr (i.e., future purrr) package includes an

equivalent future_map function:

# install.packages("furrr")

library(furrr)

plan(multisession)

y <- future_map(x, sqrt)

unlist(y)
[1] 1 2 3 4

How much faster did this simple example run in parallel?

library(future.apply)

plan(multisession)

x <- c(1, 4, 9, 16)

microbenchmark::microbenchmark(

   sequential = lapply(x, sqrt),

   parallel = future_lapply(x, sqrt),

   unit = "s"

)
Unit: seconds
       expr         min           lq         mean       median          uq
 sequential 0.000001763 0.0000020705 0.0000029848 0.0000029315 0.000003772
   parallel 0.026585548 0.0282111980 0.0333569452 0.0291357070 0.030628517
         max neval
 0.000009799   100
 0.337903181   100

Parallelization was actually slower. In this case, the overhead of

setting the code to run in parallel far outweighed any performance gain. In

general, parallelization works well on long-running & compute intensive jobs.

A (somewhat) More Complex Example

In this example we’ll use the diamonds dataset from ggplot2 and perform a

kmeans cluster. We’ll use lapply to iterate the number of clusters from 2 to

5:

df <- ggplot2::diamonds

df <- dplyr::select(df, -c(cut, color, clarity))

centers = 2:5

system.time(

   lapply(centers,

                function(x) kmeans(df, centers = x, nstart = 500)

                )

   )
   user  system elapsed 
 35.229   2.688  42.309 

A now running the same code in parallel:

library(future.apply)

plan(multisession)

system.time(

   future_lapply(centers,

                               function(x) kmeans(df, centers = x, nstart = 500)

                               )

   )
   user  system elapsed 
  0.876   0.210  21.655 

While we didn’t achieve perfect scaling, we still get a nice bump in execution

time.

Additional Packages

For the sake of ease and brevity, this guide focused on the futures framework

for parallelization. However, you should be aware that there are a number of

other ways to parallelize your code.

The parallel Package

The parallel package is included in your base R installation. It includes

analogues of the various apply functions:

  • parLapply

  • mclapply - not available on Windows

These functions generally require more setup, especially on Windows machines.

The doParallel Package

The doParallel package builds off of parallel and is

useful for code that uses for loops instead of lapply. Like the parallel

package, it generally requires more setup, especially on Windows machines.

Machine Learning - caret

For those running machine learning models, the caret package can easily

leverage doParallel to speed up the execution of multiple models. Lifting

the example from the package documentation:

library(doParallel)

cl <- makePSOCKcluster(5) # number of cores to use

registerDoParallel(cl)

## All subsequent models are then run in parallel

model <- train(y ~ ., data = training, method = "rf")

## When you are done:

stopCluster(cl)

Be sure to check out the full

documentation

for more detail.


Big Data

As data collection and storage becomes easier and cheaper, it is relatively

simple to obtain relatively large data files. An important point to keep in

mind is that the size of your data will generally expand when it is read

from a storage device into R. A general rule of thumb is that a file will take

somewhere around 3-4 times more space in memory than it does on disk.

For instance, compare the size of the iris data set when it is saved as a

.csv file locally vs the size of the object when it is read in to an R session:

file.size("optimization/data/iris.csv") / 1000
[1] 3.716
df <- readr::read_csv("optimization/data/iris.csv")

pryr::object_size(df)
10.14 kB

This means that on a standard Urban Institute desktop, you may have issues

reading in files that are larger than 4 GB.

Object Size

The type of your data can have a big impact on the size of your data frame

when you are dealing with larger files. There are four main types of atomic

vectors in R:

  1. logical

  2. integer

  3. double (also called numeric)

  4. character

Each of these data types occupies a different amount of space in memory

logical and integer vectors use 4 bytes per element, while a double will

occupy 8 bytes. R uses a global string pool, so character vectors are hard

to estimate, but will generally take up more space for element.

Consider the following example:

x <- 1:100

pryr::object_size(x)
680 B
pryr::object_size(as.double(x))
680 B
pryr::object_size(as.character(x))
1.32 kB

An incorrect data type can easily cost you a lot of space in memory, especially

at scale. This often happens when reading data from a text or csv file - data

may have a format such as c(1.0, 2.0, 3.0) and will be read in as a numeric

column, when integer is more appropriate and compact.

You may also be familiar with factor variables within R. Essentially a

factor will represent your data as integers, and map them back to their

character representation. This can save memory when you have a compact and

unique level of factors:

x <- sample(letters, 10000, replace = TRUE)

pryr::object_size(as.character(x))
81.50 kB
pryr::object_size(as.factor(x))
42.10 kB

However if each element is unique, or if there is not a lot of overlap among

elements, than the overhead will make a factor larger than its character

representation:

pryr::object_size(as.factor(letters))
2.22 kB
pryr::object_size(as.character(letters))
1.71 kB

Cloud Computing

Sometimes, you will have data that are simply too large to ever fit on your

local desktop machine. If that is the case, then the Elastic Cloud Computing

Environment from the Office of Technology and Data Science can provide you with

easy access to powerful analytic tools for computationally intensive project.

The Elastic Cloud Computing Environment allows researchers to quickly spin-up

an Amazon Web Services (AWS) Elastic Cloud Compute (EC2) instance. These

instances offer increased memory to read in large datasets, along with

additional CPUs to provide the ability to process data in parallel at an

impressive scale.

Instance | CPU | Memory (GB) |

|———-|—–|——–|

Desktop | 8 | 16 |

c5.4xlarge | 16 | 32 |

c5.9xlarge | 36 | 72 |

c5.18xlarge | 72 | 144 |

x1e.8xlarge | 32 | 976 |

x1e.16xlarge | 64 | 1952 |

Feel free to contact Erika Tyagi (etyagi@urban.org) if this would be useful

for your project.


Common Pitfalls

For Loops and Vector Allocation

A refrain you will often hear is that for loops in R are slow and need to be

avoided at all costs. This is not true! Rather, an improperly constructed loop

in R can bring the execution of your program to a near standstill.

A common for loop structure may look something like:

x <- 1:100

out <- c()

for (i in x) {

   out <- c(out, sqrt(x))

   }

The bottleneck in this loop is with the allocation of the vector out. Every

time we iterate over an item in x and append it to out, R makes a copy

of all the items already in out. As the size of the loop grows, your code

will take longer and longer to run.

A better practice is to pre-allocate out to be the correct length, and then

insert the results as the loop runs.

x <- 1:100

out <- rep(NA, length(x))

for (i in seq_along(x)) {

       out[i] <- sqrt(x[i])

}

A quick benchmark shows how much more efficient a loop with a pre-allocated

results vector is:

bad_loop <- function(x) {

   out <- c()

   for (i in x) {

       out <- c(out, sqrt(x))

   }

}

good_loop <- function(x) {

   out <- rep(NA, length(x))

   for (i in seq_along(x)) {

       out[i] <- sqrt(x[i])

   }

}

x <- 1:100

microbenchmark::microbenchmark(

   bad_loop(x),

   good_loop(x)

)
Unit: microseconds
         expr      min       lq       mean    median        uq       max neval
  bad_loop(x) 1042.179 1267.577 1891.78264 1328.5640 1446.9720 10173.125   100
 good_loop(x)    6.191    6.437   32.23338    6.7035   11.2545  2366.725   100

And note how performance of the “bad” loop degrades as the loop size grows.

y <- 1:250

microbenchmark::microbenchmark(

   bad_loop(y),

   good_loop(y)

)
Unit: microseconds
         expr       min         lq       mean    median        uq       max
  bad_loop(y) 19249.582 22663.9595 24695.7231 23492.774 24909.407 81335.882
 good_loop(y)    14.022    14.5345    21.2626    23.329    26.486    64.616
 neval
   100
   100

Vectorized Functions

Many functions in R are vectorized, meaning they can accept an entire vector

(and not just a single value) as input. The sqrt function from the

prior examples is one:

x <- c(1, 4, 9, 16)

sqrt(x)
[1] 1 2 3 4

This removes the need to use lapply or a for loop. Vectorized functions in

R are generally written in a compiled language like C, C++, or FORTRAN, which

makes their implementation faster.

x <- 1:100

microbenchmark::microbenchmark(

   lapply(x, sqrt),

   sqrt(x)

)
Unit: nanoseconds
            expr   min    lq     mean median      uq   max neval
 lapply(x, sqrt) 20172 20418 20847.27  20541 20725.5 37228   100
         sqrt(x)   287   328   397.70    369   369.0  2296   100