library(profvis)
library(dplyr)
profvis({
<- read.csv("optimization/data/diamonds.csv")
diamonds
<- diamonds %>%
diamonds_by_cut group_by(cut) %>%
summarise_if(is.numeric, mean)
write.csv(diamonds_by_cut, file = "optimization/data/diamonds_by_cut.csv")
})
Introduction
This guide outlines tools and tips for improving the speed and execution of R code.
Sometimes, simply tweaking a few lines of code can lead to large performance gains in the execution of a program. Other issues may take more time to work through but can be a huge benefit to a project in the long term.
An important lesson to learn when it comes to optimising an R (or any) program is knowing both if to start and when to stop. You most likely want to optimize your code because it is “too slow”, but what that means will vary from project to project. Be sure to consider what “fast enough” is for your project and how much needs to be optimized. If your program takes an hour to complete, spending 5 hours trying to make it faster can be time well spent if the script will be run regularly, and a complete waste of time if it’s an ad-hoc analysis.
For more information, see the CRAN Task View High-Performance and Parallel Computing with R.
The “Performant Code” section of Hadley Wickham’s Advanced R is another great resource and provides a deeper dive into what is covered in this guide.
Update Your Installation
One of the easiest ways to improve the performance of R is to update R. In general, R will have a big annual release (i.e., 3.5.0) in the spring and around 3-4 smaller patch releases (i.e., 3.5.1) throughout the rest of the year. If the middle digit of your installation is behind the current release, you should consider updating.
For instance, R 3.5.0 implemented an improved read from text files. A 5GB file took over 5 minutes to read in 3.4.4:
While 3.5.0 took less than half the time:
To see what the R-core development team is up to, check out the NEWS file from the R project.
Profiling & Benchmarking
In order to efficiently optimize your code, you’ll first need to know where it’s running slowest. The profvis
package provides a nice way of visualizing the execution time and memory useage of your program.
In this toy example it looks like the read.csv
function is the bottleneck, so
work on optimizing that first.
Once you find the bottleneck that needs to be optimized, it can be useful to
benchmark different potential solutions. The microbenchmark
package can help
you choose between different options. Continuing with the simple example with
the diamonds
dataset, compare the base read.csv
function with read_csv
from the readr
package.
library(microbenchmark)
microbenchmark(
read.csv("optimization/data/diamonds.csv"),
::read_csv("optimization/data/diamonds.csv")
readr
)
Unit: milliseconds
expr min lq
read.csv("optimization/data/diamonds.csv") 103.14624 111.61502
readr::read_csv("optimization/data/diamonds.csv") 55.57689 59.80873
mean median uq max neval
135.10956 115.71928 127.93492 453.5855 100
75.97688 63.64992 71.88532 372.4557 100
In this case, read_csv
is about twice as fast as the base R implementations.
Parallel Computing
Often, time-intensive R code can be sped up by breaking the execution of
the job across additional cores of your computer. This is called parallel computing.
Learn lapply
/purrr::map
Learning the lapply
(and variants) function from Base R or the map
(and variants) function from the purrr
package is the first step in learning to run R code in parallel. Once you understand how lapply
and map
work, running your code in parallel will be simple.
Say you have a vector of numbers and want to find the square root of each one
(ignore for now that sqrt
is vectorized, which will be covered later).
You could write a for loop and iterate over each element of the vector:
<- c(1, 4, 9, 16)
x
<- vector("list", length(x))
out
for (i in seq_along(x)) {
<- sqrt(x[[i]])
out[[i]]
}
unlist(out)
[1] 1 2 3 4
The lapply
function essentially handles the overhead of constructing a for
loop for you. The syntax is:
lapply(X, FUN, ...)
lapply
will then take each element of X
and apply the FUN
ction to it.
Our simple example then becomes:
<- c(1, 4, 9, 16)
x
<- lapply(x, sqrt)
out
unlist(out)
[1] 1 2 3 4
Those working within the tidyverse
may use map
from the purrr
package equivalently:
library(purrr)
<- c(1, 4, 9, 16)
x
<- map(x, sqrt)
out
unlist(out)
[1] 1 2 3 4
Motivating Example
Once you are comfortable with lapply
and/or map
, running the same code in
parallel takes just an additional line of code.
For lapply
users, the future.apply
package contains an equivalent
future_lapply
function. Just be sure to call plan(multiprocess)
beforehand,
which will handle the back-end orchestration needed to run in parallel.
# install.packages("future.apply")
library(future.apply)
plan(multisession)
<- future_lapply(x, sqrt)
out
unlist(out)
[1] 1 2 3 4
For purrr
users, the furrr
(i.e., future purrr) package includes an
equivalent future_map
function:
# install.packages("furrr")
library(furrr)
plan(multisession)
<- future_map(x, sqrt)
y
unlist(y)
[1] 1 2 3 4
How much faster did this simple example run in parallel?
library(future.apply)
plan(multisession)
<- c(1, 4, 9, 16)
x
::microbenchmark(
microbenchmark
sequential = lapply(x, sqrt),
parallel = future_lapply(x, sqrt),
unit = "s"
)
Unit: seconds
expr min lq mean median uq
sequential 0.000001763 0.0000020705 0.0000029848 0.0000029315 0.000003772
parallel 0.026585548 0.0282111980 0.0333569452 0.0291357070 0.030628517
max neval
0.000009799 100
0.337903181 100
Parallelization was actually slower. In this case, the overhead of
setting the code to run in parallel far outweighed any performance gain. In
general, parallelization works well on long-running & compute intensive jobs.
A (somewhat) More Complex Example
In this example we’ll use the diamonds
dataset from ggplot2
and perform a
kmeans cluster. We’ll use lapply
to iterate the number of clusters from 2 to
5:
<- ggplot2::diamonds
df
<- dplyr::select(df, -c(cut, color, clarity))
df
= 2:5
centers
system.time(
lapply(centers,
function(x) kmeans(df, centers = x, nstart = 500)
)
)
user system elapsed
35.229 2.688 42.309
A now running the same code in parallel:
library(future.apply)
plan(multisession)
system.time(
future_lapply(centers,
function(x) kmeans(df, centers = x, nstart = 500)
)
)
user system elapsed
0.876 0.210 21.655
While we didn’t achieve perfect scaling, we still get a nice bump in execution
time.
Additional Packages
For the sake of ease and brevity, this guide focused on the futures
framework
for parallelization. However, you should be aware that there are a number of
other ways to parallelize your code.
The parallel
Package
The parallel
package is included in your base R installation. It includes
analogues of the various apply
functions:
parLapply
mclapply
- not available on Windows
These functions generally require more setup, especially on Windows machines.
The doParallel
Package
The doParallel
package builds off of parallel
and is
useful for code that uses for loops instead of lapply
. Like the parallel
package, it generally requires more setup, especially on Windows machines.
Machine Learning - caret
For those running machine learning models, the caret
package can easily
leverage doParallel
to speed up the execution of multiple models. Lifting
the example from the package documentation:
library(doParallel)
<- makePSOCKcluster(5) # number of cores to use
cl
registerDoParallel(cl)
## All subsequent models are then run in parallel
<- train(y ~ ., data = training, method = "rf")
model
## When you are done:
stopCluster(cl)
Be sure to check out the full
for more detail.
Big Data
As data collection and storage becomes easier and cheaper, it is relatively
simple to obtain relatively large data files. An important point to keep in
mind is that the size of your data will generally expand when it is read
from a storage device into R. A general rule of thumb is that a file will take
somewhere around 3-4 times more space in memory than it does on disk.
For instance, compare the size of the iris
data set when it is saved as a
.csv file locally vs the size of the object when it is read in to an R session:
file.size("optimization/data/iris.csv") / 1000
[1] 3.716
<- readr::read_csv("optimization/data/iris.csv")
df
::object_size(df) pryr
10.14 kB
This means that on a standard Urban Institute desktop, you may have issues
reading in files that are larger than 4 GB.
Object Size
The type of your data can have a big impact on the size of your data frame
when you are dealing with larger files. There are four main types of atomic
vectors in R:
logical
integer
double
(also callednumeric
)character
Each of these data types occupies a different amount of space in memory
logical
and integer
vectors use 4 bytes per element, while a double
will
occupy 8 bytes. R uses a global string pool, so character
vectors are hard
to estimate, but will generally take up more space for element.
Consider the following example:
<- 1:100
x
::object_size(x) pryr
680 B
::object_size(as.double(x)) pryr
680 B
::object_size(as.character(x)) pryr
1.32 kB
An incorrect data type can easily cost you a lot of space in memory, especially
at scale. This often happens when reading data from a text or csv file - data
may have a format such as c(1.0, 2.0, 3.0)
and will be read in as a numeric
column, when integer
is more appropriate and compact.
You may also be familiar with factor
variables within R. Essentially a
factor
will represent your data as integers, and map them back to their
character representation. This can save memory when you have a compact and
unique level of factors:
<- sample(letters, 10000, replace = TRUE)
x
::object_size(as.character(x)) pryr
81.50 kB
::object_size(as.factor(x)) pryr
42.10 kB
However if each element is unique, or if there is not a lot of overlap among
elements, than the overhead will make a factor larger than its character
representation:
::object_size(as.factor(letters)) pryr
2.22 kB
::object_size(as.character(letters)) pryr
1.71 kB
Cloud Computing
Sometimes, you will have data that are simply too large to ever fit on your
local desktop machine. If that is the case, then the Elastic Cloud Computing
Environment from the Office of Technology and Data Science can provide you with
easy access to powerful analytic tools for computationally intensive project.
The Elastic Cloud Computing Environment allows researchers to quickly spin-up
an Amazon Web Services (AWS) Elastic Cloud Compute (EC2) instance. These
instances offer increased memory to read in large datasets, along with
additional CPUs to provide the ability to process data in parallel at an
impressive scale.
Instance | CPU | Memory (GB) |
|———-|—–|——–|
Desktop | 8 | 16 |
c5.4xlarge | 16 | 32 |
c5.9xlarge | 36 | 72 |
c5.18xlarge | 72 | 144 |
x1e.8xlarge | 32 | 976 |
x1e.16xlarge | 64 | 1952 |
Feel free to contact Erika Tyagi (etyagi@urban.org) if this would be useful
for your project.
Common Pitfalls
For Loops and Vector Allocation
A refrain you will often hear is that for loops in R are slow and need to be
avoided at all costs. This is not true! Rather, an improperly constructed loop
in R can bring the execution of your program to a near standstill.
A common for loop structure may look something like:
<- 1:100
x
<- c()
out
for (i in x) {
<- c(out, sqrt(x))
out
}
The bottleneck in this loop is with the allocation of the vector out
. Every
time we iterate over an item in x
and append it to out
, R makes a copy
of all the items already in out
. As the size of the loop grows, your code
will take longer and longer to run.
A better practice is to pre-allocate out
to be the correct length, and then
insert the results as the loop runs.
<- 1:100
x
<- rep(NA, length(x))
out
for (i in seq_along(x)) {
<- sqrt(x[i])
out[i]
}
A quick benchmark shows how much more efficient a loop with a pre-allocated
results vector is:
<- function(x) {
bad_loop
<- c()
out
for (i in x) {
<- c(out, sqrt(x))
out
}
}
<- function(x) {
good_loop
<- rep(NA, length(x))
out
for (i in seq_along(x)) {
<- sqrt(x[i])
out[i]
}
}
<- 1:100
x
::microbenchmark(
microbenchmark
bad_loop(x),
good_loop(x)
)
Unit: microseconds
expr min lq mean median uq max neval
bad_loop(x) 1042.179 1267.577 1891.78264 1328.5640 1446.9720 10173.125 100
good_loop(x) 6.191 6.437 32.23338 6.7035 11.2545 2366.725 100
And note how performance of the “bad” loop degrades as the loop size grows.
<- 1:250
y
::microbenchmark(
microbenchmark
bad_loop(y),
good_loop(y)
)
Unit: microseconds
expr min lq mean median uq max
bad_loop(y) 19249.582 22663.9595 24695.7231 23492.774 24909.407 81335.882
good_loop(y) 14.022 14.5345 21.2626 23.329 26.486 64.616
neval
100
100
Vectorized Functions
Many functions in R are vectorized, meaning they can accept an entire vector
(and not just a single value) as input. The sqrt
function from the
prior examples is one:
<- c(1, 4, 9, 16)
x
sqrt(x)
[1] 1 2 3 4
This removes the need to use lapply
or a for loop. Vectorized functions in
R are generally written in a compiled language like C, C++, or FORTRAN, which
makes their implementation faster.
<- 1:100
x
::microbenchmark(
microbenchmark
lapply(x, sqrt),
sqrt(x)
)
Unit: nanoseconds
expr min lq mean median uq max neval
lapply(x, sqrt) 20172 20418 20847.27 20541 20725.5 37228 100
sqrt(x) 287 328 397.70 369 369.0 2296 100