Boost Computational Speed with Efficient Coding in R (2024)

Making Sense of Big Data

Learn code benchmarking to optimize data analysis of hundreds of observations from the HolzingerSwineford1939 dataset

Boost Computational Speed with Efficient Coding in R (1)

R is built for smart statistical data analysis and requires us to write very little code — compared to other programming languages such as C or Python. However, it appears to be a bit slow from time to time. As some of us work with large and complex datasets, computational speed comes in quite handy. But honestly, do we always construct our code the most efficient way? I am not sure. Thus, it is essential to become familiar with the main techniques for speeding up your analysis — this will make it way easier for you to get the results as quickly as possible without waiting for decades until the code has finally executed. This is what I will show you by means of a typical workflow for exploratory data analysis in R including:

Getting an overview of the data
Wrangling data to a suited format
Summarizing data to get some descriptive statistics
Plotting aspects of interest.

Sometimes you may also end up grouping your data into smaller pieces to gain additional insights (e.g., per person, per state, per sex…). Now how can we speed up the process and become more powerful programmers?

Before we start with a little case study, here are some general pieces of advice to speed up your analysis:

1. Keep your R version up-to-date

Make sure you update your R version regularly. New versions of R usually include speed boosts and bug fixes that were developed under the hood. But speaking from my personal experience, I know that updating can sometimes be a pain because the program may not be able to find packages from the library anymore. It could be helpful to leave R, go to any folder and go to user > documents > R > win-library. Then select the folder with all the packages from earlier versions (for example in my case it was called 3.4 and I would like to install version 4.0). Now delete the folder manually. This makes sure there are no confusions like “package or namespace loaded failed for …” and R can find packages without any trouble.

2. Avoid common speed bottlenecks

Even if this does not make your code specifically faster, you can avoid some common pitfalls that do make your code even slower. First of all, make sure you do not store too many variables/objects if not necessary. For example, if you are interested in the mean of X, a common habit is to store the result by using the ‘ <- ‘ operator. This automatically allocates RAM in memory, so make sure you only store the variables you need for the subsequent steps of your analysis. If you are only interested in a quick insight, you could access the result directly without allocating extra memory on your computer.

Another way to save some time is avoiding repetitions. I can remember a time in 2018 when I started coding and always used the same procedure across each row of a data frame — I produced embarrassingly tedious and messy code:

It could be fixed by using tidier dplyr syntax like:

Without knowing more elegant ways to do this, I ended up with endless lines of code, making it nearly impossible to keep track of what all the steps include. If you repeat a custom-made procedure over and over again and there is no specific built-in function for this purpose, you can write a function to make your code tidier (e.g., if you would like to generate a specific plot on different groups without facetting which I will show you below). Last but not least, make sure you avoid using loops where possible. Here is a simple example: suppose you would like to generate a sequence of integers with a fixed length n. You could keep it as simple like this:

Or you decide to run a for loop, iterating over each element of n to fill the vector X:

This is not necessary as you could easily use the first version, but it slows down your code by at most just a few seconds. What if you generated an empty vector of X and would like to grow it step-by-step?

PLEASE NEVER DO THIS! Why? At each iteration of this for-loop, you request more memory which can take hours to run through in case you want to create large vectors.

My rule of thumb: If there is an option to run the same operation in one line of code (e.g., using base R, data.table or dplyr functions), you should go for it since it is already heavily speed-optimized by each respective developer team. You may also have heard programmers suggest to “vectorize” your code — what this means is that you use built-in functions which are only called once to return a vector with a fixed number of entries (e.g., generate an n-sized random normal distribution with rnorm(n)). In contrast, if you use a loop to run the same operation, the function gets called multiple times — for each iteration (like in the last example). Depending on how large the resulting vector should be, growing it by using the for-loop can take hours to execute. Last but not least, try to use matrices instead of data frames where ever possible: if elements of data contain only one datatype (character, integer, float etc.) it will give you a massive speed boost to use a matrix instead of data frame because it will be more memory efficient. If you however need different types of data stored in a rectangular shape, data frames, tibbles or data.tables are the more appropriate solution.

3. Benchmark your code

You cannot really know whether or not your code is “too slow” or “fast enough” if you have no reference to compare it with — this is where code profiling comes into play. It is based on the following principle: write different versions of the same code, track the time it takes to execute each of them and then select the fastest. Even if there are simply some operations that cannot be optimized anymore, it really helps to find the bottlenecks in your code that slow down the whole execution. If you repeat this process from time to time, you will get an eye on the troublemakers and how to overcome them, leaving you with a sense of accomplishment and self-efficacy. There are multiple ways in R to benchmark your code — I will present them to you in the following section.

Disclaimer: All images are created by the author unless stated otherwise.

Boost Computational Speed with Efficient Coding in R (2)

Suppose you had a time machine and went back to the year 1939 — your mission is to help a group of teachers who have conducted a set of mental ability tests on 301 seventh- and eighth-grade children from two different schools (Pasteur and Grant-White), the dataset is called HolzingerSwineford1939 and retrieved from the lavaan package. Now you would like to prepare meaningful feedback for each pupil — a tedious job for a teacher in 1939! Thanks to your Data Science skills, you know what to do: you are going to prepare, summarize and visualize the results to end up with a nice graph for each child. Because your time machine only has limited time slots for travel, you need to optimize the speed of code execution. This is what we are going to do together: we are going to play through different versions of code while tracking time and memory. We will use base R functions, functions from the tidyverse, and from the data.table package. To differentiate between different syntax styles, I will give you a short overview over the different packages:

Base R

As the name suggests, you do not need to install an extra package to make use of the base R package — it is already built-in and will be loaded automatically. For example, the use of the $ operator or the subset function is a base R function.

Tidyverse/dplyr

This style has become very popular in the R community and seriously I think this is totally justified — the tidyverse style of writing code made my life so much easier once I had discovered the %>% pipe operator and the invincible beauty of its readability, consistency, and reproducibility. Actually, the tidyverse includes a range of packages that are highly compatible with each other such as dplyr, stringr, magittr and purrr. In particular, there is a set of very useful functions for data manipulation such as mutate(), filter() and group_by().

data.table

data.table is super memory-efficient especially when it comes to data manipulation, but the syntax is a bit different to that of the tidyverse and is a bit harder to learn and read. Unlike the tidyverse, it refers to a single package: data.table. According to the developers, it allows for:

“Fast aggregation of large data (e.g. 100GB in RAM), fast ordered joins, fast add/modify/delete of columns by group using no copies at all […], Offers a natural and flexible syntax, for faster development.”

It even supports low-level parallelism: many common operations are internally parallelized to use multiple CPU threads. Consequently, its superpower is speed even when working with large datasets and complicated data manipulation operations. Does this mean that data.table functions will really be more time efficient compared to tidyverse or base R functions at each step of the data manipulation pipeline? We will see — let’s prove it.

Firstly, we will set our workspace and load the HolzingerSwineford1939 dataset from the lavaan package.

Boost Computational Speed with Efficient Coding in R (3)

As you can see, the dataset contains 301 observations and 15 columns, including the pupils’ ID, sex, age, school, grade and scores for each of the 9 intelligence tasks. To make sure we have the right format for each of the columns, we will use two different functions to get the structure of our data. For this purpose, we will wrap each of it into system.time() to track the time the computer takes for its execution.

Boost Computational Speed with Efficient Coding in R (4)

In the output you can see that R first has executed the function we are actually interested in before giving us the timing result. It gives you several key indicators:

User time is the CPU time charged for the execution of user instructions.
System time is the CPU time imposed on the execution by the system on behalf of the calling process.
Elapsed time (german: “verstrichen”) is the time we are generally interested in since it is approximately the sum of user and system. In our example, this was really fast — only 0.07 seconds.

What if we used str() instead of glimpse()?

Boost Computational Speed with Efficient Coding in R (5)

It seems like the base-R function runs a bit faster, but 0.04 seconds should not make as much of a difference. Next, we are going to wrangle the data. More specifically, we will

Select only the variables we are interested in,
assign more meaningful names to each of the 9 intelligence tasks,
change the data from wide to long format and
change the Task variable to a factor which will be useful for plotting later.

For this purpose, we will write this procedure in dplyr- as well as in data.table-format. To compare the versions against each other, we need to wrap each of the versions into a function to make use of system.time() again.

dplyr-version

Boost Computational Speed with Efficient Coding in R (6)

data.table-version

Boost Computational Speed with Efficient Coding in R (7)

Okay — it appears that data.table does a more efficient job wrangling the data into the right format and is 3 times faster compared to dplyr.

For the interested readers, I have found an interesting Stack Overflow discussion on the differences between data.table and dplyr.

Now, let’s extend our timing comparisons a bit more by making use of the microbenchmark package. Its advantage compared to system.time() is that it provides a more detailed summary of the timing between different functions at one glance and runs several iterations per function to give you more reliable results. In the description, the developers argue that more precise timing will be achieved by using millisecond (supposedly nanosecond) accurate timing functions that are written in C code. To make use of it, we will write different versions of code to calculate the mean score per task and school and wrap each of them into a function. Later, this summary will provide us with a reference to compare each pupil’s scoring to school-specific performance on each of the tasks, respectively.

Boost Computational Speed with Efficient Coding in R (8)

Wow — the microbenchmark function even gives you a ranking of the time efficiency of the functions, making it easier to select the fastest directly. In this case, the data.table function is fastest, followed by the tidyverse version and then the base R function. By calculating the relative speeds, we can see that compared to the data.table function, the base R function is almost 4 times and the dplyr function is 3 times slower! So, data.table is again the clear speed winner here.

Next, we will challenge each of the packages with a more sophisticated task: we would like to create a graph showing the task performance of each pupil and then we would like to automatically export each of them into a prespecified folder. By including the school-specific scoring, we will have a reference line to find out where the child scores compared to his or her peers. Putting it differently — we would like to see whether he or she shows above- or below-average performance compared to the others on each of the intelligence tasks.

Boost Computational Speed with Efficient Coding in R (9)

In this example, you can see that this individual appears to show above-average performance at the speeded discrimination of capitals as well as at sentence completion. He or she was particularly strong on the cube task whereas demonstrating average performance on counting dots and average-to-below average performance on the remaining tasks. By looking at the colored dots, we could argue that the Grant-White school seems to be slightly better in verbal tasks (Paragraph comprehension, Sentence completion and Word meaning) on average while both schools show similar results on tasks related to speed (i.p., Speeded discrimination on capitals and speeded counting of dots). But please keep in mind that we refer to the school-specific performance only — we have no information about any reference population that actually serves to norm the results and we have no clue about the natural variation in scoring between pupils across schools.

As I said before, we would like to generate a graph for each of the 301 pupils and automatically export it to our pre-specified folder. For this purpose, we will analyse each version of code using the profvis package — a helpful interactive graphical interface that helps you to find out where the computer allocates most time and memory. Thus, you will find the true speed bottlenecks in your code.

To have a good reference, we will first create a loop that iterates through the pupils’ ID to create the plot and then save it. In the end of each version of this code, we will delete the files in our plots folder to keep everything constant across trials (and thus to not overwrite any existing png-file).

Boost Computational Speed with Efficient Coding in R (10)

You can see that saving the ggplot object to a temporary file (to overwrite it in the next iteration) already consumes a considerable amount of memory and allocates 4% of the total time for code execution.

Boost Computational Speed with Efficient Coding in R (11)

What concerns me even more is the fact that almost 96% of the time is spent on running the ggsave-function on each of the temporary ggplot files while generating the appropriate file name by iterating through the vector of pupil IDs. At least we now have a folder full of distinct files for every pupil without even touching it:

Boost Computational Speed with Efficient Coding in R (12)

The advantage of the tidyverse style of graph automation is its comprehensibility — first we group the data by pupil ID and then save a ggplot object within a data frame alongside with the pupils’ ID thanks to the practical do-function. The do-function takes the previous dplyr-related manipulations, executes a specific function (ggplot in this case) and stores the output in a data frame.

Boost Computational Speed with Efficient Coding in R (13)

As you can see, the tidyverse version is a bit faster compared to the for-loop, but it is not impressive (188150 ms vs. 218730 ms). The loop is 1,16 times slower whereas the tidyverse version of the code brings a speed boost of approx. 14%.

If you move your cursor over the lower portion of the graph, you can see the detailed history of the functions that are executed behind the scenes: for example, it seems like each ggplot is first printed before exporting it — even though we do not take notice of. In addition, we can get a good overview about the functions that take most time and memory if we click on “Data” instead of the Flame Graph — after our previous analysis it does not surprise me anymore that ggsave slows down the process as much. But what I find remarkable is the fact that the grouping procedure allocates such a considerable amount of memory since we split the data frame in a large number of groups (one for each child).

Boost Computational Speed with Efficient Coding in R (14)

Here is a little bonus: we will try a technique from the purrr package to generate the graphs to find out whether or not it is more efficient compared to the dplyr version. The idea is to split the data frame by ID — each subset will be a data frame that is stored in a large list. Then we map the ggplot function to each element of this list. Here, the pwalk() comes into play — it is a variant of a mapping function that allow you to provide any number of arguments in a list, in this case it helps us to go through each element of the list (a data.frame for each pupil), apply a function .f (ggplot function) and export them accordingly.

Even though this syntax appears so pretty and elegant to me, the code execution allocates most time to the actual plotting, even more than the dplyr version of the code does. Now the question remains — can data.table do a better job automating plot generation?

As a first step, we create a function that generates and saves a ggplot. Then we go through the data.table and apply this function to each subset (grouped by ID).

Boost Computational Speed with Efficient Coding in R (15)

Huh? Well, data.table is not faster on generating the graphs compared to the dplyr version, it’s even a bit slower. I guess that it’s because we have violated several rules of writing efficient R code: we have assigned a ggplot object to a temporary variable (p) each time we call the function and the function with both the plot generation and saving is called every time the data table gets subsetted by ID. But what does actually happen behind the hood? Keep in mind that data.table is built in analogy of querying in SQL and has the following basic syntax:

DT[i, j, by]## R: i j by## SQL: where | order by select | update group by

According to the data.table documentation, .BY is a list containing a length 1 vector for each item in by (pupil IDs in our case). The by variables are also available to j directly by name; useful for example for titles of graphs if j is a plot command (like in our example). .SD can be understood to stand for Subset, Selfsame, or Self-reference of the Data. In particular, this also means that .SD is itself a data.table. So this means that we use our customized function on every of the 301 newly generated data.tables which are subsetted by pupil ID. I feel like you can do something like this with data.table, but it is not what is designed for.

What we have also learnt is that…

Running the grouping operation alongside with the plotting function (as in the do-call) can take much time.
We should avoid saving unnecessary variables in temporary objects (such as ggplots).
Loops are not always the worse option compared to map-and apply-functions since they are actually based on the same principle.
The data.table syntax is a champion for data manipulation purposes, but cannot be applied to any data analysis step.
ggsave takes most of the time, but we can try to tune down the prespecified image resolution a bit to make it more memory efficient.

Here I have tried to combine the best from all worlds:

Now since the ggsave function allocates so much time exporting a high-resolution image, we will tune down the resolution just a bit.

Boost Computational Speed with Efficient Coding in R (16)

… And tada! Compared to our purrr-version of this code, we are able to reduce computational timing costs by 22% and found the yet fastest version to automate graph generation on a large set of individuals. Code benchmarking paid off.

If you are familiar with Python, you may have heard of parallel computing before. In theory, you can boost computational speed by assigning different parts of a tasks to each CPU (central processing unit) on your computer. CPUs are the brains of your computer that are responsible for the computations that happen behind the scenes. However, R only uses one of them. You can find out how many CPUs your machine has by running the following code:

Boost Computational Speed with Efficient Coding in R (17)

It appears that I have 4 cores on my computer. Can I make use of these on each piece of my code? Unfortunately, I can’t. Not every computation can be run in parallel — for example, if each line of code is based on the previous one like in our graph automation example. But if you can read the code forwards and backwards and it would still give you the expected results, it might be a good candidate for parallel computation. This often applies to arithmetic operations such as random sampling. If this sounds like a task you could be interested in, the parallel package in R could be for you. There is a nice tutorial online that shows you how it works.

You can learn to write more memory efficient code if you take the time to reflect on timing –there are various tools in R that help you to improve yourself. In particular, this makes much sense if you work with larger datasets and more complicated data science tasks because you will get to your insights faster and avoid trouble in case you are working under time pressure at some point in the future. Another upside: it will make your code cleaner and enhance its readability — this is something your colleagues will definitely benefit from if you work on data science projects in a team. Go and start to analyse your analysis!

References

[1] Y. Rosseel, T.D. Jorgensen, N. Rockwood, D. Oberski, J. Byrnes, L. Vanbrabant, … & H. Du HolzingerSwineford1939 (2021), lavaan R package

[2] R core team and contributors worldwide (2021), Base R package

[3] H. Wickham (2017), The tidyverse, R package ver, 1(1), 1.

[4] M. Dowle, A. Srinivasan, J. Gorecki, M. Chirico, P. Stetsenko, T. Short,… & X. Tan (2019), Package ‘data. table’, Extension of ‘data. frame.

[5] O. Mersmann & S. Krey (2011), microbenchmark: A package to accurately benchmark R expressions, In The R User Conference, useR! 2011 August 16–18 2011 University of Warwick, Coventry, UK (p. 142)

[5] W. Chang, J. Luraschi & T. Mastny (2019), Profvis: Interactive Visualizations for Profiling R Code, R package version 0.3. 7.

[6] R core team and contributors worldwide (2020), parallel R package