What is the point of the pipe in R?

How do I get the pipe function to work in R?

An Introduction to the Pipe in R (2024)

R Fundamentals

R’smostimportantoperatorfordataprocessing, explained.

Data analysis often involves many steps. A typical journey from raw data to results might involve filtering cases, transforming values, summarising data, and then running a statistical test. But how can we link all these steps together, while keeping our code efficient and readable? Enter the pipe, R’s most important operator for data processing.

What does the pipe do?

The pipe operator, written as %>%, has been a longstanding feature of the magrittr package for R. It takes the output of one function and passes it into another function as an argument. This allows us to link a sequence of analysis steps.

To visualise this process, imagine a factory with different machines placed along a conveyor belt. Each machine is a function that performs a stage of our analysis, like filtering or transforming data. The pipe therefore works like a conveyor belt, transporting the output of one machine to another for further processing.

We can see exactly how this works in a real example using the mtcars dataset. This dataset comes with base R, and contains data about the specs and fuel efficiency of various cars. The code below groups the data by the number of cylinders in each car, and then returns the mean miles-per-gallon of each group. Make sure to install the tidyverse suite of packages before running this code, since it includes both the pipe and the group_by and summarise functions.

library(tidyverse)result <- mtcars %>% 
 group_by(cyl) %>% 
 summarise(meanMPG = mean(mpg))

The pipe operator feeds the mtcars dataframe into the group_by function, and then the output of group_by into summarise. The outcome of this process is stored in the tibble result, shown below.

Although this example is very simple, it demonstrates the basic pipe workflow. To go even further, I’d encourage playing around with this. Perhaps swap and add new functions to the ‘pipeline’ to gain more insight into the data. Doing this is the best way to understand how to work with the pipe. But why should we use it in the first place?

Why should we use the pipe?

The pipe has a huge advantage over any other method of processing data in R: it makes processes easy to read. If we read %>% as “then”, the code from the previous section is very easy to digest as a set of instructions in plain English:

Load tidyverse packagesTo get our result, take the mtcars dataframe, THEN
 Group its entries by number of cylinders, THEN
 Compute the mean miles-per-gallon of each group

This is far more readable than if we were to express this process in another way. The two options below are different ways of expressing the previous code, but both are worse for a few reasons.

# Option 1: Store each step in the process sequentially
result <- group_by(mtcars, cyl)
result <- summarise(result, meanMPG = mean(mpg))# Option 2: chain the functions together
> result <- summarise(
 group_by(mtcars, cyl), 
 meanMPG = mean(mpg))

Option 1 gets the job done, but overwriting our output dataframe result in every line is problematic. For one, doing this for a procedure with lots of steps isn’t efficient and creates unnecessary repetition in the code. This repetition also makes it harder to identify exactly what is changing on each line in some cases.

Option 2 is even less practical. Nesting each function we want to use gets ugly fast, especially for long procedures. It’s hard to read, and harder to debug. This approach also makes it tough to see the order of steps in the analysis, which is bad news if you want to add new functionality later.

It’s easy to see how using the pipe can substantially improve most R scripts. It makes analyses more readable, removes repetition, and simplifies the process of adding and modifying code. Is there anything it can’t do?

What are the pipe’s limitations?

Although it’s immensely handy, the pipe isn’t useful in every situation. Here are a few of its limitations:

Because it chains functions in a linear order, the pipe is less applicable to problems that include multidirectional relationships.
The pipe can only transport one object at a time, meaning it’s not so suited to functions that need multiple inputs or produce multiple outputs.
It doesn’t work with functions that use the current environment, nor functions that use lazy evaluation. Hadley Wickham’s book “R for Data Science” has a couple of examples of these.

These things are to be expected. Just as you’d struggle to build a house with a single tool, no lone feature will solve all your programming problems. But for what it’s worth, the pipe is still pretty versatile. Although this piece focused on the basics, there’s plenty of scope for using the pipe in advanced or creative ways. I’ve used it in a variety of scripts, data-focused and not, and it’s made my life easier in each instance.

Bonus pipe tips!

Thanks for reading this far. As a reward, here are some bonus pipe tips and resources:

Fed up of awkwardly typing %>%? The slightly easier keyboard shortcut CTRL + SHIFT + M will print a pipe in RStudio!
Need style guidance about how to format pipes? Check out this helpful section from ‘R Style Guide’ by Hadley Wickham.
Want to learn a bit more about the history of pipes in R? Check out this blog post from Adolfo Álvarez.

The pipe is great. It turns your code into a list of readable instructions and has lots of other practical benefits. So now you know about the pipe, use it, and watch your code turn into a narrative.