Manipulating, analyzing and exporting data with tidyverse (2024)

Bracket subsetting is handy, but it can be cumbersome and difficultto read, especially for complicated operations. Enterdplyr. dplyris a package for helping with tabular data manipulation. It pairs nicelywith tidyr which enables you to swiftlyconvert between different data formats for plotting and analysis.

The tidyverse package is an“umbrella-package” that installs tidyr,dplyr, and several other useful packagesfor data analysis, such as ggplot2,tibble, etc.

The tidyverse package tries to address3 common issues that arise when doing data analysis in R:

You should already have installed and loaded thetidyverse package. If you haven’t alreadydone so, you can type install.packages("tidyverse")straight into the console. Then, type library(tidyverse) toload the package.

What are dplyr andtidyr?

The package dplyr provides helper toolsfor the most common data manipulation tasks. It is built to workdirectly with data frames, with many common tasks optimized by beingwritten in a compiled language (C++). An additional feature is theability to work directly with data stored in an external database. Thebenefits of doing this are that the data can be managed natively in arelational database, queries can be conducted on that database, and onlythe results of the query are returned.

This addresses a common problem with R in that all operations areconducted in-memory and thus the amount of data you can work with islimited by available memory. The database connections essentially removethat limitation in that you can connect to a database of many hundredsof GB, conduct queries on it directly, and pull back into R only whatyou need for analysis.

The package tidyr addresses the commonproblem of wanting to reshape your data for plotting and usage bydifferent R functions. For example, sometimes we want data sets where wehave one row per measurement. Other times we want a data frame whereeach measurement type has its own column, and rows are instead moreaggregated groups (e.g., a time period, an experimental unit like a plotor a batch number). Moving back and forth between these formats isnon-trivial, and tidyr gives you tools forthis and more sophisticated data manipulation.

To learn more about dplyr andtidyr after the workshop, you may want tocheck out this handydata transformation with dplyrcheatsheet and this oneabout tidyr.

As before, we’ll read in our data using the read_csv()function from the tidyverse packagereadr.

R

surveys <- read_csv("data_raw/portal_data_joined.csv")

OUTPUT

#> Rows: 34786 Columns: 13#> ── Column specification ────────────────────────────────────────────────────────#> Delimiter: ","#> chr (6): species_id, sex, genus, species, taxa, plot_type#> dbl (7): record_id, month, day, year, plot_id, hindfoot_length, weight#> #> ℹ Use `spec()` to retrieve the full column specification for this data.#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

R

## inspect the datastr(surveys)

R

## preview the dataview(surveys)

Next, we’re going to learn some of the most commondplyr functions:

  • select(): subset columns
  • filter(): subset rows on conditions
  • mutate(): create new columns by using information fromother columns
  • group_by() and summarize(): create summarystatistics on grouped data
  • arrange(): sort results
  • count(): count discrete values

Selecting columns and filtering rows

To select columns of a data frame, use select(). Thefirst argument to this function is the data frame(surveys), and the subsequent arguments are the columns tokeep.

R

select(surveys, plot_id, species_id, weight)

To select all columns except certain ones, put a “-” infront of the variable to exclude it.

R

select(surveys, -record_id, -species_id)

This will select all the variables in surveys exceptrecord_id and species_id.

To choose rows based on a specific criterion, usefilter():

R

filter(surveys, year == 1995)

Pipes

What if you want to select and filter at the same time? There arethree ways to do this: use intermediate steps, nested functions, orpipes.

With intermediate steps, you create a temporary data frame and usethat as input to the next function, like this:

R

surveys2 <- filter(surveys, weight < 5)surveys_sml <- select(surveys2, species_id, sex, weight)

This is readable, but can clutter up your workspace with lots ofobjects that you have to name individually. With multiple steps, thatcan be hard to keep track of.

You can also nest functions (i.e.one function inside of another),like this:

R

surveys_sml <- select(filter(surveys, weight < 5), species_id, sex, weight)

This is handy, but can be difficult to read if too many functions arenested, as R evaluates the expression from the inside out (in this case,filtering, then selecting).

The last option, pipes, are a recent addition to R. Pipeslet you take the output of one function and send it directly to thenext, which is useful when you need to do many things to the samedataset. Pipes in R look like %>% and are made availablevia the magrittr package, installedautomatically with dplyr. If you useRStudio, you can type the pipe with Ctrl

  • Shift + M if you have a PC or Cmd +Shift + M if you have a Mac.

R

surveys %>% filter(weight < 5) %>% select(species_id, sex, weight)

In the above code, we use the pipe to send the surveysdataset first through filter() to keep rows whereweight is less than 5, then through select()to keep only the species_id, sex, andweight columns. Since %>% takes the objecton its left and passes it as the first argument to the function on itsright, we don’t need to explicitly include the data frame as an argumentto the filter() and select() functions anymore.

Some may find it helpful to read the pipe like the word “then.” Forinstance, in the example above, we took the data framesurveys, then we filtered for rowswith weight < 5, then we selectedcolumns species_id, sex, andweight. The dplyr functionsby themselves are somewhat simple, but by combining them into linearworkflows with the pipe we can accomplish more complex manipulations ofdata frames.

If we want to create a new object with this smaller version of thedata, we can assign it a new name:

R

surveys_sml <- surveys %>% filter(weight < 5) %>% select(species_id, sex, weight)surveys_sml

Note that the final data frame is the leftmost part of thisexpression.

Challenge

Using pipes, subset the surveys data to include animalscollected before 1995 and retain only the columns year,sex, and weight.

R

surveys %>% filter(year < 1995) %>% select(year, sex, weight)

Mutate

Frequently you’ll want to create new columns based on the values inexisting columns, for example to do unit conversions, or to find theratio of values in two columns. For this we’ll usemutate().

To create a new column of weight in kg:

R

surveys %>% mutate(weight_kg = weight / 1000)

You can also create a second new column based on the first new columnwithin the same call of mutate():

R

surveys %>% mutate(weight_kg = weight / 1000, weight_lb = weight_kg * 2.2)

If this runs off your screen and you just want to see the first fewrows, you can use a pipe to view the head() of the data.(Pipes work with non-dplyr functions, too,as long as the dplyr ormagrittr package is loaded).

R

surveys %>% mutate(weight_kg = weight / 1000) %>% head()

The first few rows of the output are full of NAs, so ifwe wanted to remove those we could insert a filter() in thechain:

R

surveys %>% filter(!is.na(weight)) %>% mutate(weight_kg = weight / 1000) %>% head()

is.na() is a function that determines whether somethingis an NA. The ! symbol negates the result, sowe’re asking for every row where weight is not anNA.

Challenge

Create a new data frame from the surveys data that meetsthe following criteria: contains only the species_id columnand a new column called hindfoot_cm containing thehindfoot_length values (currently in mm) converted tocentimeters. In this hindfoot_cm column, there are noNAs and all values are less than 3.

Hint: think about how the commands should be orderedto produce this data frame!

R

surveys_hindfoot_cm <- surveys %>% filter(!is.na(hindfoot_length)) %>% mutate(hindfoot_cm = hindfoot_length / 10) %>% filter(hindfoot_cm < 3) %>% select(species_id, hindfoot_cm)

Split-apply-combine data analysis and the summarize()function

Many data analysis tasks can be approached using thesplit-apply-combine paradigm: split the data into groups, applysome analysis to each group, and then combine the results. Key functionsof dplyr for this workflow aregroup_by() and summarize().

The group_by() and summarize()functions

group_by() is often used together withsummarize(), which collapses each group into a single-rowsummary of that group. group_by() takes as arguments thecolumn names that contain the categorical variables forwhich you want to calculate the summary statistics. So to compute themean weight by sex:

R

surveys %>% group_by(sex) %>% summarize(mean_weight = mean(weight, na.rm = TRUE))

You may also have noticed that the output from these calls doesn’trun off the screen anymore. It’s one of the advantages oftbl_df over data frame.

You can also group by multiple columns:

R

surveys %>% group_by(sex, species_id) %>% summarize(mean_weight = mean(weight, na.rm = TRUE)) %>% tail()

OUTPUT

#> `summarise()` has grouped output by 'sex'. You can override using the `.groups`#> argument.

Here, we used tail() to look at the last six rows of oursummary. Before, we had used head() to look at the firstsix rows. We can see that the sex column containsNA values because some animals had escaped before their sexand body weights could be determined. The resultingmean_weight column does not contain NA butNaN (which refers to “Not a Number”) becausemean() was called on a vector of NA valueswhile at the same time setting na.rm = TRUE. To avoid this,we can remove the missing values for weight before we attempt tocalculate the summary statistics on weight. Because the missing valuesare removed first, we can omit na.rm = TRUE when computingthe mean:

R

surveys %>% filter(!is.na(weight)) %>% group_by(sex, species_id) %>% summarize(mean_weight = mean(weight))

OUTPUT

#> `summarise()` has grouped output by 'sex'. You can override using the `.groups`#> argument.

Here, again, the output from these calls doesn’t run off the screenanymore. If you want to display more data, you can use theprint() function at the end of your chain with the argumentn specifying the number of rows to display:

R

surveys %>% filter(!is.na(weight)) %>% group_by(sex, species_id) %>% summarize(mean_weight = mean(weight)) %>% print(n = 15)

OUTPUT

#> `summarise()` has grouped output by 'sex'. You can override using the `.groups`#> argument.

Once the data are grouped, you can also summarize multiple variablesat the same time (and not necessarily on the same variable). Forinstance, we could add a column indicating the minimum weight for eachspecies for each sex:

R

surveys %>% filter(!is.na(weight)) %>% group_by(sex, species_id) %>% summarize(mean_weight = mean(weight), min_weight = min(weight))

OUTPUT

#> `summarise()` has grouped output by 'sex'. You can override using the `.groups`#> argument.

It is sometimes useful to rearrange the result of a query to inspectthe values. For instance, we can sort on min_weight to putthe lighter species first:

R

surveys %>% filter(!is.na(weight)) %>% group_by(sex, species_id) %>% summarize(mean_weight = mean(weight), min_weight = min(weight)) %>% arrange(min_weight)

OUTPUT

#> `summarise()` has grouped output by 'sex'. You can override using the `.groups`#> argument.

To sort in descending order, we need to add the desc()function. If we want to sort the results by decreasing order of meanweight:

R

surveys %>% filter(!is.na(weight)) %>% group_by(sex, species_id) %>% summarize(mean_weight = mean(weight), min_weight = min(weight)) %>% arrange(desc(mean_weight))

OUTPUT

#> `summarise()` has grouped output by 'sex'. You can override using the `.groups`#> argument.

Counting

When working with data, we often want to know the number ofobservations found for each factor or combination of factors. For thistask, dplyr provides count().For example, if we wanted to count the number of rows of data for eachsex, we would do:

R

surveys %>% count(sex)

The count() function is shorthand for something we’vealready seen: grouping by a variable, and summarizing it by counting thenumber of observations in that group. In other words,surveys %>% count() is equivalent to:

R

surveys %>% group_by(sex) %>% summarize(count = n())

For convenience, count() provides the sortargument:

R

surveys %>% count(sex, sort = TRUE)

Previous example shows the use of count() to count thenumber of rows/observations for one factor (i.e.,sex). If we wanted to count combination offactors, such as sex and species, wewould specify the first and the second factor as the arguments ofcount():

R

surveys %>% count(sex, species)

With the above code, we can proceed with arrange() tosort the table according to a number of criteria so that we have abetter comparison. For instance, we might want to arrange the tableabove in (i) an alphabetical order of the levels of the species and (ii)in descending order of the count:

R

surveys %>% count(sex, species) %>% arrange(species, desc(n))

From the table above, we may learn that, for instance, there are 75observations of the albigula species that are not specified forits sex (i.e.NA).

Challenge

  1. How many animals were caught in each plot_typesurveyed?

R

surveys %>% count(plot_type)

Challenge(continued)

  1. Use group_by() and summarize() to find themean, min, and max hindfoot length for each species (usingspecies_id). Also add the number of observations (hint: see?n).

R

surveys %>% filter(!is.na(hindfoot_length)) %>% group_by(species_id) %>% summarize( mean_hindfoot_length = mean(hindfoot_length), min_hindfoot_length = min(hindfoot_length), max_hindfoot_length = max(hindfoot_length), n = n() )

Challenge(continued)

  1. What was the heaviest animal measured in each year? Return thecolumns year, genus, species_id,and weight.

R

surveys %>% filter(!is.na(weight)) %>% group_by(year) %>% filter(weight == max(weight)) %>% select(year, genus, species, weight) %>% arrange(year)

Reshaping with pivot_longer and pivot_wider

In the spreadsheetlesson, we discussed how to structure our data leading to the fourrules defining a tidy dataset:

  1. Each variable has its own column
  2. Each observation has its own row
  3. Each value must have its own cell
  4. Each type of observational unit forms a table

Here we examine the fourth rule: Each type of observational unitforms a table.

In surveys, the rows of surveys contain thevalues of variables associated with each record (the unit), values suchas the weight or sex of each animal associated with each record. What ifinstead of comparing records, we wanted to compare the different meanweight of each genus between plots? (Ignoring plot_type forsimplicity).

We’d need to create a new table where each row (the unit) iscomprised of values of variables associated with each plot. In practicalterms this means the values in genus would become the namesof column variables and the cells would contain the values of the meanweight observed on each plot.

Having created a new table, it is therefore straightforward toexplore the relationship between the weight of different genera within,and between, the plots. The key point here is that we are stillfollowing a tidy data structure, but we have reshapedthe data according to the observations of interest: average genus weightper plot instead of recordings per date.

The opposite transformation would be to transform column names intovalues of a variable.

We can do both these of transformations with two tidyrfunctions, pivot_wider() andpivot_longer().

These may sound like dramatically different data layouts, but thereare some tools that make transitions between these layouts morestraightforward than you might think! The gif below shows how these twoformats relate to each other, and gives you an idea of how we can use Rto shift from one format to the other.

Manipulating, analyzing and exporting data with tidyverse (1)

Pivoting from long to wide format

pivot_wider() takes three principal arguments:

  1. the data
  2. the names_from column variable whose values will become newcolumn names.
  3. the values_from column variable whose values will fill thenew column variables.

Further arguments include values_fill which, if set,fills in missing values with the value provided.

Let’s use pivot_wider() to transform surveys to find themean weight of each genus in each plot over the entire survey period. Weuse filter(), group_by() andsummarize() to filter our observations and variables ofinterest, and create a new variable for themean_weight.

R

surveys_gw <- surveys %>% filter(!is.na(weight)) %>% group_by(plot_id, genus) %>% summarize(mean_weight = mean(weight))

OUTPUT

#> `summarise()` has grouped output by 'plot_id'. You can override using the#> `.groups` argument.

R

str(surveys_gw)

This yields surveys_gw where the observations for eachplot are distributed across multiple rows, 196 observations of 3variables. Using pivot_wider() with the names fromgenus and with values from mean_weight thisbecomes 24 observations of 11 variables, one row for each plot.

R

surveys_wide <- surveys_gw %>% pivot_wider(names_from = genus, values_from = mean_weight)str(surveys_wide)
Manipulating, analyzing and exporting data with tidyverse (2)

We could now plot comparisons between the weight of genera (one iscalled a genus, multiple are called genera) in different plots, althoughwe may wish to fill in the missing values first.

R

surveys_gw %>% pivot_wider(names_from = genus, values_from = mean_weight, values_fill = 0) %>% head()

Pivoting from wide to long format

The opposing situation could occur if we had been provided with datain the form of surveys_wide, where the genus names arecolumn names, but we wish to treat them as values of a genus variableinstead.

In this situation we are reshaping the column names and turning theminto a pair of new variables. One variable represents the column namesas values, and the other variable contains the values previouslyassociated with the column names.

pivot_longer() takes four principal arguments:

  1. the data
  2. the names_to column variable we wish to create from columnnames.
  3. the values_to column variable we wish to create and fillwith values.
  4. cols are the name of the columns we use to make this pivot(or to drop).

To recreate surveys_gw from surveys_wide wewould create a names variable called genus and valuevariable called mean_weight.

In pivoting longer, we also need to specify what columns to reshape.If the columns are directly adjacent as they are here, we don’t evenneed to list the all out: we can just use the :operator!

R

surveys_long <- surveys_wide %>% pivot_longer(names_to = "genus", values_to = "mean_weight", cols = -plot_id)str(surveys_long)
Manipulating, analyzing and exporting data with tidyverse (3)

Note that now the NA genera are included in the longformat data frame. Pivoting wider and then longer can be a useful way tobalance out a dataset so that every replicate has the samecomposition

We could also have used a specification for what columns to exclude.In this example, we will use all columns exceptplot_id for the names variable. By using the minus sign inthe cols argument, we omit plot_id from beingreshaped

R

surveys_wide %>% pivot_longer(names_to = "genus", values_to = "mean_weight", cols = -plot_id) %>% head()

Challenge

  1. Reshape the surveys data frame with yearas columns, plot_id as rows, and the number of genera perplot as the values. You will need to summarize before reshaping, and usethe function n_distinct() to get the number of uniquegenera within a particular chunk of data. It’s a powerful function! See?n_distinct for more.

R

surveys_wide_genera <- surveys %>% group_by(plot_id, year) %>% summarize(n_genera = n_distinct(genus)) %>% pivot_wider(names_from = year, values_from = n_genera)

OUTPUT

#> `summarise()` has grouped output by 'plot_id'. You can override using the#> `.groups` argument.

R

head(surveys_wide_genera)

Challenge(continued)

  1. Now take that data frame and pivot_longer() it, so eachrow is a unique plot_id by yearcombination.

R

surveys_wide_genera %>% pivot_longer(names_to = "year", values_to = "n_genera", cols = -plot_id)

Challenge(continued)

  1. The surveys data set has two measurement columns:hindfoot_length and weight. This makes itdifficult to do things like look at the relationship between mean valuesof each measurement per year in different plot types. Let’s walk througha common solution for this type of problem. First, usepivot_longer() to create a dataset where we have a namescolumn called measurement and a value columnthat takes on the value of either hindfoot_length orweight. Hint: You’ll need to specify which columnswill be part of the reshape.

R

surveys_long <- surveys %>% pivot_longer(names_to = "measurement", values_to = "value", cols = c(hindfoot_length, weight))
  1. With this new data set, calculate the average of eachmeasurement in each year for each differentplot_type. Then pivot_wider() them into a dataset with a column for hindfoot_length andweight. Hint: You only need to specify the namesand values columns for pivot_wider().

R

surveys_long %>% group_by(year, measurement, plot_type) %>% summarize(mean_value = mean(value, na.rm=TRUE)) %>% pivot_wider(names_from = measurement, values_from = mean_value)

OUTPUT

#> `summarise()` has grouped output by 'year', 'measurement'. You can override#> using the `.groups` argument.

Certainly! From the passage you provided, let's break down the concepts discussed:

1. dplyr:

  • Definition: It's a package in R used for data manipulation.
  • Functionality: Offers various functions (like select, filter, mutate, group_by, summarize, etc.) to perform common data manipulation tasks.
  • Performance: Optimized tasks written in C++ to work directly with data frames.
  • Database Compatibility: Allows working with data stored in external databases.

2. tidyr:

  • Definition: Another package in R used for data manipulation, particularly for data reshaping.
  • Functionality: Provides tools (pivot_longer, pivot_wider, etc.) to reshape data into tidy formats.
  • Purpose: Aids in switching between different data layouts, facilitating analysis and plotting.

3. tidyverse:

  • Definition: An "umbrella-package" that installs tidyr, dplyr, along with other helpful packages for data analysis like ggplot2, tibble, etc.
  • Objective: Addresses common issues faced during data analysis in R.
  • Advantages: Encourages tidy data formats, which are more accessible and easier to work with for analysis and visualization.

4. Key dplyr Functions:

  • select(): Subsetting columns
  • filter(): Subsetting rows based on conditions
  • mutate(): Creating new columns based on existing ones
  • group_by() and summarize(): Summarizing data after grouping
  • arrange(): Sorting results
  • count(): Counting discrete values

5. Pipes in dplyr:

  • Definition: Symbolized by %>%, allows chaining multiple functions together for efficient data manipulation.
  • Functionality: Takes the output of one function and passes it directly as input to the next function, enhancing readability and simplicity in code.

6. Data Reshaping with tidyr:

  • pivot_longer(): Converts wide data to long data format.
  • pivot_wider(): Converts long data to wide data format.
  • Purpose: Facilitates comparison between variables, groups, or observations in different formats.

7. Examples in Data Handling:

  • Subset Selection: Using select() and filter().
  • Pivoting: Employing pivot_longer() and pivot_wider() for data format changes.
  • Summarizing Data: Utilizing summarize() after grouping with group_by().

8. Working with External Data:

  • Database Integration: dplyr allows working directly with data stored in external databases.
  • Advantage: Removes limitations on data size due to in-memory operations, facilitating handling of larger datasets.

9. Performance Enhancement:

  • Compiled Language Usage: dplyr uses compiled code (C++) for common tasks, enhancing performance and speed in data manipulation.

10. Workflow Optimization:

  • Pipes for Workflow: Pipes in R (%>%) streamline the workflow, allowing complex data manipulations in a more readable manner.

11. Challenge Exercises:

  • These exercises demonstrate practical application of the discussed functions for data manipulation and summarization.

By understanding these concepts, you can efficiently manipulate, reshape, and summarize data using dplyr and tidyr packages in R.

Manipulating, analyzing and exporting data with tidyverse (2024)
Top Articles
Latest Posts
Article information

Author: Golda Nolan II

Last Updated:

Views: 5923

Rating: 4.8 / 5 (78 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Golda Nolan II

Birthday: 1998-05-14

Address: Suite 369 9754 Roberts Pines, West Benitaburgh, NM 69180-7958

Phone: +522993866487

Job: Sales Executive

Hobby: Worldbuilding, Shopping, Quilting, Cooking, Homebrewing, Leather crafting, Pet

Introduction: My name is Golda Nolan II, I am a thoughtful, clever, cute, jolly, brave, powerful, splendid person who loves writing and wants to share my knowledge and understanding with you.