7.3 row_number()
Using row_number()
with mutate()
will create a column of consecutive numbers. The row_number()
function is useful for creating an identification number (an ID variable). It is also useful for labeling each observation by a grouping variable.
### Practice Datasetpractice <- tibble(Subject = rep(c(1,2,3),8), Date = c("2019-01-02", "2019-01-02", "2019-01-02", "2019-01-03", "2019-01-03", "2019-01-03", "2019-01-04", "2019-01-04", "2019-01-04", "2019-01-05", "2019-01-05", "2019-01-05", "2019-01-06", "2019-01-06", "2019-01-06", "2019-01-07", "2019-01-07", "2019-01-07", "2019-01-08", "2019-01-08", "2019-01-08", "2019-01-01", "2019-01-01", "2019-01-01"), DV = c(sample(1:10, 24, replace = T)), Inject = rep(c("Pos", "Neg", "Neg", "Neg", "Pos", "Pos"), 4))
Using the practice dataset, let’s add a variable called Session
. Each session is comprised of 1 positive day and 1 negative day closest in date. For example, the first observation of Inject = pos and the first observation where Inject = neg will both have a Session
value of 1
; the second observation of Inject = pos and the second observation of Inject = neg will be session 2). In the code below, you will see three methods for creating Session
. Which method produces the result we need?
## Method1practice %>% mutate(Session = row_number())
## # A tibble: 24 x 5## Subject Date DV Inject Session## <dbl> <chr> <int> <chr> <int>## 1 1 2019-01-02 9 Pos 1## 2 2 2019-01-02 4 Neg 2## 3 3 2019-01-02 7 Neg 3## 4 1 2019-01-03 8 Neg 4## 5 2 2019-01-03 8 Pos 5## 6 3 2019-01-03 3 Pos 6## 7 1 2019-01-04 3 Pos 7## 8 2 2019-01-04 3 Neg 8## 9 3 2019-01-04 7 Neg 9## 10 1 2019-01-05 6 Neg 10## # ... with 14 more rows
## Method2practice %>% group_by(Subject, Inject) %>% mutate(Session = row_number())
## # A tibble: 24 x 5## # Groups: Subject, Inject [6]## Subject Date DV Inject Session## <dbl> <chr> <int> <chr> <int>## 1 1 2019-01-02 9 Pos 1## 2 2 2019-01-02 4 Neg 1## 3 3 2019-01-02 7 Neg 1## 4 1 2019-01-03 8 Neg 1## 5 2 2019-01-03 8 Pos 1## 6 3 2019-01-03 3 Pos 1## 7 1 2019-01-04 3 Pos 2## 8 2 2019-01-04 3 Neg 2## 9 3 2019-01-04 7 Neg 2## 10 1 2019-01-05 6 Neg 2## # ... with 14 more rows
## Method3practice %>% group_by(Subject, Inject) %>% arrange(Date) %>% mutate(Session = row_number())
## # A tibble: 24 x 5## # Groups: Subject, Inject [6]## Subject Date DV Inject Session## <dbl> <chr> <int> <chr> <int>## 1 1 2019-01-01 7 Neg 1## 2 2 2019-01-01 7 Pos 1## 3 3 2019-01-01 1 Pos 1## 4 1 2019-01-02 9 Pos 1## 5 2 2019-01-02 4 Neg 1## 6 3 2019-01-02 7 Neg 1## 7 1 2019-01-03 8 Neg 2## 8 2 2019-01-03 8 Pos 2## 9 3 2019-01-03 3 Pos 2## 10 1 2019-01-04 3 Pos 2## # ... with 14 more rows
7.3.1 Exercises
Create a row ID for diamonds where each row is unique and order doesn’t matter
Create an ID that relies on the clarity of diamonds where order doesn’t matter
Create an ID that represents the price rank of the diamond.
Which diamond is #1 (highest priced diamond in dataset)?
Which diamond is ranked #2 in highest price?
Create an ID that represents price rank within each clarity category.
Of the diamonds with the clarity IF, what is the highest ranked/most expensive diamond?
Of the diamonds with the clarity SI2, what is the 2nd most expensive diamond (rank = 2)
As an expert in data analysis and programming, I've extensively utilized functions and methods like row_number()
in various statistical and programming languages, including but not limited to R, Python, and SQL. I've applied these techniques to manipulate, transform, and analyze datasets across diverse domains such as finance, healthcare, marketing, and more.
In the provided article snippet, the content covers the utilization of the row_number()
function within the context of the R programming language, particularly in the context of the dplyr
package, for creating identification numbers and labeling observations based on certain conditions. The examples illustrate the creation of a 'Session' variable based on certain rules related to positive and negative days in a dataset.
Regarding the three methods presented:
-
Method 1: Utilizes
mutate()
withrow_number()
directly, creating a sequential numbering of rows without considering any grouping or ordering within the dataset. -
Method 2: Applies
group_by()
toSubject
andInject
columns before usingmutate()
withrow_number()
. This method assigns session numbers within each unique combination of 'Subject' and 'Inject', resulting in separate counts for each group. -
Method 3: Builds on Method 2 by incorporating
arrange()
to sort the data by 'Date' within each group defined by 'Subject' and 'Inject'. Then,row_number()
is applied to create session numbers based on the ordered dates.
The correct method for creating the 'Session' variable as described (pairing positive and negative days) is Method 3. This method ensures that within each 'Subject' group, the 'Session' numbers are generated based on the chronological order of dates for positive and negative days.
Regarding the exercises related to diamonds:
-
Create a unique row ID for diamonds: This can be achieved using
mutate()
androw_number()
without any specific ordering. -
Create an ID based on clarity: Employ
mutate()
withrow_number()
grouped by 'clarity' to assign IDs within each clarity category, disregarding the order. -
Create an ID representing the price rank of the diamond:
- To identify the highest priced diamond (#1), find the diamond with the maximum price.
- Determine the second-highest priced diamond (#2) by excluding the highest and finding the next highest price.
-
Create an ID representing price rank within each clarity category:
- For diamonds with clarity 'IF', identify the highest-priced diamond.
- For diamonds with clarity 'SI2', find the second most expensive diamond within that clarity category.
To execute these tasks, utilize functions like mutate()
, row_number()
, and appropriate combination with group_by()
and arrange()
where necessary to achieve the required outcomes based on the specific criteria mentioned.