Reason 6 Data wrangling superpowers

The Tidyverse offers many convenience functions for common data manipulation tasks. At its core, are a handful of workhorse functions from the {dplyr} package for manipulating data frames:

The Tidyverse’s most important data wrangling functions
filter() Select observations (data frame rows)
select() Select variables (data frame columns)
mutate() Create new variables (columns) or change existing ones
summarize() Aggregate rows, e.g, to calculate a column’s mean
group_by() Split the data frame
pivot_wider() and pivot_longer() Reshape the data between long and wide format
left_join(), right_join(), inner_join(), full_join() Join observations from multiple data frames
bind_cols(), bind_rows() Combining multiple data frames into one, either by adding additional columns or additional rows

6.1 Filtering and counting rows

Let’s do a simple data wrangling task using functions from the Tidyverse:

  1. Find all peaks that are higher than 7,000 meters and were climbed before 1980
  2. Count which countries the climbers came from

We can use the filter() function to select only the rows (observations) we want. Then we can use the count() function to count how many rows there are for each kind of first_ascent_country:

before80_7k <- filter(peaks, first_ascent_year < 1980, height_metres > 7000)
count(before80_7k, first_ascent_country)
# A tibble: 28 × 2
   first_ascent_country     n
   <chr>                <int>
 1 Austria                  2
 2 Austria, Nepal           1
 3 Austria, USA, Nepal      1
 4 Austria, W Germany       1
 5 France                   3
 6 France, Nepal            1
 7 Germany, Austria         1
 8 Germany, Nepal           1
 9 India                    1
10 Japan                    7
# … with 18 more rows

6.2 The pipe operator %>%

Above, we stored the result of the first function in an intermediate variables and then gave it to the second. There’s another way to chain functions that is often more convenient and readable: Use the pipe operator %>% to pass data from one function to the next. You can read %>% as “…and then…”. Use the keyboard shortcut Ctrl + Shift + M to make it.

peaks %>%
  filter(first_ascent_year < 1980, height_metres > 7000) %>% 
  count(first_ascent_country, sort = TRUE) %>% 
  filter(n > 3) 
# A tibble: 4 × 2
  first_ascent_country     n
  <chr>                <int>
1 Japan, Nepal            14
2 Japan                    7
3 Poland                   4
4 UK, Nepal                4
  # ... the pipe can continue. Here, we filter all peaks that 
  # were climbed more than 3 times

Did you notice that the first argument of all functions (the data frame you want to operate on) has disappeared? The pipe operator fills that one in automatically.

An image of the Star Trek character Data, smoking a pipe

Figure 6.1: Indubitably, sir!

6.3 Create new columns with mutate()

peaks %>% 
  select(peak_name, climbing_status, first_ascent_year, first_ascent_country) %>% 
  mutate(climbed_before80 = first_ascent_year < 1980)
# A tibble: 468 × 5
   peak_name  climbing_status first_ascent_ye… first_ascent_co… climbed_before80
   <chr>      <chr>                      <dbl> <chr>            <lgl>           
 1 Ama Dablam Climbed                     1961 New Zealand, US… TRUE            
 2 Amphu Gya… Climbed                     1953 UK               TRUE            
 3 Annapurna… Climbed                     1950 France           TRUE            
 4 Annapurna… Climbed                     1960 UK, Nepal        TRUE            
 5 Annapurna… Climbed                     1961 India            TRUE            
 6 Annapurna… Climbed                     1955 W Germany        TRUE            
 7 Annapurna… Climbed                     1974 Spain            TRUE            
 8 Annapurna… Climbed                     1980 W Germany        FALSE           
 9 Annapurna… Climbed                     1964 Japan, Nepal     TRUE            
10 Api Main   Climbed                     1960 Japan, Nepal     TRUE            
# … with 458 more rows

6.4 Split-apply-combine with group_by() and summarize()

Use group_by() and summarize() together for powerful split-apply-combine operations:

peaks %>% 
  filter(first_ascent_year > 201) %>% 
  group_by(first_ascent_country) %>% 
  summarize(min_ascent = min(first_ascent_year, na.rm = TRUE)) %>% 
  arrange(min_ascent)
# A tibble: 77 × 2
   first_ascent_country min_ascent
   <chr>                     <dbl>
 1 UK                         1909
 2 Austria                    1930
 3 Austria, UK                1930
 4 Germany, Austria           1930
 5 Germany, Nepal             1930
 6 Switzerland, Germany       1939
 7 Switzerland                1949
 8 Switzerland, Nepal         1949
 9 UK, Nepal                  1949
10 France                     1950
# … with 67 more rows

Dealing with NA (missing values)

In R, there is a special value for missing data, NA. This is not a data type. That is, a value can be, e.g., numeric and missing (NA) at the same time, and a numeric vector can contain a mix of numbers and NA.

Functions that operate on vectors, like mean() or min() do not like to make assumptions about missing values, so when the input contains an NA, the result will automatically also be NA. This makes sense, because the unknown number may well be a value would totally change the mean or the minimum. It is possible (and often necessary) to circumvent this “safety measure” by setting the function argument na.rm = TRUE. This will treat the vector as if the missing values did not exist.

6.5 Joining data frames with left_join()

#...

6.6 Other useful data wrangling functions

There are many other useful functions that can assist you with all sorts of data cleaning and transformation.