Reason 6 Data wrangling superpowers

The Tidyverse offers many convenience functions for common data manipulation tasks. At its core, are a handful of workhorse functions from the {dplyr} package for manipulating data frames:

The Tidyverse’s most important data wrangling functions
`filter()`	Select observations (data frame rows)
`select()`	Select variables (data frame columns)
`mutate()`	Create new variables (columns) or change existing ones
`summarize()`	Aggregate rows, e.g, to calculate a column’s mean
`group_by()`	Split the data frame
`pivot_wider()` and `pivot_longer()`	Reshape the data between long and wide format
`left_join()`, `right_join()`, `inner_join()`, `full_join()`	Join observations from multiple data frames
`bind_cols()`, `bind_rows()`	Combining multiple data frames into one, either by adding additional columns or additional rows

6.1 Filtering and counting rows

Let’s do a simple data wrangling task using functions from the Tidyverse:

Find all peaks that are higher than 7,000 meters and were climbed before 1980
Count which countries the climbers came from

We can use the filter() function to select only the rows (observations) we want. Then we can use the count() function to count how many rows there are for each kind of first_ascent_country:

before80_7k <- filter(peaks, first_ascent_year < 1980, height_metres > 7000)
count(before80_7k, first_ascent_country)

# A tibble: 28 × 2
   first_ascent_country     n
   <chr>                <int>
 1 Austria                  2
 2 Austria, Nepal           1
 3 Austria, USA, Nepal      1
 4 Austria, W Germany       1
 5 France                   3
 6 France, Nepal            1
 7 Germany, Austria         1
 8 Germany, Nepal           1
 9 India                    1
10 Japan                    7
# … with 18 more rows

6.2 The pipe operator %>%

Above, we stored the result of the first function in an intermediate variables and then gave it to the second. There’s another way to chain functions that is often more convenient and readable: Use the pipe operator %>% to pass data from one function to the next. You can read %>% as “…and then…”. Use the keyboard shortcut Ctrl + Shift + M to make it.

peaks %>%
  filter(first_ascent_year < 1980, height_metres > 7000) %>% 
  count(first_ascent_country, sort = TRUE) %>% 
  filter(n > 3)

# A tibble: 4 × 2
  first_ascent_country     n
  <chr>                <int>
1 Japan, Nepal            14
2 Japan                    7
3 Poland                   4
4 UK, Nepal                4

  # ... the pipe can continue. Here, we filter all peaks that 
  # were climbed more than 3 times

Did you notice that the first argument of all functions (the data frame you want to operate on) has disappeared? The pipe operator fills that one in automatically.

An image of the Star Trek character Data, smoking a pipe

Figure 6.1: Indubitably, sir!

6.3 Create new columns with `mutate()`

peaks %>% 
  select(peak_name, climbing_status, first_ascent_year, first_ascent_country) %>% 
  mutate(climbed_before80 = first_ascent_year < 1980)

# A tibble: 468 × 5
   peak_name  climbing_status first_ascent_ye… first_ascent_co… climbed_before80
   <chr>      <chr>                      <dbl> <chr>            <lgl>           
 1 Ama Dablam Climbed                     1961 New Zealand, US… TRUE            
 2 Amphu Gya… Climbed                     1953 UK               TRUE            
 3 Annapurna… Climbed                     1950 France           TRUE            
 4 Annapurna… Climbed                     1960 UK, Nepal        TRUE            
 5 Annapurna… Climbed                     1961 India            TRUE            
 6 Annapurna… Climbed                     1955 W Germany        TRUE            
 7 Annapurna… Climbed                     1974 Spain            TRUE            
 8 Annapurna… Climbed                     1980 W Germany        FALSE           
 9 Annapurna… Climbed                     1964 Japan, Nepal     TRUE            
10 Api Main   Climbed                     1960 Japan, Nepal     TRUE            
# … with 458 more rows

6.4 Split-apply-combine with `group_by()` and `summarize()`

Use group_by() and summarize() together for powerful split-apply-combine operations:

peaks %>% 
  filter(first_ascent_year > 201) %>% 
  group_by(first_ascent_country) %>% 
  summarize(min_ascent = min(first_ascent_year, na.rm = TRUE)) %>% 
  arrange(min_ascent)

# A tibble: 77 × 2
   first_ascent_country min_ascent
   <chr>                     <dbl>
 1 UK                         1909
 2 Austria                    1930
 3 Austria, UK                1930
 4 Germany, Austria           1930
 5 Germany, Nepal             1930
 6 Switzerland, Germany       1939
 7 Switzerland                1949
 8 Switzerland, Nepal         1949
 9 UK, Nepal                  1949
10 France                     1950
# … with 67 more rows

Dealing with NA (missing values)

In R, there is a special value for missing data, NA. This is not a data type. That is, a value can be, e.g., numeric and missing (NA) at the same time, and a numeric vector can contain a mix of numbers and NA.

Functions that operate on vectors, like mean() or min() do not like to make assumptions about missing values, so when the input contains an NA, the result will automatically also be NA. This makes sense, because the unknown number may well be a value would totally change the mean or the minimum. It is possible (and often necessary) to circumvent this “safety measure” by setting the function argument na.rm = TRUE. This will treat the vector as if the missing values did not exist.

6.5 Joining data frames with `left_join()`

#...

6.6 Other useful data wrangling functions

There are many other useful functions that can assist you with all sorts of data cleaning and transformation.