Reason 5 Expandability

What are R packages
What is the Tidyverse
Installing and loading R packages
Loading data from a text file with read_csv()
Saving a data frame to the disk with write_csv()
Exploring a data frame

R is popular among data analysts, data scientists, researchers, and statisticians. As a result, it gives you access to a vast amount of amazing packages written by domain experts, statisticians and programmers from around the world. An R package is a collection of R code that has been bundled up so that it can be easily used by others. The full power of R unfolds when you tap into this vast ecosystem of packages.

R has an official peer-reviewed package repository called CRANas well as domain-specific repositories BioConductor, and Neuroconductor. Many more packages are available on Github.

There are tens of thousands of packages to make your life easier! There’s an R package for almost everything! See for yourself at rdrr.io or Metacran. And if there’s ever anything missing you can build it yourself – together with others across the globe. It’s fun and you will learn a lot! The R Packages book provides a deep dive into writing and publishing R packages.

By the way: R can also integrate with Python, C, and other languages and interface with other software, such as MPLUS, so you are not limited to what R has to offer.

5.1 Introducing the Tidyverse

The Tidyverse (Wickham et al. 2019) is a set of R packages that share a common design philosophy. It aims to provide user-friendly functions for common data-tasks and tries to iron out some of the idiosyncrasies of Base R, that is, the way R works out of the box. The functions in the Tidyverse have a consistent interface and easy-to-remember names. The Tidy Data that gives the Tidyverse its name has tabular shape, with

one observation per row
one variable per column
a single-level header
variable names that are easy to type and easy to remember

The core idea is this: Once the data is in tidy format, all the Tidyverse functions can be used seamlessly. A special operator, the pipe %>%, glues all the functions together, so that the steps of the analysis can be read from top to bottom, like a recipe (see below).

The Tidyverse is a relatively recent addition to the R package ecosystem but the design philosophy has struck a chord with the R developer community and many packages are now following the same principles. As a result, the overall data analysis workflow in R has become more streamlined and integrated.

Nevertheless, some people prefer base R or other alternatives to the Tidyverse, such as the {data.table} package. Moreover, many older packages will never change to a “tidy” design. This leads to a bit of a schism in the R package ecosystem. Eventually, you will have to learn about these alternatives as well. However, you can get a long way by just focusing on the Tidyverse.

5.2 Installing packages

For the examples in this book we will use a table with information about climbing expeditions in the Himalaya that was provided by Tidy Tuesday.

Often you will read data from external sources and the first step is to turn it into something that R can work with, usually a data frame. In this case, the data is provided as a CSV file. CSV stands for comma separated values and is a popular plain text format for sharing tabular data across the internet.

To read the CSV file and turn it into an R data frame we will use the read_csv() function. To use this function, we first have to install and load the {tidyverse} package.¹⁴

You can install the package from R’s central package repository CRAN. The install.packages() function looks for a package of a given name on CRAN, and, if it finds it, copies it to a special library directory on your computer.

The library() function loads a package that you have installed, so that all of its functions become available for you to use in your current R session. The next time you start RStudio, you will not have to reinstall the package (it is now on your computer), but you will have to load it again.

install.packages("tidyverse")
library(tidyverse)

library vs. package

The distinction between library and package can be confusing. It might help to imagine that you are

using install.packages()to add a package to your system’s library of packages
using library() to check out a package from your library, so that you can use it in your current R session.

Note also that the quotation marks around the package name in install.packages() are required, because this is essentially a search term that will be used to find the package on CRAN. The quotation marks in library() are optional, because the package exists as an object in your library, from where you can call it by its name, just like any other variable.

5.3 Loading data

Now that we have the {tidyverse} package installed and loaded, we can use the read_csv() function to load the data straight from the web into an R data frame in our R session.

peaks <- read_csv("https://raw.githubusercontent.com/teebusch/r-introduction/master/data/peaks.csv")

Rows: 468 Columns: 8

── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): peak_id, peak_name, peak_alternative_name, climbing_status, first_a...
dbl (2): height_metres, first_ascent_year


ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

5.4 Inspecting the data

When we load the data, R shows us the names of the columns in the data set and the type of each column. We can also have a look at the first few rows with the head() function.

head(peaks)

# A tibble: 6 × 8
  peak_id peak_name     peak_alternative_name height_metres climbing_status
  <chr>   <chr>         <chr>                         <dbl> <chr>          
1 AMAD    Ama Dablam    Amai Dablang                   6814 Climbed        
2 AMPG    Amphu Gyabjen <NA>                           5630 Climbed        
3 ANN1    Annapurna I   <NA>                           8091 Climbed        
4 ANN2    Annapurna II  <NA>                           7937 Climbed        
5 ANN3    Annapurna III <NA>                           7555 Climbed        
6 ANN4    Annapurna IV  <NA>                           7525 Climbed        
# … with 3 more variables: first_ascent_year <dbl>, first_ascent_country <chr>,
#   first_ascent_expedition_id <chr>

You can get a more traditional spreadsheet-like view of the complete data frame by clicking its name in the Environment panel or running the command view(peaks).

To better understand our data, let’s get some summary statistics. The {skimr} package (Waring et al. 2021) provides a really nice function for this.

Exercise 5.1 Install and load the {skimr} package from CRAN.

install.packages("skimr")
library(skimr)

Now we can “skim” our data:

skim(peaks)

Table 5.1: Data summary
Name	peaks
Number of rows	468
Number of columns	8
_______________________
Column type frequency:
character	6
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
peak_id	0	1.00	4	4	468
peak_name	0	1.00	4	25	468
peak_alternative_name	223	0.52	5	49	242
climbing_status	0	1.00	7	9	2
first_ascent_country	132	0.72	2	44	77
first_ascent_expedition_id	135	0.71	9	9	332

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
height_metres	0	1.00	6656.64	571.91	5407	6235.75	6559.5	6911	8850	▂▇▃▁▁
first_ascent_year	132	0.72	1979.08	100.21	201	1963.00	1982.0	2008	2019	▁▁▁▁▇

This summary tells us quite a lot about our data! For example, we can see that some peaks have not been successfully climbed until very recently, or not at all. There also appears to be an outlier in the first_ascent_year column that we’ll have to take care of – a peak that was successfully climbed in the year 201.

The :: operator

We only use one function from the {skimr} package, and we only use it this one time. In a case like this, you might want to use the function without loading the entire package. To do so, you can put :: between the package name and the function name. Thus,

skimr::skim(peaks)

is equivalent to

library(skimr)
skim(peaks)

The :: is also often used to clarify (either to R or to the person reading the code) which package a function is coming from. This is particularly useful when there are functions with the same name in different packages. Another approach to solving this problem is the {conflicted} package (???).

5.5 Saving data to disk

We loaded the data from the web straight into R. Thus so far, it only exists in the current R session, that is, in the computer’s working memory. To store a copy of the data on the hard drive, we can use the write_csv() function. It takes the variable and a file name.

In more complex projects you should put your data files into a subfolder. Here, we will keep it simple and put everything in the top-level project folder.

write_csv(peaks, "peaks.csv")

R can connect to many different sorts of data sources, such as databases, APIs, and websites).
It can also read and write many different data formats – including data from software like SPSS, Stata, and SAS. The {rio} package (???) provides a unified interface for a whole range of data formats.

Be careful when using Excel to store and share your data! Excel silently converts certain data. This can cause problems. In many cases it may be better to use CSV, JSON, Feather, or a database to store your data.

Usually, packages are collections of functions. The {tidyverse} package (Wickham 2021) is a bit different, as it is actually a collection of other packages — a meta-package. When you load it, it loads the most popular packages from theTidyverse, such as {ggplot2}, {readr}, and {dplyr}. For example, the read_csv() function is from the {readr} package.↩︎