Reason 5 Expandability
- What are R packages
- What is the Tidyverse
- Installing and loading R packages
- Loading data from a text file with
read_csv()
- Saving a data frame to the disk with
write_csv()
- Exploring a data frame
R is popular among data analysts, data scientists, researchers, and statisticians. As a result, it gives you access to a vast amount of amazing packages written by domain experts, statisticians and programmers from around the world. An R package is a collection of R code that has been bundled up so that it can be easily used by others. The full power of R unfolds when you tap into this vast ecosystem of packages.
R has an official peer-reviewed package repository called CRANas well as domain-specific repositories BioConductor, and Neuroconductor. Many more packages are available on Github.
There are tens of thousands of packages to make your life easier! There’s an R package for almost everything! See for yourself at rdrr.io or Metacran. And if there’s ever anything missing you can build it yourself – together with others across the globe. It’s fun and you will learn a lot! The R Packages book provides a deep dive into writing and publishing R packages.
By the way: R can also integrate with Python, C, and other languages and interface with other software, such as MPLUS, so you are not limited to what R has to offer.
5.1 Introducing the Tidyverse
The Tidyverse (Wickham et al. 2019) is a set of R packages that share a common design philosophy. It aims to provide user-friendly functions for common data-tasks and tries to iron out some of the idiosyncrasies of Base R, that is, the way R works out of the box. The functions in the Tidyverse have a consistent interface and easy-to-remember names. The Tidy Data that gives the Tidyverse its name has tabular shape, with
- one observation per row
- one variable per column
- a single-level header
- variable names that are easy to type and easy to remember
The core idea is this: Once the data is in tidy format, all the Tidyverse functions can
be used seamlessly. A special operator, the pipe %>%
, glues all the functions together,
so that the steps of the analysis can be read from top to bottom, like a recipe (see
below).
The Tidyverse is a relatively recent addition to the R package ecosystem but the design philosophy has struck a chord with the R developer community and many packages are now following the same principles. As a result, the overall data analysis workflow in R has become more streamlined and integrated.
Nevertheless, some people prefer base R or other alternatives to the Tidyverse, such as
the {data.table}
package. Moreover, many
older packages will never change to a “tidy” design. This leads to a bit of a schism in
the R package ecosystem. Eventually, you will have to learn about these alternatives as
well. However, you can get a long way by just focusing on the Tidyverse.
5.2 Installing packages
For the examples in this book we will use a table with information about climbing expeditions in the Himalaya that was provided by Tidy Tuesday.
Often you will read data from external sources and the first step is to turn it into something that R can work with, usually a data frame. In this case, the data is provided as a CSV file. CSV stands for comma separated values and is a popular plain text format for sharing tabular data across the internet.
To read the CSV file and turn it into an R data frame we will use the read_csv()
function. To use this function, we first have to install and load the {tidyverse}
package.14
You can install the package from R’s central package repository CRAN. The
install.packages()
function looks for a package of a given name on CRAN, and, if it
finds it, copies it to a special library directory on your computer.
The library()
function loads a package that you have installed, so that all of its
functions become available for you to use in your current R session. The next time you
start RStudio, you will not have to reinstall the package (it is now on your computer),
but you will have to load it again.
library vs. package
The distinction between library and package can be confusing. It might help to imagine that you are
- using
install.packages()
to add a package to your system’s library of packages - using
library()
to check out a package from your library, so that you can use it in your current R session.
Note also that the quotation marks around the package name in install.packages()
are
required, because this is essentially a search term that will be used to find the package
on CRAN. The quotation marks in library()
are optional, because the package exists as an
object in your library, from where you can call it by its name, just like any other
variable.
5.3 Loading data
Now that we have the {tidyverse}
package installed and loaded, we can use the
read_csv()
function to load the data straight from the web into an R data frame in our R
session.
peaks <- read_csv("https://raw.githubusercontent.com/teebusch/r-introduction/master/data/peaks.csv")
Rows: 468 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): peak_id, peak_name, peak_alternative_name, climbing_status, first_a...
dbl (2): height_metres, first_ascent_year
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
5.4 Inspecting the data
When we load the data, R shows us the names of the columns in the data set and the type of
each column. We can also have a look at the first few rows with the head()
function.
# A tibble: 6 × 8
peak_id peak_name peak_alternative_name height_metres climbing_status
<chr> <chr> <chr> <dbl> <chr>
1 AMAD Ama Dablam Amai Dablang 6814 Climbed
2 AMPG Amphu Gyabjen <NA> 5630 Climbed
3 ANN1 Annapurna I <NA> 8091 Climbed
4 ANN2 Annapurna II <NA> 7937 Climbed
5 ANN3 Annapurna III <NA> 7555 Climbed
6 ANN4 Annapurna IV <NA> 7525 Climbed
# … with 3 more variables: first_ascent_year <dbl>, first_ascent_country <chr>,
# first_ascent_expedition_id <chr>
You can get a more traditional spreadsheet-like view of the complete data frame by
clicking its name in the Environment panel or running the command view(peaks)
.
To better understand our data, let’s get some summary statistics. The {skimr}
package
(Waring et al. 2021) provides a really nice function for this.
Exercise 5.1 Install and load the {skimr}
package from CRAN.
Now we can “skim” our data:
Name | peaks |
Number of rows | 468 |
Number of columns | 8 |
_______________________ | |
Column type frequency: | |
character | 6 |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
peak_id | 0 | 1.00 | 4 | 4 | 0 | 468 | 0 |
peak_name | 0 | 1.00 | 4 | 25 | 0 | 468 | 0 |
peak_alternative_name | 223 | 0.52 | 5 | 49 | 0 | 242 | 0 |
climbing_status | 0 | 1.00 | 7 | 9 | 0 | 2 | 0 |
first_ascent_country | 132 | 0.72 | 2 | 44 | 0 | 77 | 0 |
first_ascent_expedition_id | 135 | 0.71 | 9 | 9 | 0 | 332 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
height_metres | 0 | 1.00 | 6656.64 | 571.91 | 5407 | 6235.75 | 6559.5 | 6911 | 8850 | ▂▇▃▁▁ |
first_ascent_year | 132 | 0.72 | 1979.08 | 100.21 | 201 | 1963.00 | 1982.0 | 2008 | 2019 | ▁▁▁▁▇ |
This summary tells us quite a lot about our data! For example, we can see that some peaks
have not been successfully climbed until very recently, or not at all. There also appears
to be an outlier in the first_ascent_year
column that we’ll have to take care of – a
peak that was successfully climbed in the year 201.
The :: operator
We only use one function from the {skimr}
package, and we only use it this one time. In
a case like this, you might want to use the function without loading the entire package.
To do so, you can put ::
between the package name and the function name. Thus,
is equivalent to
The ::
is also often used to clarify (either to R or to the person reading the code)
which package a function is coming from. This is particularly useful when there are
functions with the same name in different packages. Another approach to solving this
problem is the {conflicted}
package (???).
5.5 Saving data to disk
We loaded the data from the web straight into R. Thus so far, it only exists in the
current R session, that is, in the computer’s working memory. To store a copy of the data
on the hard drive, we can use the write_csv()
function. It takes the variable and a file
name.
In more complex projects you should put your data files into a subfolder. Here, we will keep it simple and put everything in the top-level project folder.
R can connect to many different sorts of data sources, such as
databases,
APIs, and websites).
It can also read and write many different data formats – including data from software
like SPSS, Stata, and SAS. The {rio}
package (???) provides a unified interface for a
whole range of data formats.
Be careful when using Excel to store and share your data! Excel silently converts certain data. This can cause problems. In many cases it may be better to use CSV, JSON, Feather, or a database to store your data.
Usually, packages are collections of functions. The
{tidyverse}
package (Wickham 2021) is a bit different, as it is actually a collection of other packages — a meta-package. When you load it, it loads the most popular packages from theTidyverse, such as{ggplot2}
,{readr}
, and{dplyr}
. For example, theread_csv()
function is from the{readr}
package.↩︎