Reason 3 A language made for data analysis

  • The most important data types and data structures
  • Assigning data to variables
  • Writing comments
  • Working with vectors
  • Creating data frames

R is a programming language that supports many different programming paradigms but to get started with data analysis, its enough to understand R’s basic data types — numeric, logical, and character — and two data structures, vectors and data frames. These are made for data manipulation and analysis, and along with the many convenience functions that R provides, they allow you to do data analysis without worrying much about things that are fundamental in other programming languages, like loops and conditionals. As you become more familiar with R, you can add those to your repertoire, too, and utilize R’s full power and flexibility.

If you are a seasoned programmer, I recommend to skip this chapter and instead read Advanced R, chapters 2-4 (https://adv-r.hadley.nz/vectors-chap.html). These provide a much more thorough overview of R’s data types and data structures.

3.1 Basic Data Types

The data type of a piece of data determines what operations you can perform with it. The bread and butter of data types in R are numeric, logical, and character.

Numeric

Numeric data can be integer, decimal10, or complex, but usually the difference does not matter. Sometimes, R might display a number in scientific notation, e.g., 1e+23. To the uninitiated, these can look cryptic and a bit scary. But don’t worry! It’s just a very large or very small number that is written in a more compact form and if you want you can change this behavior.

1e3 + 230 + 4L + 0.5 + .06 + 7i

Logical

Logical data (also known as Boolean data) can take only two values, TRUE or FALSE (or their shorthands, T and F). These are often used to represent binary categorical variables, as function arguments, or as intermediate steps when selecting and filtering data.

(TRUE || FALSE) != (F == T)

Character

Also known as strings. Characters are the most flexible data type. Almost anything you type becomes a character string if you surround it with "" or ''.

"Hello World!"

3.2 Assignment

If you want to store data for later use you can assign it to a variable11 with the assignment operator <- (use the keyboard shortcut Alt/Option+- to save yourself some hand gymnastics). Once assigned, you can use the variable’s name to represent its content in other operations. Typing a variable name by itself will print the variable’s content.

a <- 1
b <- 9
c <- a + b

c
[1] 10

Note how newly created variables appear in the Environment panel in RStudio. When you are working with R, you are working in a session. The session provides an environment in which data is stored, and the variable name allows you to retrieve the data again. Until you explicitly delete the variable or end the session, the data will be around for you to use it.

3.2.1 Variable Names

There are some rules for variables names: They must start with a letter and contain only the letters of the English alphabet, numbers, underscores _, hyphens -, and dots ..12 For long variable names, it is recommended to use snake case, which looks like this: my_firssst_variable 🐍. Also note that R is case-sensitive.

Exercise 3.1 What would be the result of running the code below? Think about it first and then run it to check your answer.

a <- 10
a + b    # What will be the result of a + b?
c        # What has happened to c?
C        # Why is this not working?

What’s with the arrow, though?

Many other programming languages use = as the assignment operator. You can, for the most part, also use = for assignment in R — just be aware that it will let everyone now that you are not a true believer.

You can also use -> to assign from left to right. This can it come in handy when you are messing around in the console, but it may not be a good idea to use it too much in your scrips, because it can make your code harder to read:

"Never look back!" -> x

Finally, there’s <<-, which allows assignment to variables in an enclosing environment. This can be useful if you need to update global variables from within a function. However, this is far beyond the scope of this intro, and chances are you will never encounter a <<- in the wild.

You can read more about R’s assignment operators here.

3.3 Comments

As you can see in the example above, you can use # to leave comments in your script. Everything on the right side of the # will be ignored by R. You can quickly “comment out” multiple lines at the same time by going to Code ▸ Comment/Uncomment Lines or pressing Ctrl/Cmd + Shift + C. Learn to write good comments and write comments gratuitously – your collaborators and future self will thank you for it! Sarah Drasner’s presentation The Art of Code Commentsis a very entertaining introduction to good comment writing.

3.4 Vectors

In computer programming, data structures are a way to organize and process particular types of data efficiently. When working with data in R, you will encounter two important data structures: vectors and data frames. Vectors are made for storing values of the same data type (numerical, character, logical, …) in a particular order.

3.4.1 Creating vectors

To create a vector, you can use the “combine” function, c(). There are also the : and seq() functions, that can help you create a vector that contains a sequence of numbers:

c(1,2,3,4)      # create a vector manually
0:4             # a vector of all integer numbers from 0 to 4
4:0             # it also works backwards
seq(0, 2, 0.5)  # numbers 1.0 to 2.0 in 0.5 increments
[1] 1 2 3 4
[1] 0 1 2 3 4
[1] 4 3 2 1 0
[1] 0.0 0.5 1.0 1.5 2.0

3.4.2 Working with vectors

R is really good at working with vectors! See how easy it is to multiply all numbers by a scalar, get the mean, or sum all elements — all without using a loop or a temporary variable:

x <- c(1,2,3,4) # Create vector and assign to variable `x` 
x * 10
mean(x) 
sum(x)
[1] 10 20 30 40
[1] 2.5
[1] 10

3.4.3 Accessing elements of a vector

You can access specific positions of a vector using square brackets. In R, indices start at 1, that is, index 1 will return the first element of the vector.

x[1]
[1] 1

You can also use negative indices to exclude elements. For example, index -1 will return everything but the first element of a vector.

x[-1]
[1] 2 3 4

Use : within brackets to access a range of indices.

x[2:3]
[1] 2 3

You can also use a vector of indices to index into another vector in order to select specific elements

x[c(1,3)]   
[1] 1 3

You can also use the bracket syntax to replace individual elements:

x[2:3] <- 0   # replace 2nd and 3rd element with 0
x[2:3] <- c(7,8)   # replace 2nd and 3rd element with 7 and 8, respectively

3.4.4 Named vectors

Values in a vector can be named. This comes in handy, for example, if you need a simple key-value store:

fav_food <- c(
    Garfield = "Lasagna", 
    Popeye = "Spinach",
    Obelix = "Roast Wild Boar"
)

fav_food["Popeye"]
   Popeye 
"Spinach" 

Implicit Coercion

Vectors can only contain a single data type. If you are trying to mix data types, R will quietly (without throwing an error) coerce (that is, convert) all data types to a data type that can represent all elements. This coercion follows a certain hierarchy:

  • logical becomes numerical (FALSE = 0 and TRUE = 1)
  • numerical becomes character

The coercion only goes in one direction. For example, character strings that happen to contain numbers will nevertheless be treated as characters. Consider these examples:

c(FALSE, TRUE, 2, 3)   # logical becomes numerical
[1] 0 1 2 3
c(1, 2, "3")           # numerical becomes character
[1] "1" "2" "3"

This also happens if you update a vector after it has already been created.

x <- 1:3 
x[1] <- TRUE   # this logical will become numerical
x 
[1] 1 2 3

Vectors cannot contain other vectors, so nested vectors automatically get unnested, and as they get unnested, the elements will be coerced to a common data type:

c(1, c(2,3,4)) 
c(c(TRUE, 0), c(TRUE, "0")) # logical and numerical become character
[1] 1 2 3 4
[1] "1"    "0"    "TRUE" "0"   

This behavior can sometimes be used for good, but in general it’s best to avoid it. Make coercion explicit. For example, if you want to convert a character representation of a number to numeric, do so with the as.numeric() function. For more complex applications consider static typing with https://github.com/moodymudskipper/typed. If you really need to store a sequence of mixed data types, you will have to use another data structure called list.

3.5 Data Frames & Tibbles

Usually, your research data will be more complex than just a single sequence of values. For this, R has data frames. Data frames are two-dimensional like a spreadsheet – with rows usually containing different observations and columns usually containing different variables.

You may see the word Tibble pop up occasionally in the examples in this book. A Tibble is a modern implementation of data frames. Tibbles are backwards-compatible with data frames and for the most part interchangeable. Throughout this book, we will use the word data frame to mean both of them.

3.5.1 Creating data frames

You will often get a data frame by reading data from a tabular data file like a spreadsheet but you can also make one yourself, by combining vectors with the data.frame() function:

cool_numbers <- data.frame(
    "number" = c(1,13,23,42),
    "isPrime" = c(FALSE, TRUE, TRUE, FALSE),
    "oneTwo" = c(1, 2)
)

cool_numbers
  number isPrime oneTwo
1      1   FALSE      1
2     13    TRUE      2
3     23    TRUE      1
4     42   FALSE      2

You don’t have to name the columns in your data frame at all, but it is good practice to give them names that are easy to remember and easy to type. Rows of a data frame can be named, too, but row names can be complicated to work with so it’s better to include them as an additional column.

Note how the vector c(1,2) in the oneTwo column did get recycled to match the length of the other two columns. This works, as long as the total length is a multiple of the shorter vector’s length. Otherwise, R will produce an error.

3.5.2 Data frame operations

A quick way to access a single column of a data frame is with the $ operator.

cool_numbers$number
[1]  1 13 23 42

Because columns in a data frame are vectors, everything that can be done with a vector can be done with a data frame column!

Everything is a vector

In fact, all “atomic” data types in R (numeric, characters, logical) are always stored as vectors – even a single number is actually a numerical vector of length one! This is a really powerful idea and the source of much of R’s strength!

Exercise 3.2 Make a new data frame my_cool_numbers with a column a that contains all numbers from 1 to 100 and column b that contains your favorite number 100 times.

my_cool_numbers <- data.frame(
    "a" = 1:100,
    "b" = 42
)

Since data frame columns are vectors, we can use the same operations on them. calculate the standard deviation of all numbers in column a in our my_cool_numbers data frame! Use the RStudio help to find the right function for it.

sd(my_cool_numbers$a)

As for vectors, R comes out of the box with many more commands to create and modify data frames, select columns, and filter rows. You will eventually need to get familiar with these. However, in this book we will skip these in favor of the more streamlined and beginner-friendly functions from the Tidyverse – more about that later.

3.6 Other Data Structures

R has a couple of other data structures that you will eventually encounter. I don’t want to dwell on them for too long, because they will not play a role in the rest this book. If you are interested you can read more here. Additional R data structures include …

  • Factor – Factors are numerical vectors with a fixed set of possible values (levels), whereby each level can be assigned a label. The levels may be ordered or unordered. They are most useful as a data structure for ordinal and categorical variables in statistical models. Unfortunately, they are often also a point of confusion for R novices.

  • List – Lists are similar to vectors, but they can contain a mix of data types and data structures, including other lists.

  • Matrix – Matrices are multidimensional vectors. Like data frames, they have rows and columns. Unlike data frames, all elements in a matrix must be of the same data type. Unless you are implementing statistical methods yourself and want to use R’s matrix algebra functions you probably won’t have to work with matrices directly.

  • Object – Objects can be used to organize functionality. For instance, a statistical model may be an object that has various attributes and associated functions. R even has three types of objects: R6, S3, and S4.

  • Environment – Do you remember when I said that your R session provides an environment in which data is stored? It turns out that environments are yet another data structure in R. An environment is a data structure that contains other named R data structures. Understanding environments is central to a deep understanding of R, but as a beginner it’s nothing you need to worry about.


  1. Decimal numbers are also called floating point numbers or “floats” for short. Note that R uses periods . to designate the decimal places, not commas , as you might be used to, depending on where you are from.↩︎

  2. In official R terminology, you are assigning values to objects, but people who have experience with other programming languages sometimes find the term object confusing. Therefore, I will use the word variable in this book.↩︎

  3. Unlike in other programming languages, dots have no special function or meaning in R and can be used in variable names.↩︎