What is R?

R is a statistical computing and data analysis environment. It is a free and open source implementation of the S programming language, which was created in the 1970s with interactive data analysis in mind. Here’s what that means:

  • R is free and open source – R is owned and developed by the community of its users. This has led to a rich and continually evolving ecosystem of freely available extensions (packages) for R that cover many fields and applications.

  • R is interactive – R works like a pocket calculator that allows you to do calculations, explore data, and analyse data quickly and on the fly. It allows for a very flexible, interactive workflow that is especially well suited for data exploration and analysis.2

  • R is focused on data analysis – Right out of the box, R gives you a powerful tool belt for data analysis. More than in many other programming languages, data and statistics are at the core of R and there are built-in data structures and functions that make data analysis easier.

Although R is a proper programming language3, its main focus is undoubtedly data analysis. So much so, that some people call R a domain specific programming language. Others say that R is an environment for interactive data analysis that has a programming language.

A tool for data analysis

R is popular in academia, industry, and the public sector as a tool for data analysis,4 where it is used to support all parts of the data analysis workflow, and throughout this book we will get a glimpse of how R can help you with all of these steps:

A sketch of the different steps in the data analysis workflow.

  • Data collection — This involves obtaining the data from its source, for example, a spreadsheet, a database, the output of another software, or the log files of a measurement device.

  • Data wrangling — This includes tasks like cleaning, transforming, and summarizing the data, merging data sources, dealing with odd data points, and calculating new variables. This part of the data analysis workflow is also known as data munging, or “80% of data science”.

  • Data Visualization — This involves turning the data into visual representations, for exploration, explanation, or exhibition.

  • Modeling — This is all about applying models to your data in order to understand it better (e.g. clustering) and often with the goal to make predictions. This includes relatively common statistical models and tests (such as the general linear model) and more advanced machine learning techniques like deep neural networks.

  • Communication — This is about communicating the analysis, for example in the form of a report, a manuscript, a presentation, or a web interface, and doing it in a way that is transparent and reproducible.


  1. This workflow is not unique to R. Python, for example, offers similar functionality via the iPython (interactive Python) project. However, R was early to the game.↩︎

  2. Some people claim that you can use R for everything.↩︎

  3. In 2020, R even briefly climbed to the 8th place of the TIOBE index, which ranks programming languages by how in-demand they are (https://www.tiobe.com/tiobe-index/r/).↩︎