1 This session

1.1 Summary

This session is a very brief introduction to R and RStudio for beginners, with reference to Civil Service People (CSPS) Data.

There’s a lot of material about R and about the CSPS data that we aren’t going to have time to cover today.

We’ll be developing this guidance and making it freely available on the web. It will include more information on tidying, analysing, plotting and reporting.

Important note on the data: the data used in this document is not real data. Instead, it’s a ‘synthetic’ version created using the {synthpop} package. This preserves data distributions without any response being that sampled from a real individual.

1.2 Code along

This session is designed as a ‘code-along’. You’ll be asked to type what you see on screen as we progress

Ideally you have R and RStudio installed already and you are able to download packages. In which case, you can do everything outlined in this document from your computer.

Don’t worry if you don’t have R and RStudio downloaded, or you can’t download packages. Instead, we’ve set up an instance of RStudio in the cloud for this training session, using a non-profit service called Binder.

Click the button below to launch RStudio in your browser with Binder. It may take a few moments to load; retry or switch browser if it fails.

Launch Rstudio Binder

In this RStudio instance, the folder structure is set up and the packages and data are pre-installed for you, so you won’t need to follow the steps in:

  • section 2.5: Project folders
  • section 3.6: Install and load packages (don’t use install.packages(), but do use library())
  • section 4.2: Download the data

Note that This is not how you would normally access RStudio; this has been set up so you are able to follow along with the demonstrations in the session.

After a period of inactivity, your instance of Binder will shut down. Note that anything code you write won’t be preserved. You will need to copy, paste and save anything you write into a file on your computer instead.

1.3 Background

The annual CSPS produces a lot of data each year. Departments are provided with summary reports, but can access response-level data (‘microdata’) to perform their own in-depth analyses.

Many tools like Excel, SPSS and Stata are used across government to analyse the microdata. Many of these tools are proprietary and require expensive licenses. This variety can make it tricky for analysts to share approaches between departments and even within them.


Activity

  • What program do you use for analysis?
  • Have you used R before?
  • What makes you want to use R?


Of course, every analyst and every department is welcome to use the tools that are available to them, that they understand and that get the job done. Having said this, we’re advocating for the statistical programming language R and the RStudio code editor.

2 R and RStudio

2.1 Why?

Why R? It:

  • is free and open source
  • has a strong community of users across government and the world
  • handles all steps of an analysis from reading data to writing reports
  • makes your work reproducible because the code can be re-run by you in future with the same or different data
  • has highly configurable graphics, including interactives
  • has tools for building apps and interactive ‘dashboards’ to let users explore your data

RStudio is a popular and well-supported piece of software for editing and running R code for both beginners and advanced users. It’s also free of charge and the company behind it is a public benefit corporation with a commitment to producing open source software.

2.2 {cspstools}

In particular, the CSPS team are developing some R-based tools for analysing CSPS data specifically. You will be able to download a package called {cspstools} that contains common functions for analysing CSPS data. This will help provide consistency in analysis and reporting and make tasks easier to perform and more reproducible. The tools will be shared in the open for anyone to use and so that anyone can help to improve them.

2.3 Download

Before starting, you should download:

Both are free, but you might need to get in touch with your IT team to get them installed to your computer.

2.4 RStudio layout

Open RStudio – its icon is a white letter ‘R’ in a blue circle:

When you open RStudio for the first time, you’ll see the window is split into three ‘panes’, which are numbered below:

Your window may not look exactly like this one, depending on your operating system.

Labelled in the image are:

  1. The console pane – code is executed here
  2. The environment pane – stored values, tables, plots, etc, are displayed here
  3. The files pane – navigate your folder structure (also has tabs for showing plots and help)

We don’t need to concern ourselves with every button and tab for now.

2.5 Project folders


Binder users

Binder users: you don’t need to run this section because you are already working in a Project folder with the correct folder structure.


There are many benefits to having one folder per analytical project. It means your work is more:

  • organised – all the code, data, outputs, etc, are stored in one place (a single project folder)
  • reproducible – your code can be re-run from scratch to produce the same outputs every time
  • transferable – you can pass the entire project folder to someone else and they’ll be able to run it on their own machine; the filepaths you specify in your code assume the home folder is the project folder, so you can write something like data/dataset.csv rather than file/path/on/my/personal/machine/that/you/cannot/access.csv

RStudio has a system that helps you set this up. You can create an ‘RStudio Project’ like this:

  1. Open RStudio (the icon is a white R inside a blue circle)
  2. File > New Project…
  3. New Directory > New Project
  4. Give your project a meaningful name in the ‘Directory Name’ box (e.g. csps-r for this session)
  5. Browse for the filepath where your R Project folder will be placed
  6. Click ‘Create Project’ and RStudio will open your project (note the project name in the top right)

This creates a folder where you specified that contains an RStudio Project file (extension ‘.Rproj’). This folder is the ‘home’ of your project and this is where you should house all the files and code that you need.

For now, create two new folders – data and output – in your Project folder (we’ll be using these later).

We haven’t created any R script files yet, but they’ll go in the project folder too.

This means we’ll get a folder structure like this:

csps-r/              # the project folder
  ├── csps-r.Rproj   # R Project file
  ├── data/          # read-only raw data
  ├── output/        # processed data
  └── training.R     # R script files

To access your RStudio Project in future, navigate to the project folder and double-click your R Project file, which has the .Rproj extension (e.g. your-project.Rproj). It will open RStudio in the same state that you left it when you last closed it.

2.6 Start a new script

You’ll write your code into a special text file called an R script, which has the extension .R.

Having opened the R Project (.Rproj) file for your analysis, open a new script by clicking File > New File > R script. A new blank script will appear in a new pane in the upper left of the RStudio window.

You can type or copy and paste code into this document. This serves as a record of the actions you used to analyse the data step-by-step.


Tip

  • You can have multiple scripts open at once
  • They will appear as separate tabs in the scripts pane


3 R fundamentals

3.1 Comments

In an R script, any characters prefixed with a hash (#) will be recognised as a comment. R will ignore these when you run your code.

Comments are really helpful for letting people to understand what your code is doing. Try to keep a narrative going throughout your code to explain what it’s doing. Be explicit – it might be obvious to you right now why a certain line code is being written, but you might come back in a few months time and forget.

It’s also good to use comments to explain what each block of code is doing and to explain particular lines of code. Don’t worry about the code itself, but here’s an example of comments in use:

It’s also good to add the title, your name, date, etc, as comments at the top of your script so people know what the script is for when they open it.

3.2 Run code

How do you actually run some R code? Let’s start with a small calculation.

First, we’ll add two numbers together. Type the calculation 1 + 1 into your script:

To execute it, make sure your cursor is on the same line as the code and press Command+Enter on a Mac or Control+Enter on a PC (there’s also a ‘Run’ button in the upper right of the script pane). You can also run multiple bits of code by highlighting selected lines and then running it.

What happened when you ran the code? The following was printed to the console in the lower-left pane of RStudio:

[1] 2

Great, we got the answer 2, as expected. (The number in square brackets is related to the the number of items retuned in the answer and doesn’t concern us right now.)


Tip

  • Don’t forget to save your script file
  • Go to File > Save or use the Control+S or Command+S shortcuts


3.3 Store a result

This is good, but ideally we want to store objects (values, tables, plots, etc), so we can refer to them in other pieces of code later.

You do this in R with a special operator: the ‘assignment arrow’, which is written as <-. The shortcut for it is Alt+- (hyphen).

For example, we can assign 1 + 1 to the name my_num with <-. Execute the following code:

Hm. Nothing printed out in the console. Instead the object is now stored in your ‘environment’ – see the top right pane in RStudio:

You can now refer to this object by name in your script. For example, you can print it:

[1] 2

Tip

  • You can actually print an object by running its name alone
  • Running my_num is equivalent to print(my_num)
  • I’ll be using print() throughout to be more explicit


The real benefit to this is that you don’t have to repeat yourself every time you want to use that particular calculation. For example, you can refer to the object in new expressions:

[1] 10

Tip

  • Use names that are meaningful, relatively short and use consistent naming conventions (like ‘snake_case’, which is all lowercase and separates words with underscores)
  • For example, var_mean and var_median
  • Names are case sensitive, can’t have spaces, nor start with a number


Activity

  • Make an object called val1 that stores the value 543
  • Make another object called val2 that stores the value 612
  • Make a third object called calc that is the multiplication (*) of val1 and val2
  • What value do you get when you print calc?
01:00


3.4 Object classes

We stored a numeric value in the last section. We can do more than just store one item of data at a time though.

This next chunk of code combines multiple elements with the c() command. This kind of multi-element object is called a ‘vector’.

Here’s a vector that contains text rather than numbers. You put character strings inside quotation marks (""), which isn’t needed for numbers.

[1] "DfE"  "DHSC" "DfT" 

So each of the elements of the object was returned.

You can see what ‘class’ your object is at any time with the class() function.

[1] "numeric"
[1] "character"

Tip

  • To create a vector of numbers, you could write c(1, 2, 3)
  • There’s a shortcut for this: 1:3
  • This means ‘all the numbers from 1 to 3 including 1 and 3’


So we’ve create objects composed of a single values (my_num) and a vector of values (dept_names).

The next step would be to combine a number of vectors together to create a table with rows and columns. Tables of data with rows and columns are called ‘data frames’ in R and are effectively a bunch of vectors of the same length stuck together.

Here’s an example of a data frame built from scratch:

  dept headcount responsibility
1  DfE      6900      Education
2 DHSC      8300         Health
3  DfT     15000      Transport

Can you see how the data frame is three vectors (dept, headcount and responsibility) of the same length (3 values) arranged into columns? The function data.frame() bound these together into a table format. Let’s check the class:

[1] "data.frame"

R is capable of building very complex objects, but tabular data with rows and columns is ubiquitous and it’s how the CSPS data is stored. We’ll be focusing on data frames for now.

3.5 Use a function

You’ve been using functions already: print(), c(), data.frame(), class().

A function is a reproducible unit of code that performs a given task, such as reading a data file or fitting a model. There are any of these built into R already, but you can also download ‘packages’ of functions and you can also create your own.

Functions are written as the function name followed by brackets. The brackets contain the ‘arguments’, which are like the settings for the function. One argument might be be a filepath to some data, another might describe the colour of points to be plotted. They’re separated by commas.

So a generic function might look like this:

Note that you can break the function over several lines to improve readability and so you can comment on individual arguments. You can put your cursor on any of these lines and run it. You don’t have to highlight the whole thing.

You can use type a question mark followed by a function name to learn about its arguments. This will appear in a help file in the bottom right pane. For example, ?plot().


Tip

  • You can create your own functions, but we’re not going to spend time on that now
  • This is a good way to stop repeating code
  • Here’s an example that adds two provided numbers together:
[1] 7


3.6 Install and load packages


Binder users

Binder users: you don’t need to use install.packages() because the packages have already been installed for you; you will need to use library() though.


Functions can be bundled into packages. A bunch of packages are pre-installed with R, but there are thousands more available for download. These packages extend the basic capabilities of R or improve them.

Packages can be installed to your computer using the install.packages() function. This automatically fetches and downloads packages from a centralised package database on the internet called CRAN, which only accepts packages that meet strict quality criteria.


Tip

  • Packages can be downloaded from places other than CRAN
  • You can download packages from open online repositories like GitHub
  • You may need your IT team to authorise or install packages for you


We’re going to use a few packages to help us:

  • {haven} for reading a range of file types, like Stata’s (.dta) and SPSS’s (.sav)
  • {dplyr} for data cleaning and preparation
  • {ggplot2} for plotting

Tip

  • {haven}, {dplyr} and {ggplot2} are part of a larger suite of packages called ‘the tidyverse’
  • This is a set of packages that share a common design philosophy and seek to make analysis more intuitive
  • The RStudio company produces the majority of the packages in the tidyverse
  • You can download all the packages of the tidyverse at once with install.packages("tidyverse")


Typically you would type install.packages("packagename") to download the package, but we can use the following to install the packages from the tidyverse all at once:

You only need to run the installation function once per package on your machine.

Each time you start a new session you’ll need to run library("package_name") to tell R to make available the functions from a that package so you can use them in your script.

So now we have the tidyverse packages installed we can call the packages we need with library().


Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Sometimes a message will be printed to tell you a bit more about the package, which is what happens for {dplyr}.

We can start using functions from these packages now that they’re loaded.

It’s good practice to write the library() lines near the top of your script file so that others know which packages are being used in the script.

4 Data

4.1 Synthetic data

We aren’t using real CSPS data for these exercises. Instead, we’ll be using a ‘synthetic’ version that mimics the 2019 data.

In short, this means that the data distributions within the variables are preserved, but no response represents a real individual. This means we can get some realistic-looking outputs without any response being from a real individual.

We’ve also restricted the number of variables (columns) and rows (responses) to keep the data set relatively small, and have added a fake unique ID value.

The variables are in the synthetic data set are:

  • ResponseID (a faked unique respondent identifier) and OverallDeptCode (the department name)
  • B01 to B05 (questions about ‘my work’) and B47 to B51 (engagement questions)
  • E03 and E03_GRP (have you been bullied or harassed n the past 12 months?) and E03A_1 to E03A_16 (a column for each possible response to the bullying, harassment or discrimination experienced)
  • W01 to W04 (wellbeing questions)
  • J03 (ethnicity)
  • Z02 (civil service grades)
  • ees (a derived variable that translates each engagement question score to a percentage and takes the mean)
  • mw_p (a derived variable that gives the proportion of responses to the ‘my work’ theme questions that were positive, i.e. ‘agree’ or ‘strongly agree’)

4.2 Get the data set


Binder users

Binder users: You don’t need to run this section because the data set is already in your data/ folder.


Ordinarily we would send you the data for your organisation on request. For this session, we’ve prepared the synthetic data set as a Stata-format (.dta) file.

You can download the data from the Cabinet Office GitHub page. Visit the link, click the ‘download’ button and save the downloaded file to the data/ folder of your project.


Tip

You could also download the file to your machine with the download.file() function. The first argument is url; the file path to where the data are saved on the internet. The destfile argument is where you want to save the file on your computer; we want to put it in data/.


Now take a look at the ‘Files’ pane in RStudio and navigate into the data/ folder. The csps_synth.dta file should now be in there.

5 Read

There’s a number of functions for reading in data to R. A common one is read_csv() from the tidyverse’s {readr} package.

The {haven} package has a function called read_stata() that you can use to read in a .dta file. Let’s read in the data with this function and name the object ‘data’.

This will read the data in as a ‘tibble’, a fancier type of data frame that’s used by the tidyverse packages. For example, when printed to the console, tibbles use colour coding and are truncated to fit.


Activity

How do you know that the data has been successfully read into R?


5.1 Inspect

It’s good to preview the data and check it looks like what we expected.

5.1.1 Glimpse the structure

The {dplyr} package that we loaded earlier has a function called glimpse(), which tells you about the structure of the data.

Rows: 11,555
Columns: 38
$ ResponseID      <dbl> 100000, 100001, 100002, 100003, 100004, 100005, 10000…
$ OverallDeptCode <chr> "ORGA", "ORGA", "ORGA", "ORGA", "ORGA", "ORGA", "ORGA…
$ B01             <dbl+lbl> 4, 4, 3, 5, 5, 5, 4, 4, 4, 5, 4, 4, 5, 4, 4, 4, 3…
$ B02             <dbl+lbl> 4, 4, 4, 5, 5, 5, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 3…
$ B03             <dbl+lbl> 3, 4, 3, 5, 5, 5, 4, 3, 3, 4, 2, 4, 2, 4, 4, 2, 2…
$ B04             <dbl+lbl> 3, 4, 4, 5, 4, 5, 4, 2, 4, 4, 1, 4, 4, 4, 3, 1, 2…
$ B05             <dbl+lbl> 4, 4, 4, 5, 4, 5, 3, 4, 5, 5, 4, 3, 5, 3, 3, 4, 3…
$ B47             <dbl+lbl> 4, 3, 4, 4, 5, 5, 4, 3, 3, 4, 2, 4, 3, 4, 4, 4, 3…
$ B48             <dbl+lbl> 4, 3, 4, 5, 5, 5, 4, 4, 3, 4, 2, 4, 5, 4, 4, 4, 3…
$ B49             <dbl+lbl> 4, 2, 4, 4, 5, 5, 4, 3, 2, 2, 2, 2, 2, 4, 2, 3, 2…
$ B50             <dbl+lbl> 4, 2, 4, 4, 5, 5, 4, 4, 2, 3, 3, 2, 4, 4, 4, 3, 2…
$ B51             <dbl+lbl> 4, 2, 4, 4, 5, 5, 4, 4, 2, 3, 3, 2, 4, 4, 2, 3, 3…
$ E03             <dbl+lbl> 1, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 1, 4…
$ E03_GRP         <dbl+lbl> 1, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2…
$ E03A_01         <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA…
$ E03A_02         <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA…
$ E03A_03         <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ E03A_04         <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ E03A_05         <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ E03A_06         <dbl+lbl> 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ E03A_07         <dbl+lbl> 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ E03A_08         <dbl+lbl> 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ E03A_09         <dbl+lbl> 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ E03A_10         <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ E03A_11         <dbl+lbl> 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ E03A_12         <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ E03A_13         <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ E03A_14         <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ E03A_15         <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ E03A_16         <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ W01             <dbl> 7, 10, 9, 10, 10, 10, 7, 9, 8, 10, 8, 5, 8, 9, 3, 8, …
$ W02             <dbl> 7, 10, 9, 10, 10, 10, 7, 10, 7, 10, 10, 5, 8, 11, 3, …
$ W03             <dbl> 8, 10, NA, 10, 11, 10, 7, 8, 6, 9, 9, 4, 9, 3, 11, 8,…
$ W04             <dbl> 7, 1, 6, 5, 2, 4, 1, 8, 6, 5, 2, 9, 8, 1, 8, 4, 1, 4,…
$ J03             <dbl+lbl> 18, 1, 1, 1, 1, 1, 1, 1, NA, NA, 1, 1, 1, 1, 1, 1…
$ Z02             <dbl+lbl> 2, 3, 2, 4, 3, 4, 2, 2, NA, 4, 4, 3, 4, 5, 3, 2, …
$ ees             <dbl> 0.75, 0.35, 0.75, 0.80, 1.00, 1.00, 0.75, 0.65, 0.35,…
$ mw_p            <dbl> 0.6, 1.0, 0.6, 1.0, 1.0, 1.0, 0.8, 0.4, 0.8, 1.0, 0.6…

The top of the output tells us there’s 11,555 observations (rows) and 38 variables (columns).

Column names are then listed with the data type and the first few examples. For example, ‘OverallDeptCode’ contains character class (<chr>) data in the form of strings. Column names starting with ‘B’, ‘E’, ‘J’ and ‘Z’ are question codes and they contain responses expressed in numeric form, so they’re of class ‘double’ (<dbl>).

The numbers encode certain responses. For example, 1 means ‘strongly disagree’ and 5 means be ‘strongly agree’ for the ‘B’ series of questions.

How do we know what all the numeric values mean? You’ll see that a number of the columns have the label class (<lbl>) too. This means that the column carries additional ‘attributes’ that give the corresponding labels for the values.

Labels aren’t used that frequently in R data frames, but are used in programs like Stata and SPSS. Since we’ve read in a Stata file, we’ve got these labels available to us.

You can also see that there are also lots of NA values. R uses NA to mean ‘not available’ – the data are missing. In this case, it means that the respondent didn’t supply an answer for that question.

5.1.2 See the full data

Another way of expressing this is to print() to the console.

# A tibble: 11,555 x 38
   ResponseID OverallDeptCode     B01     B02     B03     B04     B05     B47
        <dbl> <chr>           <dbl+l> <dbl+l> <dbl+l> <dbl+l> <dbl+l> <dbl+l>
 1     100000 ORGA            4 [Agr… 4 [Agr… 3 [Nei… 3 [Nei… 4 [Agr… 4 [Agr…
 2     100001 ORGA            4 [Agr… 4 [Agr… 4 [Agr… 4 [Agr… 4 [Agr… 3 [Nei…
 3     100002 ORGA            3 [Nei… 4 [Agr… 3 [Nei… 4 [Agr… 4 [Agr… 4 [Agr…
 4     100003 ORGA            5 [Str… 5 [Str… 5 [Str… 5 [Str… 5 [Str… 4 [Agr…
 5     100004 ORGA            5 [Str… 5 [Str… 5 [Str… 4 [Agr… 4 [Agr… 5 [Str…
 6     100005 ORGA            5 [Str… 5 [Str… 5 [Str… 5 [Str… 5 [Str… 5 [Str…
 7     100006 ORGA            4 [Agr… 4 [Agr… 4 [Agr… 4 [Agr… 3 [Nei… 4 [Agr…
 8     100007 ORGA            4 [Agr… 3 [Nei… 3 [Nei… 2 [Dis… 4 [Agr… 3 [Nei…
 9     100008 ORGA            4 [Agr… 4 [Agr… 3 [Nei… 4 [Agr… 5 [Str… 3 [Nei…
10     100009 ORGA            5 [Str… 4 [Agr… 4 [Agr… 4 [Agr… 5 [Str… 4 [Agr…
# … with 11,545 more rows, and 30 more variables: B48 <dbl+lbl>, B49 <dbl+lbl>,
#   B50 <dbl+lbl>, B51 <dbl+lbl>, E03 <dbl+lbl>, E03_GRP <dbl+lbl>,
#   E03A_01 <dbl+lbl>, E03A_02 <dbl+lbl>, E03A_03 <dbl+lbl>, E03A_04 <dbl+lbl>,
#   E03A_05 <dbl+lbl>, E03A_06 <dbl+lbl>, E03A_07 <dbl+lbl>, E03A_08 <dbl+lbl>,
#   E03A_09 <dbl+lbl>, E03A_10 <dbl+lbl>, E03A_11 <dbl+lbl>, E03A_12 <dbl+lbl>,
#   E03A_13 <dbl+lbl>, E03A_14 <dbl+lbl>, E03A_15 <dbl+lbl>, E03A_16 <dbl+lbl>,
#   W01 <dbl>, W02 <dbl>, W03 <dbl>, W04 <dbl>, J03 <dbl+lbl>, Z02 <dbl+lbl>,
#   ees <dbl>, mw_p <dbl>

The output is displayed in table format, but is truncated to fit the console window (this prevents you from printing millions of rows to the console!). You can see the labels are printed alongside the values in this view.

If you want to see the whole dataset you could use the View() function:

This opens up a read-only tab in the script pane that displays your data in full. You can scroll around and order the columns by clicking the headers. This doesn’t affect the underlying data at all.

You can also access this by clicking the little image of a table to the right of the object in the environment pane (upper-right).

6 Wrangle

We’re going to use a number of functions from the {dplyr} package, which we loaded earlier, to practice some data manipulation.

Functions in the tidyverse suite of packages are usually verbs that describe what they’re doing, like select() and filter().

We won’t have time to go through all of the functions and their variants, but you should get a flavour of what’s possible.

6.1 Select columns

Firstly, we can select() columns of interest. This means we can return a version of the data set composed of a smaller number of columns. This can be helpful for a number of reasons, but in particular it lets us focus on specific variables of interest.

The {dplyr} functions take the data frame as their first argument, so the first thing we’ll supply the function is our data object. Then we can supply the names of columns that we want to keep. Note that we can also rename columns as we select them with the format new_name = old_name. (Alternatively there is a rename() function that only renames columns.)

# A tibble: 11,555 x 2
            Z02                                          ethnicity
      <dbl+lbl>                                          <dbl+lbl>
 1  2 [eo]      18 [Any other background]                         
 2  3 [SEO/HEO]  1 [English/Welsh/Scottish/Northern Irish/British]
 3  2 [eo]       1 [English/Welsh/Scottish/Northern Irish/British]
 4  4 [G6/7]     1 [English/Welsh/Scottish/Northern Irish/British]
 5  3 [SEO/HEO]  1 [English/Welsh/Scottish/Northern Irish/British]
 6  4 [G6/7]     1 [English/Welsh/Scottish/Northern Irish/British]
 7  2 [eo]       1 [English/Welsh/Scottish/Northern Irish/British]
 8  2 [eo]       1 [English/Welsh/Scottish/Northern Irish/British]
 9 NA           NA                                                
10  4 [G6/7]    NA                                                
# … with 11,545 more rows

See that the order in which we selected the columns is the order in which they appeared when printed.

Instead of naming columns to keep, you can also specify columns to remove by prefixing the column name with a - (minus).


Tip

  • It’s worth noting that our original data set (data) remains unchanged, despite us having selected some columns
  • Usually you will wrangle a data set and save it with a different object name so the original is always available to you
  • You can overwrite an original variable by using its name again, but it’s usually best to avoid this
  • For example, data <- select(data, B01) would overwrite our original data object


To save time you can use some special select() helper functions. For example, you can select a column that contains() or starts_with() certain strings. This is useful if you have lots of columns that share a similarity in their names, like in the CSPS (e.g. B01, B02, etc, all start with “B”).

# A tibble: 11,555 x 5
   ResponseID   W01   W02   W03   W04
        <dbl> <dbl> <dbl> <dbl> <dbl>
 1     100000     7     7     8     7
 2     100001    10    10    10     1
 3     100002     9     9    NA     6
 4     100003    10    10    10     5
 5     100004    10    10    11     2
 6     100005    10    10    10     4
 7     100006     7     7     7     1
 8     100007     9    10     8     8
 9     100008     8     7     6     6
10     100009    10    10     9     5
# … with 11,545 more rows

Activity

  1. Use select() to return all the ‘B’ series columns (B01, B02, etc)
  2. How many columns are in this selection?
  3. Do the same, but without the B02 column
02:00


6.2 Filter rows

Now to filter the rows of the data set based on certain criteria.

We’re going to make use of some logical operators for filtering our data. These return TRUE or FALSE depending on the statement’s validity.

Symbol Meaning Example
== Equal to 5 == 2 + 3 returns TRUE
!= Not equal to 5 != 3 + 3 returns TRUE
%in% Match to a vector (shortcut for multiple logical tests) 4 %in% c(2, 4, 6) returns TRUE
>, < Greater than, less than 2 < 3 returns TRUE
>=, <= Equal or greater than, equal or less than 5 <= 5 returns TRUE
& And (helps string together multiple filters) 1 < 2 & 5 == 5 returns TRUE
| Or (helps string together multiple filters) 1 < 2 | 5 == 6 returns FALSE (only one of them is true)

R also has some special shortcut functions for come logical checks. For example:

Symbol Meaning Example
is.numeric() Is the content numeric class? is.numeric(10) returns TRUE
is.character() Is the content character class? is.character("Downing Street") returns TRUE
is.na() Is the content an NA? is.na(NA) returns TRUE

You can negate these functions by preceding them with a !, so is.na(NA) returns TRUE but !is.na(NA) returns FALSE.

Let’s start by creating an object that contains the data filtered for senior civil servants (where variable Z02 equals 5) from two of the organisations.

See how there are two filter statements: Z02 == 5 and Organisation %in% c("ORGB", "ORGC")? We’re asking for both of these things to be true by using the & operator between them.

Notice that we used %in% to match to a vector of department names (this is quicker than writing OverallDeptCode == ORGB | OverallDeptCode == ORGC). The names are stored as character strings, so we put them in quotation marks.

We could print the columns of interest to see if it worked, but a better method would be to return only the ‘distinct’ (unique) values in these columns:

# A tibble: 2 x 2
  OverallDeptCode       Z02
  <chr>           <dbl+lbl>
1 ORGB              5 [scs]
2 ORGC              5 [scs]

Activity

  1. Use filter() to return senior civil servants in Org A only
  2. Use distinct() to make sure it’s worked
  3. How many rows are in this filtered data set?
02:00


6.3 Add new columns

Now to create new columns. The function name is mutate(); we’re ‘mutating’ our dataframe by budding a new column where there wasn’t one before. Often you’ll be creating new columns based on the content of columns that already exist, like adding the contents of one to another.

One relevant use of this for the CSPS is to create dummy columns. If certain conditions are met in other columns, we can put a ‘1’ in the dummy column, else we can put ‘0’ if it’s not met.

So we could create a dummy column that flags when a respondent is a SEO/HEO grade. This example uses an ifelse() statement that fills the column with one value if the logical test is TRUE and another if it’s FALSE.

# A tibble: 11,555 x 3
            Z02                                                J03 dummy
      <dbl+lbl>                                          <dbl+lbl> <dbl>
 1  2 [eo]      18 [Any other background]                              0
 2  3 [SEO/HEO]  1 [English/Welsh/Scottish/Northern Irish/British]     1
 3  2 [eo]       1 [English/Welsh/Scottish/Northern Irish/British]     0
 4  4 [G6/7]     1 [English/Welsh/Scottish/Northern Irish/British]     0
 5  3 [SEO/HEO]  1 [English/Welsh/Scottish/Northern Irish/British]     1
 6  4 [G6/7]     1 [English/Welsh/Scottish/Northern Irish/British]     0
 7  2 [eo]       1 [English/Welsh/Scottish/Northern Irish/British]     0
 8  2 [eo]       1 [English/Welsh/Scottish/Northern Irish/British]     0
 9 NA           NA                                                     0
10  4 [G6/7]    NA                                                     0
# … with 11,545 more rows

Activity

Use mutate() to create a dummy column where:

  • people who responded with ‘strongly agree’ (5) to both B01 and B02 get a 1
  • everyone else gets a 0
02:00


This function is particularly useful for the CSPS data if we want to overwrite our numeric values with their corresponding text labels. Fortunately, the {haven} package that we loaded earlier has a function that replaces the numeric values with their labels: as_factor().

We want to apply this only to the columns that are numeric. Fortunately there’s a variant of mutate() called mutate_if(), which lets you use logical statements to select columns. This means we don’t have to write out all their names.


Tip

  • There’s more than one function called as_factor() – how can we resolve this?
  • We can specify that we mean the one from the {haven} package by writing it in the form package::function()


6.4 Join

We can use variant join() functions to merge two data frames together on a common column.

Let’s create a small trivial data frame that provides a lookup from department codes to full department names and merge it into our CSPS data.

We’ll use the tibble() function from {dplyr} to build the data frame. Remember: tibbles are data frames with nice defaults and printing properties; we’ve seen them already in the outputs from our earlier wrangling with {dplyr}

# A tibble: 3 x 2
  OverallDeptCode dept_full_name
  <chr>           <chr>         
1 ORGA            Dept for A    
2 ORGB            Ministry of B 
3 ORGC            C Agency      

We want what is perhaps the most common join: left_join(). It gives you all the rows from the ‘left’ data set (in our case, data) and merges on the columns from the ‘right’ (our new lookup).

Here’s what we’ll be doing (gif by Garrick Aden-Buie):

To do this, we pass two data frames to arguments x (‘left’) and y (‘right’) and provide the column name to join by.

Warning: Column `OverallDeptCode` has different attributes on LHS and RHS of
join

You might get a message saying that the attributes for our joining column aren’t the same. That’s okay; it’s because the column in data (the data set on the ‘LHS’, or ‘left-hand side’, of the join) has attributes, but the one in lookup (on the right-hand side) doesn’t.

Let’s check to see if rows from both data frames are present in the joined data set:

# A tibble: 11,555 x 4
   ResponseID                            B01 OverallDeptCode dept_full_name
        <dbl>                      <dbl+lbl> <chr>           <chr>         
 1     100000 4 [Agree]                      ORGA            Dept for A    
 2     100001 4 [Agree]                      ORGA            Dept for A    
 3     100002 3 [Neither agree nor disagree] ORGA            Dept for A    
 4     100003 5 [Strongly agree]             ORGA            Dept for A    
 5     100004 5 [Strongly agree]             ORGA            Dept for A    
 6     100005 5 [Strongly agree]             ORGA            Dept for A    
 7     100006 4 [Agree]                      ORGA            Dept for A    
 8     100007 4 [Agree]                      ORGA            Dept for A    
 9     100008 4 [Agree]                      ORGA            Dept for A    
10     100009 5 [Strongly agree]             ORGA            Dept for A    
# … with 11,545 more rows

Success: the output has all the rows of the data data frame, plus the new one (dept_full_name) from the lookup data frame.

6.5 Pipes

We’ve seen how to manipulate our data frame a bit. But we’ve been doing it one discrete step at a time, so your script might end up looking something like this:

# A tibble: 1,060 x 5
   ResponseID OverallDeptCode                         B01        Z02 positive   
        <dbl> <chr>                             <dbl+lbl>  <dbl+lbl> <chr>      
 1     100000 ORGA            4 [Agree]                   2 [eo]     Positive   
 2     100001 ORGA            4 [Agree]                   3 [SEO/HE… Positive   
 3     100002 ORGA            3 [Neither agree nor disag… 2 [eo]     Not positi…
 4     100003 ORGA            5 [Strongly agree]          4 [G6/7]   Positive   
 5     100004 ORGA            5 [Strongly agree]          3 [SEO/HE… Positive   
 6     100005 ORGA            5 [Strongly agree]          4 [G6/7]   Positive   
 7     100006 ORGA            4 [Agree]                   2 [eo]     Positive   
 8     100007 ORGA            4 [Agree]                   2 [eo]     Positive   
 9     100009 ORGA            5 [Strongly agree]          4 [G6/7]   Positive   
10     100010 ORGA            4 [Agree]                   4 [G6/7]   Positive   
# … with 1,050 more rows

This is fine, but you will be creating a lot of intermediate objects to get to the final data frame that you want. This clutters up your environment and can fill up your computer’s memory if the data are large enough. You’re in danger of accidentally referring to the wrong object if you don’t name them well.

Instead, you could create one object that is built by chaining all the functions together in order.

We’ll use a special pipe operator – %>% – that will read as ‘take what’s on the left of the operator and pass it through to the next function’. In pseudocode:

A real example with our data might look like this:

# A tibble: 1,060 x 5
   ResponseID OverallDeptCode                         B01        Z02 positive   
        <dbl> <chr>                             <dbl+lbl>  <dbl+lbl> <chr>      
 1     100000 ORGA            4 [Agree]                   2 [eo]     Positive   
 2     100001 ORGA            4 [Agree]                   3 [SEO/HE… Positive   
 3     100002 ORGA            3 [Neither agree nor disag… 2 [eo]     Not positi…
 4     100003 ORGA            5 [Strongly agree]          4 [G6/7]   Positive   
 5     100004 ORGA            5 [Strongly agree]          3 [SEO/HE… Positive   
 6     100005 ORGA            5 [Strongly agree]          4 [G6/7]   Positive   
 7     100006 ORGA            4 [Agree]                   2 [eo]     Positive   
 8     100007 ORGA            4 [Agree]                   2 [eo]     Positive   
 9     100009 ORGA            5 [Strongly agree]          4 [G6/7]   Positive   
10     100010 ORGA            4 [Agree]                   4 [G6/7]   Positive   
# … with 1,050 more rows

So the steps for creating the data_piped object are:

  • take the data object
  • then select out some columns
  • then filter on a variable
  • then add a column

This is a bit like a recipe. And it’s easier to read.

You also repeat yourself fewer times. We only to name the data object once, a the very start. This minimises the chance that you’ll accidentally name the wrong object by mistake.

6.6 Save your wrangled data

There are a number of ways and formats in which to save our wrangled data.

For example, to save the output as a CSV, we can do one of these:

You pass to the function the object name and the filepath for where you want it to be saved.

Note that the labels will be lost if you save as CSV, but they’re retained in .dta and .rds format.

Check in your output/ folder to make sure they’ve been saved.

You can then read these back in like how we did earlier in this document (you don’t have to do this now):

7 Summarise

So far we’ve been wrangling but not analysing data. Let’s look at the summarise() function for some quick summaries.

A simple example might be to get the total count of responses in the data set and the mean of the engagement scores.

# A tibble: 1 x 2
  total_count ees_mean
        <int>    <dbl>
1       11555     0.59

That’s good, but we can extend the summary so we get results grouped by some other variables. This is what the group_by() function does. You give group_by() the variables within which to summarise and you finish by calling ungroup() so that the subsequent functions don’t get applied to the groups.

So here’s a more comprehensive example that gets the mean count and mean EES grouped within departments and the Z02 variable (grade). It then filters out people who didn’t answer Z02 and uses a mutate() to suppress any mean EES values composed of less than 10 responses.

# A tibble: 15 x 5
   OverallDeptCode         Z02 total_count ees_mean ees_mean_supp
   <chr>             <dbl+lbl>       <int>    <dbl>         <dbl>
 1 ORGA            1 [AO/AA]             4    0.75         NA    
 2 ORGA            2 [eo]              131    0.7           0.7  
 3 ORGA            3 [SEO/HEO]         438    0.7           0.7  
 4 ORGA            4 [G6/7]            487    0.64          0.64 
 5 ORGA            5 [scs]              78    0.85          0.85 
 6 ORGB            1 [AO/AA]          4930    0.56          0.56 
 7 ORGB            2 [eo]             1513    0.580         0.580
 8 ORGB            3 [SEO/HEO]        2359    0.64          0.64 
 9 ORGB            4 [G6/7]            133    0.76          0.76 
10 ORGB            5 [scs]              18    0.570         0.570
11 ORGC            1 [AO/AA]             1    1            NA    
12 ORGC            2 [eo]               17    0.79          0.79 
13 ORGC            3 [SEO/HEO]          24    0.61          0.61 
14 ORGC            4 [G6/7]             37    0.6           0.6  
15 ORGC            5 [scs]               4    0.78         NA    

8 Plot

We could have a whole separate session on visualising data.

The tidyverse package for plotting is called {ggplot2}. The ‘gg’ stands for ‘grammar of graphics’. It’s a system to build up a graphic using common components including:

  • data
  • ‘geoms’ (marks that represent the data)
  • a coordinate system

You also supply aesthetic properties like size, colour, x and y locations.


These elements are built up with the + operator. Imagine you’ve created a blank canvas and you’re adding each layer. (This is different to using the pipe, %>%, which is passing information from the left-hand side to the right-hand side.)

The great thing about building plots with code is that you can produce them with the same styles very quickly without all the manual adjustments that might be required in some other programs.

{ggplot2} is a very powerful graphics package that can create all sorts of charts. Check out the R Graph Gallery for some more examples.

8.1 Simple plots

For now, let’s look at a simple bar chart of the answers to question B01 using the ggplot() function from {ggplot2}.

What just happened? We:

  • gave ggplot() the dataset, plot_data
  • gave it some aesthetic mappings, aes() (in this case, the x and y variables)
  • added the column geom, geom_col(), to make a bar chart

We can spruce this up a little by adding on additional things like a theme or labels.

But we could also split each department’s results into a grid of small multiples, or ‘facets’, with facet_grid().

8.2 Advanced plot

We can also use {ggplot} to recreate the style of bar charts used in the PDF reports of People Survey results. First we need to process the data to get the data for ORGB, reshape it into a plottable format, and calculate percentages. This section uses a couple of tidyverse packages we haven’t seen yet: {tidyr} for reshaping data frames and {forcats} for working with vectors. These are shown as package::function() to make them more apparent.

# A tibble: 25 x 5
   question value                      response_count question_count     pc
   <fct>    <fct>                               <int>          <int>  <dbl>
 1 B47      Strongly disagree                     455          10158 0.0448
 2 B47      Disagree                              904          10158 0.0890
 3 B47      Neither agree nor disagree           2397          10158 0.236 
 4 B47      Agree                                4410          10158 0.434 
 5 B47      Strongly agree                       1992          10158 0.196 
 6 B48      Strongly disagree                    1197          10152 0.118 
 7 B48      Disagree                             1995          10152 0.197 
 8 B48      Neither agree nor disagree           3006          10152 0.296 
 9 B48      Agree                                2950          10152 0.291 
10 B48      Strongly agree                       1004          10152 0.0989
# … with 15 more rows

We now have a dataset that has counted the responses for each question-value pair (response_count), the number of responses for each question (question_count) and a percentage response (pc), for questions B47-B52 for respondents in ORGB.

We can now plot this data, rather than Department we’ll be plotting the questions on the “x-axis” and our calculated percentage on the “y-axis” (we’ll actually flip these axes, but that’s one of the last things we do, so it’s best to still think of these in their original x-y positions).

We can also add data labels, using geom_text().

The PDF survey reports use a colourblind friendly pink-green scale from the {RColorBrewer} package, which provides the palettes developed by the Color Brewer project.

Finally, we apply some customisation to the theme to remove the axis titles, reposition the legend, give the legend keys an outline, and format the title text.

ggplot(plot_data2, aes(x = question, y = pc)) +
  geom_col(aes(fill = value), width = 0.75, colour = "gray60", size = 0.2) +
  geom_text(
    aes(
      label = scales::percent(pc, accuracy = 1),
      colour = value),
    position = position_fill(vjust = 0.5),
    size = 3,
    show.legend = FALSE) +
  # geom_text adds text labels, we set the label aesthetic to the text
  # we've also mapped the colour aesthetic to vary the label text's colour
  # text positioning can be tricky, this is why the value factor was reversed
  # when we created plot_data2 ¯\_(ツ)_/¯
  scale_y_reverse() +
  # reverse the y-axis so that strongly agree will be on the left-hand side
  scale_fill_brewer(palette = "PiYG", direction = -1) +
  # the PiYG palette is the same as is used in the highlights reports
  # it is colourblind friendly, so recommended instead of basic red-green
  scale_colour_manual(
    values = c("Strongly agree" = "white", 
               "Agree" = "gray20",
               "Neither agree nor disagree" = "gray20",
               "Disagree" = "gray20",
               "Strongly disagree" = "white")) +
  # this provides the colours for the text labels, so that the labels for the 
  # 'strongly' values have white text, and the others have grey text
  coord_flip() +
  # flip the axis
  labs(
    title = "Employee engagement question results",
    subtitle = "Almost two-thirds of staff are proud to work for ORG B",
    caption = "Source: B47-B52, ORGB, synthetic CSPS data") +
  theme_light() +
  theme(
    panel.grid = element_blank(),
    # element_blank() removes an element from the plot
    panel.border = element_blank(),
    axis.title.x = element_blank(),
    axis.text.x = element_blank(),
    axis.title.y = element_blank(),
    axis.ticks = element_blank(),
    legend.position = "top",
    legend.title = element_blank(),
    legend.key.size = unit(1, "char"),
    legend.margin = margin(1,0,0,0, "char"),
    plot.title = element_text(face = "bold"))

9 Learn more

9.1 R and RStudio

9.2 Getting help

  • Many departments have R groups and specialists – see if you can join a network or identify other R users in your organisation
  • Many departments also have Coffee & Coding sessions for learning about R, asking questions and sharing ideas
  • Often it helps to produce a small reproducible example (a ‘reprex’) of your code if you run into trouble
  • Getting help with R page of resources from RStudio
  • Explore questions and answers tagged as r on StackOverflow, or even ask your own question