This session is a very brief introduction to R and RStudio for beginners, with reference to Civil Service People (CSPS) Data.
There’s a lot of material about R and about the CSPS data that we aren’t going to have time to cover today.
We’ll be developing this guidance and making it freely available on the web. It will include more information on tidying, analysing, plotting and reporting.
Important note on the data: the data used in this document is not real data. Instead, it’s a ‘synthetic’ version created using the {synthpop} package. This preserves data distributions without any response being that sampled from a real individual.
This session is designed as a ‘code-along’. You’ll be asked to type what you see on screen as we progress
Ideally you have R and RStudio installed already and you are able to download packages. In which case, you can do everything outlined in this document from your computer.
Don’t worry if you don’t have R and RStudio downloaded, or you can’t download packages. Instead, we’ve set up an instance of RStudio in the cloud for this training session, using a non-profit service called Binder.
Click the button below to launch RStudio in your browser with Binder. It may take a few moments to load; retry or switch browser if it fails.
In this RStudio instance, the folder structure is set up and the packages and data are pre-installed for you, so you won’t need to follow the steps in:
install.packages()
, but do use library()
)Note that This is not how you would normally access RStudio; this has been set up so you are able to follow along with the demonstrations in the session.
After a period of inactivity, your instance of Binder will shut down. Note that anything code you write won’t be preserved. You will need to copy, paste and save anything you write into a file on your computer instead.
The annual CSPS produces a lot of data each year. Departments are provided with summary reports, but can access response-level data (‘microdata’) to perform their own in-depth analyses.
Many tools like Excel, SPSS and Stata are used across government to analyse the microdata. Many of these tools are proprietary and require expensive licenses. This variety can make it tricky for analysts to share approaches between departments and even within them.
Activity
Of course, every analyst and every department is welcome to use the tools that are available to them, that they understand and that get the job done. Having said this, we’re advocating for the statistical programming language R and the RStudio code editor.
Why R? It:
RStudio is a popular and well-supported piece of software for editing and running R code for both beginners and advanced users. It’s also free of charge and the company behind it is a public benefit corporation with a commitment to producing open source software.
In particular, the CSPS team are developing some R-based tools for analysing CSPS data specifically. You will be able to download a package called {cspstools} that contains common functions for analysing CSPS data. This will help provide consistency in analysis and reporting and make tasks easier to perform and more reproducible. The tools will be shared in the open for anyone to use and so that anyone can help to improve them.
Before starting, you should download:
Both are free, but you might need to get in touch with your IT team to get them installed to your computer.
Open RStudio – its icon is a white letter ‘R’ in a blue circle:
When you open RStudio for the first time, you’ll see the window is split into three ‘panes’, which are numbered below:
Your window may not look exactly like this one, depending on your operating system.
Labelled in the image are:
We don’t need to concern ourselves with every button and tab for now.
Binder users
Binder users: you don’t need to run this section because you are already working in a Project folder with the correct folder structure.
There are many benefits to having one folder per analytical project. It means your work is more:
data/dataset.csv
rather than file/path/on/my/personal/machine/that/you/cannot/access.csv
RStudio has a system that helps you set this up. You can create an ‘RStudio Project’ like this:
csps-r
for this session)This creates a folder where you specified that contains an RStudio Project file (extension ‘.Rproj’). This folder is the ‘home’ of your project and this is where you should house all the files and code that you need.
For now, create two new folders – data
and output
– in your Project folder (we’ll be using these later).
We haven’t created any R script files yet, but they’ll go in the project folder too.
This means we’ll get a folder structure like this:
csps-r/ # the project folder
├── csps-r.Rproj # R Project file
├── data/ # read-only raw data
├── output/ # processed data
└── training.R # R script files
To access your RStudio Project in future, navigate to the project folder and double-click your R Project file, which has the .Rproj extension (e.g. your-project.Rproj
). It will open RStudio in the same state that you left it when you last closed it.
You’ll write your code into a special text file called an R script, which has the extension .R
.
Having opened the R Project (.Rproj) file for your analysis, open a new script by clicking File > New File > R script. A new blank script will appear in a new pane in the upper left of the RStudio window.
You can type or copy and paste code into this document. This serves as a record of the actions you used to analyse the data step-by-step.
Tip
How do you actually run some R code? Let’s start with a small calculation.
First, we’ll add two numbers together. Type the calculation 1 + 1
into your script:
To execute it, make sure your cursor is on the same line as the code and press Command+Enter on a Mac or Control+Enter on a PC (there’s also a ‘Run’ button in the upper right of the script pane). You can also run multiple bits of code by highlighting selected lines and then running it.
What happened when you ran the code? The following was printed to the console in the lower-left pane of RStudio:
[1] 2
Great, we got the answer 2
, as expected. (The number in square brackets is related to the the number of items retuned in the answer and doesn’t concern us right now.)
Tip
File
> Save
or use the Control+S or Command+S shortcutsThis is good, but ideally we want to store objects (values, tables, plots, etc), so we can refer to them in other pieces of code later.
You do this in R with a special operator: the ‘assignment arrow’, which is written as <-
. The shortcut for it is Alt+- (hyphen).
For example, we can assign 1 + 1
to the name my_num
with <-
. Execute the following code:
Hm. Nothing printed out in the console. Instead the object is now stored in your ‘environment’ – see the top right pane in RStudio:
You can now refer to this object by name in your script. For example, you can print it:
[1] 2
Tip
my_num
is equivalent to print(my_num)
print()
throughout to be more explicitThe real benefit to this is that you don’t have to repeat yourself every time you want to use that particular calculation. For example, you can refer to the object in new expressions:
[1] 10
Tip
var_mean
and var_median
Activity
val1
that stores the value 543val2
that stores the value 612calc
that is the multiplication (*
) of val1
and val2
calc
?01:00
We stored a numeric value in the last section. We can do more than just store one item of data at a time though.
This next chunk of code combines multiple elements with the c()
command. This kind of multi-element object is called a ‘vector’.
Here’s a vector that contains text rather than numbers. You put character strings inside quotation marks (""
), which isn’t needed for numbers.
# Create an example vector
dept_names <- c("DfE", "DHSC", "DfT") # combine some values
print(dept_names) # have a look at what the object contains
[1] "DfE" "DHSC" "DfT"
So each of the elements of the object was returned.
You can see what ‘class’ your object is at any time with the class()
function.
[1] "numeric"
[1] "character"
Tip
c(1, 2, 3)
1:3
So we’ve create objects composed of a single values (my_num
) and a vector of values (dept_names
).
The next step would be to combine a number of vectors together to create a table with rows and columns. Tables of data with rows and columns are called ‘data frames’ in R and are effectively a bunch of vectors of the same length stuck together.
Here’s an example of a data frame built from scratch:
# Create a data frame of selected departments
dept_info <- data.frame(
dept = dept_names, # use vector from earlier
headcount = c(6900, 8300, 15000),
responsibility = c("Education", "Health", "Transport")
)
print(dept_info) # see the data frame
dept headcount responsibility
1 DfE 6900 Education
2 DHSC 8300 Health
3 DfT 15000 Transport
Can you see how the data frame is three vectors (dept
, headcount
and responsibility
) of the same length (3 values) arranged into columns? The function data.frame()
bound these together into a table format. Let’s check the class:
[1] "data.frame"
R is capable of building very complex objects, but tabular data with rows and columns is ubiquitous and it’s how the CSPS data is stored. We’ll be focusing on data frames for now.
You’ve been using functions already: print()
, c()
, data.frame()
, class()
.
A function is a reproducible unit of code that performs a given task, such as reading a data file or fitting a model. There are any of these built into R already, but you can also download ‘packages’ of functions and you can also create your own.
Functions are written as the function name followed by brackets. The brackets contain the ‘arguments’, which are like the settings for the function. One argument might be be a filepath to some data, another might describe the colour of points to be plotted. They’re separated by commas.
So a generic function might look like this:
# This isn't a real function; don't run it
function_name(
data = my_data,
colour = "red",
option = 5
)
Note that you can break the function over several lines to improve readability and so you can comment on individual arguments. You can put your cursor on any of these lines and run it. You don’t have to highlight the whole thing.
You can use type a question mark followed by a function name to learn about its arguments. This will appear in a help file in the bottom right pane. For example, ?plot()
.
Tip
# Define a function that adds two numbers
add_nums <- function(val_a, val_b) {
val_a + val_b
}
add_nums(val_a = 3, val_b = 4) # use the function
[1] 7
Binder users
Binder users: you don’t need to use install.packages()
because the packages have already been installed for you; you will need to use library()
though.
Functions can be bundled into packages. A bunch of packages are pre-installed with R, but there are thousands more available for download. These packages extend the basic capabilities of R or improve them.
Packages can be installed to your computer using the install.packages()
function. This automatically fetches and downloads packages from a centralised package database on the internet called CRAN, which only accepts packages that meet strict quality criteria.
Tip
We’re going to use a few packages to help us:
Tip
install.packages("tidyverse")
Typically you would type install.packages("packagename")
to download the package, but we can use the following to install the packages from the tidyverse all at once:
You only need to run the installation function once per package on your machine.
Each time you start a new session you’ll need to run library("package_name")
to tell R to make available the functions from a that package so you can use them in your script.
So now we have the tidyverse
packages installed we can call the packages we need with library()
.
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Sometimes a message will be printed to tell you a bit more about the package, which is what happens for {dplyr}.
We can start using functions from these packages now that they’re loaded.
It’s good practice to write the library()
lines near the top of your script file so that others know which packages are being used in the script.
We aren’t using real CSPS data for these exercises. Instead, we’ll be using a ‘synthetic’ version that mimics the 2019 data.
In short, this means that the data distributions within the variables are preserved, but no response represents a real individual. This means we can get some realistic-looking outputs without any response being from a real individual.
We’ve also restricted the number of variables (columns) and rows (responses) to keep the data set relatively small, and have added a fake unique ID value.
The variables are in the synthetic data set are:
Binder users
Binder users: You don’t need to run this section because the data set is already in your data/
folder.
Ordinarily we would send you the data for your organisation on request. For this session, we’ve prepared the synthetic data set as a Stata-format (.dta) file.
You can download the data from the Cabinet Office GitHub page. Visit the link, click the ‘download’ button and save the downloaded file to the data/
folder of your project.
Tip
You could also download the file to your machine with the download.file()
function. The first argument is url
; the file path to where the data are saved on the internet. The destfile
argument is where you want to save the file on your computer; we want to put it in data/
.
# The location of the data on the internet
path <-
"https://github.com/co-analysis/csps-with-r/blob/master/data/csps_synth.dta?raw=true"
# Download from the internet to your computer
download.file(
url = path, # the path to the file on the internet
destfile = "data/csps_synth.dta", # where to save on your computer
mode = "wb" # save as a 'binary file'
)
Now take a look at the ‘Files’ pane in RStudio and navigate into the data/
folder. The csps_synth.dta
file should now be in there.
There’s a number of functions for reading in data to R. A common one is read_csv()
from the tidyverse’s {readr} package.
The {haven} package has a function called read_stata()
that you can use to read in a .dta file. Let’s read in the data with this function and name the object ‘data’.
This will read the data in as a ‘tibble’, a fancier type of data frame that’s used by the tidyverse packages. For example, when printed to the console, tibbles use colour coding and are truncated to fit.
Activity
How do you know that the data has been successfully read into R?
It’s good to preview the data and check it looks like what we expected.
The {dplyr} package that we loaded earlier has a function called glimpse()
, which tells you about the structure of the data.
Rows: 11,555
Columns: 38
$ ResponseID <dbl> 100000, 100001, 100002, 100003, 100004, 100005, 10000…
$ OverallDeptCode <chr> "ORGA", "ORGA", "ORGA", "ORGA", "ORGA", "ORGA", "ORGA…
$ B01 <dbl+lbl> 4, 4, 3, 5, 5, 5, 4, 4, 4, 5, 4, 4, 5, 4, 4, 4, 3…
$ B02 <dbl+lbl> 4, 4, 4, 5, 5, 5, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 3…
$ B03 <dbl+lbl> 3, 4, 3, 5, 5, 5, 4, 3, 3, 4, 2, 4, 2, 4, 4, 2, 2…
$ B04 <dbl+lbl> 3, 4, 4, 5, 4, 5, 4, 2, 4, 4, 1, 4, 4, 4, 3, 1, 2…
$ B05 <dbl+lbl> 4, 4, 4, 5, 4, 5, 3, 4, 5, 5, 4, 3, 5, 3, 3, 4, 3…
$ B47 <dbl+lbl> 4, 3, 4, 4, 5, 5, 4, 3, 3, 4, 2, 4, 3, 4, 4, 4, 3…
$ B48 <dbl+lbl> 4, 3, 4, 5, 5, 5, 4, 4, 3, 4, 2, 4, 5, 4, 4, 4, 3…
$ B49 <dbl+lbl> 4, 2, 4, 4, 5, 5, 4, 3, 2, 2, 2, 2, 2, 4, 2, 3, 2…
$ B50 <dbl+lbl> 4, 2, 4, 4, 5, 5, 4, 4, 2, 3, 3, 2, 4, 4, 4, 3, 2…
$ B51 <dbl+lbl> 4, 2, 4, 4, 5, 5, 4, 4, 2, 3, 3, 2, 4, 4, 2, 3, 3…
$ E03 <dbl+lbl> 1, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 1, 4…
$ E03_GRP <dbl+lbl> 1, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2…
$ E03A_01 <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA…
$ E03A_02 <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA…
$ E03A_03 <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ E03A_04 <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ E03A_05 <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ E03A_06 <dbl+lbl> 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ E03A_07 <dbl+lbl> 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ E03A_08 <dbl+lbl> 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ E03A_09 <dbl+lbl> 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ E03A_10 <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ E03A_11 <dbl+lbl> 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ E03A_12 <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ E03A_13 <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ E03A_14 <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ E03A_15 <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ E03A_16 <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ W01 <dbl> 7, 10, 9, 10, 10, 10, 7, 9, 8, 10, 8, 5, 8, 9, 3, 8, …
$ W02 <dbl> 7, 10, 9, 10, 10, 10, 7, 10, 7, 10, 10, 5, 8, 11, 3, …
$ W03 <dbl> 8, 10, NA, 10, 11, 10, 7, 8, 6, 9, 9, 4, 9, 3, 11, 8,…
$ W04 <dbl> 7, 1, 6, 5, 2, 4, 1, 8, 6, 5, 2, 9, 8, 1, 8, 4, 1, 4,…
$ J03 <dbl+lbl> 18, 1, 1, 1, 1, 1, 1, 1, NA, NA, 1, 1, 1, 1, 1, 1…
$ Z02 <dbl+lbl> 2, 3, 2, 4, 3, 4, 2, 2, NA, 4, 4, 3, 4, 5, 3, 2, …
$ ees <dbl> 0.75, 0.35, 0.75, 0.80, 1.00, 1.00, 0.75, 0.65, 0.35,…
$ mw_p <dbl> 0.6, 1.0, 0.6, 1.0, 1.0, 1.0, 0.8, 0.4, 0.8, 1.0, 0.6…
The top of the output tells us there’s 11,555 observations (rows) and 38 variables (columns).
Column names are then listed with the data type and the first few examples. For example, ‘OverallDeptCode’ contains character class (<chr>
) data in the form of strings. Column names starting with ‘B’, ‘E’, ‘J’ and ‘Z’ are question codes and they contain responses expressed in numeric form, so they’re of class ‘double’ (<dbl>
).
The numbers encode certain responses. For example, 1 means ‘strongly disagree’ and 5 means be ‘strongly agree’ for the ‘B’ series of questions.
How do we know what all the numeric values mean? You’ll see that a number of the columns have the label class (<lbl>
) too. This means that the column carries additional ‘attributes’ that give the corresponding labels for the values.
Labels aren’t used that frequently in R data frames, but are used in programs like Stata and SPSS. Since we’ve read in a Stata file, we’ve got these labels available to us.
You can also see that there are also lots of NA
values. R uses NA
to mean ‘not available’ – the data are missing. In this case, it means that the respondent didn’t supply an answer for that question.
Another way of expressing this is to print()
to the console.
# A tibble: 11,555 x 38
ResponseID OverallDeptCode B01 B02 B03 B04 B05 B47
<dbl> <chr> <dbl+l> <dbl+l> <dbl+l> <dbl+l> <dbl+l> <dbl+l>
1 100000 ORGA 4 [Agr… 4 [Agr… 3 [Nei… 3 [Nei… 4 [Agr… 4 [Agr…
2 100001 ORGA 4 [Agr… 4 [Agr… 4 [Agr… 4 [Agr… 4 [Agr… 3 [Nei…
3 100002 ORGA 3 [Nei… 4 [Agr… 3 [Nei… 4 [Agr… 4 [Agr… 4 [Agr…
4 100003 ORGA 5 [Str… 5 [Str… 5 [Str… 5 [Str… 5 [Str… 4 [Agr…
5 100004 ORGA 5 [Str… 5 [Str… 5 [Str… 4 [Agr… 4 [Agr… 5 [Str…
6 100005 ORGA 5 [Str… 5 [Str… 5 [Str… 5 [Str… 5 [Str… 5 [Str…
7 100006 ORGA 4 [Agr… 4 [Agr… 4 [Agr… 4 [Agr… 3 [Nei… 4 [Agr…
8 100007 ORGA 4 [Agr… 3 [Nei… 3 [Nei… 2 [Dis… 4 [Agr… 3 [Nei…
9 100008 ORGA 4 [Agr… 4 [Agr… 3 [Nei… 4 [Agr… 5 [Str… 3 [Nei…
10 100009 ORGA 5 [Str… 4 [Agr… 4 [Agr… 4 [Agr… 5 [Str… 4 [Agr…
# … with 11,545 more rows, and 30 more variables: B48 <dbl+lbl>, B49 <dbl+lbl>,
# B50 <dbl+lbl>, B51 <dbl+lbl>, E03 <dbl+lbl>, E03_GRP <dbl+lbl>,
# E03A_01 <dbl+lbl>, E03A_02 <dbl+lbl>, E03A_03 <dbl+lbl>, E03A_04 <dbl+lbl>,
# E03A_05 <dbl+lbl>, E03A_06 <dbl+lbl>, E03A_07 <dbl+lbl>, E03A_08 <dbl+lbl>,
# E03A_09 <dbl+lbl>, E03A_10 <dbl+lbl>, E03A_11 <dbl+lbl>, E03A_12 <dbl+lbl>,
# E03A_13 <dbl+lbl>, E03A_14 <dbl+lbl>, E03A_15 <dbl+lbl>, E03A_16 <dbl+lbl>,
# W01 <dbl>, W02 <dbl>, W03 <dbl>, W04 <dbl>, J03 <dbl+lbl>, Z02 <dbl+lbl>,
# ees <dbl>, mw_p <dbl>
The output is displayed in table format, but is truncated to fit the console window (this prevents you from printing millions of rows to the console!). You can see the labels are printed alongside the values in this view.
If you want to see the whole dataset you could use the View()
function:
This opens up a read-only tab in the script pane that displays your data in full. You can scroll around and order the columns by clicking the headers. This doesn’t affect the underlying data at all.
You can also access this by clicking the little image of a table to the right of the object in the environment pane (upper-right).
We’re going to use a number of functions from the {dplyr} package, which we loaded earlier, to practice some data manipulation.
Functions in the tidyverse suite of packages are usually verbs that describe what they’re doing, like select()
and filter()
.
We won’t have time to go through all of the functions and their variants, but you should get a flavour of what’s possible.
Firstly, we can select()
columns of interest. This means we can return a version of the data set composed of a smaller number of columns. This can be helpful for a number of reasons, but in particular it lets us focus on specific variables of interest.
The {dplyr} functions take the data frame as their first argument, so the first thing we’ll supply the function is our data
object. Then we can supply the names of columns that we want to keep. Note that we can also rename columns as we select them with the format new_name = old_name
. (Alternatively there is a rename()
function that only renames columns.)
# Return specific columns
select(
data, # the first argument is the data
Z02, ethnicity = J03 # then the columns to keep
)
# A tibble: 11,555 x 2
Z02 ethnicity
<dbl+lbl> <dbl+lbl>
1 2 [eo] 18 [Any other background]
2 3 [SEO/HEO] 1 [English/Welsh/Scottish/Northern Irish/British]
3 2 [eo] 1 [English/Welsh/Scottish/Northern Irish/British]
4 4 [G6/7] 1 [English/Welsh/Scottish/Northern Irish/British]
5 3 [SEO/HEO] 1 [English/Welsh/Scottish/Northern Irish/British]
6 4 [G6/7] 1 [English/Welsh/Scottish/Northern Irish/British]
7 2 [eo] 1 [English/Welsh/Scottish/Northern Irish/British]
8 2 [eo] 1 [English/Welsh/Scottish/Northern Irish/British]
9 NA NA
10 4 [G6/7] NA
# … with 11,545 more rows
See that the order in which we selected the columns is the order in which they appeared when printed.
Instead of naming columns to keep, you can also specify columns to remove by prefixing the column name with a -
(minus).
Tip
data
) remains unchanged, despite us having selected some columnsdata <- select(data, B01)
would overwrite our original data
objectTo save time you can use some special select()
helper functions. For example, you can select a column that contains()
or starts_with()
certain strings. This is useful if you have lots of columns that share a similarity in their names, like in the CSPS (e.g. B01, B02, etc, all start with “B”).
# A tibble: 11,555 x 5
ResponseID W01 W02 W03 W04
<dbl> <dbl> <dbl> <dbl> <dbl>
1 100000 7 7 8 7
2 100001 10 10 10 1
3 100002 9 9 NA 6
4 100003 10 10 10 5
5 100004 10 10 11 2
6 100005 10 10 10 4
7 100006 7 7 7 1
8 100007 9 10 8 8
9 100008 8 7 6 6
10 100009 10 10 9 5
# … with 11,545 more rows
Activity
select()
to return all the ‘B’ series columns (B01, B02, etc)02:00
Now to filter the rows of the data set based on certain criteria.
We’re going to make use of some logical operators for filtering our data. These return TRUE
or FALSE
depending on the statement’s validity.
Symbol | Meaning | Example |
---|---|---|
== |
Equal to | 5 == 2 + 3 returns TRUE |
!= |
Not equal to | 5 != 3 + 3 returns TRUE |
%in% |
Match to a vector (shortcut for multiple logical tests) | 4 %in% c(2, 4, 6) returns TRUE |
> , < |
Greater than, less than | 2 < 3 returns TRUE |
>= , <= |
Equal or greater than, equal or less than | 5 <= 5 returns TRUE |
& |
And (helps string together multiple filters) | 1 < 2 & 5 == 5 returns TRUE |
| |
Or (helps string together multiple filters) | 1 < 2 | 5 == 6 returns FALSE (only one of them is true) |
R also has some special shortcut functions for come logical checks. For example:
Symbol | Meaning | Example |
---|---|---|
is.numeric() |
Is the content numeric class? | is.numeric(10) returns TRUE |
is.character() |
Is the content character class? | is.character("Downing Street") returns TRUE |
is.na() |
Is the content an NA ? |
is.na(NA) returns TRUE |
You can negate these functions by preceding them with a !
, so is.na(NA)
returns TRUE
but !is.na(NA)
returns FALSE
.
Let’s start by creating an object that contains the data filtered for senior civil servants (where variable Z02 equals 5) from two of the organisations.
See how there are two filter statements: Z02 == 5
and Organisation %in% c("ORGB", "ORGC")
? We’re asking for both of these things to be true by using the &
operator between them.
Notice that we used %in%
to match to a vector of department names (this is quicker than writing OverallDeptCode == ORGB | OverallDeptCode == ORGC
). The names are stored as character strings, so we put them in quotation marks.
We could print the columns of interest to see if it worked, but a better method would be to return only the ‘distinct’ (unique) values in these columns:
# A tibble: 2 x 2
OverallDeptCode Z02
<chr> <dbl+lbl>
1 ORGB 5 [scs]
2 ORGC 5 [scs]
Activity
filter()
to return senior civil servants in Org A onlydistinct()
to make sure it’s worked02:00
Now to create new columns. The function name is mutate()
; we’re ‘mutating’ our dataframe by budding a new column where there wasn’t one before. Often you’ll be creating new columns based on the content of columns that already exist, like adding the contents of one to another.
One relevant use of this for the CSPS is to create dummy columns. If certain conditions are met in other columns, we can put a ‘1’ in the dummy column, else we can put ‘0’ if it’s not met.
So we could create a dummy column that flags when a respondent is a SEO/HEO grade. This example uses an ifelse()
statement that fills the column with one value if the logical test is TRUE
and another if it’s FALSE
.
# Add a column that gets a 1 when the condition is true
data_dummy <- mutate(
data,
dummy = ifelse( # create a new column called 'dummy'
test = Z02 == 3 & J03 %in% 1:4, # test this condition
yes = 1, # if TRUE, put a 1 in the dummy column
no = 0 # otherwise put a 0 in the column
)
)
# See if it worked
select(data_dummy, Z02, J03, dummy)
# A tibble: 11,555 x 3
Z02 J03 dummy
<dbl+lbl> <dbl+lbl> <dbl>
1 2 [eo] 18 [Any other background] 0
2 3 [SEO/HEO] 1 [English/Welsh/Scottish/Northern Irish/British] 1
3 2 [eo] 1 [English/Welsh/Scottish/Northern Irish/British] 0
4 4 [G6/7] 1 [English/Welsh/Scottish/Northern Irish/British] 0
5 3 [SEO/HEO] 1 [English/Welsh/Scottish/Northern Irish/British] 1
6 4 [G6/7] 1 [English/Welsh/Scottish/Northern Irish/British] 0
7 2 [eo] 1 [English/Welsh/Scottish/Northern Irish/British] 0
8 2 [eo] 1 [English/Welsh/Scottish/Northern Irish/British] 0
9 NA NA 0
10 4 [G6/7] NA 0
# … with 11,545 more rows
Activity
Use mutate()
to create a dummy column where:
5
) to both B01 and B02 get a 1
0
02:00
This function is particularly useful for the CSPS data if we want to overwrite our numeric values with their corresponding text labels. Fortunately, the {haven} package that we loaded earlier has a function that replaces the numeric values with their labels: as_factor()
.
We want to apply this only to the columns that are numeric. Fortunately there’s a variant of mutate()
called mutate_if()
, which lets you use logical statements to select columns. This means we don’t have to write out all their names.
# Add a column that gets a 1 when the condition is true
data_lbl <- mutate_if(
data,
is.numeric, # if the column is numeric
haven::as_factor # then apply the as_factor function
)
data_lbl_chr <- mutate_all(data_lbl, as.character)
glimpse(data_lbl_chr)
Tip
as_factor()
– how can we resolve this?package::function()
We can use variant join()
functions to merge two data frames together on a common column.
Let’s create a small trivial data frame that provides a lookup from department codes to full department names and merge it into our CSPS data.
We’ll use the tibble()
function from {dplyr} to build the data frame. Remember: tibbles are data frames with nice defaults and printing properties; we’ve seen them already in the outputs from our earlier wrangling with {dplyr}
lookup <- tibble(
OverallDeptCode = c("ORGA", "ORGB", "ORGC"),
dept_full_name = c("Dept for A", "Ministry of B", "C Agency")
)
print(lookup)
# A tibble: 3 x 2
OverallDeptCode dept_full_name
<chr> <chr>
1 ORGA Dept for A
2 ORGB Ministry of B
3 ORGC C Agency
We want what is perhaps the most common join: left_join()
. It gives you all the rows from the ‘left’ data set (in our case, data
) and merges on the columns from the ‘right’ (our new lookup
).
Here’s what we’ll be doing (gif by Garrick Aden-Buie):
To do this, we pass two data frames to arguments x
(‘left’) and y
(‘right’) and provide the column name to join by.
data_join <- left_join(
x = data, # the original data set
y = lookup, # the data to merge to it
by = "OverallDeptCode" # the common column between them
)
Warning: Column `OverallDeptCode` has different attributes on LHS and RHS of
join
You might get a message saying that the attributes for our joining column aren’t the same. That’s okay; it’s because the column in data
(the data set on the ‘LHS’, or ‘left-hand side’, of the join) has attributes, but the one in lookup
(on the right-hand side) doesn’t.
Let’s check to see if rows from both data frames are present in the joined data set:
# A tibble: 11,555 x 4
ResponseID B01 OverallDeptCode dept_full_name
<dbl> <dbl+lbl> <chr> <chr>
1 100000 4 [Agree] ORGA Dept for A
2 100001 4 [Agree] ORGA Dept for A
3 100002 3 [Neither agree nor disagree] ORGA Dept for A
4 100003 5 [Strongly agree] ORGA Dept for A
5 100004 5 [Strongly agree] ORGA Dept for A
6 100005 5 [Strongly agree] ORGA Dept for A
7 100006 4 [Agree] ORGA Dept for A
8 100007 4 [Agree] ORGA Dept for A
9 100008 4 [Agree] ORGA Dept for A
10 100009 5 [Strongly agree] ORGA Dept for A
# … with 11,545 more rows
Success: the output has all the rows of the data
data frame, plus the new one (dept_full_name
) from the lookup
data frame.
We’ve seen how to manipulate our data frame a bit. But we’ve been doing it one discrete step at a time, so your script might end up looking something like this:
data_select <- select(data, ResponseID, OverallDeptCode, B01, Z02)
data_filter <- filter(data_select, OverallDeptCode == "ORGA" & Z02 != 5)
data_mutate <- mutate(
data_filter,
positive = ifelse(B01 %in% c(4, 5), "Positive", "Not positive")
)
print(data_mutate)
# A tibble: 1,060 x 5
ResponseID OverallDeptCode B01 Z02 positive
<dbl> <chr> <dbl+lbl> <dbl+lbl> <chr>
1 100000 ORGA 4 [Agree] 2 [eo] Positive
2 100001 ORGA 4 [Agree] 3 [SEO/HE… Positive
3 100002 ORGA 3 [Neither agree nor disag… 2 [eo] Not positi…
4 100003 ORGA 5 [Strongly agree] 4 [G6/7] Positive
5 100004 ORGA 5 [Strongly agree] 3 [SEO/HE… Positive
6 100005 ORGA 5 [Strongly agree] 4 [G6/7] Positive
7 100006 ORGA 4 [Agree] 2 [eo] Positive
8 100007 ORGA 4 [Agree] 2 [eo] Positive
9 100009 ORGA 5 [Strongly agree] 4 [G6/7] Positive
10 100010 ORGA 4 [Agree] 4 [G6/7] Positive
# … with 1,050 more rows
This is fine, but you will be creating a lot of intermediate objects to get to the final data frame that you want. This clutters up your environment and can fill up your computer’s memory if the data are large enough. You’re in danger of accidentally referring to the wrong object if you don’t name them well.
Instead, you could create one object that is built by chaining all the functions together in order.
We’ll use a special pipe operator – %>%
– that will read as ‘take what’s on the left of the operator and pass it through to the next function’. In pseudocode:
A real example with our data might look like this:
data_piped <- data %>%
select(ResponseID, OverallDeptCode, B01, Z02) %>%
filter(OverallDeptCode == "ORGA" & Z02 != 5) %>%
mutate(positive = ifelse(B01 %in% c(4, 5), "Positive", "Not positive"))
print(data_piped)
# A tibble: 1,060 x 5
ResponseID OverallDeptCode B01 Z02 positive
<dbl> <chr> <dbl+lbl> <dbl+lbl> <chr>
1 100000 ORGA 4 [Agree] 2 [eo] Positive
2 100001 ORGA 4 [Agree] 3 [SEO/HE… Positive
3 100002 ORGA 3 [Neither agree nor disag… 2 [eo] Not positi…
4 100003 ORGA 5 [Strongly agree] 4 [G6/7] Positive
5 100004 ORGA 5 [Strongly agree] 3 [SEO/HE… Positive
6 100005 ORGA 5 [Strongly agree] 4 [G6/7] Positive
7 100006 ORGA 4 [Agree] 2 [eo] Positive
8 100007 ORGA 4 [Agree] 2 [eo] Positive
9 100009 ORGA 5 [Strongly agree] 4 [G6/7] Positive
10 100010 ORGA 4 [Agree] 4 [G6/7] Positive
# … with 1,050 more rows
So the steps for creating the data_piped
object are:
data
objectThis is a bit like a recipe. And it’s easier to read.
You also repeat yourself fewer times. We only to name the data
object once, a the very start. This minimises the chance that you’ll accidentally name the wrong object by mistake.
There are a number of ways and formats in which to save our wrangled data.
For example, to save the output as a CSV, we can do one of these:
write_dta(data_piped, "output/data_piped.dta") # Stata-format
# Other options
write_rds(data_piped, "output/data_piped.rds") # R-specific format
write_csv(data_piped, "output/data_piped.csv") # comma-separated values
You pass to the function the object name and the filepath for where you want it to be saved.
Note that the labels will be lost if you save as CSV, but they’re retained in .dta and .rds format.
Check in your output/
folder to make sure they’ve been saved.
You can then read these back in like how we did earlier in this document (you don’t have to do this now):
So far we’ve been wrangling but not analysing data. Let’s look at the summarise()
function for some quick summaries.
A simple example might be to get the total count of responses in the data set and the mean of the engagement scores.
# A tibble: 1 x 2
total_count ees_mean
<int> <dbl>
1 11555 0.59
That’s good, but we can extend the summary so we get results grouped by some other variables. This is what the group_by()
function does. You give group_by()
the variables within which to summarise and you finish by calling ungroup()
so that the subsequent functions don’t get applied to the groups.
So here’s a more comprehensive example that gets the mean count and mean EES grouped within departments and the Z02 variable (grade). It then filters out people who didn’t answer Z02 and uses a mutate()
to suppress any mean EES values composed of less than 10 responses.
data %>%
group_by(OverallDeptCode, Z02) %>%
summarise(
total_count = n(),
ees_mean = round(mean(ees, na.rm = TRUE), 2)
) %>%
ungroup() %>%
filter(!is.na(Z02)) %>%
mutate(
ees_mean_supp = ifelse(
test = total_count < 10, yes = NA, no = ees_mean
)
)
# A tibble: 15 x 5
OverallDeptCode Z02 total_count ees_mean ees_mean_supp
<chr> <dbl+lbl> <int> <dbl> <dbl>
1 ORGA 1 [AO/AA] 4 0.75 NA
2 ORGA 2 [eo] 131 0.7 0.7
3 ORGA 3 [SEO/HEO] 438 0.7 0.7
4 ORGA 4 [G6/7] 487 0.64 0.64
5 ORGA 5 [scs] 78 0.85 0.85
6 ORGB 1 [AO/AA] 4930 0.56 0.56
7 ORGB 2 [eo] 1513 0.580 0.580
8 ORGB 3 [SEO/HEO] 2359 0.64 0.64
9 ORGB 4 [G6/7] 133 0.76 0.76
10 ORGB 5 [scs] 18 0.570 0.570
11 ORGC 1 [AO/AA] 1 1 NA
12 ORGC 2 [eo] 17 0.79 0.79
13 ORGC 3 [SEO/HEO] 24 0.61 0.61
14 ORGC 4 [G6/7] 37 0.6 0.6
15 ORGC 5 [scs] 4 0.78 NA
We could have a whole separate session on visualising data.
The tidyverse package for plotting is called {ggplot2}. The ‘gg’ stands for ‘grammar of graphics’. It’s a system to build up a graphic using common components including:
You also supply aesthetic properties like size, colour, x and y locations.
These elements are built up with the +
operator. Imagine you’ve created a blank canvas and you’re adding each layer. (This is different to using the pipe, %>%
, which is passing information from the left-hand side to the right-hand side.)
The great thing about building plots with code is that you can produce them with the same styles very quickly without all the manual adjustments that might be required in some other programs.
{ggplot2} is a very powerful graphics package that can create all sorts of charts. Check out the R Graph Gallery for some more examples.
For now, let’s look at a simple bar chart of the answers to question B01 using the ggplot()
function from {ggplot2}.
# Prepare the data
plot_data <- data %>%
filter(!is.na(B01)) %>% # remove NAs
count(OverallDeptCode, B01) %>% # count() is a shortcut for summarising
mutate(
B01 = haven::as_factor(B01), # add the text labels
Department = OverallDeptCode
)
# Plot the data
plot_data %>% # with the plot data
ggplot(aes(x = B01, y = n)) + # create a canvas with these coords
geom_col() # apply columns to the canvas given the coords
What just happened? We:
plot_data
aes()
(in this case, the x and y variables)geom_col()
, to make a bar chartWe can spruce this up a little by adding on additional things like a theme or labels.
ggplot(plot_data, aes(x = B01, y = n)) +
geom_col(aes(fill = Department)) +
coord_flip() + # flip the axes
theme_light() + # apply a theme
scale_fill_brewer(palette = "Blues") + # set the bar colours
labs( # provide overall labels
title = "Most people say they're interested in their work",
subtitle = "This is true across all organisations",
caption = "Source: B01, synthetic CSPS data"
) +
xlab(NULL) + # remove x axis
ylab("Count of responses") # y axis title
But we could also split each department’s results into a grid of small multiples, or ‘facets’, with facet_grid()
.
ggplot(plot_data, aes(x = B01, y = n)) +
geom_col() +
coord_flip() +
theme_light() +
labs(
title = "Most people say they're interested in their work",
subtitle = "This is true across all organisations",
caption = "Source: B01, synthetic CSPS data"
) +
xlab(NULL) +
ylab("Count of responses") +
facet_grid(
cols = vars(OverallDeptCode), # one column per department
scales = "free" # scales are relative to the facet
)
We can also use {ggplot} to recreate the style of bar charts used in the PDF reports of People Survey results. First we need to process the data to get the data for ORGB, reshape it into a plottable format, and calculate percentages. This section uses a couple of tidyverse packages we haven’t seen yet: {tidyr} for reshaping data frames and {forcats} for working with vectors. These are shown as package::function()
to make them more apparent.
plot_data2 <- data %>%
filter(OverallDeptCode == "ORGB") %>%
select(B47:B51) %>%
mutate_all(haven::as_factor) %>% # convert the variables to factors
tidyr::pivot_longer( # turn the data into 'long' format
cols = everything(), # using all the columns
names_to = "question", # assign names to variable 'question'
values_to = "value" # assign values to variable 'value'
) %>%
tidyr::drop_na(value) %>% # drops any missing responses
count( # count the combinations of:
question, # question, and
value, # value, and
name = "response_count") %>% # give the count a specific name
add_count(
question, # add an extra count by question
wt = response_count, # summing the 'wt' variable
name = "question_count") %>% # give it a specific name
mutate(
pc = response_count/question_count, # calculate responses as % of question
value = forcats::fct_rev(value), # character strings are often better as
question = forcats::fct_rev( # factors when plotting, but sometimes
forcats::as_factor(question) # you need to reverse their 'order'
)
)
print(plot_data2)
# A tibble: 25 x 5
question value response_count question_count pc
<fct> <fct> <int> <int> <dbl>
1 B47 Strongly disagree 455 10158 0.0448
2 B47 Disagree 904 10158 0.0890
3 B47 Neither agree nor disagree 2397 10158 0.236
4 B47 Agree 4410 10158 0.434
5 B47 Strongly agree 1992 10158 0.196
6 B48 Strongly disagree 1197 10152 0.118
7 B48 Disagree 1995 10152 0.197
8 B48 Neither agree nor disagree 3006 10152 0.296
9 B48 Agree 2950 10152 0.291
10 B48 Strongly agree 1004 10152 0.0989
# … with 15 more rows
We now have a dataset that has counted the responses for each question-value pair (response_count
), the number of responses for each question (question_count
) and a percentage response (pc
), for questions B47-B52 for respondents in ORGB.
We can now plot this data, rather than Department we’ll be plotting the questions on the “x-axis” and our calculated percentage on the “y-axis” (we’ll actually flip these axes, but that’s one of the last things we do, so it’s best to still think of these in their original x-y positions).
We can also add data labels, using geom_text()
.
The PDF survey reports use a colourblind friendly pink-green scale from the {RColorBrewer} package, which provides the palettes developed by the Color Brewer project.
Finally, we apply some customisation to the theme to remove the axis titles, reposition the legend, give the legend keys an outline, and format the title text.
ggplot(plot_data2, aes(x = question, y = pc)) +
geom_col(aes(fill = value), width = 0.75, colour = "gray60", size = 0.2) +
geom_text(
aes(
label = scales::percent(pc, accuracy = 1),
colour = value),
position = position_fill(vjust = 0.5),
size = 3,
show.legend = FALSE) +
# geom_text adds text labels, we set the label aesthetic to the text
# we've also mapped the colour aesthetic to vary the label text's colour
# text positioning can be tricky, this is why the value factor was reversed
# when we created plot_data2 ¯\_(ツ)_/¯
scale_y_reverse() +
# reverse the y-axis so that strongly agree will be on the left-hand side
scale_fill_brewer(palette = "PiYG", direction = -1) +
# the PiYG palette is the same as is used in the highlights reports
# it is colourblind friendly, so recommended instead of basic red-green
scale_colour_manual(
values = c("Strongly agree" = "white",
"Agree" = "gray20",
"Neither agree nor disagree" = "gray20",
"Disagree" = "gray20",
"Strongly disagree" = "white")) +
# this provides the colours for the text labels, so that the labels for the
# 'strongly' values have white text, and the others have grey text
coord_flip() +
# flip the axis
labs(
title = "Employee engagement question results",
subtitle = "Almost two-thirds of staff are proud to work for ORG B",
caption = "Source: B47-B52, ORGB, synthetic CSPS data") +
theme_light() +
theme(
panel.grid = element_blank(),
# element_blank() removes an element from the plot
panel.border = element_blank(),
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.title.y = element_blank(),
axis.ticks = element_blank(),
legend.position = "top",
legend.title = element_blank(),
legend.key.size = unit(1, "char"),
legend.margin = margin(1,0,0,0, "char"),
plot.title = element_text(face = "bold"))
r
on StackOverflow, or even ask your own question
3.1 Comments
In an R script, any characters prefixed with a hash (
#
) will be recognised as a comment. R will ignore these when you run your code.Comments are really helpful for letting people to understand what your code is doing. Try to keep a narrative going throughout your code to explain what it’s doing. Be explicit – it might be obvious to you right now why a certain line code is being written, but you might come back in a few months time and forget.
It’s also good to use comments to explain what each block of code is doing and to explain particular lines of code. Don’t worry about the code itself, but here’s an example of comments in use:
It’s also good to add the title, your name, date, etc, as comments at the top of your script so people know what the script is for when they open it.