Today’s class outline:
- Welcome back, questions on homework (10-15 minutes)
Reproducibility and Your Personal Values (10 minutes)
Building a Reproducible Workflow Using Projects (45 miuntes)
Data Transformations using dplyr
(45 minutes)
Reproducibility is ethical.
When I post a project, I pour over my code for hours, adding comments, rendering to multiple formats, trying to flag locations in online materials in the mansucript, etc.
I am trying to prevent errors, but I am also trying to make sure that other people know what I did, especially if I did make errors
Reproducible research is also equitable.
A reproducible research workflow can be downloaded by another person as a starting point, providing tools to other researchers who may not have the same access to education and resources as you
tidyr
)A reproducible workflow is organized. What does it mean to be be organized? At least:
More advanced (later in the class)
A project is a discrete piece of work which has a number of files associated with it such as the data and scripts for an analysis and the production reports.
Using a project-oriented workflow means to have a hierarchical folder structure with everything needed to reproduce an analysis.
One research project might have several organizational projects associated with it, for example:
Good Workflows are:
Naming
-- ipcs_data_2019
|__ipcs_data_2019.Rproj
|__data
|__raw_data
|__2019-03-21_ema_raw.csv
|__2019-03-21_baseline_raw.csv
|__clean_data
|__2019-06-21_ema_long.csv
|__2019-06-21_ema_long.RData
|__2019-06-21_baseline_wide.csv
|__2019-06-21_baseline_wide.RData
|__results
|__01_models
|__E_mortality.RData
|__A_mortality.RData
|__02_summaries
|__E_mortality.RData
|__A_mortality.RData
|__03_figures
|__mortality.png
|__mortality.pdf
|__04_tables
|__zero_order_cors.RData
|__descriptives.RData
|__key_terms.RData
|__all_model_terms.RData
|__README.md
|__refs
|__r_refs.bib
|__proj_refs.bib
|__analyses
|__01_background.Rmd
|__02_data_cleaning.Rmd
|__03_models.Rmd
|__04_summary.Rmd
A path gives the address - or location - of a filesystem object, such as a file or directory.
An Absolute path is given from the “root directory” of the object.
The root directory of a file system is the first or top directory in the hierarchy.
For example, C:\
or M:\
on windows or /
on a Mac which is displayed as Macintosh HD in Finder.
The absolute path for a file, pigeon.txt
could be:
C:/Users/edbeck/Desktop/pigeons/data-raw/pigeon.txt
/Users/edbeck/Desktop/pigeons/data-raw/pigeon.txt
http://github.com/emoriebeck/pigeons/data/pigeon.txt
Directory is the old word for what many now call a folder 📂.
Commands that act on directories in most programming languages and environments reflect this.
For example, in R
this means “tell me my working directory”:
getwd()
get working directory in R
The working directory is the default location a program is using. It is where the program will read and write files by default. You have only one working directory at a time.
The terms ‘working directory’, ‘current working directory’ and ‘current directory’ all mean the same thing.
A relative path gives the location of a filesystem object relative to the working directory, (i.e., that returned by getwd()
).
When pigeon.txt
is in the working directory the relative path is just the file
name: pigeon.txt
If there is a folder in the working directory called data-raw
and pigeon.txt
is in there then the relative path is data-raw/pigeon.txt
../
allows you to look in the directory above the working directory
When pigeon.txt
is in folder above the working the relative path is ../pigeon.txt
And if it is in a folder called data-raw
which is in the directory above the working directory then the relative path is ../data-raw/pigeon.txt
You can list the contents of a directory using the dir()
command
dir()
list the contents of the working directorydir("..")
list the contents of the directory above the working directorydir("../..")
list the contents of the directory two directories above the working directorydir("data-raw")
list the contents of a folder call data-raw which is in the working directory.Most of the time you should use relative paths because that makes your work portable (i.e. to a different machine / user / etc.).
🎉 The tab key is your friend!
.Rproj
file in it.For example, if you create an RStudio Project ipcs_data_2019
your folder structure would look something like this:
-- ipcs_data_2019
|__ipcs_data_2019.Rproj
|__data
|__raw_data
|__2019-03-21_ema_raw.csv
|__2019-03-21_baseline_raw.csv
|__clean_data
|__2019-06-21_ema_long.csv
|__2019-06-21_ema_long.RData
|__2019-06-21_baseline_wide.csv
|__2019-06-21_baseline_wide.RData
|__results
|__01_models
|__02_summaries
|__03_figures
|__04_tables
|__README.md
|__refs
|__r_refs.bib
|__proj_refs.bib
|__analyses
|__01_background.Rmd
|__02_data_cleaning.Rmd
|__03_models.Rmd
|__04_summary.Rmd
the .RProj
file which is the defining feature of an RStudio Project
When you open an RStudio Project, the working directory is set to the Project directory (i.e., the location of the .Rproj
file).
This makes your work portable. You can zip up the project folder and send it to any person, including future you, or any computer.
They will be able to unzip, open the project and have all the code just work.
(This is great for sending code and/or results to your advisors)
You are aiming for structured, systematic and repeatable. For example, the Project directory might contain:
READMEs are a form of documentation which have been widely used for a long time. They contain all the information about the other files in a directory. They can be extensive.
Wikipedia README page
GitHub Doc’s About READMEs
OSF
A minimal README might give:
Here’s an example from one of my webapps
A license tells others what they can and can’t do with your work.
choosealicense.com is a useful explainer.
I typically use:
create a new Project called iris
by:
clicking File->New Project…
clicking on the little icon (second from the left) at the top
Choose New Project, then New Directory, then New Project. Name the RStudio Project iris
.
Create folders in iris
called data-raw
, data-processed
and figures
.
Start new scripts called 01-import.R
, 02-tidy.R
, and 03-figures.R
Save a copy of iris.csv to your data-raw
folder. These data give the information about different species of irises.
In your 01-import.R
script, load the tidyverse set of packages.
data-raw/iris.csv
because your working directory is the Project directory, iris
.This dataset has three observations in a row - it is not ‘tidy’.
02-tidy.R
script, and reshape the data using:This reformats the dataframe in R but does not overwrite the text file of the data.
Don’t worry too much about this right now. We’ll spend a lot of time talking about reshaping data next week!
Often we want to write to files.
iris
to a csv file named iris-long.csv
in your data-processed
folder:Open your 03-figures.R
script and create a simple plot of this data with:
View plot with:
A useful function for saving ggplot figures is ggsave()
.
It has arguments for the size, resolution and device for the image. See the ggsave()
reference page.
ggsave
argument values to variables:dplyr
dplyr
Core Functionsdplyr
Core Functions%>%
: The pipe. Read as “and then.”filter()
: Pick observations (rows) by their values.select()
: Pick variables (columns) by their names.arrange()
: Reorder the rows.group_by()
: Implicitly split the data set by grouping by names (columns).mutate()
: Create new variables with functions of existing variables.summarize()
/ summarise()
: Collapse many values down to a single summary.%>%
filter()
select()
arrange()
group_by()
mutate()
summarize()
Although each of these functions are powerful alone, they are incredibly powerful in conjunction with one another. So below, I’ll briefly introduce each function, then link them all together using an example of basic data cleaning and summary.
%>%
%>%
is wonderful. It makes coding intuitive. Often in coding, you need to use so-called nested functions. For example, you might want to round a number after taking the square of 43.%>%
The issue with this comes whenever we need to do a series of operations on a data set or other type of object. In such cases, if we run it in a single call, then we have to start in the middle and read our way out.
%>%
The pipe solves this by allowing you to read from left to right (or top to bottom). The easiest way to think of it is that each call of %>%
reads and operates as “and then.” So with the rounded square root of 43, for example:
filter()
Often times, when conducting research (experiments or otherwise), there are observations (people, specific trials, etc.) that you don’t want to include.
# A tibble: 6 × 28
A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2 E3
<int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 2 4 3 4 4 2 3 3 4 4 3 3 3
2 2 4 5 2 5 5 4 4 3 4 1 1 6
3 5 4 5 4 4 4 5 4 2 5 2 4 4
4 4 4 6 5 5 4 4 3 5 5 5 3 4
5 2 3 3 4 5 4 4 5 3 2 2 2 5
6 6 6 5 6 5 6 6 6 1 3 2 1 6
# ℹ 15 more variables: E4 <int>, E5 <int>, N1 <int>, N2 <int>, N3 <int>,
# N4 <int>, N5 <int>, O1 <int>, O2 <int>, O3 <int>, O4 <int>, O5 <int>,
# gender <int>, education <int>, age <int>
filter()
Often times, when conducting research (experiments or otherwise), there are observations (people, specific trials, etc.) that you don’t want to include.
filter()
Often times, when conducting research (experiments or otherwise), there are observations (people, specific trials, etc.) that you don’t want to include.
But this isn’t quite right. We still have folks below 12. But, the beauty of filter()
is that you can do sequence of OR
and AND
statements when there is more than one condition, such as up to 18 AND
at least 12.
filter()
Often times, when conducting research (experiments or otherwise), there are observations (people, specific trials, etc.) that you don’t want to include.
Got it!
filter()
<
, >
, <=
, and >=
bfi
data frame to a string.filter()
Now let’s try a few things:
1. Create a data set with only individuals with some college (==
).
filter()
Now let’s try a few things:
2. Create a data set with only people age 18 (==
).
filter()
Now let’s try a few things:
3. Create a data set with individuals with some college or above (%in%
).
select()
filter()
is for pulling certain observations (rows), then select()
is for pulling certain variables (columns).select()
bfi
data, most of these have been pre-removed, so instead, we’ll imagine we don’t want to use any indicators of Agreeableness (A1-A5) and that we aren’t interested in gender.select()
, there are few ways choose variables. We can bare quote name the ones we want to keep, bare quote names we want to remove, or use any of a number of select()
helper functions.select()
:select()
:# A tibble: 2,800 × 22
C1 C2 C3 C4 C5 E1 E2 E3 E4 E5 N1 N2 N3
<int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 2 3 3 4 4 3 3 3 4 4 3 4 2
2 5 4 4 3 4 1 1 6 4 3 3 3 3
3 4 5 4 2 5 2 4 4 4 5 4 5 4
4 4 4 3 5 5 5 3 4 4 4 2 5 2
5 4 4 5 3 2 2 2 5 4 5 2 3 4
6 6 6 6 1 3 2 1 6 5 6 3 5 2
# ℹ 2,794 more rows
# ℹ 9 more variables: N4 <int>, N5 <int>, O1 <int>, O2 <int>, O3 <int>,
# O4 <int>, O5 <int>, education <chr>, age <int>
select()
:select()
helper functions.starts_with()
ends_with()
contains()
matches()
num_range()
one_of()
all_of()
arrange()
R
sort()
function, the arrange()
function is tidyverse
version that plays nicely with other tidyverse functions
.arrange()
So in our previous examples, we could also arrange()
our data by age or education, rather than simply filtering. (Or as we’ll see later, we can do both!)
arrange()
We can also arrange by multiple columns, like if we wanted to sort by gender then education:
Much of the power of dplyr
functions lay in the split-apply-combine method
A given set of of data are:
group_by()
group_by()
function is the “split” of the methodgroup_by()
So imagine that we wanted to group_by()
education levels to get average ages at each level
# A tibble: 2,800 × 8
# Groups: education [6]
C1 C2 C3 C4 C5 age gender education
<int> <int> <int> <int> <int> <int> <int> <chr>
1 2 3 3 4 4 16 1 <NA>
2 5 4 4 3 4 18 2 <NA>
3 4 5 4 2 5 17 2 <NA>
4 4 4 3 5 5 17 2 <NA>
5 4 4 5 3 2 17 1 <NA>
6 6 6 6 1 3 21 2 Some College
# ℹ 2,794 more rows
group_by()
ungroup()
function:bfi %>%
select(starts_with("C"), age, gender, education) %>%
group_by(education) %>%
ungroup() %>%
print(n = 6)
# A tibble: 2,800 × 8
C1 C2 C3 C4 C5 age gender education
<int> <int> <int> <int> <int> <int> <int> <chr>
1 2 3 3 4 4 16 1 <NA>
2 5 4 4 3 4 18 2 <NA>
3 4 5 4 2 5 17 2 <NA>
4 4 4 3 5 5 17 2 <NA>
5 4 4 5 3 2 17 1 <NA>
6 6 6 6 1 3 21 2 Some College
# ℹ 2,794 more rows
group_by()
Multiple group_by()
calls overwrites previous calls:
bfi %>%
select(starts_with("C"), age, gender, education) %>%
group_by(education) %>%
group_by(gender, age) %>%
print(n = 6)
# A tibble: 2,800 × 8
# Groups: gender, age [115]
C1 C2 C3 C4 C5 age gender education
<int> <int> <int> <int> <int> <int> <int> <chr>
1 2 3 3 4 4 16 1 <NA>
2 5 4 4 3 4 18 2 <NA>
3 4 5 4 2 5 17 2 <NA>
4 4 4 3 5 5 17 2 <NA>
5 4 4 5 3 2 17 1 <NA>
6 6 6 6 1 3 21 2 Some College
# ℹ 2,794 more rows
mutate()
mutate()
is one of your “apply” functionsmutate()
, the resulting data frame will have the same number of rows you started withmutate()
To demonstrate, let’s add a column that indicated average age levels within each age group
bfi %>%
select(starts_with("C"), age, gender, education) %>%
arrange(education) %>%
group_by(education) %>%
mutate(age_by_edu = mean(age, na.rm = T)) %>%
print(n = 6)
# A tibble: 2,800 × 9
# Groups: education [6]
C1 C2 C3 C4 C5 age gender education age_by_edu
<int> <int> <int> <int> <int> <int> <int> <chr> <dbl>
1 6 6 3 4 5 19 1 Below HS 25.1
2 4 3 5 3 2 21 1 Below HS 25.1
3 5 5 5 2 2 17 1 Below HS 25.1
4 5 5 4 1 1 18 1 Below HS 25.1
5 4 5 4 3 3 18 1 Below HS 25.1
6 3 2 3 4 6 18 2 Below HS 25.1
# ℹ 2,794 more rows
mutate()
mutate()
is also super useful even when you aren’t grouping
We can create a new category
bfi %>%
select(starts_with("C"), age, gender, education) %>%
mutate(gender_cat = plyr::mapvalues(gender, c(1,2), c("Male", "Female")))
# A tibble: 2,800 × 9
C1 C2 C3 C4 C5 age gender education gender_cat
<int> <int> <int> <int> <int> <int> <int> <chr> <chr>
1 2 3 3 4 4 16 1 <NA> Male
2 5 4 4 3 4 18 2 <NA> Female
3 4 5 4 2 5 17 2 <NA> Female
4 4 4 3 5 5 17 2 <NA> Female
5 4 4 5 3 2 17 1 <NA> Male
6 6 6 6 1 3 21 2 Some College Female
7 5 4 4 2 3 18 1 <NA> Male
8 3 2 4 2 4 19 1 HS Male
9 6 6 3 4 5 19 1 Below HS Male
10 6 5 6 2 1 17 2 <NA> Female
# ℹ 2,790 more rows
mutate()
mutate()
is also super useful even when you aren’t grouping
We could also just overwrite it:
bfi %>%
select(starts_with("C"), age, gender, education) %>%
mutate(gender = plyr::mapvalues(gender, c(1,2), c("Male", "Female")))
# A tibble: 2,800 × 8
C1 C2 C3 C4 C5 age gender education
<int> <int> <int> <int> <int> <int> <chr> <chr>
1 2 3 3 4 4 16 Male <NA>
2 5 4 4 3 4 18 Female <NA>
3 4 5 4 2 5 17 Female <NA>
4 4 4 3 5 5 17 Female <NA>
5 4 4 5 3 2 17 Male <NA>
6 6 6 6 1 3 21 Female Some College
7 5 4 4 2 3 18 Male <NA>
8 3 2 4 2 4 19 Male HS
9 6 6 3 4 5 19 Male Below HS
10 6 5 6 2 1 17 Female <NA>
# ℹ 2,790 more rows
summarize()
/ summarise()
summarize()
is one of your “apply” functions# group_by() education
bfi %>%
select(starts_with("C"), age, gender, education) %>%
arrange(education) %>%
group_by(education) %>%
summarize(age_by_edu = mean(age, na.rm = T))
# A tibble: 6 × 2
education age_by_edu
<chr> <dbl>
1 Below HS 25.1
2 College 33.0
3 HS 31.5
4 Higher Degree 35.3
5 Some College 27.2
6 <NA> 18.0
summarize()
/ summarise()
summarize()
is one of your “apply” functionsPart 1 of these slides was adapted from Emma Rand’s course on reproducibility at York University.
Rand E. (2023). White Rose BBSRC DTP Training: An Introduction to Reproducible Analyses in R (version v1.2). DOI: https://doi.org/10.5281/zenodo.3859818 URL: https://github.com/3mmaRand/pgr_reproducibility
PSC 290 - Data Management and Cleaning