Week 2 Workbook

Author

Emorie D Beck

Week 2: Reproducibility and Data Transformations

Code

library(knitr)
library(psych)
library(emo)
library(plyr)
library(tidyverse)

Overview

Today’s class outline:
- Welcome back, questions on homework (10-15 minutes)

Reproducibility and Your Personal Values (10 minutes)
Building a Reproducible Workflow Using Projects (45 miuntes)
Data Transformations using dplyr (45 minutes)

Reproducibility and Your Personal Values

Why reproducibility AND values?

The definition of reproducibility is somewhat debated
- “‘Reproducibility’ refers to instances in which the original researcher’s data and computer codes are used to regenerate the results”
- “‘Reproducibility’ refers to independent researchers arriving at the same results using their own data and methods”
But regardless of what definition you choose, reproducibility starts with a commitment in research to be
clear
transparent
honest
thorough

Why reproducibility AND values?

Reproducibility is ethical.
When I post a project, I pour over my code for hours, adding comments, rendering to multiple formats, trying to flag locations in online materials in the mansucript, etc.
I am trying to prevent errors, but I am also trying to make sure that other people know what I did, especially if I did make errors
Reproducible research is also equitable.
A reproducible research workflow can be downloaded by another person as a starting point, providing tools to other researchers who may not have the same access to education and resources as you

Where should we reproducible?

Planning
- Study planning and design
- Lab Protocols
- Codebooks
- etc.
Analyses
- Scripting
- Communication
- etc.

Aspects of Reproducibility

Data within files should be ‘tidy’ (next week – tidyr)
Project based approach (today)
Consistency: naming, space, style (today)
Documentation: commenting and README (today)
Literate programming e.g. Rmarkdown (every day!)

Building a Reproducible Workflow Using Projects

Reproducible Workflow

A reproducible workflow is organized. What does it mean to be be organized? At least:

Use a project based approach, e.g., RStudio project or similar
Have a hierarchical folder structure
Have a consistent and informative naming system that ‘plays nice’
Document code with comments and analyses with README

More advanced (later in the class)

Generalize with functions and packages
version control

What is a project?

A project is a discrete piece of work which has a number of files associated with it such as the data and scripts for an analysis and the production reports.
Using a project-oriented workflow means to have a hierarchical folder structure with everything needed to reproduce an analysis.

One research project might have several organizational projects associated with it, for example:

data files and metadata (which may be made into a package)
preregistration
analysis and reporting
a package developed for the analysis
an app for allowing data to be explored by others

Example

Good Workflows are:

structured
systematic
repeatable

Naming

human and machine readable
- no spaces
- use snake/kebab case
- ordering: numbers (zero left padded), dates
- file extensions

-- ipcs_data_2019
   |__ipcs_data_2019.Rproj
   |__data
      |__raw_data
         |__2019-03-21_ema_raw.csv
         |__2019-03-21_baseline_raw.csv
      |__clean_data
         |__2019-06-21_ema_long.csv
         |__2019-06-21_ema_long.RData
         |__2019-06-21_baseline_wide.csv
         |__2019-06-21_baseline_wide.RData
   |__results
      |__01_models
         |__E_mortality.RData
         |__A_mortality.RData
      |__02_summaries
         |__E_mortality.RData
         |__A_mortality.RData
      |__03_figures
         |__mortality.png
         |__mortality.pdf
      |__04_tables
         |__zero_order_cors.RData
         |__descriptives.RData
         |__key_terms.RData
         |__all_model_terms.RData
   |__README.md
   |__refs
      |__r_refs.bib
      |__proj_refs.bib
   |__analyses
      |__01_background.Rmd
      |__02_data_cleaning.Rmd
      |__03_models.Rmd
      |__04_summary.Rmd

What is a path?

A path gives the address - or location - of a filesystem object, such as a file or directory.

Paths appear in the address bar of your browser or file explorer.
We need to know a file path whenever we want to read, write or refer to a file using code rather than interactively pointing and clicking to navigate.
A path can be absolute or relative
- absolute = whole path from root
- relative = path from current directory

Absolute paths

An Absolute path is given from the “root directory” of the object.
The root directory of a file system is the first or top directory in the hierarchy.
For example, C:\ or M:\ on windows or / on a Mac which is displayed as Macintosh HD in Finder.

The absolute path for a file, pigeon.txt could be:

windows: C:/Users/edbeck/Desktop/pigeons/data-raw/pigeon.txt
Mac/unix systems: /Users/edbeck/Desktop/pigeons/data-raw/pigeon.txt
web: http://github.com/emoriebeck/pigeons/data/pigeon.txt

What is a directory?

Directory is the old word for what many now call a folder 📂.
Commands that act on directories in most programming languages and environments reflect this.
For example, in R this means “tell me my working directory”:
getwd() get working directory in R

What is a working directory?

The working directory is the default location a program is using. It is where the program will read and write files by default. You have only one working directory at a time.
The terms ‘working directory’, ‘current working directory’ and ‘current directory’ all mean the same thing.

Find your current working directory with:

Code

getwd()

[1] "/Users/emoriebeck/Documents/teaching/PSC290-cleaning-fall-2023/psc290-data-FQ23/psc290-data-FQ23"

Relative paths

A relative path gives the location of a filesystem object relative to the working directory, (i.e., that returned by getwd()).

When pigeon.txt is in the working directory the relative path is just the file * name: pigeon.txt
If there is a folder in the working directory called data-raw and pigeon.txt is in there then the relative path is data-raw/pigeon.txt

Paths: moving up the hierarchy

../ allows you to look in the directory above the working directory
When pigeon.txt is in folder above the working the relative path is ../pigeon.txt
And if it is in a folder called data-raw which is in the directory above the working directory then the relative path is ../data-raw/pigeon.txt

What’s in my directory?

You can list the contents of a directory using the dir() command

dir() list the contents of the working directory
dir("..") list the contents of the directory above the working directory
dir("../..") list the contents of the directory two directories above the working directory
dir("data-raw") list the contents of a folder call data-raw which is in the working directory.

Relative or absolute

Most of the time you should use relative paths because that makes your work portable (i.e. to a different machine / user / etc.).
🎉 The tab key is your friend!
You only need to use absolute paths when you are referring to filesystem outside the one you are using.
I often store the beginning of that path as object.
- web_wd <- “https://github.com/emoriebeck/pigeons/”
- Then I can use sprintf() or paste() to add different endings

Code

web_wd <- "https://github.com/emoriebeck/pigeons/"
sprintf("%s/data-raw/pigeon.txt", web_wd)

[1] "https://github.com/emoriebeck/pigeons//data-raw/pigeon.txt"

RStudio Projects

Example

Download and unzip pigeons.zip which has the following structure:

-- pigeons
   |__data-processed
      |__pigeon_long.txt
   |__data-raw
      |__pigeon.txt
   |__figures
      |__fig1.tiff
   |__scripts
      |__analysis.R
      |__import_reshape.R
   |__pigeons.Rproj

RStudio Projects

Project is obviously a commonly used word. When I am referring to an RStudio Project I will use the capitalised words ‘RStudio Project’ or ‘Project’.
In other cases, I will use ‘project’.
An RStudio Project is a directory with an .Rproj file in it.
The name of the RStudio Project is the same as the name of the top level directory which is referred to as the Project directory.

For example, if you create an RStudio Project ipcs_data_2019 your folder structure would look something like this:

-- ipcs_data_2019
   |__ipcs_data_2019.Rproj
   |__data
      |__raw_data
         |__2019-03-21_ema_raw.csv
         |__2019-03-21_baseline_raw.csv
      |__clean_data
         |__2019-06-21_ema_long.csv
         |__2019-06-21_ema_long.RData
         |__2019-06-21_baseline_wide.csv
         |__2019-06-21_baseline_wide.RData
   |__results
      |__01_models
      |__02_summaries
      |__03_figures
      |__04_tables
   |__README.md
   |__refs
      |__r_refs.bib
      |__proj_refs.bib
   |__analyses
      |__01_background.Rmd
      |__02_data_cleaning.Rmd
      |__03_models.Rmd
      |__04_summary.Rmd

the .RProj file which is the defining feature of an RStudio Project
When you open an RStudio Project, the working directory is set to the Project directory (i.e., the location of the .Rproj file).
This makes your work portable. You can zip up the project folder and send it to any person, including future you, or any computer.
They will be able to unzip, open the project and have all the code just work.
(This is great for sending code and/or results to your advisors)

Directory structure

You are aiming for structured, systematic and repeatable. For example, the Project directory might contain:

.RProj file
README - tell people what the project is and how to use it
License - tell people what they are allowed to do with your project
Directories
data/
prereg/
scripts/
results/
manuscript/

README

READMEs are a form of documentation which have been widely used for a long time. They contain all the information about the other files in a directory. They can be extensive.
Wikipedia README page
GitHub Doc’s About READMEs
OSF

A minimal README might give:

Title
Description, 50 words or so on what the project is
Technical Description of the project
- What software and packages are needed including versions
- Any instructions needed to run the analysis/use the software
- Any issues that a user might face in running the analysis/using the software
Instructions on how to use the work
Links to where other files, materials, etc. are stored
- E.g., an OSF readme may point to GitHub, PsyArxiv, etc.

License

A license tells others what they can and can’t do with your work.

choosealicense.com is a useful explainer.

I typically use:

MIT License for software
CC-BY-SA-4.0 for other work

Exercise

You are going to create an RStudio Project with some directories and use it to organise a very simple analysis.
The analysis will import a data file, reformat it and write the new format to file. It will then create a figure and write the image to file.
You’ll get practice with tidying data (more on that next week) and plotting data.

RStudio Project infrastructure

🎬 create a new Project called iris by:

clicking File->New Project…
clicking on the little icon (second from the left) at the top
Choose New Project, then New Directory, then New Project. Name the RStudio Project iris.
Create folders in iris called data-raw, data-processed and figures.
Start new scripts called 01-import.R, 02-tidy.R, and 03-figures.R

Save and Import

Save a copy of iris.csv to your data-raw folder. These data give the information about different species of irises.
In your 01-import.R script, load the tidyverse set of packages.

Code

library(tidyverse)
write_csv(iris, file = "data-raw/iris.csv")

Add the command to import the data:

Code

iris <- read_csv("data-raw/iris.csv")

The relative path is data-raw/iris.csv because your working directory is the Project directory, iris.

Reformat the data

This dataset has three observations in a row - it is not ‘tidy’.

Open your 02-tidy.R script, and reshape the data using:

Code

iris <- pivot_longer(data = iris, 
                     cols = -Species, 
                     names_to = "attribute", 
                     values_to = "value")

This reformats the dataframe in R but does not overwrite the text file of the data.
Don’t worry too much about this right now. We’ll spend a lot of time talking about reshaping data next week!

Writing files

Often we want to write to files.

My main reasons for doing so are to save copies of data that have been processed and to save manuscripts and graphics.
Also, as someone who collects a lot of data, the de-identified, fully anonymized data files I can share and the identifiable data I collect require multiple versions (and encryption, keys, etc.)
Write your dataframe iris to a csv file named iris-long.csv in your data-processed folder:

Code

file <- "data-processed/iris-long.csv"
write_csv(iris, file)

Putting file paths into variables often makes your code easier to read especially when file paths are long or used multiple times.

Create a plot

Open your 03-figures.R script and create a simple plot of this data with:

Code

fig1 <- ggplot(
  data = iris
  , aes(y = Species, x = value, fill = Species)
  ) + 
  geom_boxplot() +                       
  facet_grid(attribute~.) + 
  scale_x_continuous(name = "Attribute") +
  scale_y_discrete(name = "Species") +
  theme_classic() + 
  theme(legend.position = "none")

View plot

View plot with:

Code

fig1

Write ggplot figure to file

A useful function for saving ggplot figures is ggsave().
It has arguments for the size, resolution and device for the image. See the ggsave() reference page.
Since I often make more than one figure, I might set these arguments first.

Assign ggsave argument values to variables:

Code

# figure saving settings
units <- "in"  
fig_w <- 3.2
fig_h <- fig_w
dpi <- 600
device <- "tiff"

Save the figure to your figures directory:

Code

ggsave("figures/fig1.tiff",
       plot = fig1,
       device = device,
       width = fig_w,
       height = fig_h,
       units = units,
       dpi = dpi)

Check it is there!

Data Manipulation in `dplyr`

`dplyr` Core Functions

%>%: The pipe. Read as “and then.”
filter(): Pick observations (rows) by their values.
select(): Pick variables (columns) by their names.
arrange(): Reorder the rows.
group_by(): Implicitly split the data set by grouping by names (columns).
mutate(): Create new variables with functions of existing variables.
summarize() / summarise(): Collapse many values down to a single summary.

Core Functions

%>%
filter()
select()
arrange()
group_by()
mutate()
summarize()

Although each of these functions are powerful alone, they are incredibly powerful in conjunction with one another. So below, I’ll briefly introduce each function, then link them all together using an example of basic data cleaning and summary.

1. `%>%`

The pipe %>% is wonderful. It makes coding intuitive. Often in coding, you need to use so-called nested functions. For example, you might want to round a number after taking the square of 43.

Code

sqrt(43)

[1] 6.557439

Code

round(sqrt(43), 2)

[1] 6.56

The issue with this comes whenever we need to do a series of operations on a data set or other type of object. In such cases, if we run it in a single call, then we have to start in the middle and read our way out.

Code

round(sqrt(43/2), 2)

[1] 4.64

The pipe solves this by allowing you to read from left to right (or top to bottom). The easiest way to think of it is that each call of %>% reads and operates as “and then.” So with the rounded square root of 43, for example:

Code

sqrt(43) %>%
  round(2)

[1] 6.56

2. `filter()`

Often times, when conducting research (experiments or otherwise), there are observations (people, specific trials, etc.) that you don’t want to include.

Code

data(bfi) # grab the bfi data from the psych package
bfi <- bfi %>% as_tibble()
head(bfi)

Often times, when conducting research (experiments or otherwise), there are observations (people, specific trials, etc.) that you don’t want to include.

Code

summary(bfi$age) # get age descriptives

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   3.00   20.00   26.00   28.78   35.00   86.00

Often times, when conducting research (experiments or otherwise), there are observations (people, specific trials, etc.) that you don’t want to include.

Code

bfi2 <- bfi %>% # see a pipe!
  filter(age <= 18) # filter to age up to 18

summary(bfi2$age) # summary of the new data

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    3.0    16.0    17.0    16.3    18.0    18.0

But this isn’t quite right. We still have folks below 12. But, the beauty of filter() is that you can do sequence of OR and AND statements when there is more than one condition, such as up to 18 AND at least 12.

Code

bfi2 <- bfi %>%
  filter(age <= 18 & age >= 12) # filter to age up to 18 and at least 12

summary(bfi2$age) # summary of the new data

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   12.0    16.0    17.0    16.4    18.0    18.0

Got it!

But filter works for more use cases than just conditional
- <, >, <=, and >=
It can also be used for cases where we want a single values to match cases with text.
To do that, let’s convert one of the variables in the bfi data frame to a string.
So let’s change gender (1 = male, 2 = female) to text (we’ll get into factors later).

Code

bfi$education <- plyr::mapvalues(bfi$education, 1:5, c("Below HS", "HS", "Some College", "College", "Higher Degree"))

Now let’s try a few things:

1. Create a data set with only individuals with some college (==).

Code

bfi2 <- bfi %>% 
  filter(education == "Some College")
unique(bfi2$education)

[1] "Some College"

2. Create a data set with only people age 18 (==).

Code

bfi2 <- bfi %>%
  filter(age == 18)
summary(bfi2$age)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     18      18      18      18      18      18

3. Create a data set with individuals with some college or above (%in%).

Code

bfi2 <- bfi %>%
  filter(education %in% c("Some College", "College", "Higher Degree"))
unique(bfi2$education)

[1] "Some College"  "Higher Degree" "College"

%in% is great. It compares a column to a vector rather than just a single value, you can compare it to several

Code

bfi2 <- bfi %>%
  filter(age %in% 12:18)
summary(bfi2$age)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   12.0    16.0    17.0    16.4    18.0    18.0

3. `select()`

If filter() is for pulling certain observations (rows), then select() is for pulling certain variables (columns).
it’s good practice to remove these columns to stop your environment from becoming cluttered and eating up your RAM.
In our bfi data, most of these have been pre-removed, so instead, we’ll imagine we don’t want to use any indicators of Agreeableness (A1-A5) and that we aren’t interested in gender.
With select(), there are few ways choose variables. We can bare quote name the ones we want to keep, bare quote names we want to remove, or use any of a number of select() helper functions.

A. Bare quote columns we want to keep:

Code

bfi %>%
  select(C1, C2, C3, C4, C5) %>%
  print(n = 6)

# A tibble: 2,800 × 5
     C1    C2    C3    C4    C5
  <int> <int> <int> <int> <int>
1     2     3     3     4     4
2     5     4     4     3     4
3     4     5     4     2     5
4     4     4     3     5     5
5     4     4     5     3     2
6     6     6     6     1     3
# ℹ 2,794 more rows

Code

bfi %>%
  select(C1:C5) %>%
  print(n = 6)

# A tibble: 2,800 × 5
     C1    C2    C3    C4    C5
  <int> <int> <int> <int> <int>
1     2     3     3     4     4
2     5     4     4     3     4
3     4     5     4     2     5
4     4     4     3     5     5
5     4     4     5     3     2
6     6     6     6     1     3
# ℹ 2,794 more rows

B. Bare quote columns we don’t want to keep:

Code

bfi %>% 
  select(-(A1:A5), -gender) %>% # Note the `()` around the columns
  print(n = 6)

# A tibble: 2,800 × 22
     C1    C2    C3    C4    C5    E1    E2    E3    E4    E5    N1    N2    N3
  <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1     2     3     3     4     4     3     3     3     4     4     3     4     2
2     5     4     4     3     4     1     1     6     4     3     3     3     3
3     4     5     4     2     5     2     4     4     4     5     4     5     4
4     4     4     3     5     5     5     3     4     4     4     2     5     2
5     4     4     5     3     2     2     2     5     4     5     2     3     4
6     6     6     6     1     3     2     1     6     5     6     3     5     2
# ℹ 2,794 more rows
# ℹ 9 more variables: N4 <int>, N5 <int>, O1 <int>, O2 <int>, O3 <int>,
#   O4 <int>, O5 <int>, education <chr>, age <int>

C. Add or remove using `select()` helper functions.

starts_with()
ends_with()
contains()
matches()
num_range()
one_of()
all_of()

Code

bfi %>%
  select(starts_with("C"))

4. `arrange()`

Sometimes, either in order to get a better sense of our data or in order to well, order our data, we want to sort it
Although there is a base R sort() function, the arrange() function is tidyverse version that plays nicely with other tidyverse functions.

So in our previous examples, we could also arrange() our data by age or education, rather than simply filtering. (Or as we’ll see later, we can do both!)

Code

# sort by age
bfi %>% 
  select(gender:age) %>%
  arrange(age) %>% 
  print(n = 6)

# A tibble: 2,800 × 3
  gender education       age
   <int> <chr>         <int>
1      1 Higher Degree     3
2      2 <NA>              9
3      2 Some College     11
4      2 <NA>             11
5      2 <NA>             11
6      2 <NA>             12
# ℹ 2,794 more rows

Code

# sort by education
bfi %>%
  select(gender:age) %>%
  arrange(education) %>%
  print(n = 6)

# A tibble: 2,800 × 3
  gender education   age
   <int> <chr>     <int>
1      1 Below HS     19
2      1 Below HS     21
3      1 Below HS     17
4      1 Below HS     18
5      1 Below HS     18
6      2 Below HS     18
# ℹ 2,794 more rows

We can also arrange by multiple columns, like if we wanted to sort by gender then education:

Code

bfi %>%
  select(gender:age) %>%
  arrange(gender, education) %>% 
  print(n = 6)

# A tibble: 2,800 × 3
  gender education   age
   <int> <chr>     <int>
1      1 Below HS     19
2      1 Below HS     21
3      1 Below HS     17
4      1 Below HS     18
5      1 Below HS     18
6      1 Below HS     32
# ℹ 2,794 more rows

Bringing it all together: Split-Apply-Combine

Much of the power of dplyr functions lay in the split-apply-combine method
A given set of of data are:
- split into smaller chunks
- then a function or series of functions are applied to each chunk
- and then the chunks are combined back together

5. `group_by()`

The group_by() function is the “split” of the method
It basically implicitly breaks the data set into chunks by whatever bare quoted column(s)/variable(s) are supplied as arguments.

So imagine that we wanted to group_by() education levels to get average ages at each level

Code

bfi %>%
  select(starts_with("C"), age, gender, education) %>%
  group_by(education) %>%
  print(n = 6)

# A tibble: 2,800 × 8
# Groups:   education [6]
     C1    C2    C3    C4    C5   age gender education   
  <int> <int> <int> <int> <int> <int>  <int> <chr>       
1     2     3     3     4     4    16      1 <NA>        
2     5     4     4     3     4    18      2 <NA>        
3     4     5     4     2     5    17      2 <NA>        
4     4     4     3     5     5    17      2 <NA>        
5     4     4     5     3     2    17      1 <NA>        
6     6     6     6     1     3    21      2 Some College
# ℹ 2,794 more rows

Hadley’s first law of data cleaning: “What is split, must be combined”
This is super easy with the ungroup() function:

Code

bfi %>%
  select(starts_with("C"), age, gender, education) %>%
  group_by(education) %>%
  ungroup() %>%
  print(n = 6)

# A tibble: 2,800 × 8
     C1    C2    C3    C4    C5   age gender education   
  <int> <int> <int> <int> <int> <int>  <int> <chr>       
1     2     3     3     4     4    16      1 <NA>        
2     5     4     4     3     4    18      2 <NA>        
3     4     5     4     2     5    17      2 <NA>        
4     4     4     3     5     5    17      2 <NA>        
5     4     4     5     3     2    17      1 <NA>        
6     6     6     6     1     3    21      2 Some College
# ℹ 2,794 more rows

Multiple group_by() calls overwrites previous calls:

Code

bfi %>%
  select(starts_with("C"), age, gender, education) %>%
  group_by(education) %>%
  group_by(gender, age) %>%
  print(n = 6)

# A tibble: 2,800 × 8
# Groups:   gender, age [115]
     C1    C2    C3    C4    C5   age gender education   
  <int> <int> <int> <int> <int> <int>  <int> <chr>       
1     2     3     3     4     4    16      1 <NA>        
2     5     4     4     3     4    18      2 <NA>        
3     4     5     4     2     5    17      2 <NA>        
4     4     4     3     5     5    17      2 <NA>        
5     4     4     5     3     2    17      1 <NA>        
6     6     6     6     1     3    21      2 Some College
# ℹ 2,794 more rows

6. `mutate()`

mutate() is one of your “apply” functions
When you use mutate(), the resulting data frame will have the same number of rows you started with
You are directly mutating the existing data frame, either modifying existing columns or creating new ones

To demonstrate, let’s add a column that indicated average age levels within each age group

Code

bfi %>%
  select(starts_with("C"), age, gender, education) %>%
  arrange(education) %>%
  group_by(education) %>% 
  mutate(age_by_edu = mean(age, na.rm = T)) %>%
  print(n = 6)

# A tibble: 2,800 × 9
# Groups:   education [6]
     C1    C2    C3    C4    C5   age gender education age_by_edu
  <int> <int> <int> <int> <int> <int>  <int> <chr>          <dbl>
1     6     6     3     4     5    19      1 Below HS        25.1
2     4     3     5     3     2    21      1 Below HS        25.1
3     5     5     5     2     2    17      1 Below HS        25.1
4     5     5     4     1     1    18      1 Below HS        25.1
5     4     5     4     3     3    18      1 Below HS        25.1
6     3     2     3     4     6    18      2 Below HS        25.1
# ℹ 2,794 more rows

mutate() is also super useful even when you aren’t grouping

We can create a new category

Code

bfi %>%
  select(starts_with("C"), age, gender, education) %>%
  mutate(gender_cat = plyr::mapvalues(gender, c(1,2), c("Male", "Female")))

We could also just overwrite it:

Code

bfi %>%
  select(starts_with("C"), age, gender, education) %>%
  mutate(gender = plyr::mapvalues(gender, c(1,2), c("Male", "Female")))

7. `summarize()` / `summarise()`

summarize() is one of your “apply” functions
The resulting data frame will have the same number of rows as your grouping variable
You number of groups is 1 for ungrouped data frames

Code

# group_by() education
bfi %>%
  select(starts_with("C"), age, gender, education) %>%
  arrange(education) %>%
  group_by(education) %>% 
  summarize(age_by_edu = mean(age, na.rm = T))

Code

# no groups  
bfi %>% 
  select(starts_with("C"), age, gender, education) %>%
  arrange(education) %>%
  summarize(age_by_edu = mean(age, na.rm = T))

--- title: "Week 2 Workbook" author: "Emorie D Beck" format: html: code-tools: true code-copy: true code-line-numbers: true code-link: true theme: united highlight-style: tango df-print: paged code-fold: show toc: true toc-float: true self-contained: true # height: 900 footer: "PSC 290 - Data Cleaning and Management FQ23" logo: "https://github.com/emoriebeck/psc290-data-viz-2022/raw/main/01-week1-intro/02-code/02-images/ucdavis_logo_blue.png" editor: visual editor_options: chunk_output_type: console --- # Week 2: Reproducibility and Data Transformations ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE, fig.width = 4, fig.height = 4, fig.retina = 3) options(htmltools.dir.version = FALSE) ``` ```{r, echo = T} library(knitr) library(psych) library(emo) library(plyr) library(tidyverse) ``` # Overview Today's class outline:\ - Welcome back, questions on homework (10-15 minutes) - Reproducibility and Your Personal Values (10 minutes) - Building a Reproducible Workflow Using Projects (45 miuntes) - Data Transformations using `dplyr` (45 minutes) # Reproducibility and Your Personal Values ## Why reproducibility AND values? - The definition of reproducibility is somewhat debated - "'Reproducibility' refers to instances in which the original researcher's data and computer codes are used to regenerate the results"\ - "'Reproducibility' refers to independent researchers arriving at the same results using their own data and methods"\ - But regardless of what definition you choose, reproducibility starts with a commitment in research to be - clear\ - transparent - honest\ - thorough ## Why reproducibility AND values? - Reproducibility is *ethical*. - When I post a project, I pour over my code for hours, adding comments, rendering to multiple formats, trying to flag locations in online materials in the mansucript, etc. - I am trying to prevent errors, but I am also trying to make sure that other people know what I did, especially if I did make errors - Reproducible research is also *equitable.* - A reproducible research workflow can be downloaded by another person as a starting point, providing tools to other researchers who may not have the same access to education and resources as you ## Where should we reproducible? - Planning - Study planning and design\ - Lab Protocols\ - Codebooks\ - etc.\ - Analyses - Scripting\ - Communication\ - etc. ## Aspects of Reproducibility - Data within files should be 'tidy' (next week -- `tidyr`) - Project based approach (today) - Consistency: naming, space, style (today) - Documentation: commenting and README (today) - Literate programming e.g. Rmarkdown (every day!) # Building a Reproducible Workflow Using Projects ## Reproducible Workflow A reproducible workflow is *organized*. What does it mean to be be organized? At least: - Use a project based approach, e.g., RStudio project or similar\ - Have a hierarchical folder structure\ - Have a consistent and informative naming system that 'plays nice'\ - Document code with comments and analyses with README ::: fragment More advanced (later in the class) ::: - Generalize with functions and packages - version control\  ## What is a project? - A project is a discrete piece of work which has a number of files associated with it such as the data and scripts for an analysis and the production reports. - Using a project-oriented workflow means to have a hierarchical folder structure with everything needed to reproduce an analysis. One research project might have several organizational projects associated with it, for example: - data files and metadata (which may be made into a package) - preregistration - analysis and reporting - a package developed for the analysis - an app for allowing data to be explored by others ## Example Good Workflows are: - structured\ - systematic\ - repeatable **Naming** - human and machine readable - no spaces\ - use snake/kebab case\ - ordering: numbers (zero left padded), dates\ - file extensions ``` -- ipcs_data_2019 |__ipcs_data_2019.Rproj |__data |__raw_data |__2019-03-21_ema_raw.csv |__2019-03-21_baseline_raw.csv |__clean_data |__2019-06-21_ema_long.csv |__2019-06-21_ema_long.RData |__2019-06-21_baseline_wide.csv |__2019-06-21_baseline_wide.RData |__results |__01_models |__E_mortality.RData |__A_mortality.RData |__02_summaries |__E_mortality.RData |__A_mortality.RData |__03_figures |__mortality.png |__mortality.pdf |__04_tables |__zero_order_cors.RData |__descriptives.RData |__key_terms.RData |__all_model_terms.RData |__README.md |__refs |__r_refs.bib |__proj_refs.bib |__analyses |__01_background.Rmd |__02_data_cleaning.Rmd |__03_models.Rmd |__04_summary.Rmd ``` ## What is a path? A path gives the address - or location - of a filesystem object, such as a file or directory. - Paths appear in the address bar of your browser or file explorer. - We need to know a file path whenever we want to read, write or refer to a file using code rather than interactively pointing and clicking to navigate. - A path can be **absolute** or **relative** - absolute = whole path from root - relative = path from current directory ### Absolute paths - An Absolute path is given from the "root directory" of the object. - The root directory of a file system is the first or top directory in the hierarchy. - For example, `C:\` or `M:\` on windows or `/` on a Mac which is displayed as Macintosh HD in Finder. The absolute path for a file, `pigeon.txt` could be: - windows: `C:/Users/edbeck/Desktop/pigeons/data-raw/pigeon.txt` - Mac/unix systems: `/Users/edbeck/Desktop/pigeons/data-raw/pigeon.txt`\ - web: `http://github.com/emoriebeck/pigeons/data/pigeon.txt` ### What is a directory? - Directory is the old word for what many now call a folder `r emo::ji("folder")`. - Commands that act on directories in most programming languages and environments reflect this. - For example, in `R` this means "tell me my working directory": - `getwd()` **get** **w**orking **d**irectory in R ### What is a working directory? - The working directory is the default location a program is using. It is where the program will read and write files by default. You have only one working directory at a time. - The terms 'working directory', 'current working directory' and 'current directory' all mean the same thing. Find your current working directory with: ```{r} getwd() ``` ### Relative paths A relative path gives the location of a filesystem object *relative* to the working directory, (i.e., that returned by `getwd()`). - When `pigeon.txt` is in the working directory the relative path is just the file \* name: `pigeon.txt` - If there is a folder in the working directory called `data-raw` and `pigeon.txt` is in there then the relative path is `data-raw/pigeon.txt` ### Paths: moving up the hierarchy - `../` allows you to look in the directory above the working directory - When `pigeon.txt` is in folder above the working the relative path is `../pigeon.txt` - And if it is in a folder called `data-raw` which is in the directory above the working directory then the relative path is `../data-raw/pigeon.txt` ### What's in my directory? You can list the contents of a directory using the `dir()` command - `dir()` list the contents of the working directory - `dir("..")` list the contents of the directory above the working directory - `dir("../..")` list the contents of the directory two directories above the working directory - `dir("data-raw")` list the contents of a folder call data-raw which is in the working directory. ### Relative or absolute - Most of the time you should use relative paths because that makes your work portable (i.e. to a different machine / user / etc.). - `r emo::ji("party")` The tab key is your friend! - You only need to use absolute paths when you are referring to filesystem outside the one you are using. - I often store the beginning of that path as object. - web_wd \<- "https://github.com/emoriebeck/pigeons/" - Then I can use `sprintf()` or `paste()` to add different endings ```{r} web_wd <- "https://github.com/emoriebeck/pigeons/" sprintf("%s/data-raw/pigeon.txt", web_wd) ``` # RStudio Projects ## Example Download and unzip [pigeons.zip](../pigeons.zip) which has the following structure: ``` -- pigeons |__data-processed |__pigeon_long.txt |__data-raw |__pigeon.txt |__figures |__fig1.tiff |__scripts |__analysis.R |__import_reshape.R |__pigeons.Rproj ``` ## RStudio Projects - Project is obviously a commonly used word. When I am referring to an [RStudio Project](https://support.posit.co/hc/en-us/articles/200526207-Using-Projects) I will use the capitalised words 'RStudio Project' or 'Project'. - In other cases, I will use 'project'. - An RStudio Project is a directory with an `.Rproj` file in it. - The name of the RStudio Project is the same as the name of the top level directory which is referred to as the Project directory. For example, if you create an RStudio Project `ipcs_data_2019` your folder structure would look something like this: ``` -- ipcs_data_2019 |__ipcs_data_2019.Rproj |__data |__raw_data |__2019-03-21_ema_raw.csv |__2019-03-21_baseline_raw.csv |__clean_data |__2019-06-21_ema_long.csv |__2019-06-21_ema_long.RData |__2019-06-21_baseline_wide.csv |__2019-06-21_baseline_wide.RData |__results |__01_models |__02_summaries |__03_figures |__04_tables |__README.md |__refs |__r_refs.bib |__proj_refs.bib |__analyses |__01_background.Rmd |__02_data_cleaning.Rmd |__03_models.Rmd |__04_summary.Rmd ``` - the `.RProj` file which is the defining feature of an RStudio Project - When you open an RStudio Project, the working directory is set to the Project directory (i.e., the location of the `.Rproj` file). - This makes your work portable. You can zip up the project folder and send it to any person, including future you, or any computer. - They will be able to unzip, open the project and have all the code just work. - (This is great for sending code and/or results to your advisors) ## Directory structure You are aiming for structured, systematic and repeatable. For example, the Project directory might contain: - .RProj file\ - README - tell people what the project is and how to use it\ - License - tell people what they are allowed to do with your project - Directories - data/\ - prereg/\ - scripts/ - results/\ - manuscript/ ## README - READMEs are a form of documentation which have been widely used for a long time. They contain all the information about the other files in a directory. They can be extensive. - Wikipedia [README page](https://en.wikipedia.org/wiki/README)\ - GitHub Doc's [About READMEs](https://docs.github.com/en/free-pro-team@latest/github/creating-cloning-and-archiving-repositories/about-readmes)\ - OSF A minimal README might give: - Title - Description, 50 words or so on what the project is - Technical Description of the project - What software and packages are needed including versions - Any instructions needed to run the analysis/use the software - Any issues that a user might face in running the analysis/using the software - Instructions on how to use the work - Links to where other files, materials, etc. are stored - E.g., an OSF readme may point to GitHub, PsyArxiv, etc. ## License A license tells others what they can and can't do with your work. [choosealicense.com](https://choosealicense.com/) is a useful explainer. I typically use: - [MIT License](https://choosealicense.com/licenses/mit/) for software - [CC-BY-SA-4.0](https://choosealicense.com/licenses/cc-by-sa-4.0/) for other work # Exercise - You are going to create an RStudio Project with some directories and use it to organise a very simple analysis. - The analysis will import a data file, reformat it and write the new format to file. It will then create a figure and write the image to file. - You'll get practice with tidying data (more on that next week) and plotting data. ## RStudio Project infrastructure `r emo::ji("clapper")` create a new Project called `iris` by: - clicking **File-\>New Project...** - clicking on the little icon (second from the left) at the top - Choose New Project, then New Directory, then New Project. Name the RStudio Project `iris`. - Create folders in `iris` called `data-raw`, `data-processed` and `figures`. - Start new scripts called `01-import.R`, `02-tidy.R`, and `03-figures.R` ## Save and Import - Save a copy of [iris.csv](data/iris.csv) to your `data-raw` folder. These data give the information about different species of irises. - In your `01-import.R` script, load the tidyverse set of packages. ```{r eval=FALSE} library(tidyverse) write_csv(iris, file = "data-raw/iris.csv") ``` - Add the command to import the data: ```{r, eval = F} iris <- read_csv("data-raw/iris.csv") ``` ```{r, echo = F} data(iris) ``` - The relative path is `data-raw/iris.csv` because your working directory is the Project directory, `iris`. ## Reformat the data This dataset has three observations in a row - it is not 'tidy'. - Open your `02-tidy.R` script, and reshape the data using: ```{r} iris <- pivot_longer(data = iris, cols = -Species, names_to = "attribute", values_to = "value") ``` - This reformats the dataframe in R but does not overwrite the text file of the data. - Don't worry too much about this right now. We'll spend a lot of time talking about reshaping data next week! ## Writing files Often we want to write to files. - My main reasons for doing so are to save copies of data that have been processed and to save manuscripts and graphics. - Also, as someone who collects a lot of data, the de-identified, fully anonymized data files I can share and the identifiable data I collect require multiple versions (and encryption, keys, etc.) - Write your dataframe `iris` to a csv file named `iris-long.csv` in your `data-processed` folder: ```{r, eval = F} file <- "data-processed/iris-long.csv" write_csv(iris, file) ``` ```{r, echo = F, eval = F} file <- "iris/data-processed/iris-long.csv" write_csv(iris, file) ``` - Putting file paths into variables often makes your code easier to read especially when file paths are long or used multiple times. ## Create a plot Open your `03-figures.R` script and create a simple plot of this data with: ```{r} fig1 <- ggplot( data = iris , aes(y = Species, x = value, fill = Species) ) + geom_boxplot() + facet_grid(attribute~.) + scale_x_continuous(name = "Attribute") + scale_y_discrete(name = "Species") + theme_classic() + theme(legend.position = "none") ``` ## View plot View plot with: ```{r, fig.width = 8} fig1 ``` ## Write ggplot figure to file - A useful function for saving ggplot figures is `ggsave()`. - It has arguments for the size, resolution and device for the image. See the [`ggsave()` reference page](https://ggplot2.tidyverse.org/reference/ggsave.html). - Since I often make more than one figure, I might set these arguments first. ::: columns ::: column - Assign `ggsave` argument values to variables: ```{r} # figure saving settings units <- "in" fig_w <- 3.2 fig_h <- fig_w dpi <- 600 device <- "tiff" ``` ::: ::: column - Save the figure to your figures directory: ```{r, eval = F} ggsave("figures/fig1.tiff", plot = fig1, device = device, width = fig_w, height = fig_h, units = units, dpi = dpi) ``` ```{r, echo = F} ggsave("iris/figures/fig1.tiff", plot = fig1, device = device, width = fig_w, height = fig_h, units = units, dpi = dpi) ``` - Check it is there! ::: ::: ------------------------------------------------------------------------ ::: {.columns style="display: flex !important; height: 90%;"} ::: {.column width="70%" style="display: flex; align-items: center;"}  # Data Manipulation in `dplyr`  ::: ::: {.column width="30%" style="display: flex; justify-content: center; align-items: center;"} ```{r, fig.align='center', echo = F} knitr::include_graphics("https://github.com/rstudio/hex-stickers/raw/main/thumbs/dplyr.png") ``` ::: ::: # `dplyr` Core Functions 1. **`%>%`**: The pipe. Read as "and then." 2. **`filter()`**: Pick observations (rows) by their values. 3. **`select()`**: Pick variables (columns) by their names. 4. **`arrange()`**: Reorder the rows. 5. **`group_by()`**: Implicitly split the data set by grouping by names (columns). 6. **`mutate()`**: Create new variables with functions of existing variables. 7. **`summarize()` / `summarise()`**: Collapse many values down to a single summary. ## Core Functions ::: columns ::: {.column width="40%"} ::: nonincremental 1. **`%>%`** 2. **`filter()`** 3. **`select()`** 4. **`arrange()`** 5. **`group_by()`** 6. **`mutate()`** 7. **`summarize()`** ::: ::: ::: {.column width="60%" style="text-align: center; background-color: #FFD966; color: black; border: 5px solid #033266;"} Although each of these functions are powerful alone, they are incredibly powerful in conjunction with one another. So below, I'll briefly introduce each function, then link them all together using an example of basic data cleaning and summary. ::: ::: ## 1. `%>%` - The pipe `%>%` is wonderful. It makes coding intuitive. Often in coding, you need to use so-called nested functions. For example, you might want to round a number after taking the square of 43. ```{r, echo = T} sqrt(43) round(sqrt(43), 2) ``` The issue with this comes whenever we need to do a series of operations on a data set or other type of object. In such cases, if we run it in a single call, then we have to start in the middle and read our way out. ```{r, echo = T} round(sqrt(43/2), 2) ``` The pipe solves this by allowing you to read from left to right (or top to bottom). The easiest way to think of it is that each call of `%>%` reads and operates as "and then." So with the rounded square root of 43, for example: ```{r, echo = T} sqrt(43) %>% round(2) ``` ## 2. `filter()` Often times, when conducting research (experiments or otherwise), there are observations (people, specific trials, etc.) that you don't want to include.  ```{r, echo=TRUE} data(bfi) # grab the bfi data from the psych package bfi <- bfi %>% as_tibble() head(bfi) ``` Often times, when conducting research (experiments or otherwise), there are observations (people, specific trials, etc.) that you don't want to include. ```{r, echo = T} summary(bfi$age) # get age descriptives ``` Often times, when conducting research (experiments or otherwise), there are observations (people, specific trials, etc.) that you don't want to include. ```{r, echo = T} #| code-line-numbers: "|2" bfi2 <- bfi %>% # see a pipe! filter(age <= 18) # filter to age up to 18 summary(bfi2$age) # summary of the new data ``` But this isn't quite right. We still have folks below 12. But, the beauty of `filter()` is that you can do sequence of `OR` and `AND` statements when there is more than one condition, such as up to 18 `AND` at least 12. ```{r, echo = T} bfi2 <- bfi %>% filter(age <= 18 & age >= 12) # filter to age up to 18 and at least 12 summary(bfi2$age) # summary of the new data ``` Got it! - But filter works for more use cases than just conditional - `<`, `>`, `<=`, and `>=` - It can also be used for cases where we want a single values to match cases with text. - To do that, let's convert one of the variables in the `bfi` data frame to a string. - So let's change gender (1 = male, 2 = female) to text (we'll get into factors later). ```{r, echo = T} bfi$education <- plyr::mapvalues(bfi$education, 1:5, c("Below HS", "HS", "Some College", "College", "Higher Degree")) ``` Now let's try a few things: **1. Create a data set with only individuals with some college (`==`).** ```{r, echo = T} bfi2 <- bfi %>% filter(education == "Some College") unique(bfi2$education) ``` **2. Create a data set with only people age 18 (`==`).** ```{r, echo = T} bfi2 <- bfi %>% filter(age == 18) summary(bfi2$age) ``` **3. Create a data set with individuals with some college or above (`%in%`).** ```{r, echo = T} bfi2 <- bfi %>% filter(education %in% c("Some College", "College", "Higher Degree")) unique(bfi2$education) ``` `%in%` is great. It compares a column to a vector rather than just a single value, you can compare it to several ```{r, echo = T} bfi2 <- bfi %>% filter(age %in% 12:18) summary(bfi2$age) ``` ## 3. `select()` - If `filter()` is for pulling certain observations (rows), then `select()` is for pulling certain variables (columns). - it's good practice to remove these columns to stop your environment from becoming cluttered and eating up your RAM. - In our `bfi` data, most of these have been pre-removed, so instead, we'll imagine we don't want to use any indicators of Agreeableness (A1-A5) and that we aren't interested in gender. - With `select()`, there are few ways choose variables. We can bare quote name the ones we want to keep, bare quote names we want to remove, or use any of a number of `select()` helper functions. ### A. Bare quote columns we want to keep: ::: columns ::: column ```{r, echo = T} #| code-line-numbers: "|2" bfi %>% select(C1, C2, C3, C4, C5) %>% print(n = 6) ``` ::: ::: column ```{r, echo=T} #| code-line-numbers: "|2" bfi %>% select(C1:C5) %>% print(n = 6) ```  ::: ::: ### B. Bare quote columns we don't want to keep: ```{r, echo = T} #| code-line-numbers: "|2" bfi %>% select(-(A1:A5), -gender) %>% # Note the `()` around the columns print(n = 6) ``` ### C. Add or remove using `select()` helper functions. ::: columns ::: {.column width="40%"} - `starts_with()`\ - `ends_with()` - `contains()` - `matches()` - `num_range()` - `one_of()` - `all_of()` ::: ::: {.column width="60%"} ::: fragment ```{r, echo = T} bfi %>% select(starts_with("C")) ``` ::: ::: ::: ## 4. `arrange()` - Sometimes, either in order to get a better sense of our data or in order to well, order our data, we want to sort it - Although there is a base `R` `sort()` function, the `arrange()` function is `tidyverse` version that plays nicely with other `tidyverse functions`. ::: columns So in our previous examples, we could also `arrange()` our data by age or education, rather than simply filtering. (Or as we'll see later, we can do both!) ::: {.column width="50%"} ```{r, echo = T} #| code-line-numbers: "|4" # sort by age bfi %>% select(gender:age) %>% arrange(age) %>% print(n = 6) ``` ::: ::: {.column width="50%"} ```{r, echo=TRUE} #| code-line-numbers: "|4" # sort by education bfi %>% select(gender:age) %>% arrange(education) %>% print(n = 6) ``` ::: ::: We can also arrange by multiple columns, like if we wanted to sort by gender then education: ```{r, echo = T} bfi %>% select(gender:age) %>% arrange(gender, education) %>% print(n = 6) ``` # Bringing it all together: Split-Apply-Combine - Much of the power of `dplyr` functions lay in the split-apply-combine method - A given set of of data are: - *split* into smaller chunks - then a function or series of functions are *applied* to each chunk - and then the chunks are *combined* back together ## 5. `group_by()` - The `group_by()` function is the "split" of the method - It basically implicitly breaks the data set into chunks by whatever bare quoted column(s)/variable(s) are supplied as arguments. So imagine that we wanted to `group_by()` education levels to get average ages at each level ```{r, echo = T} bfi %>% select(starts_with("C"), age, gender, education) %>% group_by(education) %>% print(n = 6) ``` - Hadley's first law of data cleaning: "What is split, must be combined" - This is super easy with the `ungroup()` function: ```{r, echo=TRUE} bfi %>% select(starts_with("C"), age, gender, education) %>% group_by(education) %>% ungroup() %>% print(n = 6) ``` Multiple `group_by()` calls overwrites previous calls: ```{r, echo = T} bfi %>% select(starts_with("C"), age, gender, education) %>% group_by(education) %>% group_by(gender, age) %>% print(n = 6) ``` ## 6. `mutate()` - `mutate()` is one of your "apply" functions - When you use `mutate()`, the resulting data frame will have the same number of rows you started with - You are directly mutating the existing data frame, either modifying existing columns or creating new ones To demonstrate, let's add a column that indicated average age levels within each age group ```{r, echo = T} bfi %>% select(starts_with("C"), age, gender, education) %>% arrange(education) %>% group_by(education) %>% mutate(age_by_edu = mean(age, na.rm = T)) %>% print(n = 6) ``` `mutate()` is also super useful even when you aren't grouping We can create a new category ```{r, echo = T} bfi %>% select(starts_with("C"), age, gender, education) %>% mutate(gender_cat = plyr::mapvalues(gender, c(1,2), c("Male", "Female"))) ``` We could also just overwrite it: ```{r, echo = T} bfi %>% select(starts_with("C"), age, gender, education) %>% mutate(gender = plyr::mapvalues(gender, c(1,2), c("Male", "Female"))) ``` ## 7. `summarize()` / `summarise()` - `summarize()` is one of your "apply" functions - The resulting data frame will have the same number of rows as your grouping variable - You number of groups is 1 for ungrouped data frames ```{r, echo = T} # group_by() education bfi %>% select(starts_with("C"), age, gender, education) %>% arrange(education) %>% group_by(education) %>% summarize(age_by_edu = mean(age, na.rm = T)) ``` ```{r, echo = T} # no groups bfi %>% select(starts_with("C"), age, gender, education) %>% arrange(education) %>% summarize(age_by_edu = mean(age, na.rm = T)) ```

Week 2: Reproducibility and Data Transformations

Overview

Reproducibility and Your Personal Values

Why reproducibility AND values?

Why reproducibility AND values?

Where should we reproducible?

Aspects of Reproducibility

Building a Reproducible Workflow Using Projects

Reproducible Workflow

What is a project?

Example

What is a path?

Absolute paths

What is a directory?

What is a working directory?

Relative paths

Paths: moving up the hierarchy

What’s in my directory?

Relative or absolute

RStudio Projects

Example

RStudio Projects

Directory structure

README

License

Exercise

RStudio Project infrastructure

Save and Import

Reformat the data

Writing files

Create a plot

View plot

Write ggplot figure to file

Data Manipulation in dplyr

dplyr Core Functions

Core Functions

1. %>%

2. filter()

3. select()

A. Bare quote columns we want to keep:

B. Bare quote columns we don’t want to keep:

C. Add or remove using select() helper functions.

4. arrange()

Bringing it all together: Split-Apply-Combine

5. group_by()

6. mutate()

7. summarize() / summarise()

Data Manipulation in `dplyr`

`dplyr` Core Functions

1. `%>%`

2. `filter()`

3. `select()`

C. Add or remove using `select()` helper functions.

4. `arrange()`

5. `group_by()`

6. `mutate()`

7. `summarize()` / `summarise()`