Week 1 (Workbook) - Getting Situated in R and tidyverse

Author

Emorie D Beck

Code

library(knitr)
library(psych)
library(plyr)
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ ggplot2::%+%()     masks psych::%+%()
✖ ggplot2::alpha()   masks psych::alpha()
✖ dplyr::arrange()   masks plyr::arrange()
✖ purrr::compact()   masks plyr::compact()
✖ dplyr::count()     masks plyr::count()
✖ dplyr::desc()      masks plyr::desc()
✖ dplyr::failwith()  masks plyr::failwith()
✖ dplyr::filter()    masks stats::filter()
✖ dplyr::id()        masks plyr::id()
✖ dplyr::lag()       masks stats::lag()
✖ dplyr::mutate()    masks plyr::mutate()
✖ dplyr::rename()    masks plyr::rename()
✖ dplyr::summarise() masks plyr::summarise()
✖ dplyr::summarize() masks plyr::summarize()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Course Goals & Learning Outcomes

Understand the cognitive and psychological underpinnings of perceiving data visualization.
Identify good data visualizations and describe what makes them good.
Produce data visualizations according to types of questions, data, and more, with a particular emphasis on building a toolbox that you can carry into your own research.

Course Expectations

~50% of the course will be in R
You will get the most from this course if you:
- have your own data you can apply course content to
- know how to clean clean, transform, and manage that data
- today’s workshop is a good litmus test for this

Course Materials

All materials (required and optional) are free and online
- Wickham & Grolemond: R for Data Science https://r4ds.had.co.nz
- Wickham: Advanced R http://adv-r.had.co.nz
- Wilke: Fundamentals of Data Visualization https://clauswilke.com/dataviz/
- Healy: Data Visualization: A Practical Introduction https://socviz.co
- Data Camp: All paid content unlocked

Assignments

Assignment Weights	Percent
Problem Sets (5)	20%
Response Papers + Visualizations	20%
Final Project Proposal	10%
Class Presentation	20%
Final Project	30%
Total	100%

Response Papers / Visualizations

The main homework in the course is your weekly visualization assignment
The goal is to demonstrate how the principles and skills you learn in the class function “in the wild.”
These should be fun and not taken too seriously! No one is judging you for a pulling a graphic from Instagram instead of Nature.
Due 12:00 PM the day before (i.e. Tuesday) class (last class is “free points”)

Problem Sets

About every other week, there will be a practice set to help you practice what you’re learning.
These will have you apply the code you’ve been learning to your own data or a provided data set
Assigning them every other week aims to reduce burden while still allowing you to practice
Frequency / form will be adjusted as needed throughout the quarter

Final Projects

I don’t want you to walk out of this course and not know how to apply what you learned
Final project replaces final exam (there are no exams)
Create a visualization for an ongoing project!
- Stage 1: Proposal (due 02/12/25)
- Stage 2: 1-on-1 meetings + feedback (due by 02/26/25)
- Stage 3: In-class presentations (03/12/25)
- Stage 4: Final visualization + brief write-up (due 03/19/25 at midnight)

Extra Credit

Lots of talk series, etc. this winter
1 pt extra credit for each one you:
- go to,
- take a snap of a data viz,
- and critique it according to what you’ve learned in class
max 5 pts

Class Time

~5-10 min: welcome and review (if needed)
~20-35 min: discussion / some lecture content on readings
~5-10 min: break
~40-60 min: workshop
~20-30 min: open lab

Course Topics

Intro and Overview
Cognitive Perspectives
Proportions and Probability
Differences and Associations
Change and Time Series
Uncertainty
Piecing Visualizations Together

Polishing Visualizations
Interactive Visualizations Additional Topics:

Spatial Information
Automated Reports
Diagrams
More?

Questions on the Syllabus and Course?

Data Visualization

Why Should I Care About Data Visualization

Summarizing huge amounts of information
Seeing the forest and the trees
Errors in probabilistic reasoning
It’s fun!

Why Visualize Data in R

Open source and freeeee
Flexible
Reproducible
Flexible formatting / output
Lots of model- and package-specific support
Did I mention free?

Code

knitr::include_graphics("https://github.com/emoriebeck/psc290-data-viz-2022/raw/main/01-week1-intro/02-code/02-images/R_logo.svg.png")

Why Use RStudio (Pivot)

Also free
Basically a GUI for R
Organize files, import data, etc. with ease
RMarkdown, Quarto, and more are powerful tools (they were used to create these slides!)
Lots of new features and support

Code

knitr::include_graphics("https://github.com/emoriebeck/psc290-data-viz-2022/raw/main/01-week1-intro/02-code/02-images/RStudio-Logo-Flat.png")

Why Use the `tidyverse`

Maintained by RStudio (Pivot)
No one should have to use a for loop to change data from long to wide
Tons of integrated tools for data cleaning, manipulation, transformation, and visualization
Even increasing support for modeling (e.g., tidymodels)

Goals for Today

Review core principles of:
- dyplr (data manipulation)
- tidyr (data transformation and reshaping)

Data Manipulation in `dplyr`

`dplyr` Core Functions

%>%: The pipe. Read as “and then.”
filter(): Pick observations (rows) by their values.
select(): Pick variables (columns) by their names.
arrange(): Reorder the rows.
group_by(): Implicitly split the data set by grouping by names (columns).
mutate(): Create new variables with functions of existing variables.
summarize() / summarise(): Collapse many values down to a single summary.

Core Functions

%>%
filter()
select()
arrange()
group_by()
mutate()
summarize()

Although each of these functions are powerful alone, they are incredibly powerful in conjunction with one another. So below, I’ll briefly introduce each function, then link them all together using an example of basic data cleaning and summary.

1. `%>%`

The pipe %>% is wonderful. It makes coding intuitive. Often in coding, you need to use so-called nested functions. For example, you might want to round a number after taking the square of 43.

Code

sqrt(43)

[1] 6.557439

Code

round(sqrt(43), 2)

[1] 6.56

The issue with this comes whenever we need to do a series of operations on a data set or other type of object. In such cases, if we run it in a single call, then we have to start in the middle and read our way out.

Code

round(sqrt(43/2), 2)

[1] 4.64

The pipe solves this by allowing you to read from left to right (or top to bottom). The easiest way to think of it is that each call of %>% reads and operates as “and then.” So with the rounded square root of 43, for example:

Code

sqrt(43) %>%
  round(2)

[1] 6.56

2. `filter()`

Often times, when conducting research (experiments or otherwise), there are observations (people, specific trials, etc.) that you don’t want to include.

Code

data(bfi) # grab the bfi data from the psych package
bfi <- bfi %>% as_tibble()
head(bfi)

Often times, when conducting research (experiments or otherwise), there are observations (people, specific trials, etc.) that you don’t want to include.

Code

summary(bfi$age) # get age descriptives

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   3.00   20.00   26.00   28.78   35.00   86.00

Often times, when conducting research (experiments or otherwise), there are observations (people, specific trials, etc.) that you don’t want to include.

Code

bfi2 <- bfi %>% # see a pipe!
  filter(age <= 18) # filter to age up to 18

summary(bfi2$age) # summary of the new data

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    3.0    16.0    17.0    16.3    18.0    18.0

But this isn’t quite right. We still have folks below 12. But, the beauty of filter() is that you can do sequence of OR and AND statements when there is more than one condition, such as up to 18 AND at least 12.

Code

bfi2 <- bfi %>%
  filter(age <= 18 & age >= 12) # filter to age up to 18 and at least 12

summary(bfi2$age) # summary of the new data

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   12.0    16.0    17.0    16.4    18.0    18.0

Got it!

But filter works for more use cases than just conditional
- <, >, <=, and >=
It can also be used for cases where we want a single values to match cases with text.
To do that, let’s convert one of the variables in the bfi data frame to a string.
So let’s change gender (1 = male, 2 = female) to text (we’ll get into factors later).

Code

bfi$education <- plyr::mapvalues(bfi$education, 1:5, c("Below HS", "HS", "Some College", "College", "Higher Degree"))

Now let’s try a few things:

1. Create a data set with only individuals with some college (==).

Code

bfi2 <- bfi %>% 
  filter(education == "Some College")
unique(bfi2$education)

[1] "Some College"

2. Create a data set with only people age 18 (==).

Code

bfi2 <- bfi %>%
  filter(age == 18)
summary(bfi2$age)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     18      18      18      18      18      18

3. Create a data set with individuals with some college or above (%in%).

Code

bfi2 <- bfi %>%
  filter(education %in% c("Some College", "College", "Higher Degree"))
unique(bfi2$education)

[1] "Some College"  "Higher Degree" "College"

%in% is great. It compares a column to a vector rather than just a single value, you can compare it to several

Code

bfi2 <- bfi %>%
  filter(age %in% 12:18)
summary(bfi2$age)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   12.0    16.0    17.0    16.4    18.0    18.0

3. `select()`

If filter() is for pulling certain observations (rows), then select() is for pulling certain variables (columns).
it’s good practice to remove these columns to stop your environment from becoming cluttered and eating up your RAM.
In our bfi data, most of these have been pre-removed, so instead, we’ll imagine we don’t want to use any indicators of Agreeableness (A1-A5) and that we aren’t interested in gender.
With select(), there are few ways choose variables. We can bare quote name the ones we want to keep, bare quote names we want to remove, or use any of a number of select() helper functions.

A. Bare quote columns we want to keep:

Code

bfi %>%
  select(C1, C2, C3, C4, C5) %>%
  print(n = 6)

# A tibble: 2,800 × 5
     C1    C2    C3    C4    C5
  <int> <int> <int> <int> <int>
1     2     3     3     4     4
2     5     4     4     3     4
3     4     5     4     2     5
4     4     4     3     5     5
5     4     4     5     3     2
6     6     6     6     1     3
# ℹ 2,794 more rows

Code

bfi %>%
  select(C1:C5) %>%
  print(n = 6)

# A tibble: 2,800 × 5
     C1    C2    C3    C4    C5
  <int> <int> <int> <int> <int>
1     2     3     3     4     4
2     5     4     4     3     4
3     4     5     4     2     5
4     4     4     3     5     5
5     4     4     5     3     2
6     6     6     6     1     3
# ℹ 2,794 more rows

B. Bare quote columns we don’t want to keep:

Code

bfi %>% 
  select(-(A1:A5), -gender) %>% # Note the `()` around the columns
  print(n = 6)

# A tibble: 2,800 × 22
     C1    C2    C3    C4    C5    E1    E2    E3    E4    E5    N1    N2    N3
  <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1     2     3     3     4     4     3     3     3     4     4     3     4     2
2     5     4     4     3     4     1     1     6     4     3     3     3     3
3     4     5     4     2     5     2     4     4     4     5     4     5     4
4     4     4     3     5     5     5     3     4     4     4     2     5     2
5     4     4     5     3     2     2     2     5     4     5     2     3     4
6     6     6     6     1     3     2     1     6     5     6     3     5     2
# ℹ 2,794 more rows
# ℹ 9 more variables: N4 <int>, N5 <int>, O1 <int>, O2 <int>, O3 <int>,
#   O4 <int>, O5 <int>, education <chr>, age <int>

C. Add or remove using `select()` helper functions.

starts_with()
ends_with()
contains()
matches()
num_range()
one_of()
all_of()

Code

bfi %>%
  select(starts_with("C"))

4. `arrange()`

Sometimes, either in order to get a better sense of our data or in order to well, order our data, we want to sort it
Although there is a base R sort() function, the arrange() function is tidyverse version that plays nicely with other tidyverse functions.

So in our previous examples, we could also arrange() our data by age or education, rather than simply filtering. (Or as we’ll see later, we can do both!)

Code

# sort by age
bfi %>% 
  select(gender:age) %>%
  arrange(age) %>% 
  print(n = 6)

# A tibble: 2,800 × 3
  gender education       age
   <int> <chr>         <int>
1      1 Higher Degree     3
2      2 <NA>              9
3      2 Some College     11
4      2 <NA>             11
5      2 <NA>             11
6      2 <NA>             12
# ℹ 2,794 more rows

Code

# sort by education
bfi %>%
  select(gender:age) %>%
  arrange(education) %>%
  print(n = 6)

# A tibble: 2,800 × 3
  gender education   age
   <int> <chr>     <int>
1      1 Below HS     19
2      1 Below HS     21
3      1 Below HS     17
4      1 Below HS     18
5      1 Below HS     18
6      2 Below HS     18
# ℹ 2,794 more rows

We can also arrange by multiple columns, like if we wanted to sort by gender then education:

Code

bfi %>%
  select(gender:age) %>%
  arrange(gender, education) %>% 
  print(n = 6)

# A tibble: 2,800 × 3
  gender education   age
   <int> <chr>     <int>
1      1 Below HS     19
2      1 Below HS     21
3      1 Below HS     17
4      1 Below HS     18
5      1 Below HS     18
6      1 Below HS     32
# ℹ 2,794 more rows

Bringing it all together: Split-Apply-Combine

Much of the power of dplyr functions lay in the split-apply-combine method
A given set of of data are:
- split into smaller chunks
- then a function or series of functions are applied to each chunk
- and then the chunks are combined back together

5. `group_by()`

The group_by() function is the “split” of the method
It basically implicitly breaks the data set into chunks by whatever bare quoted column(s)/variable(s) are supplied as arguments.

So imagine that we wanted to group_by() education levels to get average ages at each level

Code

bfi %>%
  select(starts_with("C"), age, gender, education) %>%
  group_by(education) %>%
  print(n = 6)

# A tibble: 2,800 × 8
# Groups:   education [6]
     C1    C2    C3    C4    C5   age gender education   
  <int> <int> <int> <int> <int> <int>  <int> <chr>       
1     2     3     3     4     4    16      1 <NA>        
2     5     4     4     3     4    18      2 <NA>        
3     4     5     4     2     5    17      2 <NA>        
4     4     4     3     5     5    17      2 <NA>        
5     4     4     5     3     2    17      1 <NA>        
6     6     6     6     1     3    21      2 Some College
# ℹ 2,794 more rows

Hadley’s first law of data cleaning: “What is split, must be combined”
This is super easy with the ungroup() function:

Code

bfi %>%
  select(starts_with("C"), age, gender, education) %>%
  group_by(education) %>%
  ungroup() %>%
  print(n = 6)

# A tibble: 2,800 × 8
     C1    C2    C3    C4    C5   age gender education   
  <int> <int> <int> <int> <int> <int>  <int> <chr>       
1     2     3     3     4     4    16      1 <NA>        
2     5     4     4     3     4    18      2 <NA>        
3     4     5     4     2     5    17      2 <NA>        
4     4     4     3     5     5    17      2 <NA>        
5     4     4     5     3     2    17      1 <NA>        
6     6     6     6     1     3    21      2 Some College
# ℹ 2,794 more rows

Multiple group_by() calls overwrites previous calls:

Code

bfi %>%
  select(starts_with("C"), age, gender, education) %>%
  group_by(education) %>%
  group_by(gender, age) %>%
  print(n = 6)

# A tibble: 2,800 × 8
# Groups:   gender, age [115]
     C1    C2    C3    C4    C5   age gender education   
  <int> <int> <int> <int> <int> <int>  <int> <chr>       
1     2     3     3     4     4    16      1 <NA>        
2     5     4     4     3     4    18      2 <NA>        
3     4     5     4     2     5    17      2 <NA>        
4     4     4     3     5     5    17      2 <NA>        
5     4     4     5     3     2    17      1 <NA>        
6     6     6     6     1     3    21      2 Some College
# ℹ 2,794 more rows

6. `mutate()`

mutate() is one of your “apply” functions
When you use mutate(), the resulting data frame will have the same number of rows you started with
You are directly mutating the existing data frame, either modifying existing columns or creating new ones

To demonstrate, let’s add a column that indicated average age levels within each age group

Code

bfi %>%
  select(starts_with("C"), age, gender, education) %>%
  arrange(education) %>%
  group_by(education) %>% 
  mutate(age_by_edu = mean(age, na.rm = T)) %>%
  print(n = 6)

# A tibble: 2,800 × 9
# Groups:   education [6]
     C1    C2    C3    C4    C5   age gender education age_by_edu
  <int> <int> <int> <int> <int> <int>  <int> <chr>          <dbl>
1     6     6     3     4     5    19      1 Below HS        25.1
2     4     3     5     3     2    21      1 Below HS        25.1
3     5     5     5     2     2    17      1 Below HS        25.1
4     5     5     4     1     1    18      1 Below HS        25.1
5     4     5     4     3     3    18      1 Below HS        25.1
6     3     2     3     4     6    18      2 Below HS        25.1
# ℹ 2,794 more rows

mutate() is also super useful even when you aren’t grouping

We can create a new category

Code

bfi %>%
  select(starts_with("C"), age, gender, education) %>%
  mutate(gender_cat = plyr::mapvalues(gender, c(1,2), c("Male", "Female")))

We could also just overwrite it:

Code

bfi %>%
  select(starts_with("C"), age, gender, education) %>%
  mutate(gender = plyr::mapvalues(gender, c(1,2), c("Male", "Female")))

7. `summarize()` / `summarise()`

summarize() is one of your “apply” functions
The resulting data frame will have the same number of rows as your grouping variable
You number of groups is 1 for ungrouped data frames

Code

# group_by() education
bfi %>%
  select(starts_with("C"), age, gender, education) %>%
  arrange(education) %>%
  group_by(education) %>% 
  summarize(age_by_edu = mean(age, na.rm = T))

Code

# no groups  
bfi %>% 
  select(starts_with("C"), age, gender, education) %>%
  arrange(education) %>%
  summarize(age_by_edu = mean(age, na.rm = T))

Data Wrangling in tidyr

Code

knitr::include_graphics("https://github.com/rstudio/hex-stickers/raw/master/thumbs/tidyr.png")

`tidyr`

Now, let’s build off what we learned from dplyr and focus on reshaping and merging our data.
First, the reshapers:

pivot_longer(), which takes a “wide” format data frame and makes it long.
pivot_wider(), which takes a “long” format data frame and makes it wide.

Next, the mergers:

full_join(), which merges all rows in either data frame
inner_join(), which merges rows whose keys are present in both data frames
left_join(), which “prioritizes” the first data set
right_join(), which “prioritizes” the second data set

(See also:anti_join() and semi_join())

Key `tidyr` Functions

1. `pivot_longer()`

(Formerly gather()) Makes wide data long, based on a key
Core arguments:
- data: the data, blank if piped
- cols: columns to be made long, selected via select() calls
- names_to: name(s) of key column(s) in new long data frame (string or string vector)
- values_to: name of values in new long data frame (string)
- names_sep: separator in column headers, if multiple keys
- values_drop_na: drop missing cells (similar to na.rm = T)

Basic Application

Let’s start with an easy one – one key, one value:

Code

bfi %>%
  rownames_to_column("SID") %>%
  pivot_longer(
    cols = A1:O5
    , names_to = "item"
    , values_to = "values"
    , values_drop_na = T
  ) %>%
  print(n = 8)

# A tibble: 69,492 × 6
  SID   gender education   age item  values
  <chr>  <int> <chr>     <int> <chr>  <int>
1 1          1 <NA>         16 A1         2
2 1          1 <NA>         16 A2         4
3 1          1 <NA>         16 A3         3
4 1          1 <NA>         16 A4         4
5 1          1 <NA>         16 A5         4
6 1          1 <NA>         16 C1         2
7 1          1 <NA>         16 C2         3
8 1          1 <NA>         16 C3         3
# ℹ 69,484 more rows

More Advanced Application

Now a harder one – two keys, one value:

Code

bfi %>%
  rownames_to_column("SID") %>%
  pivot_longer(
    cols = A1:O5
    , names_to = c("trait", "item_num")
    , names_sep = -1
    , values_to = "values"
    , values_drop_na = T
  ) %>%
  print(n = 8)

# A tibble: 69,492 × 7
  SID   gender education   age trait item_num values
  <chr>  <int> <chr>     <int> <chr> <chr>     <int>
1 1          1 <NA>         16 A     1             2
2 1          1 <NA>         16 A     2             4
3 1          1 <NA>         16 A     3             3
4 1          1 <NA>         16 A     4             4
5 1          1 <NA>         16 A     5             4
6 1          1 <NA>         16 C     1             2
7 1          1 <NA>         16 C     2             3
8 1          1 <NA>         16 C     3             3
# ℹ 69,484 more rows

2. `pivot_wider()`

(Formerly spread()) Makes wide data long, based on a key
Core arguments:
- data: the data, blank if piped
- names_from: name(s) of key column(s) in new long data frame (string or string vector)
- names_sep: separator in column headers, if multiple keys
- names_glue: specify multiple or custom separators of multiple keys
- values_from: name of values in new long data frame (string)
- values_fn: function applied to data with duplicate labels

Basic Application

Code

bfi_long <- bfi %>%
  rownames_to_column("SID") %>%
  pivot_longer(
    cols = A1:O5
    , names_to = "item"
    , values_to = "values"
    , values_drop_na = T
  )

More Advanced

Code

bfi_long <- bfi %>%
  rownames_to_column("SID") %>%
  pivot_longer(
    cols = A1:O5
    , names_to = c("trait", "item_num")
    , names_sep = -1
    , values_to = "values"
    , values_drop_na = T
  )

Code

bfi_long %>%
  pivot_wider(
    names_from = c("trait", "item_num")
    , values_from = "values"
    , names_sep = "_"
  )

A Little More Advanced

Code

bfi_long %>%
  select(-item_num) %>%
  pivot_wider(
    names_from = "trait"
    , values_from = "values"
    , names_sep = "_"
    , values_fn = mean
  )

More `dplyr` Functions

The `_join()` Functions

Often we may need to pull different data from different sources
There are lots of reasons to need to do this
We don’t have time to get into all the use cases here, so we’ll talk about them in high level terms
We’ll focus on:
- full_join()
- inner_join()
- left_join()
- right_join()
Let’s separate demographic and BFI data

Code

bfi_only <- bfi %>% 
  rownames_to_column("SID") %>%
  select(SID, matches("[0-9]"))
bfi_only %>% as_tibble() %>% print(n = 6)

# A tibble: 2,800 × 26
  SID      A1    A2    A3    A4    A5    C1    C2    C3    C4    C5    E1    E2
  <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 1         2     4     3     4     4     2     3     3     4     4     3     3
2 2         2     4     5     2     5     5     4     4     3     4     1     1
3 3         5     4     5     4     4     4     5     4     2     5     2     4
4 4         4     4     6     5     5     4     4     3     5     5     5     3
5 5         2     3     3     4     5     4     4     5     3     2     2     2
6 6         6     6     5     6     5     6     6     6     1     3     2     1
# ℹ 2,794 more rows
# ℹ 13 more variables: E3 <int>, E4 <int>, E5 <int>, N1 <int>, N2 <int>,
#   N3 <int>, N4 <int>, N5 <int>, O1 <int>, O2 <int>, O3 <int>, O4 <int>,
#   O5 <int>

Code

bfi_dem <- bfi %>%
  rownames_to_column("SID") %>%
  select(SID, education, gender, age)
bfi_dem %>% as_tibble() %>% print(n = 6)

# A tibble: 2,800 × 4
  SID   education    gender   age
  <chr> <chr>         <int> <int>
1 1     <NA>              1    16
2 2     <NA>              2    18
3 3     <NA>              2    17
4 4     <NA>              2    17
5 5     <NA>              1    17
6 6     Some College      2    21
# ℹ 2,794 more rows

Before we get into it, as a reminder, this is what the data set looks like before we do any joining:

Code

bfi %>%
  rownames_to_column("SID") %>%
  as_tibble() %>%
  print(n = 6)

# A tibble: 2,800 × 29
  SID      A1    A2    A3    A4    A5    C1    C2    C3    C4    C5    E1    E2
  <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 1         2     4     3     4     4     2     3     3     4     4     3     3
2 2         2     4     5     2     5     5     4     4     3     4     1     1
3 3         5     4     5     4     4     4     5     4     2     5     2     4
4 4         4     4     6     5     5     4     4     3     5     5     5     3
5 5         2     3     3     4     5     4     4     5     3     2     2     2
6 6         6     6     5     6     5     6     6     6     1     3     2     1
# ℹ 2,794 more rows
# ℹ 16 more variables: E3 <int>, E4 <int>, E5 <int>, N1 <int>, N2 <int>,
#   N3 <int>, N4 <int>, N5 <int>, O1 <int>, O2 <int>, O3 <int>, O4 <int>,
#   O5 <int>, gender <int>, education <chr>, age <int>

3. `full_join()`

Most simply, we can put those back together keeping all observations.

Code

bfi_only %>%
  full_join(bfi_dem) %>%
  as_tibble() %>%
  print(n = 6)

Joining with `by = join_by(SID)`

# A tibble: 2,800 × 29
  SID      A1    A2    A3    A4    A5    C1    C2    C3    C4    C5    E1    E2
  <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 1         2     4     3     4     4     2     3     3     4     4     3     3
2 2         2     4     5     2     5     5     4     4     3     4     1     1
3 3         5     4     5     4     4     4     5     4     2     5     2     4
4 4         4     4     6     5     5     4     4     3     5     5     5     3
5 5         2     3     3     4     5     4     4     5     3     2     2     2
6 6         6     6     5     6     5     6     6     6     1     3     2     1
# ℹ 2,794 more rows
# ℹ 16 more variables: E3 <int>, E4 <int>, E5 <int>, N1 <int>, N2 <int>,
#   N3 <int>, N4 <int>, N5 <int>, O1 <int>, O2 <int>, O3 <int>, O4 <int>,
#   O5 <int>, education <chr>, gender <int>, age <int>

4. `inner_join()`

We can also keep all rows present in both data frames

Code

bfi_dem %>%
  filter(row_number() %in% 1:1700) %>%
  inner_join(
    bfi_only %>%
      filter(row_number() %in% 1200:2800)
  ) %>%
  as_tibble() %>%
  print(n = 6)

Joining with `by = join_by(SID)`

# A tibble: 501 × 29
  SID   education   gender   age    A1    A2    A3    A4    A5    C1    C2    C3
  <chr> <chr>        <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 1200  Some Colle…      2    18     1     5     6     5     5     5     6     5
2 1201  College          2    29     1     5     6     5     5     2     1     4
3 1202  Higher Deg…      1    46     2     5     6     5     6     6     6     6
4 1203  Higher Deg…      1    58     5     4     4     4     5     4     4     5
5 1204  Higher Deg…      2    38     1     4     6     6     6     4     4     5
6 1205  Higher Deg…      2    27     2     3     1     1     1     4     2     2
# ℹ 495 more rows
# ℹ 17 more variables: C4 <int>, C5 <int>, E1 <int>, E2 <int>, E3 <int>,
#   E4 <int>, E5 <int>, N1 <int>, N2 <int>, N3 <int>, N4 <int>, N5 <int>,
#   O1 <int>, O2 <int>, O3 <int>, O4 <int>, O5 <int>

5. `left_join()`

Or all rows present in the left (first) data frame, perhaps if it’s a subset of people with complete data

Code

bfi_dem %>%
  drop_na() %>%
  left_join(bfi_only) %>%
  as_tibble() %>%
  print(n = 6)

Joining with `by = join_by(SID)`

# A tibble: 2,577 × 29
  SID   education   gender   age    A1    A2    A3    A4    A5    C1    C2    C3
  <chr> <chr>        <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 6     Some Colle…      2    21     6     6     5     6     5     6     6     6
2 8     HS               1    19     4     3     1     5     1     3     2     4
3 9     Below HS         1    19     4     3     6     3     3     6     6     3
4 11    Below HS         1    21     4     4     5     6     5     4     3     5
5 15    Below HS         1    17     4     5     2     2     1     5     5     5
6 23    Higher Deg…      1    68     1     5     6     5     6     4     3     2
# ℹ 2,571 more rows
# ℹ 17 more variables: C4 <int>, C5 <int>, E1 <int>, E2 <int>, E3 <int>,
#   E4 <int>, E5 <int>, N1 <int>, N2 <int>, N3 <int>, N4 <int>, N5 <int>,
#   O1 <int>, O2 <int>, O3 <int>, O4 <int>, O5 <int>

6. `right_join()`

Or all rows present in the right (second) data frame, such as I do when I join a codebook with raw data

Code

bfi_dem %>%
  drop_na() %>%
  right_join(bfi_only) %>%
  as_tibble() %>%
  print(n = 6)

Joining with `by = join_by(SID)`

# A tibble: 2,800 × 29
  SID   education   gender   age    A1    A2    A3    A4    A5    C1    C2    C3
  <chr> <chr>        <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 6     Some Colle…      2    21     6     6     5     6     5     6     6     6
2 8     HS               1    19     4     3     1     5     1     3     2     4
3 9     Below HS         1    19     4     3     6     3     3     6     6     3
4 11    Below HS         1    21     4     4     5     6     5     4     3     5
5 15    Below HS         1    17     4     5     2     2     1     5     5     5
6 23    Higher Deg…      1    68     1     5     6     5     6     4     3     2
# ℹ 2,794 more rows
# ℹ 17 more variables: C4 <int>, C5 <int>, E1 <int>, E2 <int>, E3 <int>,
#   E4 <int>, E5 <int>, N1 <int>, N2 <int>, N3 <int>, N4 <int>, N5 <int>,
#   O1 <int>, O2 <int>, O3 <int>, O4 <int>, O5 <int>

--- title: "Week 1 (Workbook) - Getting Situated in R and tidyverse" author: "Emorie D Beck" format: html: code-tools: true code-copy: true code-line-numbers: true code-link: true theme: united highlight-style: tango df-print: paged code-fold: show toc: true toc-float: true self-contained: true # height: 900 footer: "PSC 290 - Data Visualization in R" logo: "https://github.com/emoriebeck/psc290-data-viz-2022/raw/main/01-week1-intro/02-code/02-images/ucdavis_logo_blue.png" editor: visual editor_options: chunk_output_type: console --- ```{r} library(knitr) library(psych) library(plyr) library(tidyverse) ``` ## Course Goals & Learning Outcomes 1. Understand the cognitive and psychological underpinnings of perceiving data visualization. 2. Identify good data visualizations and describe what makes them good. 3. Produce data visualizations according to types of questions, data, and more, with a particular emphasis on building a toolbox that you can carry into your own research. ## Course Expectations - \~50% of the course will be in R - You will get the most from this course if you: - have your own data you can apply course content to - know how to clean clean, transform, and manage that data - today's workshop is a good litmus test for this ## Course Materials - All materials (required and optional) are free and online - Wickham & Grolemond: *R for Data Science* <https://r4ds.had.co.nz> - Wickham: *Advanced R* <http://adv-r.had.co.nz> - Wilke: *Fundamentals of Data Visualization* <https://clauswilke.com/dataviz/> - Healy: *Data Visualization: A Practical Introduction* <https://socviz.co> - [Data Camp](https://www.datacamp.com/groups/shared_links/9f274d67a5851e9f7c6245893b50a01c76ca3bdcc4aeb031a2265b6787d54357): All paid content unlocked ## Assignments | | | |----------------------------------|-------------| | **Assignment Weights** | **Percent** | | Problem Sets (5) | 20% | | Response Papers + Visualizations | 20% | | Final Project Proposal | 10% | | Class Presentation | 20% | | Final Project | 30% | | **Total** | **100%** | ### Response Papers / Visualizations - The main homework in the course is your weekly visualization assignment - The goal is to demonstrate how the principles and skills you learn in the class function "in the wild." - These should be fun and not taken too seriously! No one is judging you for a pulling a graphic from Instagram instead of Nature. - Due 12:00 PM the day before (i.e. Tuesday) class (last class is "free points") ### Problem Sets - About every other week, there will be a practice set to help you practice what you're learning. - These will have you apply the code you've been learning to your own data or a provided data set - Assigning them every other week aims to reduce burden while still allowing you to practice - Frequency / form will be adjusted as needed throughout the quarter ### Final Projects - I don't want you to walk out of this course and not know how to apply what you learned - Final project replaces final exam (there are no exams) - Create a visualization for an ongoing project! - Stage 1: Proposal (due 02/12/25) - Stage 2: 1-on-1 meetings + feedback (due by 02/26/25) - Stage 3: In-class presentations (03/12/25) - Stage 4: Final visualization + brief write-up (due 03/19/25 at midnight) ### Extra Credit - Lots of talk series, etc. this winter - 1 pt extra credit for each one you: - go to, - take a snap of a data viz, - and critique it according to what you've learned in class - max 5 pts ## Class Time - \~5-10 min: welcome and review (if needed) - \~20-35 min: discussion / some lecture content on readings - \~5-10 min: break - \~40-60 min: workshop - \~20-30 min: open lab ## Course Topics ::::: columns ::: column 1. Intro and Overview 2. Cognitive Perspectives 3. Proportions and Probability 4. Differences and Associations 5. Change and Time Series 6. Uncertainty 7. Piecing Visualizations Together ::: ::: column 8. Polishing Visualizations 9. Interactive Visualizations Additional Topics: - Spatial Information - Automated Reports - Diagrams - More? ::: ::::: # Questions on the Syllabus and Course? # Data Visualization ## Why Should I Care About Data Visualization - Summarizing huge amounts of information - Seeing the forest and the trees - Errors in probabilistic reasoning - It's fun! ## Why Visualize Data in R ::::: columns ::: {.column width="60%"} - Open source and freeeee - Flexible - Reproducible - Flexible formatting / output - Lots of model- and package-specific support - Did I mention free? ::: ::: {.column width="40%"} ```{r} knitr::include_graphics("https://github.com/emoriebeck/psc290-data-viz-2022/raw/main/01-week1-intro/02-code/02-images/R_logo.svg.png") ``` ::: ::::: ## Why Use RStudio (Pivot) ::::: columns ::: {.column width="60%"} - Also free - Basically a GUI for R - Organize files, import data, etc. with ease - RMarkdown, Quarto, and more are powerful tools (they were used to create these slides!) - Lots of new features and support ::: ::: {.column width="40%"} ```{r} knitr::include_graphics("https://github.com/emoriebeck/psc290-data-viz-2022/raw/main/01-week1-intro/02-code/02-images/RStudio-Logo-Flat.png") ``` ::: ::::: ## Why Use the `tidyverse` ::::: columns ::: {.column width="70%"} - Maintained by RStudio (Pivot) - No one should have to use a for loop to change data from long to wide - Tons of integrated tools for data cleaning, manipulation, transformation, and visualization - Even increasing support for modeling (e.g., `tidymodels`) ::: ::: {.column width="30%"} ```{r, fig.align='center', echo = F} knitr::include_graphics("https://github.com/emoriebeck/psc290-data-viz-2022/raw/main/01-week1-intro/02-code/02-images/tidyverse.png") ``` ::: ::::: ::: {layout="[[1,1, 1, 1], [1, 1, 1,1], [1, 1, 1,1]]"} ```{r, fig.align='center', echo = F, out.width = "60%"} knitr::include_graphics("https://github.com/rstudio/hex-stickers/raw/main/thumbs/tidyr.png") ``` ```{r, fig.align='center', echo = F, out.width = "60%"} knitr::include_graphics("https://github.com/rstudio/hex-stickers/raw/main/thumbs/stringr.png") ``` ```{r, fig.align='center', echo = F, out.width = "60%"} knitr::include_graphics("https://github.com/rstudio/hex-stickers/raw/main/thumbs/shiny.png") ``` ```{r, fig.align='center', echo = F, out.width = "60%"} knitr::include_graphics("https://github.com/rstudio/hex-stickers/raw/main/thumbs/rmarkdown.png") ``` ```{r, fig.align='center', echo = F, out.width = "60%"} knitr::include_graphics("https://github.com/rstudio/hex-stickers/raw/main/thumbs/quarto.png") ``` ```{r, fig.align='center', echo = F, out.width = "60%"} knitr::include_graphics("https://github.com/rstudio/hex-stickers/raw/main/thumbs/knitr.png") ``` ```{r, fig.align='center', echo = F, out.width = "60%"} knitr::include_graphics("https://github.com/rstudio/hex-stickers/raw/main/thumbs/ggplot2.png") ``` ```{r, fig.align='center', echo = F, out.width = "60%"} knitr::include_graphics("https://github.com/rstudio/hex-stickers/raw/main/thumbs/forcats.png") ``` ```{r, fig.align='center', echo = F, out.width = "60%"} knitr::include_graphics("https://github.com/rstudio/hex-stickers/raw/main/thumbs/dplyr.png") ``` ```{r, fig.align='center', echo = F, out.width = "60%"} knitr::include_graphics("https://github.com/rstudio/hex-stickers/raw/main/thumbs/broom.png") ``` ```{r, fig.align='center', echo = F, out.width = "60%"} knitr::include_graphics("https://github.com/rstudio/hex-stickers/raw/main/thumbs/tibble.png") ``` ```{r, fig.align='center', echo = F, out.width = "60%"} knitr::include_graphics("https://github.com/rstudio/hex-stickers/raw/main/thumbs/purrr.png") ``` ::: ## Goals for Today - **Review** core principles of: - `dyplr` (data manipulation) - `tidyr` (data transformation and reshaping) ::::: {.columns style="display: flex !important; height: 90%;"} ::: {.column width="70%" style="display: flex; align-items: center;"}  # Data Manipulation in `dplyr`  ::: ::: {.column width="30%" style="display: flex; justify-content: center; align-items: center;"} ```{r, fig.align='center', echo = F} knitr::include_graphics("https://github.com/rstudio/hex-stickers/raw/main/thumbs/dplyr.png") ``` ::: ::::: # `dplyr` Core Functions 1. **`%>%`**: The pipe. Read as "and then." 2. **`filter()`**: Pick observations (rows) by their values. 3. **`select()`**: Pick variables (columns) by their names. 4. **`arrange()`**: Reorder the rows. 5. **`group_by()`**: Implicitly split the data set by grouping by names (columns). 6. **`mutate()`**: Create new variables with functions of existing variables. 7. **`summarize()` / `summarise()`**: Collapse many values down to a single summary. ## Core Functions :::::: columns :::: {.column width="40%"} ::: nonincremental 1. **`%>%`** 2. **`filter()`** 3. **`select()`** 4. **`arrange()`** 5. **`group_by()`** 6. **`mutate()`** 7. **`summarize()`** ::: :::: ::: {.column width="60%" style="text-align: center; background-color: #FFD966; color: black; border: 5px solid #033266;"} Although each of these functions are powerful alone, they are incredibly powerful in conjunction with one another. So below, I'll briefly introduce each function, then link them all together using an example of basic data cleaning and summary. ::: :::::: ## 1. `%>%` - The pipe `%>%` is wonderful. It makes coding intuitive. Often in coding, you need to use so-called nested functions. For example, you might want to round a number after taking the square of 43. ```{r, echo = T} sqrt(43) round(sqrt(43), 2) ``` The issue with this comes whenever we need to do a series of operations on a data set or other type of object. In such cases, if we run it in a single call, then we have to start in the middle and read our way out. ```{r, echo = T} round(sqrt(43/2), 2) ``` The pipe solves this by allowing you to read from left to right (or top to bottom). The easiest way to think of it is that each call of `%>%` reads and operates as "and then." So with the rounded square root of 43, for example: ```{r, echo = T} sqrt(43) %>% round(2) ``` ## 2. `filter()` Often times, when conducting research (experiments or otherwise), there are observations (people, specific trials, etc.) that you don't want to include.  ```{r, echo=TRUE} data(bfi) # grab the bfi data from the psych package bfi <- bfi %>% as_tibble() head(bfi) ``` Often times, when conducting research (experiments or otherwise), there are observations (people, specific trials, etc.) that you don't want to include. ```{r, echo = T} summary(bfi$age) # get age descriptives ``` Often times, when conducting research (experiments or otherwise), there are observations (people, specific trials, etc.) that you don't want to include. ```{r, echo = T} #| code-line-numbers: "|2" bfi2 <- bfi %>% # see a pipe! filter(age <= 18) # filter to age up to 18 summary(bfi2$age) # summary of the new data ``` But this isn't quite right. We still have folks below 12. But, the beauty of `filter()` is that you can do sequence of `OR` and `AND` statements when there is more than one condition, such as up to 18 `AND` at least 12. ```{r, echo = T} bfi2 <- bfi %>% filter(age <= 18 & age >= 12) # filter to age up to 18 and at least 12 summary(bfi2$age) # summary of the new data ``` Got it! - But filter works for more use cases than just conditional - `<`, `>`, `<=`, and `>=` - It can also be used for cases where we want a single values to match cases with text. - To do that, let's convert one of the variables in the `bfi` data frame to a string. - So let's change gender (1 = male, 2 = female) to text (we'll get into factors later). ```{r, echo = T} bfi$education <- plyr::mapvalues(bfi$education, 1:5, c("Below HS", "HS", "Some College", "College", "Higher Degree")) ``` Now let's try a few things: **1. Create a data set with only individuals with some college (`==`).** ```{r, echo = T} bfi2 <- bfi %>% filter(education == "Some College") unique(bfi2$education) ``` **2. Create a data set with only people age 18 (`==`).** ```{r, echo = T} bfi2 <- bfi %>% filter(age == 18) summary(bfi2$age) ``` **3. Create a data set with individuals with some college or above (`%in%`).** ```{r, echo = T} bfi2 <- bfi %>% filter(education %in% c("Some College", "College", "Higher Degree")) unique(bfi2$education) ``` `%in%` is great. It compares a column to a vector rather than just a single value, you can compare it to several ```{r, echo = T} bfi2 <- bfi %>% filter(age %in% 12:18) summary(bfi2$age) ``` ## 3. `select()` - If `filter()` is for pulling certain observations (rows), then `select()` is for pulling certain variables (columns). - it's good practice to remove these columns to stop your environment from becoming cluttered and eating up your RAM. - In our `bfi` data, most of these have been pre-removed, so instead, we'll imagine we don't want to use any indicators of Agreeableness (A1-A5) and that we aren't interested in gender. - With `select()`, there are few ways choose variables. We can bare quote name the ones we want to keep, bare quote names we want to remove, or use any of a number of `select()` helper functions. ### A. Bare quote columns we want to keep: ::::: columns ::: column ```{r, echo = T} #| code-line-numbers: "|2" bfi %>% select(C1, C2, C3, C4, C5) %>% print(n = 6) ``` ::: ::: column ```{r, echo=T} #| code-line-numbers: "|2" bfi %>% select(C1:C5) %>% print(n = 6) ```  ::: ::::: ### B. Bare quote columns we don't want to keep: ```{r, echo = T} #| code-line-numbers: "|2" bfi %>% select(-(A1:A5), -gender) %>% # Note the `()` around the columns print(n = 6) ``` ### C. Add or remove using `select()` helper functions. :::::: columns ::: {.column width="40%"} - `starts_with()`\ - `ends_with()` - `contains()` - `matches()` - `num_range()` - `one_of()` - `all_of()` ::: :::: {.column width="60%"} ::: fragment ```{r, echo = T} bfi %>% select(starts_with("C")) ``` ::: :::: :::::: ## 4. `arrange()` - Sometimes, either in order to get a better sense of our data or in order to well, order our data, we want to sort it - Although there is a base `R` `sort()` function, the `arrange()` function is `tidyverse` version that plays nicely with other `tidyverse functions`. ::::: columns So in our previous examples, we could also `arrange()` our data by age or education, rather than simply filtering. (Or as we'll see later, we can do both!) ::: {.column width="50%"} ```{r, echo = T} #| code-line-numbers: "|4" # sort by age bfi %>% select(gender:age) %>% arrange(age) %>% print(n = 6) ``` ::: ::: {.column width="50%"} ```{r, echo=TRUE} #| code-line-numbers: "|4" # sort by education bfi %>% select(gender:age) %>% arrange(education) %>% print(n = 6) ``` ::: ::::: We can also arrange by multiple columns, like if we wanted to sort by gender then education: ```{r, echo = T} bfi %>% select(gender:age) %>% arrange(gender, education) %>% print(n = 6) ``` # Bringing it all together: Split-Apply-Combine - Much of the power of `dplyr` functions lay in the split-apply-combine method - A given set of of data are: - *split* into smaller chunks - then a function or series of functions are *applied* to each chunk - and then the chunks are *combined* back together ## 5. `group_by()` - The `group_by()` function is the "split" of the method - It basically implicitly breaks the data set into chunks by whatever bare quoted column(s)/variable(s) are supplied as arguments. So imagine that we wanted to `group_by()` education levels to get average ages at each level ```{r, echo = T} bfi %>% select(starts_with("C"), age, gender, education) %>% group_by(education) %>% print(n = 6) ``` - Hadley's first law of data cleaning: "What is split, must be combined" - This is super easy with the `ungroup()` function: ```{r, echo=TRUE} bfi %>% select(starts_with("C"), age, gender, education) %>% group_by(education) %>% ungroup() %>% print(n = 6) ``` Multiple `group_by()` calls overwrites previous calls: ```{r, echo = T} bfi %>% select(starts_with("C"), age, gender, education) %>% group_by(education) %>% group_by(gender, age) %>% print(n = 6) ``` ## 6. `mutate()` - `mutate()` is one of your "apply" functions - When you use `mutate()`, the resulting data frame will have the same number of rows you started with - You are directly mutating the existing data frame, either modifying existing columns or creating new ones To demonstrate, let's add a column that indicated average age levels within each age group ```{r, echo = T} bfi %>% select(starts_with("C"), age, gender, education) %>% arrange(education) %>% group_by(education) %>% mutate(age_by_edu = mean(age, na.rm = T)) %>% print(n = 6) ``` `mutate()` is also super useful even when you aren't grouping We can create a new category ```{r, echo = T} bfi %>% select(starts_with("C"), age, gender, education) %>% mutate(gender_cat = plyr::mapvalues(gender, c(1,2), c("Male", "Female"))) ``` We could also just overwrite it: ```{r, echo = T} bfi %>% select(starts_with("C"), age, gender, education) %>% mutate(gender = plyr::mapvalues(gender, c(1,2), c("Male", "Female"))) ``` ## 7. `summarize()` / `summarise()` - `summarize()` is one of your "apply" functions - The resulting data frame will have the same number of rows as your grouping variable - You number of groups is 1 for ungrouped data frames ```{r, echo = T} # group_by() education bfi %>% select(starts_with("C"), age, gender, education) %>% arrange(education) %>% group_by(education) %>% summarize(age_by_edu = mean(age, na.rm = T)) ``` ```{r, echo = T} # no groups bfi %>% select(starts_with("C"), age, gender, education) %>% arrange(education) %>% summarize(age_by_edu = mean(age, na.rm = T)) ``` ::::: {.columns .v-center-container} ::: {.column width="70%"} Data Wrangling in `tidyr` ::: ::: {.column width="30%"} ```{r, fig.align='center'} knitr::include_graphics("https://github.com/rstudio/hex-stickers/raw/master/thumbs/tidyr.png") ``` ::: ::::: # `tidyr` - Now, let's build off what we learned from dplyr and focus on *reshaping* and *merging* our data. - First, the reshapers: 1. `pivot_longer()`, which takes a "wide" format data frame and makes it long.\ 2. `pivot_wider()`, which takes a "long" format data frame and makes it wide. - Next, the mergers: 3. `full_join()`, which merges *all* rows in either data frame\ 4. `inner_join()`, which merges rows whose keys are present in *both* data frames\ 5. `left_join()`, which "prioritizes" the first data set\ 6. `right_join()`, which "prioritizes" the second data set (See also:`anti_join()` and `semi_join()`) # Key `tidyr` Functions ## 1. `pivot_longer()` - (Formerly `gather()`) Makes wide data long, based on a key - Core arguments: - `data`: the data, blank if piped - `cols`: columns to be made long, selected via `select()` calls - `names_to`: name(s) of key column(s) in new long data frame (string or string vector) - `values_to`: name of values in new long data frame (string) - `names_sep`: separator in column headers, if multiple keys - `values_drop_na`: drop missing cells (similar to `na.rm = T`) ### Basic Application Let's start with an easy one -- one key, one value: ```{r, echo=TRUE} bfi %>% rownames_to_column("SID") %>% pivot_longer( cols = A1:O5 , names_to = "item" , values_to = "values" , values_drop_na = T ) %>% print(n = 8) ``` ### More Advanced Application Now a harder one -- two keys, one value: ```{r, echo=TRUE} bfi %>% rownames_to_column("SID") %>% pivot_longer( cols = A1:O5 , names_to = c("trait", "item_num") , names_sep = -1 , values_to = "values" , values_drop_na = T ) %>% print(n = 8) ``` ## 2. `pivot_wider()` {.smaller} - (Formerly `spread()`) Makes wide data long, based on a key - Core arguments: - `data`: the data, blank if piped - `names_from`: name(s) of key column(s) in new long data frame (string or string vector) - `names_sep`: separator in column headers, if multiple keys - `names_glue`: specify multiple or custom separators of multiple keys - `values_from`: name of values in new long data frame (string) - `values_fn`: function applied to data with duplicate labels ### Basic Application ```{r} bfi_long <- bfi %>% rownames_to_column("SID") %>% pivot_longer( cols = A1:O5 , names_to = "item" , values_to = "values" , values_drop_na = T ) ``` ### More Advanced ```{r, results = 'hide'} bfi_long <- bfi %>% rownames_to_column("SID") %>% pivot_longer( cols = A1:O5 , names_to = c("trait", "item_num") , names_sep = -1 , values_to = "values" , values_drop_na = T ) ``` ```{r, echo = T} bfi_long %>% pivot_wider( names_from = c("trait", "item_num") , values_from = "values" , names_sep = "_" ) ``` ### A Little More Advanced ```{r, echo = T} bfi_long %>% select(-item_num) %>% pivot_wider( names_from = "trait" , values_from = "values" , names_sep = "_" , values_fn = mean ) ``` # More `dplyr` Functions ## The `_join()` Functions - Often we may need to pull different data from different sources - There are lots of reasons to need to do this - We don't have time to get into all the use cases here, so we'll talk about them in high level terms - We'll focus on: - `full_join()` - `inner_join()` - `left_join()` - `right_join()` - Let's separate demographic and BFI data ```{r, echo = T} #| code-line-numbers: "|3" bfi_only <- bfi %>% rownames_to_column("SID") %>% select(SID, matches("[0-9]")) bfi_only %>% as_tibble() %>% print(n = 6) ``` ```{r, echo = T} #| code-line-numbers: "|3" bfi_dem <- bfi %>% rownames_to_column("SID") %>% select(SID, education, gender, age) bfi_dem %>% as_tibble() %>% print(n = 6) ``` Before we get into it, as a reminder, this is what the data set looks like before we do any joining: ```{r, echo = T} bfi %>% rownames_to_column("SID") %>% as_tibble() %>% print(n = 6) ``` ## 3. `full_join()` Most simply, we can put those back together keeping all observations. ```{r, echo = T} bfi_only %>% full_join(bfi_dem) %>% as_tibble() %>% print(n = 6) ``` ## 4. `inner_join()` We can also keep all rows present in *both* data frames ```{r, echo = T} #| code-line-numbers: "|1-2|4-5|3" bfi_dem %>% filter(row_number() %in% 1:1700) %>% inner_join( bfi_only %>% filter(row_number() %in% 1200:2800) ) %>% as_tibble() %>% print(n = 6) ``` ## 5. `left_join()` Or all rows present in the left (first) data frame, perhaps if it's a subset of people with complete data ```{r, echo = T} #| code-line-numbers: "|2|3" bfi_dem %>% drop_na() %>% left_join(bfi_only) %>% as_tibble() %>% print(n = 6) ``` ## 6. `right_join()` Or all rows present in the right (second) data frame, such as I do when I join a codebook with raw data ```{r, echo = T} #| code-line-numbers: "|3" bfi_dem %>% drop_na() %>% right_join(bfi_only) %>% as_tibble() %>% print(n = 6) ```

Course Goals & Learning Outcomes

Course Expectations

Course Materials

Assignments

Response Papers / Visualizations

Problem Sets

Final Projects

Extra Credit

Class Time

Course Topics

Questions on the Syllabus and Course?

Data Visualization

Why Should I Care About Data Visualization

Why Visualize Data in R

Why Use RStudio (Pivot)

Why Use the tidyverse

Goals for Today

Data Manipulation in dplyr

dplyr Core Functions

Core Functions

1. %>%

2. filter()

3. select()

A. Bare quote columns we want to keep:

B. Bare quote columns we don’t want to keep:

C. Add or remove using select() helper functions.

4. arrange()

Bringing it all together: Split-Apply-Combine

5. group_by()

6. mutate()

7. summarize() / summarise()

tidyr

Key tidyr Functions

1. pivot_longer()

Basic Application

More Advanced Application

2. pivot_wider()

Basic Application

More Advanced

A Little More Advanced

More dplyr Functions

The _join() Functions

3. full_join()

4. inner_join()

5. left_join()

6. right_join()

Why Use the `tidyverse`

Data Manipulation in `dplyr`

`dplyr` Core Functions

1. `%>%`

2. `filter()`

3. `select()`

C. Add or remove using `select()` helper functions.

4. `arrange()`

5. `group_by()`

6. `mutate()`

7. `summarize()` / `summarise()`

`tidyr`

Key `tidyr` Functions

1. `pivot_longer()`

2. `pivot_wider()`

More `dplyr` Functions

The `_join()` Functions

3. `full_join()`

4. `inner_join()`

5. `left_join()`

6. `right_join()`