“Data quality measures how well a dataset meets criteria for accuracy, completeness, validity, consistency, uniqueness, timeliness, and fitness for purpose”
Aspects of data quality
Accuracy: Do the data reflect reality / truth?
Completeness: Are the data usable or complete (no missing people, values, etc. beyond random)
Uniqueness: There is no duplicated data
Validity: Do the data have the correct properties (values, ranges, etc.)
Consistency: When integrating across multiple data sources, information should converge across sources and match reality
Timeliness: Can the data be maintained and distributed within a specified time frame
Fitness for purpose: Do the data meet your research need?
Why should we care about data quality
You aren’t responsible for poor quality data you receive, but you are responsible for the data products you work with – that is, you are responsible for improving data quality
Poor quality data threatens scientific integrity
Poor quality data are a pain for you to work with and for others to work with
What can data quality do for my career?
The virtuous cycle of data cleaning
Some people get a reputation for getting their data, analyses, etc. right
This is important for publications, grant funding, etc.
It tends to be inter-generational – you inherit some of your reputation on this from your advisor
Start paying it forward now to build your own career, whether it’s in academia or industry
“Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics”
So EDA is basically a fancy word for the descriptive statistics you’ve been learning about for years
I think about “exploratory data analysis for data quality”
Investigating values and patterns of variables from “input data”
Identifying and cleaning errors or values that need to be changed
Creating analysis variables
Checking values of analysis variables against values of input variables
How I will teach exploratory data analysis
Will teach exploratory data analysis (EDA) in two sub-sections:
Provide “Guidelines for EDA”
Less about coding, more about practices you should follow and mentality necessary to ensure high data quality
Introduce “Tools of EDA”:
Demonstrate code to investigate variables and relationship between variables
Most of these tools are just the application of programming skills you have already learned (or will learn soon!)
Guidelines for “EDA for data quality”
Assume that your goal in “EDA for data quality” is to investigate “input” data sources and create “analysis variables”
Usually, your analysis dataset will incorporate multiple sources of input data, including data you collect (primary data) and/or data collected by others (secondary data)
EDA is not a linear process, and the process will vary across people and projects Some broad steps:
Understand how input data sources were created
e.g., when working with survey data, have survey questionnaire and codebooks on hand (watch out for skip patterns!!!)
For each input data source, identify the “unit of analysis” and which combination of variables uniquely identify observations
Investigate patterns in input variables
Create analysis variable from input variable(s)
Verify that analysis variable is created correctly through descriptive statistics that compare values of input variable(s) against values of the analysis variable
It is critically important to step through EDA processes at multiple points during data cleaning, from the input / raw data to the output / analysis / clean data.
Always be aware of missing values
They will not always be coded as NA in input variables (e.g., some projects code them as 99, 99999, negative values, etc.)
“Unit of analysis” and which variables uniquely identify observations
“Unit of analysis” refers to “what does each observation represent” in an input data source
If each obs represents a trial in an experiment, you have “trial level data”
If each obs represents a participant, you have “participant level data”
If each obs represents a sample, you have “sample-level data”
If each obs represents a year, you have “year level data” (i.e. longitudinal)
How to identify unit of analysis
data documentation
investigating the data set
This is very important because we often conduct analyses that span multiple units of analysis (e.g., between- v within-person, person- v stimuli-level, etc.)
We have to be careful and thoughtful about identifiers that let us do that (important for joining data together, which will be the focus on our R workshop today)
Rules for creating new variables
Rules I follow for variable creation
Never modify “input variable”; instead create new variable based on input variable(s)
Always keep input variables used to create new variables
Investigate input variable(s) and relationship between input variables
Developing a plan for creation of analysis variable
e.g., for each possible value of input variables, what should value of analysis variable be?
Write code to create analysis variable
Run descriptive checks to verify new variables are constructed correctly
Can “comment out” these checks, but don’t delete them
Document new variables with notes and labels
DESCRIPTIVES
Data we will use
Use read_csv() function from readr (loaded with tidyverse) to import .csv dataset into R.
Never modify “input variable”; instead create new variable based on input variable(s)
Always keep input variables used to create new variables
I already did this before the data were loaded in. I renamed all the input variables with interpretable names and reshaped them so the time variable (year) is long and the other variables are wide
Rule 2
Investigate input variable(s) and relationship between input variables
We’ll talk more about this in a bit when we discuss different kinds of descriptives, but briefly let’s look at basic descriptives + zero-order correlations
Code
describe(soep_long)
This doesn’t look great because we’ve negative values where we shouldn’t, which represent flags for different kinds of missing variables. We’ll have to fix that
I’ll show you a better way later, but we haven’t learned everything to do it nicely yet. So instead, we’ll use cor.plot() from the psych package to make a simple heat map of the correlations.
We shouldn’t see that many negative correlations, which flags that we need to reverse score some items
(Note: I honestly wouldn’t normally do it like this, but we haven’t learned how to reshape data yet! Check the online materials for code on how to do this)
# A tibble: 69,492 × 7
SID gender education age trait item_num values
<chr> <int> <int> <int> <chr> <chr> <int>
1 61617 1 NA 16 A 1 2
2 61617 1 NA 16 A 2 4
3 61617 1 NA 16 A 3 3
4 61617 1 NA 16 A 4 4
5 61617 1 NA 16 A 5 4
6 61617 1 NA 16 C 1 2
7 61617 1 NA 16 C 2 3
8 61617 1 NA 16 C 3 3
# ℹ 69,484 more rows
2. pivot_wider()
(Formerly spread()) Makes wide data long, based on a key
Core arguments:
data: the data, blank if piped
names_from: name(s) of key column(s) in new long data frame (string or string vector)
names_sep: separator in column headers, if multiple keys
names_glue: specify multiple or custom separators of multiple keys
values_from: name of values in new long data frame (string)
values_fn: function applied to data with duplicate labels
Parts of Part 1 of these slides was adapted from Ozan Jaquette’s EDUC 260A at UCLA.
Source Code
---title: "Week 3 (Data Quality and `tidyr`) Workbook"author: "Emorie D Beck"format: html: code-tools: true code-copy: true code-line-numbers: true code-link: true theme: united highlight-style: tango df-print: paged code-fold: show toc: true toc-float: true self-contained: true # height: 900 footer: "PSC 203A - Data Cleaning and Management WQ26" logo: "https://github.com/emoriebeck/psc290-data-viz-2022/raw/main/01-week1-intro/02-code/02-images/ucdavis_logo_blue.png"editor: visualeditor_options: chunk_output_type: console---```{r, echo = F}library(knitr)library(psych)library(plyr)library(tidyverse)``````{r setup, include=FALSE}knitr::opts_chunk$set(echo =TRUE, message =FALSE, warning =FALSE,results ='show',fig.width =4, fig.height =4, fig.retina =3)options(htmltools.dir.version =FALSE)```## Outline1. Welcome & Q's on homework2. Part 1: Data Quality and Descriptives3. Part 2: `tidyr`4. Problem set & Q time# DATA QUALITY## What is data quality[IBM's](https://www.ibm.com/topics/data-quality#:~:text=the%20next%20step-,What%20is%20data%20quality%3F,governance%20initiatives%20within%20an%20organization.) definition of data quality:> "Data quality measures how well a dataset meets criteria for accuracy, completeness, validity, consistency, uniqueness, timeliness, and fitness for purpose"## Aspects of data quality- **Accuracy**: Do the data reflect reality / truth?- **Completeness**: Are the data usable or complete (no missing people, values, etc. beyond random)\- **Uniqueness**: There is no duplicated data- **Validity**: Do the data have the correct properties (values, ranges, etc.)- **Consistency**: When integrating across multiple data sources, information should converge across sources and match reality- **Timeliness**: Can the data be maintained and distributed within a specified time frame- **Fitness for purpose**: Do the data meet your research need?## Why should we care about data quality- You aren't responsible for poor quality data you receive, but you are responsible for the data products you work with -- that is, you are responsible for improving data quality- Poor quality data threatens scientific integrity\- Poor quality data are a pain for you to work with and for others to work with## What can data quality do for my career?- The *virtuous cycle* of data cleaning - Some people get a reputation for getting their data, analyses, etc. right - This is important for publications, grant funding, etc. - It tends to be inter-generational -- you inherit some of your reputation on this from your advisor - Start paying it forward now to build your own career, whether it's in academia or industry## How do I ensure data quality?The [Towards Data Science](https://towardsdatascience.com/exploratory-data-analysis-8fc1cb20fd15) website has a nice definition of EDA:> "Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics"- So EDA is basically a fancy word for the descriptive statistics you've been learning about for years**I think about "exploratory data analysis for data quality"**- Investigating values and patterns of variables from "input data"- Identifying and cleaning errors or values that need to be changed- Creating analysis variables- Checking values of analysis variables against values of input variables## How I will teach exploratory data analysisWill teach exploratory data analysis (EDA) in two sub-sections:1. Provide "Guidelines for EDA" - Less about coding, more about practices you should follow and mentality necessary to ensure high data quality2. Introduce "Tools of EDA": - Demonstrate code to investigate variables and relationship between variables - Most of these tools are just the application of programming skills you have already learned (or will learn soon!)## Guidelines for "EDA for data quality"Assume that your goal in "EDA for data quality" is to investigate "input" data sources and create "analysis variables"- Usually, your analysis dataset will incorporate multiple sources of input data, including data you collect (primary data) and/or data collected by others (secondary data)EDA is not a linear process, and the process will vary across people and projects Some broad steps:1. Understand how input data sources were created - e.g., when working with survey data, have survey questionnaire and codebooks on hand (watch out for skip patterns!!!)2. For each input data source, identify the "unit of analysis" and which combination of variables uniquely identify observations3. Investigate patterns in input variables4. Create analysis variable from input variable(s)5. Verify that analysis variable is created correctly through descriptive statistics that compare values of input variable(s) against values of the analysis variable- It is critically important to step through EDA processes at multiple points during data cleaning, from the input / raw data to the output / analysis / clean data.- **Always be aware of missing values**- They will not always be coded as `NA` in input variables (e.g., some projects code them as 99, 99999, negative values, etc.)## "Unit of analysis" and which variables uniquely identify observations"Unit of analysis" refers to "what does each observation represent" in an input data source- If each obs represents a trial in an experiment, you have "trial level data"- If each obs represents a participant, you have "participant level data"- If each obs represents a sample, you have "sample-level data"- If each obs represents a year, you have "year level data" (i.e. longitudinal)How to identify unit of analysis- data documentation- investigating the data set- This is very important because we often conduct analyses that span multiple units of analysis (e.g., between- v within-person, person- v stimuli-level, etc.)- We have to be careful and thoughtful about identifiers that let us do that (important for joining data together, which will be the focus on our `R` workshop today)## Rules for creating new variables {.smaller}Rules I follow for variable creation1. Never modify "input variable"; instead create new variable based on input variable(s) - Always keep input variables used to create new variables2. Investigate input variable(s) and relationship between input variables3. Developing a plan for creation of analysis variable - e.g., for each possible value of input variables, what should value of analysis variable be?4. Write code to create analysis variable5. Run descriptive checks to verify new variables are constructed correctly - Can "comment out" these checks, but don't delete them6. Document new variables with notes and labels# DESCRIPTIVES## Data we will useUse `read_csv()` function from `readr` (loaded with `tidyverse`) to import .csv dataset into `R`.```{r, results="hide"}library(plyr)library(tidyverse)soep_long <-read_csv(file="https://github.com/emoriebeck/psc290-data-FQ23/raw/main/04-workshops/03-week3-tidyr/gsoep.csv")soep_long```Let's examine the data \[you **must** run this code chunk\]```{r, results="hide"}soep_long %>%names()soep_long %>%names() %>%str()str(soep_long) # ughstr(soep_long$LifeEvent__Married)attributes(soep_long$LifeEvent__Married)typeof(soep_long$LifeEvent__Married)class(soep_long$LifeEvent__Married)```## Rule 11. Never modify "input variable"; instead create new variable based on input variable(s) - Always keep input variables used to create new variables- I already did this before the data were loaded in. I renamed all the input variables with interpretable names and reshaped them so the time variable (year) is long and the other variables are wide## Rule 2 {.smaller}2. Investigate input variable(s) and relationship between input variables- We'll talk more about this in a bit when we discuss different kinds of descriptives, but briefly let's look at basic descriptives + zero-order correlations```{r}describe(soep_long)```This doesn't look great because we've negative values where we shouldn't, which represent flags for different kinds of missing variables. We'll have to fix that- I'll show you a better way later, but we haven't learned everything to do it nicely yet. So instead, we'll use `cor.plot()` from the `psych` package to make a simple heat map of the correlations.- We shouldn't see that many negative correlations, which flags that we need to reverse score some items```{r, fig.width = 12, fig.height = 12}soep_2005 <- soep_long %>%filter(year ==2005) %>%select(-year)cor.plot(soep_2005, diag = F)```## Rule 33. Developing a plan for creation of analysis variable - e.g., for each possible value of input variables, what should value of analysis variable be?- I do this in my codebooks, and this topic warrants a discussion in itself. This is our focal topic for next week!- In this case, we want Big Five (EACNO) composites for each wave and to create composites of life events experienced across all years## Rule 44. Write code to create analysis variable- From Rule 2, we know we need to recode missing values to NA and reverse code some items. From Rule 3, we know we need to create some composites.- Let's do that now!### Recoding:```{r}soep_long <- soep_long %>%mutate_at(vars(contains("Big5")) , ~ifelse(. <0|is.na(.), NA, .) ) %>%mutate_at(vars(contains("LifeEvent")) , ~mapvalues(., seq(-7,1), c(rep(NA, 5), 0, NA, NA, 1), warn_missing = F) )```### Reverse Coding:```{r}rev_code <-c("Big5__A_coarse", "Big5__C_lazy", "Big5__E_reserved", "Big5__N_dealStress")soep_long <- soep_long %>%mutate_at(vars(all_of(rev_code)) , ~as.numeric(reverse.code(., keys =-1, mini =1, maxi =7)) )```Let's check to make sure some correlations just reversed:```{r, fig.width = 12, fig.height = 12}soep_2005 <- soep_long %>%filter(year ==2005) %>%select(-year)cor.plot(soep_2005, diag = F)```### Create Composites:(Note: I honestly wouldn't normally do it like this, but we haven't learned how to reshape data yet! Check the online materials for code on how to do this)```{r}soep_long <- soep_long %>%group_by(year, Procedural__SID) %>%rowwise() %>%mutate(Big5__E =mean(cbind(Big5__E_reserved, Big5__E_communic, Big5__E_sociable), na.rm = T),Big5__A =mean(cbind(Big5__A_coarse, Big5__A_friendly, Big5__A_forgive), na.rm = T),Big5__C =mean(cbind(Big5__C_thorough, Big5__C_efficient, Big5__C_lazy), na.rm = T),Big5__N =mean(cbind(Big5__N_worry, Big5__N_nervous, Big5__N_dealStress), na.rm = T),Big5__O =mean(cbind(Big5__O_original, Big5__O_artistic, Big5__O_imagin), na.rm = T)) %>%group_by(Procedural__SID) %>%mutate_at(vars(contains("LifeEvent")) , lst(ever =~max(., na.rm = T)) ) %>%ungroup() %>%filter(year %in%c(2005, 2009, 2013))``````{r, echo = F}soep_clean <- soep_long %>%pivot_longer(cols =contains("Big5") , names_to =c("trait", "item") , names_pattern ="(.*)_(.*)" , values_to ="value" , values_drop_na = T ) %>%group_by(Procedural__SID, Procedural__household, Demographic__DOB, Demographic__Sex, year, trait) %>%summarize(value =mean(value)) %>%ungroup() %>%full_join( soep_long %>%pivot_longer(cols =contains("LifeEvent") , names_to =c("trait") , values_to ="value" , values_drop_na = T ) %>%group_by(Procedural__SID, Procedural__household, Demographic__DOB, Demographic__Sex, year, trait) %>%summarize(value =max(value)) %>%ungroup() ) %>%pivot_wider(names_from ="trait" , values_from ="value" )```## Rule 5 {.smaller}5. Run descriptive checks to verify new variables are constructed correctly - Can "comment out" these checks, but don't delete them```{r}soep_long %>%select(Big5__E:LifeEvent__SepPart_ever) %>%describe()```- Uh oh, `Inf` values popping up what went wrong?- `-Inf` pops up when there were no non-missing values and you use `na.rm = T`- Let's recode those as `NA````{r}soep_long <- soep_long %>%mutate_all(~ifelse(is.infinite(.) |is.nan(.), NA, .))```And check out the descriptives again```{r}soep_long %>%select(Big5__E:LifeEvent__SepPart_ever) %>%describe()``````{r, fig.width = 12, fig.height = 12}soep_long %>%filter(year ==2005) %>%select(Big5__E:LifeEvent__SepPart_ever) %>%cor.plot(., diag = F)```## Rule 66. Document new variables with notes and labels- Again, I do this in my codebooks, so more on this next week!!## EDA- **One-way descriptive analyses** (i.e,. focus on one variable) - Descriptive analyses for continuous variables - Descriptive analyses for discreet/categorical variables- **Two-way descriptive analyses** (relationship between two variables) - Categorical by categorical - Categorical by continuous - Continuous by continuous- Realistically, we've actually already covered all this above, so we'll loop back to this after learning `tidyr`------------------------------------------------------------------------::::: {.columns .v-center-container}::: {.column width="70%"}<pstyle="font-size:160%;">Data Wrangling in `tidyr`</p>:::::: {.column width="30%"}```{r, fig.align='center'}knitr::include_graphics("https://github.com/rstudio/hex-stickers/raw/master/thumbs/tidyr.png")```::::::::# `tidyr`- Now, let's build off what we learned from dplyr and focus on *reshaping* and *merging* our data.- First, the reshapers:1. `pivot_longer()`, which takes a "wide" format data frame and makes it long.\2. `pivot_wider()`, which takes a "long" format data frame and makes it wide.- Next, the mergers:3. `full_join()`, which merges *all* rows in either data frame\4. `inner_join()`, which merges rows whose keys are present in *both* data frames\5. `left_join()`, which "prioritizes" the first data set\6. `right_join()`, which "prioritizes" the second data set(See also:`anti_join()` and `semi_join()`)# Key `tidyr` Functions## 1. `pivot_longer()`- (Formerly `gather()`) Makes wide data long, based on a key <fontsize="5">- Core arguments: - `data`: the data, blank if piped - `cols`: columns to be made long, selected via `select()` calls - `names_to`: name(s) of key column(s) in new long data frame (string or string vector) - `values_to`: name of values in new long data frame (string) - `names_sep`: separator in column headers, if multiple keys - `values_drop_na`: drop missing cells (similar to `na.rm = T`) </font>### Basic ApplicationLet's start with an easy one -- one key, one value:```{r, echo=TRUE}bfi %>%rownames_to_column("SID") %>%pivot_longer(cols = A1:O5 , names_to ="item" , values_to ="values" , values_drop_na = T ) %>%print(n =8)```### More Advanced ApplicationNow a harder one -- two keys, one value:```{r, echo=TRUE}bfi %>%rownames_to_column("SID") %>%pivot_longer(cols = A1:O5 , names_to =c("trait", "item_num") , names_sep =-1 , values_to ="values" , values_drop_na = T ) %>%print(n =8)```## 2. `pivot_wider()` {.smaller}- (Formerly `spread()`) Makes wide data long, based on a key <fontsize="6">- Core arguments: - `data`: the data, blank if piped - `names_from`: name(s) of key column(s) in new long data frame (string or string vector) - `names_sep`: separator in column headers, if multiple keys - `names_glue`: specify multiple or custom separators of multiple keys - `values_from`: name of values in new long data frame (string) - `values_fn`: function applied to data with duplicate labels </font>### Basic Application```{r}bfi_long <- bfi %>%rownames_to_column("SID") %>%pivot_longer(cols = A1:O5 , names_to ="item" , values_to ="values" , values_drop_na = T )```### More Advanced```{r, results = 'hide'}bfi_long <- bfi %>%rownames_to_column("SID") %>%pivot_longer(cols = A1:O5 , names_to =c("trait", "item_num") , names_sep =-1 , values_to ="values" , values_drop_na = T )``````{r, echo = T}bfi_long %>%pivot_wider(names_from =c("trait", "item_num") , values_from ="values" , names_sep ="_" )```### A Little More Advanced```{r, echo = T}bfi_long %>%select(-item_num) %>%pivot_wider(names_from ="trait" , values_from ="values" , names_sep ="_" , values_fn = mean )```# More `dplyr` Functions## The `_join()` Functions- Often we may need to pull different data from different sources- There are lots of reasons to need to do this- We don't have time to get into all the use cases here, so we'll talk about them in high level terms- We'll focus on: - `full_join()` - `inner_join()` - `left_join()` - `right_join()`- Let's separate demographic and BFI data```{r, echo = T}#| code-line-numbers: "|3"bfi_only <- bfi %>%rownames_to_column("SID") %>%select(SID, matches("[0-9]"))bfi_only %>%as_tibble() %>%print(n =6)``````{r, echo = T}#| code-line-numbers: "|3"bfi_dem <- bfi %>%rownames_to_column("SID") %>%select(SID, education, gender, age)bfi_dem %>%as_tibble() %>%print(n =6)```Before we get into it, as a reminder, this is what the data set looks like before we do any joining:```{r, echo = T}bfi %>%rownames_to_column("SID") %>%as_tibble() %>%print(n =6)```## 3. `full_join()`Most simply, we can put those back together keeping all observations.```{r, echo = T}bfi_only %>%full_join(bfi_dem) %>%as_tibble() %>%print(n =6)```## 4. `inner_join()`We can also keep all rows present in *both* data frames```{r, echo = T}#| code-line-numbers: "|1-2|4-5|3"bfi_dem %>%filter(row_number() %in%1:1700) %>%inner_join( bfi_only %>%filter(row_number() %in%1200:2800) ) %>%as_tibble() %>%print(n =6)```## 5. `left_join()`Or all rows present in the left (first) data frame, perhaps if it's a subset of people with complete data```{r, echo = T}#| code-line-numbers: "|2|3"bfi_dem %>%drop_na() %>%left_join(bfi_only) %>%as_tibble() %>%print(n =6)```## 6. `right_join()`Or all rows present in the right (second) data frame, such as I do when I join a codebook with raw data```{r, echo = T}#| code-line-numbers: "|3"bfi_dem %>%drop_na() %>%right_join(bfi_only) %>%as_tibble() %>%print(n =6)```## EDA- **One-way descriptive analyses** (i.e,. focus on one variable) - Descriptive analyses for continuous variables - Descriptive analyses for discreet/categorical variables- **Two-way descriptive analyses** (relationship between two variables) - Categorical by categorical - Categorical by continuous - Continuous by continuous## One-way descriptive analyses- These are basically what they sound like -- the focus is on single variables- Descriptive analyses for continuous variables - means, standard deviations, minima, maxima, counts- Descriptive analyses for discreet/categorical variables - counts, percentages### Continuous```{r}soep_long %>%select(Procedural__SID,year,Big5__E:Big5__O) %>%pivot_longer(cols =contains("Big5") , names_to ="trait" , values_to ="value" , values_drop_na = T ) %>%group_by(year, trait) %>%summarize_at(vars(value) , lst(mean, sd, min, max) , na.rm = T ) %>%ungroup()```### Categorical / Count```{r}soep_long %>%select(Procedural__SID, contains("_ever")) %>%distinct() %>%pivot_longer(cols =contains("LifeEvent") , names_to ="event" , values_to ="value" , values_drop_na = T ) %>%group_by(event, value) %>%tally() %>%group_by(event) %>%mutate(total =sum(n) , perc = n/total*100)```## Two-way descriptive analyses- Aims to capture relationships between variables- Categorical by categorical - cross-tabs, percentages- Categorical by continuous - means, standard deviations, etc. within categories- Continuous by continuous - correlations, covariances, etc.### Categorical x Categorical```{r}soep_long %>%select(Procedural__SID, Demographic__Sex, contains("_ever")) %>%distinct() %>%pivot_longer(cols =contains("LifeEvent") , names_to ="event" , values_to ="occurred" , values_drop_na = T ) %>%mutate(Demographic__Sex =mapvalues(Demographic__Sex, c(1,2), c("Male", "Female")) , occurred =mapvalues(occurred, c(0,1), c("No Event", "Event"))) %>%group_by(event, occurred, Demographic__Sex) %>%tally() %>%group_by(event) %>%mutate(perc = n/sum(n)*100) %>%pivot_wider(names_from =c(occurred) , values_from =c(n, perc) )```### Categorical x ContinuousSet up the data```{r}soep_twoway <- soep_long %>%filter(year ==2005) %>%select(Procedural__SID, Big5__E:Big5__O) %>%pivot_longer(cols =contains("Big5") , names_to ="trait" , values_to ="value" , values_drop_na = T ) %>%left_join( soep_long %>%select(Procedural__SID, contains("_ever")) %>%distinct() %>%pivot_longer(cols =contains("LifeEvent") , names_to ="event" , values_to ="occurred" , values_drop_na = T ) %>%mutate(occurred =mapvalues(occurred, c(0,1), c("No Event", "Event"))) )```Run the descriptives```{r}soep_twoway %>%group_by(trait, event, occurred) %>%summarize_at(vars(value) , lst(mean, sd, min, max) , na.rm = T ) %>%ungroup() %>%pivot_wider(names_from = trait , values_from =c(mean, sd, min, max) )```### Continuous x continuousHere's how I create more customizable heat maps in `ggplot2` for those who would like a reference for themselves.```{r}r <- soep_long %>%filter(year ==2005) %>%select(Big5__E:Big5__O) %>%cor(., use ="pairwise") r[lower.tri(r, diag = T)] <-NAvars <-rownames(r)r %>%data.frame() %>%rownames_to_column("V1") %>%pivot_longer(cols =-V1 , names_to ="V2" , values_to ="r" ) %>%mutate(V1 =factor(V1, levels = vars) , V2 =factor(V2, levels =rev(vars))) %>%ggplot(aes(x = V1, y = V2, fill = r)) +geom_raster() +geom_text(aes(label =round(r, 2))) +scale_fill_gradient2(limits =c(-1,1) , breaks =c(-1, -.5, 0, .5, 1) , low ="blue", high ="red" , mid ="white", na.value ="white") +labs(x =NULL , y =NULL , fill ="Zero-Order Correlation" , title ="Zero-Order Correlations Among Variables" ) +theme_classic() +theme(legend.position ="bottom" , axis.text =element_text(face ="bold") , axis.text.x =element_text(angle =45, hjust =1) , plot.title =element_text(face ="bold", hjust = .5) , plot.subtitle =element_text(face ="italic", hjust = .5) , panel.background =element_rect(color ="black", size =1) )```## AttributionsParts of Part 1 of these slides was adapted from Ozan Jaquette's [EDUC 260A](https://anyone-can-cook.github.io/rclass1/#syllabus) at UCLA.