Week 3 - Data Quality and `tidyr`

Emorie D Beck

Outline

Welcome & Q’s on homework
Part 1: Data Quality and Descriptives
Part 2: tidyr
Problem set & Q time

Outline

Welcome & Q’s on homework
Part 1: Data Quality and Descriptives
Part 2: tidyr
Problem set & Q time

Outline

~~1. Welcome & Q’s on homework~~
2. Part 1: Data Quality and Descriptives
3. Part 2: tidyr
4. Problem set & Q time

DATA QUALITY

What is data quality?

IBM’s definition of data quality:

“Data quality measures how well a dataset meets criteria for accuracy, completeness, validity, consistency, uniqueness, timeliness, and fitness for purpose”

Aspects of data quality?

Accuracy: Do the data reflect reality / truth?
Completeness: Are the data usable or complete (no missing people, values, etc. beyond random)
Uniqueness: There is no duplicated data
Validity: Do the data have the correct properties (values, ranges, etc.)
Consistency: When integrating across multiple data sources, information should converge across sources and match reality
Timeliness: Can the data be maintained and distributed within a specified time frame
Fitness for purpose: Do the data meet your research need?

Why should we care about data quality?

You aren’t responsible for poor quality data you receive, but you are responsible for the data products you work with – that is, you are responsible for improving data quality
Poor quality data threatens scientific integrity
Poor quality data are a pain for you to work with and for others to work with

What can data quality do for my career?

The virtuous cycle of data cleaning
- Some people get a reputation for getting their data, analyses, etc. right
- This is important for publications, grant funding, etc.
- It tends to be inter-generational – you inherit some of your reputation on this from your advisor
- Start paying it forward now to build your own career, whether it’s in academia or industry

What can data quality do for my career?

The virtuous cycle of data cleaning

Beyond her substantive accomplishments, Dr. Beck’s dedication to open, transparent, and reproducible science stands out… I consider Dr. Beck an exemplar of how to prudently use Open Science practices without being dogmatic or berating about it… In Dr. Beck’s case, we can see high quality and high transparency coupled, which serves to enhance and elevate her accomplishments.

How do I ensure data quality?

The Towards Data Science website has a nice definition of EDA:

“Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics”

So EDA is basically a fancy word for the descriptive statistics you’ve been learning about for years

How do I ensure data quality?

I think about “exploratory data analysis for data quality”

Investigating values and patterns of variables from “input data”
Identifying and cleaning errors or values that need to be changed
Creating analysis variables
Checking values of analysis variables against values of input variables

How I will teach exploratory data analysis

Will teach exploratory data analysis (EDA) in two sub-sections:

[Provide “Guidelines for EDA”]{style=“color: green”}
- Less about coding, more about practices you should follow and mentality necessary to ensure high data quality
[Introduce “Tools of EDA”:]{style=“color: purple”}
- Demonstrate code to investigate variables and relationship between variables
- Most of these tools are just the application of programming skills you have already learned (or will learn soon!)

Guidelines for “EDA for data quality”

Assume that your goal in “EDA for data quality” is to investigate “input” data sources and create “analysis variables”

Usually, your analysis dataset will incorporate multiple sources of input data, including data you collect (primary data) and/or data collected by others (secondary data)

Guidelines for “EDA for data quality”

EDA is not a linear process, and the process will vary across people and projects Some broad steps:

Understand how input data sources were created
- e.g., when working with survey data, have survey questionnaire and codebooks on hand (watch out for skip patterns!!!)
For each input data source, identify the “unit of analysis” and which combination of variables uniquely identify observations
Investigate patterns in input variables
Create analysis variable from input variable(s)
Verify that analysis variable is created correctly through descriptive statistics that compare values of input variable(s) against values of the analysis variable

Guidelines for “EDA for data quality”

It is critically important to step through EDA processes at multiple points during data cleaning, from the input / raw data to the output / analysis / clean data.
Always be aware of missing values
They will not always be coded as NA in input variables (e.g., some projects code them as 99, 99999, negative values, etc.)

“Unit of analysis” and Unique Identifiers

“Unit of analysis” refers to “what does each observation represent” in an input data source

If each obs represents a trial in an experiment, you have “trial level data”
If each obs represents a participant, you have “participant level data”
If each obs represents a sample, you have “sample-level data”
If each obs represents a year, you have “year level data” (i.e. longitudinal)

“Unit of analysis” and Unique Identifiers

How to identify unit of analysis

data documentation
investigating the data set
This is very important because we often conduct analyses that span multiple units of analysis (e.g., between- v within-person, person- v stimuli-level, etc.)
We have to be careful and thoughtful about identifiers that let us do that (important for joining data together, which will be the focus on our R workshop today)

Rules for creating new variables

Rules I follow for variable creation

Never modify “input variable”; instead create new variable based on input variable(s)
- Always keep input variables used to create new variables
Investigate input variable(s) and relationship between input variables
Developing a plan for creation of analysis variable
- e.g., for each possible value of input variables, what should value of analysis variable be?
Write code to create analysis variable
Run descriptive checks to verify new variables are constructed correctly
- Can “comment out” these checks, but don’t delete them
Document new variables with notes and labels

DESCRIPTIVES

Data we will use

Use read_csv() function from readr (loaded with tidyverse) to import .csv dataset into R.

library(plyr) 
library(tidyverse)
soep_long <- read_csv(file="https://github.com/emoriebeck/psc290-data-FQ23/raw/main/04-workshops/03-week3-tidyr/gsoep.csv")
soep_long

# A tibble: 136,627 × 30
   Procedural__SID Procedural__household Demographic__DOB Demographic__Sex  year
             <dbl>                 <dbl>            <dbl>            <dbl> <dbl>
 1             901                    94             1951                2  2005
 2             901                    94             1951                2  2006
 3             901                    94             1951                2  2007
 4             901                    94             1951                2  2008
 5             901                    94             1951                2  2009
 6             901                    94             1951                2  2010
 7             901                    94             1951                2  2011
 8             901                    94             1951                2  2012
 9             901                    94             1951                2  2013
10             901                    94             1951                2  2014
# ℹ 136,617 more rows
# ℹ 25 more variables: Big5__C_thorough <dbl>, Big5__E_communic <dbl>,
#   Big5__A_coarse <dbl>, Big5__O_original <dbl>, Big5__N_worry <dbl>,
#   Big5__A_forgive <dbl>, Big5__C_lazy <dbl>, Big5__E_sociable <dbl>,
#   Big5__O_artistic <dbl>, Big5__N_nervous <dbl>, Big5__C_efficient <dbl>,
#   Big5__E_reserved <dbl>, Big5__A_friendly <dbl>, Big5__O_imagin <dbl>,
#   Big5__N_dealStress <dbl>, LifeEvent__ChldBrth <dbl>, …

Data we will use

Let’s examine the data [you must run this code chunk]

soep_long %>% names()
soep_long %>% names() %>% str()

str(soep_long) 

str(soep_long$LifeEvent__Married)
attributes(soep_long$LifeEvent__Married)
typeof(soep_long$LifeEvent__Married)
class(soep_long$LifeEvent__Married)

Rule 1

1. Never modify “input variable”; instead create new variable based on input variable(s)

Always keep input variables used to create new variables
I already did this before the data were loaded in. I renamed all the input variables with interpretable names and reshaped them so the time variable (year) is long and the other variables are wide

Rule 2

2. Investigate input variable(s) and relationship between input variables

We’ll talk more about this in a bit when we discuss different kinds of descriptives, but briefly let’s look at basic descriptives + zero-order correlations
This doesn’t look great because we’ve negative values where we shouldn’t, which represent flags for different kinds of missing variables. We’ll have to fix that

describe(soep_long)

                      vars      n        mean          sd  median    trimmed
Procedural__SID          1 136627 10153709.20 11108970.01 3852202 8678436.79
Procedural__household    2 136627  1013575.62  1111841.57  500496  866584.02
Demographic__DOB         3 136627     1962.40       17.58    1964    1962.81
Demographic__Sex         4 136627        1.54        0.50       2       1.54
year                     5 136627     2010.40        3.12    2011    2010.47
Big5__C_thorough         6  36322        4.28        4.26       6       5.10
Big5__E_communic         7  36322        3.73        4.09       5       4.42
Big5__A_coarse           8  36322        1.66        3.33       2       1.96
Big5__O_original         9  36322        2.97        3.80       4       3.49
Big5__N_worry           10  36322        2.85        3.82       4       3.31
Big5__A_forgive         11  36322        3.69        4.07       5       4.37
Big5__C_lazy            12  36322        1.13        3.08       1       1.34
Big5__E_sociable        13  36322        3.38        3.97       5       3.98
Big5__O_artistic        14  36322        2.59        3.79       4       3.00
Big5__N_nervous         15  36322        2.20        3.56       3       2.57
Big5__C_efficient       16  36322        3.95        4.14       6       4.69
Big5__E_reserved        17  36322        2.58        3.69       4       3.03
Big5__A_friendly        18  36322        3.96        4.13       6       4.70
Big5__O_imagin          19  36322        3.14        3.89       4       3.67
Big5__N_dealStress      20  36322        2.94        3.80       4       3.46
LifeEvent__ChldBrth     21 136627       -2.10        0.85      -2      -2.00
LifeEvent__ChldMvOut    22 136627       -2.10        0.86      -2      -2.00
LifeEvent__Divorce      23 136627       -2.16        0.74      -2      -2.00
LifeEvent__DadDied      24 136627       -2.14        0.77      -2      -2.00
LifeEvent__NewPart      25  71270       -2.08        0.88      -2      -2.00
LifeEvent__Married      26 136627       -2.12        0.81      -2      -2.00
LifeEvent__MomDied      27 136627       -2.14        0.77      -2      -2.00
LifeEvent__MoveIn       28 136627       -2.12        0.81      -2      -2.00
LifeEvent__PartDied     29 136627       -2.16        0.73      -2      -2.00
LifeEvent__SepPart      30 136627       -2.13        0.80      -2      -2.00
                             mad  min      max    range  skew kurtosis       se
Procedural__SID       5039802.18  901 35032702 35031801  0.97    -0.63 30054.22
Procedural__household  575738.06   94  3499900  3499806  0.97    -0.64  3007.98
Demographic__DOB           19.27 1909     1997       88 -0.22    -0.72     0.05
Demographic__Sex            0.00    1        2        1 -0.14    -1.98     0.00
year                        4.45 2005     2015       10 -0.20    -1.15     0.01
Big5__C_thorough            1.48   -5        7       12 -1.60     0.80     0.02
Big5__E_communic            1.48   -5        7       12 -1.45     0.56     0.02
Big5__A_coarse              1.48   -5        7       12 -0.96     0.05     0.02
Big5__O_original            1.48   -5        7       12 -1.33     0.38     0.02
Big5__N_worry               2.97   -5        7       12 -1.18     0.16     0.02
Big5__A_forgive             1.48   -5        7       12 -1.45     0.55     0.02
Big5__C_lazy                1.48   -5        7       12 -0.87     0.17     0.02
Big5__E_sociable            1.48   -5        7       12 -1.37     0.43     0.02
Big5__O_artistic            2.97   -5        7       12 -1.05    -0.05     0.02
Big5__N_nervous             2.97   -5        7       12 -1.06     0.07     0.02
Big5__C_efficient           1.48   -5        7       12 -1.54     0.70     0.02
Big5__E_reserved            2.97   -5        7       12 -1.19     0.18     0.02
Big5__A_friendly            1.48   -5        7       12 -1.56     0.76     0.02
Big5__O_imagin              2.97   -5        7       12 -1.30     0.32     0.02
Big5__N_dealStress          1.48   -5        7       12 -1.30     0.33     0.02
LifeEvent__ChldBrth         0.00   -5        1        6 -1.11     8.88     0.00
LifeEvent__ChldMvOut        0.00   -5        1        6 -1.03     8.70     0.00
LifeEvent__Divorce          0.00   -5        1        6 -2.95    11.55     0.00
LifeEvent__DadDied          0.00   -5        1        6 -2.27    10.78     0.00
LifeEvent__NewPart          0.00   -5        1        6 -0.80     8.42     0.00
LifeEvent__Married          0.00   -5        1        6 -1.61     9.81     0.00
LifeEvent__MomDied          0.00   -5        1        6 -2.27    10.78     0.00
LifeEvent__MoveIn           0.00   -5        1        6 -1.62     9.82     0.00
LifeEvent__PartDied         0.00   -5        1        6 -3.07    11.67     0.00
LifeEvent__SepPart          0.00   -5        1        6 -1.74    10.03     0.00

Rule 2

2. Investigate input variable(s) and relationship between input variables

I’ll show you a better way later, but we haven’t learned everything to do it nicely yet. So instead, we’ll use cor.plot() from the psych package to make a simple heat map of the correlations.
We shouldn’t see that many negative correlations, which flags that we need to reverse score some items

soep_2005 <- soep_long %>% filter(year == 2005) %>% select(-year)
cor.plot(soep_2005, diag = F)

Rule 3

3. Developing a plan for creation of analysis variable

e.g., for each possible value of input variables, what should value of analysis variable be?
I do this in my codebooks, and this topic warrants a discussion in itself. This is our focal topic for next week!
In this case, we want Big Five (EACNO) composites for each wave and to create composites of life events experienced across all years

Rule 4

4. Write code to create analysis variable

From Rule 2, we know we need to recode missing values to NA and reverse code some items. From Rule 3, we know we need to create some composites.
Let’s do that now!

Recoding:

soep_long <- soep_long %>%
  mutate_at(
    vars(contains("Big5"))
    , ~ifelse(. < 0 | is.na(.), NA, .)
    ) %>%
  mutate_at(
    vars(contains("LifeEvent"))
    , ~mapvalues(., seq(-7,1), c(rep(NA, 5), 0, NA, NA, 1), warn_missing = F)
    )

Rule 4

4. Write code to create analysis variable

Remember when we talked about the select() helper functions and mutate_at()?
These will frequently become hugely useful for many cases!

Reverse Coding:

rev_code <- c("Big5__A_coarse", "Big5__C_lazy", "Big5__E_reserved", "Big5__N_dealStress")
soep_long <- soep_long %>%
  mutate_at(
    vars(all_of(rev_code))
    , ~as.numeric(reverse.code(., keys = -1, mini = 1, maxi = 7))
    )

Rule 4

4. Write code to create analysis variable

Let’s check to make sure some correlations just reversed:

soep_2005 <- soep_long %>% filter(year == 2005) %>% select(-year)
cor.plot(soep_2005, diag = F)

Rule 4

4. Write code to create analysis variable

Create Composites (Note: I honestly wouldn’t normally do it like this, but we haven’t learned how to reshape data yet! Check the online materials for code on how to do this)

soep_long <- soep_long %>% 
  group_by(year, Procedural__SID) %>%
  rowwise() %>%
  mutate(
    Big5__E = mean(cbind(Big5__E_reserved, Big5__E_communic, Big5__E_sociable), na.rm = T),
    Big5__A = mean(cbind(Big5__A_coarse, Big5__A_friendly, Big5__A_forgive), na.rm = T),
    Big5__C = mean(cbind(Big5__C_thorough, Big5__C_efficient, Big5__C_lazy), na.rm = T),
    Big5__N = mean(cbind(Big5__N_worry, Big5__N_nervous, Big5__N_dealStress), na.rm = T),
    Big5__O = mean(cbind(Big5__O_original, Big5__O_artistic, Big5__O_imagin), na.rm = T)
    ) %>%
  group_by(Procedural__SID) %>%
  mutate_at(
    vars(contains("LifeEvent"))
    , lst(ever = ~max(., na.rm = T))
    ) %>%
  ungroup() %>%
  filter(year %in% c(2005, 2009, 2013))

Rule 5

5. Run descriptive checks to verify new variables are constructed correctly

Can “comment out” these checks, but don’t delete them

soep_long %>% 
  select(Big5__E:LifeEvent__SepPart_ever) %>%
  describe() %>%
  as_tibble() %>%
  print(n = 15, width = 70)

# A tibble: 15 × 13
    vars     n    mean      sd median    trimmed   mad   min   max
   <int> <dbl>   <dbl>   <dbl>  <dbl>      <dbl> <dbl> <dbl> <dbl>
 1     1 30240    4.82   1.14    5       4.85    0.988     1     7
 2     2 30248    5.40   0.976   5.33    5.44    0.988     1     7
 3     3 30232    5.85   0.943   6       5.95    0.988     1     7
 4     4 30241    3.85   1.22    3.67    3.83    1.48      1     7
 5     5 30225    4.50   1.21    4.67    4.52    1.48      1     7
 6     6 36322 -Inf    NaN       0       0.00351 0      -Inf     1
 7     7 36322 -Inf    NaN       0       0.0440  0      -Inf     1
 8     8 36322 -Inf    NaN       0       0       0      -Inf     1
 9     9 36322 -Inf    NaN       0       0       0      -Inf     1
10    10 36322 -Inf    NaN       0    -Inf       0      -Inf     1
11    11 36322 -Inf    NaN       0       0       0      -Inf     1
12    12 36322 -Inf    NaN       0       0       0      -Inf     1
13    13 36322 -Inf    NaN       0       0       0      -Inf     1
14    14 36322 -Inf    NaN       0       0       0      -Inf     1
15    15 36322 -Inf    NaN       0       0       0      -Inf     1
# ℹ 4 more variables: range <dbl>, skew <dbl>, kurtosis <dbl>,
#   se <dbl>

Rule 5

5. Run descriptive checks to verify new variables are constructed correctly

Uh oh, Inf values popping up what went wrong?
-Inf pops up when there were no non-missing values and you use na.rm = T
Let’s recode those as NA

Rule 5

5. Run descriptive checks to verify new variables are constructed correctly

soep_long <- soep_long %>%
  mutate_all(~ifelse(is.infinite(.) | is.nan(.), NA, .))

Rule 5

5. Run descriptive checks to verify new variables are constructed correctly

Let’s check again:

soep_long %>% 
  select(Big5__E:LifeEvent__SepPart_ever) %>%
  describe() %>%
  as_tibble() %>%
  print(n = 15, width = 70)

# A tibble: 15 × 13
    vars     n   mean    sd median trimmed   mad   min   max range
   <int> <dbl>  <dbl> <dbl>  <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>
 1     1 30240 4.82   1.14    5    4.85    0.988     1     7     6
 2     2 30248 5.40   0.976   5.33 5.44    0.988     1     7     6
 3     3 30232 5.85   0.943   6    5.95    0.988     1     7     6
 4     4 30241 3.85   1.22    3.67 3.83    1.48      1     7     6
 5     5 30225 4.50   1.21    4.67 4.52    1.48      1     7     6
 6     6 35665 0.105  0.306   0    0.00589 0         0     1     1
 7     7 35665 0.138  0.345   0    0.0471  0         0     1     1
 8     8 35665 0.0274 0.163   0    0       0         0     1     1
 9     9 35665 0.0670 0.250   0    0       0         0     1     1
10    10 28848 0.0745 0.263   0    0       0         0     1     1
11    11 35665 0.0904 0.287   0    0       0         0     1     1
12    12 35665 0.0683 0.252   0    0       0         0     1     1
13    13 35665 0.0813 0.273   0    0       0         0     1     1
14    14 35665 0.0296 0.169   0    0       0         0     1     1
15    15 35665 0.0792 0.270   0    0       0         0     1     1
# ℹ 3 more variables: skew <dbl>, kurtosis <dbl>, se <dbl>

Rule 5

5. Run descriptive checks to verify new variables are constructed correctly

Let’s check again: ::: fragment

soep_long %>%
  filter(year == 2005) %>%
  select(Big5__E:LifeEvent__SepPart_ever) %>%
  cor.plot(., diag = F)

:::::::

Rule 6

6. Document new variables with notes and labels

Again, I do this in my codebooks, so more on this next week!!

EDA

One-way descriptive analyses (i.e,. focus on one variable)
- Descriptive analyses for continuous variables
- Descriptive analyses for discreet/categorical variables
Two-way descriptive analyses (relationship between two variables)
- Categorical by categorical
- Categorical by continuous
- Continuous by continuous
Realistically, we’ve actually already covered all this above, so we’ll loop back to this after learning tidyr

Outline

~~1. Welcome & Q’s on homework~~
2. Part 1: Data Quality and Descriptives
3. Part 2: tidyr
4. Problem set & Q time

Outline

~~1. Welcome & Q’s on homework~~
~~2. Part 1: Data Quality and Descriptives~~
3. Part 2: tidyr
4. Problem set & Q time

Data Wrangling in `tidyr`

Reshaping and Merging

`tidyr`

Now, let’s build off what we learned from dplyr and focus on reshaping and merging our data.
First, the reshapers:

pivot_longer(), which takes a “wide” format data frame and makes it long.
pivot_wider(), which takes a “long” format data frame and makes it wide.

`tidyr`

Next, the mergers:

full_join(), which merges all rows in either data frame
inner_join(), which merges rows whose keys are present in both data frames
left_join(), which “prioritizes” the first data set
right_join(), which “prioritizes” the second data set

(See also:anti_join() and semi_join())

Key `tidyr` Functions

1. `pivot_longer()`

(Formerly gather()) Makes wide data long, based on a key
Core arguments:
- data: the data, blank if piped
- cols: columns to be made long, selected via select() calls
- names_to: name(s) of key column(s) in new long data frame (string or string vector)
- values_to: name of values in new long data frame (string)
- names_sep: separator in column headers, if multiple keys
- values_drop_na: drop missing cells (similar to na.rm = T)

1. `pivot_longer()`: Basic Application

Let’s start with an easy one – one key, one value:

bfi %>%
  rownames_to_column("SID") %>%
  pivot_longer(
    cols = A1:O5
    , names_to = "item"
    , values_to = "values"
    , values_drop_na = T
  ) %>%
  print(n = 8)

# A tibble: 69,492 × 6
  SID   gender education   age item  values
  <chr>  <int>     <int> <int> <chr>  <int>
1 61617      1        NA    16 A1         2
2 61617      1        NA    16 A2         4
3 61617      1        NA    16 A3         3
4 61617      1        NA    16 A4         4
5 61617      1        NA    16 A5         4
6 61617      1        NA    16 C1         2
7 61617      1        NA    16 C2         3
8 61617      1        NA    16 C3         3
# ℹ 69,484 more rows

1. `pivot_longer()`: More Advanced

Now a harder one – two keys, one value:

bfi %>%
  rownames_to_column("SID") %>%
  pivot_longer(
    cols = A1:O5
    , names_to = c("trait", "item_num")
    , names_sep = -1
    , values_to = "values"
    , values_drop_na = T
  ) %>%
  print(n = 8, width = 50)

# A tibble: 69,492 × 7
  SID   gender education   age trait item_num
  <chr>  <int>     <int> <int> <chr> <chr>   
1 61617      1        NA    16 A     1       
2 61617      1        NA    16 A     2       
3 61617      1        NA    16 A     3       
4 61617      1        NA    16 A     4       
5 61617      1        NA    16 A     5       
6 61617      1        NA    16 C     1       
7 61617      1        NA    16 C     2       
8 61617      1        NA    16 C     3       
# ℹ 69,484 more rows
# ℹ 1 more variable: values <int>

1. `pivot_longer()`: Practice

Let’s practice:

Pivot only the demographic data to long form (i.e. remove the Big Five responses).

Call the names column dem_item
Call the values column dem_value
Do not drop missing values

1. `pivot_longer()`: Practice

Pivot only the demographic data to long form (i.e. remove the Big Five responses).

Call the names column dem_item
Call the values column dem_value
Do not drop missing values

bfi %>%
  select(gender:age) %>%
  pivot_longer(
    cols = everything()
    , names_to = "dem_item"
    , values_to = "dem_value"
    , values_drop_na = F
  )

# A tibble: 8,400 × 2
   dem_item  dem_value
   <chr>         <int>
 1 gender            1
 2 education        NA
 3 age              16
 4 gender            2
 5 education        NA
 6 age              18
 7 gender            2
 8 education        NA
 9 age              17
10 gender            2
# ℹ 8,390 more rows

2. `pivot_wider()`

(Formerly spread()) Makes wide data long, based on a key
Core arguments:
- data: the data, blank if piped
- names_from: name(s) of key column(s) in new long data frame (string or string vector)
- names_sep: separator in column headers, if multiple keys
- names_glue: specify multiple or custom separators of multiple keys
- values_from: name of values in new long data frame (string)
- values_fn: function applied to data with duplicate labels

2. `pivot_wider()`: Basic Application

bfi_long <- bfi %>%
  rownames_to_column("SID") %>%
  pivot_longer(
    cols = A1:O5
    , names_to = "item"
    , values_to = "values"
    , values_drop_na = T
  )

bfi_long %>%
  pivot_wider(
    names_from = "item"
    , values_from = "values"
  )

# A tibble: 2,800 × 29
   SID   gender education   age    A1    A2    A3    A4    A5    C1    C2    C3
   <chr>  <int>     <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
 1 61617      1        NA    16     2     4     3     4     4     2     3     3
 2 61618      2        NA    18     2     4     5     2     5     5     4     4
 3 61620      2        NA    17     5     4     5     4     4     4     5     4
 4 61621      2        NA    17     4     4     6     5     5     4     4     3
 5 61622      1        NA    17     2     3     3     4     5     4     4     5
 6 61623      2         3    21     6     6     5     6     5     6     6     6
 7 61624      1        NA    18     2     5     5     3     5     5     4     4
 8 61629      1         2    19     4     3     1     5     1     3     2     4
 9 61630      1         1    19     4     3     6     3     3     6     6     3
10 61633      2        NA    17     2     5     6     6     5     6     5     6
# ℹ 2,790 more rows
# ℹ 17 more variables: C4 <int>, C5 <int>, E1 <int>, E2 <int>, E3 <int>,
#   E4 <int>, E5 <int>, N1 <int>, N2 <int>, N3 <int>, N4 <int>, N5 <int>,
#   O1 <int>, O2 <int>, O3 <int>, O4 <int>, O5 <int>

2. `pivot_wider()`: More Advanced

bfi_long <- bfi %>%
  rownames_to_column("SID") %>%
  pivot_longer(
    cols = A1:O5
    , names_to = c("trait", "item_num")
    , names_sep = -1
    , values_to = "values"
    , values_drop_na = T
  )

bfi_long %>%
  pivot_wider(
    names_from = c("trait", "item_num")
    , values_from = "values"
    , names_sep = "_"
  ) %>%
  print(n = 7, width = 50)

# A tibble: 2,800 × 29
  SID   gender education   age   A_1   A_2   A_3
  <chr>  <int>     <int> <int> <int> <int> <int>
1 61617      1        NA    16     2     4     3
2 61618      2        NA    18     2     4     5
3 61620      2        NA    17     5     4     5
4 61621      2        NA    17     4     4     6
5 61622      1        NA    17     2     3     3
6 61623      2         3    21     6     6     5
7 61624      1        NA    18     2     5     5
# ℹ 2,793 more rows
# ℹ 22 more variables: A_4 <int>, A_5 <int>,
#   C_1 <int>, C_2 <int>, C_3 <int>, C_4 <int>,
#   C_5 <int>, E_1 <int>, E_2 <int>, E_3 <int>,
#   E_4 <int>, E_5 <int>, N_1 <int>, N_2 <int>,
#   N_3 <int>, N_4 <int>, N_5 <int>, O_1 <int>,
#   O_2 <int>, O_3 <int>, O_4 <int>, O_5 <int>

2. `pivot_wider()`: A Little More Advanced

bfi_long %>%
  select(-item_num) %>%
  pivot_wider(
    names_from = "trait"
    , values_from = "values"
    , names_sep = "_"
    , values_fn = mean
  )

# A tibble: 2,800 × 9
  SID   gender education   age     A     C     E
  <chr>  <int>     <int> <int> <dbl> <dbl> <dbl>
1 61617      1        NA    16   3.4   3.2   3.4
2 61618      2        NA    18   3.6   4     3  
3 61620      2        NA    17   4.4   4     3.8
4 61621      2        NA    17   4.8   4.2   4  
5 61622      1        NA    17   3.4   3.6   3.6
6 61623      2         3    21   5.6   4.4   4  
7 61624      1        NA    18   4     3.6   4.2
# ℹ 2,793 more rows
# ℹ 2 more variables: N <dbl>, O <dbl>

2. `pivot_wider()`: Practice

Let’s practice. Using the soep_long dataset:

Select Only the ID, demographic, year, and the New Partner variable
Pivot that wider
Drop the prefix (LifeEvent__ on the variable)

2. `pivot_wider()`: Practice

Let’s practice. Using the soep_long dataset:

Select Only the ID, demographic, year, and the New Partner variable
Pivot that wider
Make the name prefix NewPart

soep_long %>%
  select(Procedural__SID:year, contains("NewPart")) %>%
  pivot_wider(
    names_from = year
    , names_prefix = "NewPart_"
    , values_from = LifeEvent__NewPart
  )

# A tibble: 22,727 × 8
   Procedural__SID Procedural__household Demographic__DOB Demographic__Sex
             <dbl>                 <dbl>            <dbl>            <dbl>
 1             901                    94             1951                2
 2            1202                   124             1913                2
 3            2301                   230             1946                1
 4            2302                   230             1946                2
 5            2304                   230             1978                1
 6            4601                   469             1933                2
 7            4701                   477             1919                2
 8            4901                   493             1925                2
 9            5201                   523             1955                1
10            5202                   523             1956                2
# ℹ 22,717 more rows
# ℹ 4 more variables: LifeEvent__NewPart_ever <dbl>, NewPart_2005 <dbl>,
#   NewPart_2009 <dbl>, NewPart_2013 <dbl>

More `dplyr` Functions

The `_join()` Functions

Often we may need to pull different data from different sources
There are lots of reasons to need to do this
We don’t have time to get into all the use cases here, so we’ll talk about them in high level terms

We’ll focus on:
- full_join()
- inner_join()
- left_join()
- right_join()

The `_join()` Functions

Let’s separate demographic and BFI data

bfi_only <- bfi %>% 
  rownames_to_column("SID") %>%
  select(SID, matches("[0-9]"))
bfi_only %>% as_tibble() %>% print(n = 6, width = 50)

# A tibble: 2,800 × 26
  SID      A1    A2    A3    A4    A5    C1    C2
  <chr> <int> <int> <int> <int> <int> <int> <int>
1 61617     2     4     3     4     4     2     3
2 61618     2     4     5     2     5     5     4
3 61620     5     4     5     4     4     4     5
4 61621     4     4     6     5     5     4     4
5 61622     2     3     3     4     5     4     4
6 61623     6     6     5     6     5     6     6
# ℹ 2,794 more rows
# ℹ 18 more variables: C3 <int>, C4 <int>,
#   C5 <int>, E1 <int>, E2 <int>, E3 <int>,
#   E4 <int>, E5 <int>, N1 <int>, N2 <int>,
#   N3 <int>, N4 <int>, N5 <int>, O1 <int>,
#   O2 <int>, O3 <int>, O4 <int>, O5 <int>

bfi_dem <- bfi %>%
  rownames_to_column("SID") %>%
  select(SID, education, gender, age)
bfi_dem %>% as_tibble() %>% print(n = 6)

# A tibble: 2,800 × 4
  SID   education gender   age
  <chr>     <int>  <int> <int>
1 61617        NA      1    16
2 61618        NA      2    18
3 61620        NA      2    17
4 61621        NA      2    17
5 61622        NA      1    17
6 61623         3      2    21
# ℹ 2,794 more rows

3. `full_join()`

Most simply, we can put those back together keeping all observations.

bfi_only %>%
  full_join(bfi_dem) %>%
  as_tibble() %>%
  print(n = 6, width = 50)

# A tibble: 2,800 × 29
  SID      A1    A2    A3    A4    A5    C1    C2
  <chr> <int> <int> <int> <int> <int> <int> <int>
1 61617     2     4     3     4     4     2     3
2 61618     2     4     5     2     5     5     4
3 61620     5     4     5     4     4     4     5
4 61621     4     4     6     5     5     4     4
5 61622     2     3     3     4     5     4     4
6 61623     6     6     5     6     5     6     6
# ℹ 2,794 more rows
# ℹ 21 more variables: C3 <int>, C4 <int>,
#   C5 <int>, E1 <int>, E2 <int>, E3 <int>,
#   E4 <int>, E5 <int>, N1 <int>, N2 <int>,
#   N3 <int>, N4 <int>, N5 <int>, O1 <int>,
#   O2 <int>, O3 <int>, O4 <int>, O5 <int>,
#   education <int>, gender <int>, age <int>

bfi %>%
  rownames_to_column("SID") %>%
  as_tibble() %>%
  print(n = 6, width = 50)

# A tibble: 2,800 × 29
  SID      A1    A2    A3    A4    A5    C1    C2
  <chr> <int> <int> <int> <int> <int> <int> <int>
1 61617     2     4     3     4     4     2     3
2 61618     2     4     5     2     5     5     4
3 61620     5     4     5     4     4     4     5
4 61621     4     4     6     5     5     4     4
5 61622     2     3     3     4     5     4     4
6 61623     6     6     5     6     5     6     6
# ℹ 2,794 more rows
# ℹ 21 more variables: C3 <int>, C4 <int>,
#   C5 <int>, E1 <int>, E2 <int>, E3 <int>,
#   E4 <int>, E5 <int>, N1 <int>, N2 <int>,
#   N3 <int>, N4 <int>, N5 <int>, O1 <int>,
#   O2 <int>, O3 <int>, O4 <int>, O5 <int>,
#   gender <int>, education <int>, age <int>

4. `inner_join()`

We can also keep all rows present in both data frames

bfi_dem %>%
  filter(row_number() %in% 1:1700) %>%
  inner_join(
    bfi_only %>%
      filter(row_number() %in% 1200:2800)
  ) %>%
  as_tibble() %>%
  print(n = 6, width = 50)

# A tibble: 501 × 29
  SID   education gender   age    A1    A2    A3
  <chr>     <int>  <int> <int> <int> <int> <int>
1 64151         3      2    18     1     5     6
2 64152         4      2    29     1     5     6
3 64154         5      1    46     2     5     6
4 64155         5      1    58     5     4     4
5 64156         5      2    38     1     4     6
6 64158         5      2    27     2     3     1
# ℹ 495 more rows
# ℹ 22 more variables: A4 <int>, A5 <int>,
#   C1 <int>, C2 <int>, C3 <int>, C4 <int>,
#   C5 <int>, E1 <int>, E2 <int>, E3 <int>,
#   E4 <int>, E5 <int>, N1 <int>, N2 <int>,
#   N3 <int>, N4 <int>, N5 <int>, O1 <int>,
#   O2 <int>, O3 <int>, O4 <int>, O5 <int>

bfi %>%
  rownames_to_column("SID") %>%
  as_tibble() %>%
  print(n = 6, width = 50)

# A tibble: 2,800 × 29
  SID      A1    A2    A3    A4    A5    C1    C2
  <chr> <int> <int> <int> <int> <int> <int> <int>
1 61617     2     4     3     4     4     2     3
2 61618     2     4     5     2     5     5     4
3 61620     5     4     5     4     4     4     5
4 61621     4     4     6     5     5     4     4
5 61622     2     3     3     4     5     4     4
6 61623     6     6     5     6     5     6     6
# ℹ 2,794 more rows
# ℹ 21 more variables: C3 <int>, C4 <int>,
#   C5 <int>, E1 <int>, E2 <int>, E3 <int>,
#   E4 <int>, E5 <int>, N1 <int>, N2 <int>,
#   N3 <int>, N4 <int>, N5 <int>, O1 <int>,
#   O2 <int>, O3 <int>, O4 <int>, O5 <int>,
#   gender <int>, education <int>, age <int>

5. `left_join()`

Or all rows present in the left (first) data frame, perhaps if it’s a subset of people with complete data

bfi_dem %>%
  drop_na() %>%
  left_join(bfi_only) %>%
  as_tibble() %>%
  print(n = 6, width = 50)

# A tibble: 2,577 × 29
  SID   education gender   age    A1    A2    A3
  <chr>     <int>  <int> <int> <int> <int> <int>
1 61623         3      2    21     6     6     5
2 61629         2      1    19     4     3     1
3 61630         1      1    19     4     3     6
4 61634         1      1    21     4     4     5
5 61640         1      1    17     4     5     2
6 61661         5      1    68     1     5     6
# ℹ 2,571 more rows
# ℹ 22 more variables: A4 <int>, A5 <int>,
#   C1 <int>, C2 <int>, C3 <int>, C4 <int>,
#   C5 <int>, E1 <int>, E2 <int>, E3 <int>,
#   E4 <int>, E5 <int>, N1 <int>, N2 <int>,
#   N3 <int>, N4 <int>, N5 <int>, O1 <int>,
#   O2 <int>, O3 <int>, O4 <int>, O5 <int>

bfi %>%
  rownames_to_column("SID") %>%
  as_tibble() %>%
  print(n = 6, width = 50)

# A tibble: 2,800 × 29
  SID      A1    A2    A3    A4    A5    C1    C2
  <chr> <int> <int> <int> <int> <int> <int> <int>
1 61617     2     4     3     4     4     2     3
2 61618     2     4     5     2     5     5     4
3 61620     5     4     5     4     4     4     5
4 61621     4     4     6     5     5     4     4
5 61622     2     3     3     4     5     4     4
6 61623     6     6     5     6     5     6     6
# ℹ 2,794 more rows
# ℹ 21 more variables: C3 <int>, C4 <int>,
#   C5 <int>, E1 <int>, E2 <int>, E3 <int>,
#   E4 <int>, E5 <int>, N1 <int>, N2 <int>,
#   N3 <int>, N4 <int>, N5 <int>, O1 <int>,
#   O2 <int>, O3 <int>, O4 <int>, O5 <int>,
#   gender <int>, education <int>, age <int>

6. `right_join()`

Or all rows present in the right (second) data frame, such as I do when I join a codebook with raw data

bfi_dem %>%
  drop_na() %>%
  right_join(bfi_only) %>%
  as_tibble() %>%
  print(n = 6, width = 50)

# A tibble: 2,800 × 29
  SID   education gender   age    A1    A2    A3
  <chr>     <int>  <int> <int> <int> <int> <int>
1 61623         3      2    21     6     6     5
2 61629         2      1    19     4     3     1
3 61630         1      1    19     4     3     6
4 61634         1      1    21     4     4     5
5 61640         1      1    17     4     5     2
6 61661         5      1    68     1     5     6
# ℹ 2,794 more rows
# ℹ 22 more variables: A4 <int>, A5 <int>,
#   C1 <int>, C2 <int>, C3 <int>, C4 <int>,
#   C5 <int>, E1 <int>, E2 <int>, E3 <int>,
#   E4 <int>, E5 <int>, N1 <int>, N2 <int>,
#   N3 <int>, N4 <int>, N5 <int>, O1 <int>,
#   O2 <int>, O3 <int>, O4 <int>, O5 <int>

bfi %>%
  rownames_to_column("SID") %>%
  as_tibble() %>%
  print(n = 6, width = 50)

# A tibble: 2,800 × 29
  SID      A1    A2    A3    A4    A5    C1    C2
  <chr> <int> <int> <int> <int> <int> <int> <int>
1 61617     2     4     3     4     4     2     3
2 61618     2     4     5     2     5     5     4
3 61620     5     4     5     4     4     4     5
4 61621     4     4     6     5     5     4     4
5 61622     2     3     3     4     5     4     4
6 61623     6     6     5     6     5     6     6
# ℹ 2,794 more rows
# ℹ 21 more variables: C3 <int>, C4 <int>,
#   C5 <int>, E1 <int>, E2 <int>, E3 <int>,
#   E4 <int>, E5 <int>, N1 <int>, N2 <int>,
#   N3 <int>, N4 <int>, N5 <int>, O1 <int>,
#   O2 <int>, O3 <int>, O4 <int>, O5 <int>,
#   gender <int>, education <int>, age <int>

EDA

One-way descriptive analyses (i.e,. focus on one variable)
- Descriptive analyses for continuous variables
- Descriptive analyses for discreet/categorical variables
Two-way descriptive analyses (relationship between two variables)
- Categorical by categorical
- Categorical by continuous
- Continuous by continuous

One-way descriptive analyses

These are basically what they sound like – the focus is on single variables
Descriptive analyses for continuous variables
- means, standard deviations, minima, maxima, counts
Descriptive analyses for discreet/categorical variables
- counts, percentages

One-way descriptives: Continuous

soep_long %>% 
  select(Procedural__SID,year,Big5__E:Big5__O) %>%
  pivot_longer(
    cols = contains("Big5")
    , names_to = "trait"
    , values_to = "value"
    , values_drop_na = T
  ) %>%
  group_by(year, trait) %>%
  summarize_at(
    vars(value)
    , lst(mean, sd, min, max)
    , na.rm = T
  ) %>%
  ungroup()

# A tibble: 15 × 6
    year trait    mean    sd   min   max
   <dbl> <chr>   <dbl> <dbl> <dbl> <dbl>
 1  2005 Big5__A  5.46 0.984     1     7
 2  2005 Big5__C  5.90 0.956     1     7
 3  2005 Big5__E  4.82 1.15      1     7
 4  2005 Big5__N  3.95 1.23      1     7
 5  2005 Big5__O  4.51 1.22      1     7
 6  2009 Big5__A  5.35 0.985     1     7
 7  2009 Big5__C  5.82 0.954     1     7
 8  2009 Big5__E  4.77 1.15      1     7
 9  2009 Big5__N  3.82 1.22      1     7
10  2009 Big5__O  4.40 1.22      1     7
11  2013 Big5__A  5.41 0.955     1     7
12  2013 Big5__C  5.84 0.915     1     7
13  2013 Big5__E  4.87 1.11      1     7
14  2013 Big5__N  3.75 1.21      1     7
15  2013 Big5__O  4.60 1.18      1     7

One-way descriptives: Categorical / Count

soep_long %>%
  select(Procedural__SID, contains("_ever")) %>%
  distinct() %>%
  pivot_longer(
    cols = contains("LifeEvent")
    , names_to = "event"
    , values_to = "value"
    , values_drop_na = T
  ) %>%
  group_by(event, value) %>%
  tally() %>%
  group_by(event) %>%
  mutate(
    total = sum(n)
    , perc = n/total*100
    ) %>%
  print(n = 10, width = 50)

# A tibble: 20 × 5
# Groups:   event [10]
   event                   value     n total  perc
   <chr>                   <dbl> <int> <int> <dbl>
 1 LifeEvent__ChldBrth_ev…     0 19957 22070 90.4 
 2 LifeEvent__ChldBrth_ev…     1  2113 22070  9.57
 3 LifeEvent__ChldMvOut_e…     0 19760 22070 89.5 
 4 LifeEvent__ChldMvOut_e…     1  2310 22070 10.5 
 5 LifeEvent__DadDied_ever     0 20952 22070 94.9 
 6 LifeEvent__DadDied_ever     1  1118 22070  5.07
 7 LifeEvent__Divorce_ever     0 21569 22070 97.7 
 8 LifeEvent__Divorce_ever     1   501 22070  2.27
 9 LifeEvent__Married_ever     0 20331 22070 92.1 
10 LifeEvent__Married_ever     1  1739 22070  7.88
# ℹ 10 more rows

Two-way descriptive analyses

Aims to capture relationships between variables
Categorical by categorical
- cross-tabs, percentages
Categorical by continuous
- means, standard deviations, etc. within categories
Continuous by continuous
- correlations, covariances, etc.

Two-way: Categorical x Categorical

soep_long %>%
  select(Procedural__SID, Demographic__Sex, contains("_ever")) %>%
  distinct() %>%
  pivot_longer(
    cols = contains("LifeEvent")
    , names_to = "event"
    , values_to = "occurred"
    , values_drop_na = T
  ) %>%
  mutate(Demographic__Sex = mapvalues(Demographic__Sex, c(1,2), c("Male", "Female"))
         , occurred = mapvalues(occurred, c(0,1), c("No Event", "Event"))) %>%
  group_by(event, occurred, Demographic__Sex) %>%
  tally() %>%
  group_by(event) %>%
  mutate(perc = n/sum(n)*100) %>%
  pivot_wider(
    names_from = c(occurred)
    , values_from = c(n, perc)
  ) %>%
  print(n = 10)

# A tibble: 20 × 6
# Groups:   event [10]
   event        Demographic__Sex n_Event `n_No Event` perc_Event `perc_No Event`
   <chr>        <chr>              <int>        <int>      <dbl>           <dbl>
 1 LifeEvent__… Female              1157        10645      5.24             48.2
 2 LifeEvent__… Male                 956         9312      4.33             42.2
 3 LifeEvent__… Female              1312        10490      5.94             47.5
 4 LifeEvent__… Male                 998         9270      4.52             42.0
 5 LifeEvent__… Female               614        11188      2.78             50.7
 6 LifeEvent__… Male                 504         9764      2.28             44.2
 7 LifeEvent__… Female               296        11506      1.34             52.1
 8 LifeEvent__… Male                 205        10063      0.929            45.6
 9 LifeEvent__… Female               935        10867      4.24             49.2
10 LifeEvent__… Male                 804         9464      3.64             42.9
# ℹ 10 more rows

Two-way: Categorical x Continuous

soep_twoway <- soep_long %>% 
  filter(year == 2005) %>%
  select(Procedural__SID, Big5__E:Big5__O) %>%
  pivot_longer(
    cols = contains("Big5")
    , names_to = "trait"
    , values_to = "value"
    , values_drop_na = T
  ) %>%
  left_join(
    soep_long %>%
      select(Procedural__SID, contains("_ever")) %>%
      distinct() %>%
      pivot_longer(
        cols = contains("LifeEvent")
        , names_to = "event"
        , values_to = "occurred"
        , values_drop_na = T
      ) %>%
      mutate(occurred = mapvalues(occurred, c(0,1), c("No Event", "Event")))
  )

Two-way: Categorical x Continuous

soep_twoway %>%
  group_by(trait, event, occurred) %>%
  summarize_at(
    vars(value)
    , lst(mean, sd, min, max)
    , na.rm = T
  ) %>%
  ungroup() %>%
  pivot_wider(
    names_from = trait
    , values_from = c(mean, sd, min, max)
  )

# A tibble: 20 × 22
  event         occurred mean_Big5__A mean_Big5__C
  <chr>         <chr>           <dbl>        <dbl>
1 LifeEvent__C… Event            5.48         5.81
2 LifeEvent__C… No Event         5.46         5.91
3 LifeEvent__C… Event            5.48         6.06
4 LifeEvent__C… No Event         5.45         5.87
5 LifeEvent__D… Event            5.39         5.91
6 LifeEvent__D… No Event         5.46         5.90
7 LifeEvent__D… Event            5.32         5.90
8 LifeEvent__D… No Event         5.46         5.90
# ℹ 12 more rows
# ℹ 18 more variables: mean_Big5__E <dbl>,
#   mean_Big5__N <dbl>, mean_Big5__O <dbl>,
#   sd_Big5__A <dbl>, sd_Big5__C <dbl>,
#   sd_Big5__E <dbl>, sd_Big5__N <dbl>,
#   sd_Big5__O <dbl>, min_Big5__A <dbl>,
#   min_Big5__C <dbl>, min_Big5__E <dbl>, …

Two-way: Continuous x Continuous

r <- soep_long %>% 
  filter(year == 2005) %>%
  select(Big5__E:Big5__O) %>%
  cor(., use = "pairwise") 

r[lower.tri(r, diag = T)] <- NA
vars <- rownames(r)
r %>%
  data.frame() %>%
  rownames_to_column("V1") %>%
  pivot_longer(
    cols = -V1
    , names_to = "V2"
    , values_to = "r"
  ) %>%
  mutate(V1 = factor(V1, levels = vars)
         , V2 = factor(V2, levels = rev(vars))) %>%
  ggplot(aes(x = V1, y = V2, fill = r)) + 
    geom_raster() + 
  geom_text(aes(label = round(r, 2))) + 
  scale_fill_gradient2(
    limits = c(-1,1)
    , breaks = c(-1, -.5, 0, .5, 1)
    , low = "blue", high = "red"
    , mid = "white", na.value = "white") + 
  labs(
    x = NULL
    , y = NULL
    , fill = "Zero-Order Correlation"
    , title = "Zero-Order Correlations Among Variables"
    ) + 
  theme_classic() + 
  theme(
    legend.position = "bottom"
    , axis.text = element_text(face = "bold")
    , axis.text.x = element_text(angle = 45, hjust = 1)
    , plot.title = element_text(face = "bold", hjust = .5)
    , plot.subtitle = element_text(face = "italic", hjust = .5)
    , panel.background = element_rect(color = "black", size = 1)
  )

Practice: Putting It Together

P1: Big Five Composites:

Reload the soep_long data frame.
Filter for 2005
Select only the Big Five indicators (hint: use a select() helper function and “Big5”)
Change missing values to NA (see code earlier or try it on your own)
Reverse code “Big5__A_coarse”, “Big5__C_lazy”, “Big5__E_reserved”, “Big5__N_dealStress” (see earlier code or try it on your own)
Pivot the data longer

Use the following names_pattern: “(.*)__(.)_(.*)”
name the resulting columns “category”, “trait”, “p_item”
name the numeric values p_value

Use group_by() and summarize() to create averages for each Big Five trait
Pivot the data back wider by trait (hint: remember to use both category and trait)
Name the data that results soep_pers.

soep_long <- read_csv(file="https://github.com/emoriebeck/psc290-data-FQ23/raw/main/04-workshops/03-week3-tidyr/gsoep.csv")
soep_long

# A tibble: 136,627 × 30
   Procedural__SID Procedural__household Demographic__DOB Demographic__Sex  year
             <dbl>                 <dbl>            <dbl>            <dbl> <dbl>
 1             901                    94             1951                2  2005
 2             901                    94             1951                2  2006
 3             901                    94             1951                2  2007
 4             901                    94             1951                2  2008
 5             901                    94             1951                2  2009
 6             901                    94             1951                2  2010
 7             901                    94             1951                2  2011
 8             901                    94             1951                2  2012
 9             901                    94             1951                2  2013
10             901                    94             1951                2  2014
# ℹ 136,617 more rows
# ℹ 25 more variables: Big5__C_thorough <dbl>, Big5__E_communic <dbl>,
#   Big5__A_coarse <dbl>, Big5__O_original <dbl>, Big5__N_worry <dbl>,
#   Big5__A_forgive <dbl>, Big5__C_lazy <dbl>, Big5__E_sociable <dbl>,
#   Big5__O_artistic <dbl>, Big5__N_nervous <dbl>, Big5__C_efficient <dbl>,
#   Big5__E_reserved <dbl>, Big5__A_friendly <dbl>, Big5__O_imagin <dbl>,
#   Big5__N_dealStress <dbl>, LifeEvent__ChldBrth <dbl>, …

Practice: Putting It Together

rev_code <- c("Big5__A_coarse", "Big5__C_lazy", "Big5__E_reserved", "Big5__N_dealStress")
soep_pers <- soep_long %>%
  filter(year == 2005) %>%
  select(Procedural__SID, starts_with("Big5")) %>%
  mutate_at(
    vars(contains("Big5"))
    , ~ifelse(. < 0 | is.na(.), NA, .)
    ) %>%
  # mutate_at(
  #   vars(contains("LifeEvent"))
  #   , ~mapvalues(., seq(-7,1), c(rep(NA, 5), 0, NA, NA, 1), warn_missing = F)
  #   ) %>%
  mutate_at(
    vars(all_of(rev_code))
    , ~as.numeric(reverse.code(., keys = -1, mini = 1, maxi = 7))
    ) %>%
  pivot_longer(
    cols = -Procedural__SID
    , names_to = c("category", "trait", "p_item")
    , names_pattern = "(.*)__(.)_(.*)"
    , values_to = "p_value"
  ) %>%
  group_by(Procedural__SID, category, trait) %>%
  summarize(p_value = mean(p_value, na.rm = T)) %>%
  ungroup() %>%
  pivot_wider(
    names_from = c("category", "trait")
    , names_sep = "_"
    , values_from = "p_value"
  )

Practice: Putting It Together

Filter 2006 onward
Select only the SID, and life event variables (note: do NOT select year)
Remove missing values (see above or try on your own)
Pivot longer

Make the indicator name “category”, “le_item”
Make the value “le_value”
Drop missing values

Create a summary (max) of life event values across all years
Pivot the data back wider by le_item (hint: remember to use both category and le_item)
Assign the result to an object (soep_le)

Practice: Putting It Together

soep_le <- soep_long %>%
  filter(year > 2005) %>%
  select(Procedural__SID, starts_with("LifeEvent")) %>%
  mutate_at(
    vars(contains("LifeEvent"))
    , ~mapvalues(., seq(-7,1), c(rep(NA, 5), 0, NA, 0, 1), warn_missing = F)
    ) %>%
  pivot_longer(
    cols = -Procedural__SID
    , names_to = c("category", "le_item")
    , names_sep = "__"
    , values_to = "le_value"
    , values_drop_na = T
  ) %>%
  group_by(category, Procedural__SID, le_item) %>%
  summarize(le_value = max(le_value)) %>%
  ungroup() %>%
  pivot_wider(
    names_from = c("category", "le_item")
    , names_sep = "_"
    , values_from = "le_value"
  )

Practice: Putting It Together

Now merge the two objects. Try different mergers and see how that changes the output!

soep_pers %>%
  full_join(soep_le)

# A tibble: 25,986 × 16
   Procedural__SID Big5_A Big5_C Big5_E Big5_N Big5_O LifeEvent_ChldBrth
             <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>              <dbl>
 1             901   4.67   5      3.67   4.67   4                     0
 2            1202   7      6      4.67   5.67   5.67                  0
 3            2301   5      5.67   3.67   4.33   4.33                  0
 4            2302   3.33   6      4.67   2.33   5                     0
 5            2304   5.67   4.33   5      3.67   6.67                  0
 6            4601   6.33   6.5    3.33   6.33   3.67                  0
 7            4701   6      6.33   4.67   3      3                     0
 8            4901   5.33   5.33   6      4.33   3.33                  0
 9            5201   6.33   7      3.67   3.33   5                     0
10            5202   5      5.67   4      4.67   5.33                  0
# ℹ 25,976 more rows
# ℹ 9 more variables: LifeEvent_ChldMvOut <dbl>, LifeEvent_DadDied <dbl>,
#   LifeEvent_Divorce <dbl>, LifeEvent_Married <dbl>, LifeEvent_MomDied <dbl>,
#   LifeEvent_MoveIn <dbl>, LifeEvent_NewPart <dbl>, LifeEvent_PartDied <dbl>,
#   LifeEvent_SepPart <dbl>

Attributions

Parts of Part 1 of these slides was adapted from Ozan Jaquette’s EDUC 260A at UCLA.

Week 3 - Data Quality and tidyr

Outline

Outline

Outline

DATA QUALITY

What is data quality?

Aspects of data quality?

Why should we care about data quality?

What can data quality do for my career?

What can data quality do for my career?

How do I ensure data quality?

How do I ensure data quality?

How I will teach exploratory data analysis

Guidelines for “EDA for data quality”

Guidelines for “EDA for data quality”

Guidelines for “EDA for data quality”

“Unit of analysis” and Unique Identifiers

“Unit of analysis” and Unique Identifiers

Rules for creating new variables

DESCRIPTIVES

Data we will use

Data we will use

Rule 1

1. Never modify “input variable”; instead create new variable based on input variable(s)

Rule 2

2. Investigate input variable(s) and relationship between input variables

Rule 2

2. Investigate input variable(s) and relationship between input variables

Rule 3

3. Developing a plan for creation of analysis variable

Rule 4

4. Write code to create analysis variable

Rule 4

4. Write code to create analysis variable

Rule 4

4. Write code to create analysis variable

Rule 4

4. Write code to create analysis variable

Rule 5

5. Run descriptive checks to verify new variables are constructed correctly

Rule 5

5. Run descriptive checks to verify new variables are constructed correctly

Rule 5

5. Run descriptive checks to verify new variables are constructed correctly

Rule 5

5. Run descriptive checks to verify new variables are constructed correctly

Rule 5

5. Run descriptive checks to verify new variables are constructed correctly

Rule 6

6. Document new variables with notes and labels

EDA

Outline

Outline

Data Wrangling in tidyr

Reshaping and Merging

tidyr

tidyr

Key tidyr Functions

1. pivot_longer()

1. pivot_longer(): Basic Application

1. pivot_longer(): More Advanced

1. pivot_longer(): Practice

1. pivot_longer(): Practice

2. pivot_wider()

2. pivot_wider(): Basic Application

2. pivot_wider(): More Advanced

2. pivot_wider(): A Little More Advanced

2. pivot_wider(): Practice

2. pivot_wider(): Practice

More dplyr Functions

The _join() Functions

The _join() Functions

3. full_join()

4. inner_join()

5. left_join()

6. right_join()

EDA

One-way descriptive analyses

One-way descriptives: Continuous

One-way descriptives: Categorical / Count

Week 3 - Data Quality and `tidyr`

Data Wrangling in `tidyr`

`tidyr`

`tidyr`

Key `tidyr` Functions

1. `pivot_longer()`

1. `pivot_longer()`: Basic Application

1. `pivot_longer()`: More Advanced

1. `pivot_longer()`: Practice

1. `pivot_longer()`: Practice

2. `pivot_wider()`

2. `pivot_wider()`: Basic Application

2. `pivot_wider()`: More Advanced

2. `pivot_wider()`: A Little More Advanced

2. `pivot_wider()`: Practice

2. `pivot_wider()`: Practice

More `dplyr` Functions

The `_join()` Functions

The `_join()` Functions

3. `full_join()`

4. `inner_join()`

5. `left_join()`

6. `right_join()`