tidyr
tidyr
1. Welcome & Q’s on homework
2. Part 1: Data Quality and Descriptives
3. Part 2: tidyr
4. Problem set & Q time
1. Welcome & Q’s on homework
2. Part 1: Data Quality and Descriptives
3. Part 2: tidyr
4. Problem set & Q time
IBM’s definition of data quality:
“Data quality measures how well a dataset meets criteria for accuracy, completeness, validity, consistency, uniqueness, timeliness, and fitness for purpose”
The Towards Data Science website has a nice definition of EDA:
“Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics”
I think about “exploratory data analysis for data quality”
Will teach exploratory data analysis (EDA) in two sub-sections:
Assume that your goal in “EDA for data quality” is to investigate “input” data sources and create “analysis variables”
EDA is not a linear process, and the process will vary across people and projects Some broad steps:
It is critically important to step through EDA processes at multiple points during data cleaning, from the input / raw data to the output / analysis / clean data.
Always be aware of missing values
They will not always be coded as NA
in input variables (e.g., some projects code them as 99, 99999, negative values, etc.)
“Unit of analysis” refers to “what does each observation represent” in an input data source
How to identify unit of analysis
data documentation
investigating the data set
This is very important because we often conduct analyses that span multiple units of analysis (e.g., between- v within-person, person- v stimuli-level, etc.)
We have to be careful and thoughtful about identifiers that let us do that (important for joining data together, which will be the focus on our R
workshop today)
Rules I follow for variable creation
Use read_csv()
function from readr
(loaded with tidyverse
) to import .csv dataset into R
.
library(plyr)
library(tidyverse)
soep_long <- read_csv(file="https://github.com/emoriebeck/psc290-data-FQ23/raw/main/04-workshops/03-week3-tidyr/gsoep.csv")
soep_long
# A tibble: 136,627 × 30
Procedural__SID Procedural__household Demographic__DOB Demographic__Sex year
<dbl> <dbl> <dbl> <dbl> <dbl>
1 901 94 1951 2 2005
2 901 94 1951 2 2006
3 901 94 1951 2 2007
4 901 94 1951 2 2008
5 901 94 1951 2 2009
6 901 94 1951 2 2010
7 901 94 1951 2 2011
8 901 94 1951 2 2012
9 901 94 1951 2 2013
10 901 94 1951 2 2014
# ℹ 136,617 more rows
# ℹ 25 more variables: Big5__C_thorough <dbl>, Big5__E_communic <dbl>,
# Big5__A_coarse <dbl>, Big5__O_original <dbl>, Big5__N_worry <dbl>,
# Big5__A_forgive <dbl>, Big5__C_lazy <dbl>, Big5__E_sociable <dbl>,
# Big5__O_artistic <dbl>, Big5__N_nervous <dbl>, Big5__C_efficient <dbl>,
# Big5__E_reserved <dbl>, Big5__A_friendly <dbl>, Big5__O_imagin <dbl>,
# Big5__N_dealStress <dbl>, LifeEvent__ChldBrth <dbl>, …
Let’s examine the data [you must run this code chunk]
Always keep input variables used to create new variables
I already did this before the data were loaded in. I renamed all the input variables with interpretable names and reshaped them so the time variable (year) is long and the other variables are wide
vars n mean sd median trimmed
Procedural__SID 1 136627 10153709.20 11108970.01 3852202 8678436.79
Procedural__household 2 136627 1013575.62 1111841.57 500496 866584.02
Demographic__DOB 3 136627 1962.40 17.58 1964 1962.81
Demographic__Sex 4 136627 1.54 0.50 2 1.54
year 5 136627 2010.40 3.12 2011 2010.47
Big5__C_thorough 6 36322 4.28 4.26 6 5.10
Big5__E_communic 7 36322 3.73 4.09 5 4.42
Big5__A_coarse 8 36322 1.66 3.33 2 1.96
Big5__O_original 9 36322 2.97 3.80 4 3.49
Big5__N_worry 10 36322 2.85 3.82 4 3.31
Big5__A_forgive 11 36322 3.69 4.07 5 4.37
Big5__C_lazy 12 36322 1.13 3.08 1 1.34
Big5__E_sociable 13 36322 3.38 3.97 5 3.98
Big5__O_artistic 14 36322 2.59 3.79 4 3.00
Big5__N_nervous 15 36322 2.20 3.56 3 2.57
Big5__C_efficient 16 36322 3.95 4.14 6 4.69
Big5__E_reserved 17 36322 2.58 3.69 4 3.03
Big5__A_friendly 18 36322 3.96 4.13 6 4.70
Big5__O_imagin 19 36322 3.14 3.89 4 3.67
Big5__N_dealStress 20 36322 2.94 3.80 4 3.46
LifeEvent__ChldBrth 21 136627 -2.10 0.85 -2 -2.00
LifeEvent__ChldMvOut 22 136627 -2.10 0.86 -2 -2.00
LifeEvent__Divorce 23 136627 -2.16 0.74 -2 -2.00
LifeEvent__DadDied 24 136627 -2.14 0.77 -2 -2.00
LifeEvent__NewPart 25 71270 -2.08 0.88 -2 -2.00
LifeEvent__Married 26 136627 -2.12 0.81 -2 -2.00
LifeEvent__MomDied 27 136627 -2.14 0.77 -2 -2.00
LifeEvent__MoveIn 28 136627 -2.12 0.81 -2 -2.00
LifeEvent__PartDied 29 136627 -2.16 0.73 -2 -2.00
LifeEvent__SepPart 30 136627 -2.13 0.80 -2 -2.00
mad min max range skew kurtosis se
Procedural__SID 5039802.18 901 35032702 35031801 0.97 -0.63 30054.22
Procedural__household 575738.06 94 3499900 3499806 0.97 -0.64 3007.98
Demographic__DOB 19.27 1909 1997 88 -0.22 -0.72 0.05
Demographic__Sex 0.00 1 2 1 -0.14 -1.98 0.00
year 4.45 2005 2015 10 -0.20 -1.15 0.01
Big5__C_thorough 1.48 -5 7 12 -1.60 0.80 0.02
Big5__E_communic 1.48 -5 7 12 -1.45 0.56 0.02
Big5__A_coarse 1.48 -5 7 12 -0.96 0.05 0.02
Big5__O_original 1.48 -5 7 12 -1.33 0.38 0.02
Big5__N_worry 2.97 -5 7 12 -1.18 0.16 0.02
Big5__A_forgive 1.48 -5 7 12 -1.45 0.55 0.02
Big5__C_lazy 1.48 -5 7 12 -0.87 0.17 0.02
Big5__E_sociable 1.48 -5 7 12 -1.37 0.43 0.02
Big5__O_artistic 2.97 -5 7 12 -1.05 -0.05 0.02
Big5__N_nervous 2.97 -5 7 12 -1.06 0.07 0.02
Big5__C_efficient 1.48 -5 7 12 -1.54 0.70 0.02
Big5__E_reserved 2.97 -5 7 12 -1.19 0.18 0.02
Big5__A_friendly 1.48 -5 7 12 -1.56 0.76 0.02
Big5__O_imagin 2.97 -5 7 12 -1.30 0.32 0.02
Big5__N_dealStress 1.48 -5 7 12 -1.30 0.33 0.02
LifeEvent__ChldBrth 0.00 -5 1 6 -1.11 8.88 0.00
LifeEvent__ChldMvOut 0.00 -5 1 6 -1.03 8.70 0.00
LifeEvent__Divorce 0.00 -5 1 6 -2.95 11.55 0.00
LifeEvent__DadDied 0.00 -5 1 6 -2.27 10.78 0.00
LifeEvent__NewPart 0.00 -5 1 6 -0.80 8.42 0.00
LifeEvent__Married 0.00 -5 1 6 -1.61 9.81 0.00
LifeEvent__MomDied 0.00 -5 1 6 -2.27 10.78 0.00
LifeEvent__MoveIn 0.00 -5 1 6 -1.62 9.82 0.00
LifeEvent__PartDied 0.00 -5 1 6 -3.07 11.67 0.00
LifeEvent__SepPart 0.00 -5 1 6 -1.74 10.03 0.00
This doesn’t look great because we’ve negative values where we shouldn’t, which represent flags for different kinds of missing variables. We’ll have to fix that
cor.plot()
from the psych
package to make a simple heat map of the correlations.e.g., for each possible value of input variables, what should value of analysis variable be?
I do this in my codebooks, and this topic warrants a discussion in itself. This is our focal topic for next week!
In this case, we want Big Five (EACNO) composites for each wave and to create composites of life events experienced across all years
Create Composites (Note: I honestly wouldn’t normally do it like this, but we haven’t learned how to reshape data yet! Check the online materials for code on how to do this)
soep_long <- soep_long %>%
group_by(year, Procedural__SID) %>%
rowwise() %>%
mutate(
Big5__E = mean(cbind(Big5__E_reserved, Big5__E_communic, Big5__E_sociable), na.rm = T),
Big5__A = mean(cbind(Big5__A_coarse, Big5__A_friendly, Big5__A_forgive), na.rm = T),
Big5__C = mean(cbind(Big5__C_thorough, Big5__C_efficient, Big5__C_lazy), na.rm = T),
Big5__N = mean(cbind(Big5__N_worry, Big5__N_nervous, Big5__N_dealStress), na.rm = T),
Big5__O = mean(cbind(Big5__O_original, Big5__O_artistic, Big5__O_imagin), na.rm = T)) %>%
group_by(Procedural__SID) %>%
mutate_at(
vars(contains("LifeEvent"))
, lst(ever = ~max(., na.rm = T))
) %>%
ungroup() %>%
filter(year %in% c(2005, 2009, 2013))
Can “comment out” these checks, but don’t delete them
vars n mean sd median trimmed mad min max
Big5__E 1 30240 4.82 1.14 5.00 4.85 0.99 1 7
Big5__A 2 30248 5.40 0.98 5.33 5.44 0.99 1 7
Big5__C 3 30232 5.85 0.94 6.00 5.95 0.99 1 7
Big5__N 4 30241 3.85 1.22 3.67 3.83 1.48 1 7
Big5__O 5 30225 4.50 1.21 4.67 4.52 1.48 1 7
LifeEvent__ChldBrth_ever 6 36322 -Inf NaN 0.00 0.00 0.00 -Inf 1
LifeEvent__ChldMvOut_ever 7 36322 -Inf NaN 0.00 0.04 0.00 -Inf 1
LifeEvent__Divorce_ever 8 36322 -Inf NaN 0.00 0.00 0.00 -Inf 1
LifeEvent__DadDied_ever 9 36322 -Inf NaN 0.00 0.00 0.00 -Inf 1
LifeEvent__NewPart_ever 10 36322 -Inf NaN 0.00 -Inf 0.00 -Inf 1
LifeEvent__Married_ever 11 36322 -Inf NaN 0.00 0.00 0.00 -Inf 1
LifeEvent__MomDied_ever 12 36322 -Inf NaN 0.00 0.00 0.00 -Inf 1
LifeEvent__MoveIn_ever 13 36322 -Inf NaN 0.00 0.00 0.00 -Inf 1
LifeEvent__PartDied_ever 14 36322 -Inf NaN 0.00 0.00 0.00 -Inf 1
LifeEvent__SepPart_ever 15 36322 -Inf NaN 0.00 0.00 0.00 -Inf 1
range skew kurtosis se
Big5__E 6 -0.26 -0.17 0.01
Big5__A 6 -0.39 -0.11 0.01
Big5__C 6 -0.84 0.53 0.01
Big5__N 6 0.13 -0.31 0.01
Big5__O 6 -0.18 -0.23 0.01
LifeEvent__ChldBrth_ever Inf NaN NaN NaN
LifeEvent__ChldMvOut_ever Inf NaN NaN NaN
LifeEvent__Divorce_ever Inf NaN NaN NaN
LifeEvent__DadDied_ever Inf NaN NaN NaN
LifeEvent__NewPart_ever Inf NaN NaN NaN
LifeEvent__Married_ever Inf NaN NaN NaN
LifeEvent__MomDied_ever Inf NaN NaN NaN
LifeEvent__MoveIn_ever Inf NaN NaN NaN
LifeEvent__PartDied_ever Inf NaN NaN NaN
LifeEvent__SepPart_ever Inf NaN NaN NaN
Can “comment out” these checks, but don’t delete them
Inf
values popping up what went wrong?-Inf
pops up when there were no non-missing values and you use na.rm = T
NA
Can “comment out” these checks, but don’t delete them
Can “comment out” these checks, but don’t delete them
vars n mean sd median trimmed mad min max
Big5__E 1 30240 4.82 1.14 5.00 4.85 0.99 1 7
Big5__A 2 30248 5.40 0.98 5.33 5.44 0.99 1 7
Big5__C 3 30232 5.85 0.94 6.00 5.95 0.99 1 7
Big5__N 4 30241 3.85 1.22 3.67 3.83 1.48 1 7
Big5__O 5 30225 4.50 1.21 4.67 4.52 1.48 1 7
LifeEvent__ChldBrth_ever 6 35665 0.10 0.31 0.00 0.01 0.00 0 1
LifeEvent__ChldMvOut_ever 7 35665 0.14 0.34 0.00 0.05 0.00 0 1
LifeEvent__Divorce_ever 8 35665 0.03 0.16 0.00 0.00 0.00 0 1
LifeEvent__DadDied_ever 9 35665 0.07 0.25 0.00 0.00 0.00 0 1
LifeEvent__NewPart_ever 10 28848 0.07 0.26 0.00 0.00 0.00 0 1
LifeEvent__Married_ever 11 35665 0.09 0.29 0.00 0.00 0.00 0 1
LifeEvent__MomDied_ever 12 35665 0.07 0.25 0.00 0.00 0.00 0 1
LifeEvent__MoveIn_ever 13 35665 0.08 0.27 0.00 0.00 0.00 0 1
LifeEvent__PartDied_ever 14 35665 0.03 0.17 0.00 0.00 0.00 0 1
LifeEvent__SepPart_ever 15 35665 0.08 0.27 0.00 0.00 0.00 0 1
range skew kurtosis se
Big5__E 6 -0.26 -0.17 0.01
Big5__A 6 -0.39 -0.11 0.01
Big5__C 6 -0.84 0.53 0.01
Big5__N 6 0.13 -0.31 0.01
Big5__O 6 -0.18 -0.23 0.01
LifeEvent__ChldBrth_ever 1 2.58 4.67 0.00
LifeEvent__ChldMvOut_ever 1 2.10 2.42 0.00
LifeEvent__Divorce_ever 1 5.79 31.49 0.00
LifeEvent__DadDied_ever 1 3.46 9.99 0.00
LifeEvent__NewPart_ever 1 3.24 8.51 0.00
LifeEvent__Married_ever 1 2.86 6.16 0.00
LifeEvent__MomDied_ever 1 3.42 9.71 0.00
LifeEvent__MoveIn_ever 1 3.06 7.39 0.00
LifeEvent__PartDied_ever 1 5.56 28.87 0.00
LifeEvent__SepPart_ever 1 3.12 7.71 0.00
Again, I do this in my codebooks, so more on this next week!!
tidyr
1. Welcome & Q’s on homework
2. Part 1: Data Quality and Descriptives
3. Part 2: tidyr
4. Problem set & Q time
1. Welcome & Q’s on homework2. Part 1: Data Quality and Descriptives
3. Part 2: tidyr
4. Problem set & Q time
tidyr
tidyr
pivot_longer()
, which takes a “wide” format data frame and makes it long.pivot_wider()
, which takes a “long” format data frame and makes it wide.tidyr
full_join()
, which merges all rows in either data frameinner_join()
, which merges rows whose keys are present in both data framesleft_join()
, which “prioritizes” the first data setright_join()
, which “prioritizes” the second data set(See also:anti_join()
and semi_join()
)
tidyr
Functionspivot_longer()
gather()
) Makes wide data long, based on a key
data
: the data, blank if pipedcols
: columns to be made long, selected via select()
callsnames_to
: name(s) of key column(s) in new long data frame (string or string vector)values_to
: name of values in new long data frame (string)names_sep
: separator in column headers, if multiple keysvalues_drop_na
: drop missing cells (similar to na.rm = T
) pivot_longer()
: Basic ApplicationLet’s start with an easy one – one key, one value:
# A tibble: 69,492 × 6
SID gender education age item values
<chr> <int> <int> <int> <chr> <int>
1 61617 1 NA 16 A1 2
2 61617 1 NA 16 A2 4
3 61617 1 NA 16 A3 3
4 61617 1 NA 16 A4 4
5 61617 1 NA 16 A5 4
6 61617 1 NA 16 C1 2
7 61617 1 NA 16 C2 3
8 61617 1 NA 16 C3 3
# ℹ 69,484 more rows
pivot_longer()
: More Advanced ApplicationNow a harder one – two keys, one value:
# A tibble: 69,492 × 7
SID gender education age trait item_num values
<chr> <int> <int> <int> <chr> <chr> <int>
1 61617 1 NA 16 A 1 2
2 61617 1 NA 16 A 2 4
3 61617 1 NA 16 A 3 3
4 61617 1 NA 16 A 4 4
5 61617 1 NA 16 A 5 4
6 61617 1 NA 16 C 1 2
7 61617 1 NA 16 C 2 3
8 61617 1 NA 16 C 3 3
# ℹ 69,484 more rows
pivot_wider()
spread()
) Makes wide data long, based on a key
data
: the data, blank if pipednames_from
: name(s) of key column(s) in new long data frame (string or string vector)names_sep
: separator in column headers, if multiple keysnames_glue
: specify multiple or custom separators of multiple keysvalues_from
: name of values in new long data frame (string)values_fn
: function applied to data with duplicate labels pivot_wider()
: Basic Application# A tibble: 2,800 × 29
SID gender education age A1 A2 A3 A4 A5 C1 C2 C3
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 61617 1 NA 16 2 4 3 4 4 2 3 3
2 61618 2 NA 18 2 4 5 2 5 5 4 4
3 61620 2 NA 17 5 4 5 4 4 4 5 4
4 61621 2 NA 17 4 4 6 5 5 4 4 3
5 61622 1 NA 17 2 3 3 4 5 4 4 5
6 61623 2 3 21 6 6 5 6 5 6 6 6
7 61624 1 NA 18 2 5 5 3 5 5 4 4
8 61629 1 2 19 4 3 1 5 1 3 2 4
9 61630 1 1 19 4 3 6 3 3 6 6 3
10 61633 2 NA 17 2 5 6 6 5 6 5 6
# ℹ 2,790 more rows
# ℹ 17 more variables: C4 <int>, C5 <int>, E1 <int>, E2 <int>, E3 <int>,
# E4 <int>, E5 <int>, N1 <int>, N2 <int>, N3 <int>, N4 <int>, N5 <int>,
# O1 <int>, O2 <int>, O3 <int>, O4 <int>, O5 <int>
pivot_wider()
: More Advancedbfi_long %>%
pivot_wider(
names_from = c("trait", "item_num")
, values_from = "values"
, names_sep = "_"
)
# A tibble: 2,800 × 29
SID gender education age A_1 A_2 A_3 A_4 A_5 C_1 C_2 C_3
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 61617 1 NA 16 2 4 3 4 4 2 3 3
2 61618 2 NA 18 2 4 5 2 5 5 4 4
3 61620 2 NA 17 5 4 5 4 4 4 5 4
4 61621 2 NA 17 4 4 6 5 5 4 4 3
5 61622 1 NA 17 2 3 3 4 5 4 4 5
6 61623 2 3 21 6 6 5 6 5 6 6 6
7 61624 1 NA 18 2 5 5 3 5 5 4 4
8 61629 1 2 19 4 3 1 5 1 3 2 4
9 61630 1 1 19 4 3 6 3 3 6 6 3
10 61633 2 NA 17 2 5 6 6 5 6 5 6
# ℹ 2,790 more rows
# ℹ 17 more variables: C_4 <int>, C_5 <int>, E_1 <int>, E_2 <int>, E_3 <int>,
# E_4 <int>, E_5 <int>, N_1 <int>, N_2 <int>, N_3 <int>, N_4 <int>,
# N_5 <int>, O_1 <int>, O_2 <int>, O_3 <int>, O_4 <int>, O_5 <int>
pivot_wider()
: A Little More Advancedbfi_long %>%
select(-item_num) %>%
pivot_wider(
names_from = "trait"
, values_from = "values"
, names_sep = "_"
, values_fn = mean
)
# A tibble: 2,800 × 9
SID gender education age A C E N O
<chr> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 61617 1 NA 16 3.4 3.2 3.4 2.8 3.8
2 61618 2 NA 18 3.6 4 3 3.8 3.2
3 61620 2 NA 17 4.4 4 3.8 3.6 3.6
4 61621 2 NA 17 4.8 4.2 4 2.8 3.6
5 61622 1 NA 17 3.4 3.6 3.6 3.2 3.2
6 61623 2 3 21 5.6 4.4 4 3 3.8
7 61624 1 NA 18 4 3.6 4.2 1.4 3.8
8 61629 1 2 19 2.8 3 3.2 4.2 3.4
9 61630 1 1 19 3.8 4.8 3.75 3.6 5
10 61633 2 NA 17 4.8 4 3.6 4.2 3.6
# ℹ 2,790 more rows
dplyr
Functions_join()
Functionsfull_join()
inner_join()
left_join()
right_join()
_join()
Functionsbfi_only <- bfi %>%
rownames_to_column("SID") %>%
select(SID, matches("[0-9]"))
bfi_only %>% as_tibble() %>% print(n = 6)
# A tibble: 2,800 × 26
SID A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 61617 2 4 3 4 4 2 3 3 4 4 3 3
2 61618 2 4 5 2 5 5 4 4 3 4 1 1
3 61620 5 4 5 4 4 4 5 4 2 5 2 4
4 61621 4 4 6 5 5 4 4 3 5 5 5 3
5 61622 2 3 3 4 5 4 4 5 3 2 2 2
6 61623 6 6 5 6 5 6 6 6 1 3 2 1
# ℹ 2,794 more rows
# ℹ 13 more variables: E3 <int>, E4 <int>, E5 <int>, N1 <int>, N2 <int>,
# N3 <int>, N4 <int>, N5 <int>, O1 <int>, O2 <int>, O3 <int>, O4 <int>,
# O5 <int>
bfi_dem <- bfi %>%
rownames_to_column("SID") %>%
select(SID, education, gender, age)
bfi_dem %>% as_tibble() %>% print(n = 6)
# A tibble: 2,800 × 4
SID education gender age
<chr> <int> <int> <int>
1 61617 NA 1 16
2 61618 NA 2 18
3 61620 NA 2 17
4 61621 NA 2 17
5 61622 NA 1 17
6 61623 3 2 21
# ℹ 2,794 more rows
full_join()
Most simply, we can put those back together keeping all observations.
# A tibble: 2,800 × 29
SID A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 61617 2 4 3 4 4 2 3 3 4 4 3 3
2 61618 2 4 5 2 5 5 4 4 3 4 1 1
3 61620 5 4 5 4 4 4 5 4 2 5 2 4
4 61621 4 4 6 5 5 4 4 3 5 5 5 3
5 61622 2 3 3 4 5 4 4 5 3 2 2 2
6 61623 6 6 5 6 5 6 6 6 1 3 2 1
# ℹ 2,794 more rows
# ℹ 16 more variables: E3 <int>, E4 <int>, E5 <int>, N1 <int>, N2 <int>,
# N3 <int>, N4 <int>, N5 <int>, O1 <int>, O2 <int>, O3 <int>, O4 <int>,
# O5 <int>, education <int>, gender <int>, age <int>
# A tibble: 2,800 × 29
SID A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 61617 2 4 3 4 4 2 3 3 4 4 3 3
2 61618 2 4 5 2 5 5 4 4 3 4 1 1
3 61620 5 4 5 4 4 4 5 4 2 5 2 4
4 61621 4 4 6 5 5 4 4 3 5 5 5 3
5 61622 2 3 3 4 5 4 4 5 3 2 2 2
6 61623 6 6 5 6 5 6 6 6 1 3 2 1
# ℹ 2,794 more rows
# ℹ 16 more variables: E3 <int>, E4 <int>, E5 <int>, N1 <int>, N2 <int>,
# N3 <int>, N4 <int>, N5 <int>, O1 <int>, O2 <int>, O3 <int>, O4 <int>,
# O5 <int>, gender <int>, education <int>, age <int>
inner_join()
We can also keep all rows present in both data frames
bfi_dem %>%
filter(row_number() %in% 1:1700) %>%
inner_join(
bfi_only %>%
filter(row_number() %in% 1200:2800)
) %>%
as_tibble() %>%
print(n = 6)
# A tibble: 501 × 29
SID education gender age A1 A2 A3 A4 A5 C1 C2 C3
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 64151 3 2 18 1 5 6 5 5 5 6 5
2 64152 4 2 29 1 5 6 5 5 2 1 4
3 64154 5 1 46 2 5 6 5 6 6 6 6
4 64155 5 1 58 5 4 4 4 5 4 4 5
5 64156 5 2 38 1 4 6 6 6 4 4 5
6 64158 5 2 27 2 3 1 1 1 4 2 2
# ℹ 495 more rows
# ℹ 17 more variables: C4 <int>, C5 <int>, E1 <int>, E2 <int>, E3 <int>,
# E4 <int>, E5 <int>, N1 <int>, N2 <int>, N3 <int>, N4 <int>, N5 <int>,
# O1 <int>, O2 <int>, O3 <int>, O4 <int>, O5 <int>
# A tibble: 2,800 × 29
SID A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 61617 2 4 3 4 4 2 3 3 4 4 3 3
2 61618 2 4 5 2 5 5 4 4 3 4 1 1
3 61620 5 4 5 4 4 4 5 4 2 5 2 4
4 61621 4 4 6 5 5 4 4 3 5 5 5 3
5 61622 2 3 3 4 5 4 4 5 3 2 2 2
6 61623 6 6 5 6 5 6 6 6 1 3 2 1
# ℹ 2,794 more rows
# ℹ 16 more variables: E3 <int>, E4 <int>, E5 <int>, N1 <int>, N2 <int>,
# N3 <int>, N4 <int>, N5 <int>, O1 <int>, O2 <int>, O3 <int>, O4 <int>,
# O5 <int>, gender <int>, education <int>, age <int>
left_join()
Or all rows present in the left (first) data frame, perhaps if it’s a subset of people with complete data
# A tibble: 2,577 × 29
SID education gender age A1 A2 A3 A4 A5 C1 C2 C3
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 61623 3 2 21 6 6 5 6 5 6 6 6
2 61629 2 1 19 4 3 1 5 1 3 2 4
3 61630 1 1 19 4 3 6 3 3 6 6 3
4 61634 1 1 21 4 4 5 6 5 4 3 5
5 61640 1 1 17 4 5 2 2 1 5 5 5
6 61661 5 1 68 1 5 6 5 6 4 3 2
# ℹ 2,571 more rows
# ℹ 17 more variables: C4 <int>, C5 <int>, E1 <int>, E2 <int>, E3 <int>,
# E4 <int>, E5 <int>, N1 <int>, N2 <int>, N3 <int>, N4 <int>, N5 <int>,
# O1 <int>, O2 <int>, O3 <int>, O4 <int>, O5 <int>
# A tibble: 2,800 × 29
SID A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 61617 2 4 3 4 4 2 3 3 4 4 3 3
2 61618 2 4 5 2 5 5 4 4 3 4 1 1
3 61620 5 4 5 4 4 4 5 4 2 5 2 4
4 61621 4 4 6 5 5 4 4 3 5 5 5 3
5 61622 2 3 3 4 5 4 4 5 3 2 2 2
6 61623 6 6 5 6 5 6 6 6 1 3 2 1
# ℹ 2,794 more rows
# ℹ 16 more variables: E3 <int>, E4 <int>, E5 <int>, N1 <int>, N2 <int>,
# N3 <int>, N4 <int>, N5 <int>, O1 <int>, O2 <int>, O3 <int>, O4 <int>,
# O5 <int>, gender <int>, education <int>, age <int>
right_join()
Or all rows present in the right (second) data frame, such as I do when I join a codebook with raw data
# A tibble: 2,800 × 29
SID education gender age A1 A2 A3 A4 A5 C1 C2 C3
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 61623 3 2 21 6 6 5 6 5 6 6 6
2 61629 2 1 19 4 3 1 5 1 3 2 4
3 61630 1 1 19 4 3 6 3 3 6 6 3
4 61634 1 1 21 4 4 5 6 5 4 3 5
5 61640 1 1 17 4 5 2 2 1 5 5 5
6 61661 5 1 68 1 5 6 5 6 4 3 2
# ℹ 2,794 more rows
# ℹ 17 more variables: C4 <int>, C5 <int>, E1 <int>, E2 <int>, E3 <int>,
# E4 <int>, E5 <int>, N1 <int>, N2 <int>, N3 <int>, N4 <int>, N5 <int>,
# O1 <int>, O2 <int>, O3 <int>, O4 <int>, O5 <int>
# A tibble: 2,800 × 29
SID A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 61617 2 4 3 4 4 2 3 3 4 4 3 3
2 61618 2 4 5 2 5 5 4 4 3 4 1 1
3 61620 5 4 5 4 4 4 5 4 2 5 2 4
4 61621 4 4 6 5 5 4 4 3 5 5 5 3
5 61622 2 3 3 4 5 4 4 5 3 2 2 2
6 61623 6 6 5 6 5 6 6 6 1 3 2 1
# ℹ 2,794 more rows
# ℹ 16 more variables: E3 <int>, E4 <int>, E5 <int>, N1 <int>, N2 <int>,
# N3 <int>, N4 <int>, N5 <int>, O1 <int>, O2 <int>, O3 <int>, O4 <int>,
# O5 <int>, gender <int>, education <int>, age <int>
# A tibble: 15 × 6
year trait mean sd min max
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 2005 Big5__A 5.46 0.984 1 7
2 2005 Big5__C 5.90 0.956 1 7
3 2005 Big5__E 4.82 1.15 1 7
4 2005 Big5__N 3.95 1.23 1 7
5 2005 Big5__O 4.51 1.22 1 7
6 2009 Big5__A 5.35 0.985 1 7
7 2009 Big5__C 5.82 0.954 1 7
8 2009 Big5__E 4.77 1.15 1 7
9 2009 Big5__N 3.82 1.22 1 7
10 2009 Big5__O 4.40 1.22 1 7
11 2013 Big5__A 5.41 0.955 1 7
12 2013 Big5__C 5.84 0.915 1 7
13 2013 Big5__E 4.87 1.11 1 7
14 2013 Big5__N 3.75 1.21 1 7
15 2013 Big5__O 4.60 1.18 1 7
# A tibble: 20 × 5
# Groups: event [10]
event value n total perc
<chr> <dbl> <int> <int> <dbl>
1 LifeEvent__ChldBrth_ever 0 19957 22070 90.4
2 LifeEvent__ChldBrth_ever 1 2113 22070 9.57
3 LifeEvent__ChldMvOut_ever 0 19760 22070 89.5
4 LifeEvent__ChldMvOut_ever 1 2310 22070 10.5
5 LifeEvent__DadDied_ever 0 20952 22070 94.9
6 LifeEvent__DadDied_ever 1 1118 22070 5.07
7 LifeEvent__Divorce_ever 0 21569 22070 97.7
8 LifeEvent__Divorce_ever 1 501 22070 2.27
9 LifeEvent__Married_ever 0 20331 22070 92.1
10 LifeEvent__Married_ever 1 1739 22070 7.88
11 LifeEvent__MomDied_ever 0 20956 22070 95.0
12 LifeEvent__MomDied_ever 1 1114 22070 5.05
13 LifeEvent__MoveIn_ever 0 20455 22070 92.7
14 LifeEvent__MoveIn_ever 1 1615 22070 7.32
15 LifeEvent__NewPart_ever 0 15089 16471 91.6
16 LifeEvent__NewPart_ever 1 1382 16471 8.39
17 LifeEvent__PartDied_ever 0 21609 22070 97.9
18 LifeEvent__PartDied_ever 1 461 22070 2.09
19 LifeEvent__SepPart_ever 0 20572 22070 93.2
20 LifeEvent__SepPart_ever 1 1498 22070 6.79
Aims to capture relationships between variables
Categorical by categorical
Categorical by continuous
Continuous by continuous
soep_long %>%
select(Procedural__SID, Demographic__Sex, contains("_ever")) %>%
distinct() %>%
pivot_longer(
cols = contains("LifeEvent")
, names_to = "event"
, values_to = "occurred"
, values_drop_na = T
) %>%
mutate(Demographic__Sex = mapvalues(Demographic__Sex, c(1,2), c("Male", "Female"))
, occurred = mapvalues(occurred, c(0,1), c("No Event", "Event"))) %>%
group_by(event, occurred, Demographic__Sex) %>%
tally() %>%
group_by(event) %>%
mutate(perc = n/sum(n)*100) %>%
pivot_wider(
names_from = c(occurred)
, values_from = c(n, perc)
)
# A tibble: 20 × 6
# Groups: event [10]
event Demographic__Sex n_Event `n_No Event` perc_Event `perc_No Event`
<chr> <chr> <int> <int> <dbl> <dbl>
1 LifeEvent__… Female 1157 10645 5.24 48.2
2 LifeEvent__… Male 956 9312 4.33 42.2
3 LifeEvent__… Female 1312 10490 5.94 47.5
4 LifeEvent__… Male 998 9270 4.52 42.0
5 LifeEvent__… Female 614 11188 2.78 50.7
6 LifeEvent__… Male 504 9764 2.28 44.2
7 LifeEvent__… Female 296 11506 1.34 52.1
8 LifeEvent__… Male 205 10063 0.929 45.6
9 LifeEvent__… Female 935 10867 4.24 49.2
10 LifeEvent__… Male 804 9464 3.64 42.9
11 LifeEvent__… Female 589 11213 2.67 50.8
12 LifeEvent__… Male 525 9743 2.38 44.1
13 LifeEvent__… Female 875 10927 3.96 49.5
14 LifeEvent__… Male 740 9528 3.35 43.2
15 LifeEvent__… Female 820 8099 4.98 49.2
16 LifeEvent__… Male 562 6990 3.41 42.4
17 LifeEvent__… Female 333 11469 1.51 52.0
18 LifeEvent__… Male 128 10140 0.580 45.9
19 LifeEvent__… Female 887 10915 4.02 49.5
20 LifeEvent__… Male 611 9657 2.77 43.8
soep_twoway <- soep_long %>%
filter(year == 2005) %>%
select(Procedural__SID, Big5__E:Big5__O) %>%
pivot_longer(
cols = contains("Big5")
, names_to = "trait"
, values_to = "value"
, values_drop_na = T
) %>%
left_join(
soep_long %>%
select(Procedural__SID, contains("_ever")) %>%
distinct() %>%
pivot_longer(
cols = contains("LifeEvent")
, names_to = "event"
, values_to = "occurred"
, values_drop_na = T
) %>%
mutate(occurred = mapvalues(occurred, c(0,1), c("No Event", "Event")))
)
# A tibble: 20 × 22
event occurred mean_Big5__A mean_Big5__C mean_Big5__E mean_Big5__N
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 LifeEvent__Chld… Event 5.48 5.81 4.93 3.84
2 LifeEvent__Chld… No Event 5.46 5.91 4.81 3.97
3 LifeEvent__Chld… Event 5.48 6.06 4.87 3.96
4 LifeEvent__Chld… No Event 5.45 5.87 4.81 3.95
5 LifeEvent__DadD… Event 5.39 5.91 4.91 3.97
6 LifeEvent__DadD… No Event 5.46 5.90 4.81 3.95
7 LifeEvent__Divo… Event 5.32 5.90 5.01 4.08
8 LifeEvent__Divo… No Event 5.46 5.90 4.81 3.95
9 LifeEvent__Marr… Event 5.44 5.80 4.99 3.88
10 LifeEvent__Marr… No Event 5.46 5.91 4.80 3.96
11 LifeEvent__MomD… Event 5.49 6.06 4.80 4.01
12 LifeEvent__MomD… No Event 5.46 5.89 4.82 3.95
13 LifeEvent__Move… Event 5.43 5.73 5.08 3.95
14 LifeEvent__Move… No Event 5.46 5.91 4.80 3.95
15 LifeEvent__NewP… Event 5.32 5.56 5.14 3.90
16 LifeEvent__NewP… No Event 5.48 5.97 4.84 3.94
17 LifeEvent__Part… Event 5.61 6.03 4.72 4.12
18 LifeEvent__Part… No Event 5.45 5.90 4.82 3.95
19 LifeEvent__SepP… Event 5.33 5.76 5.03 4.03
20 LifeEvent__SepP… No Event 5.47 5.91 4.80 3.95
# ℹ 16 more variables: mean_Big5__O <dbl>, sd_Big5__A <dbl>, sd_Big5__C <dbl>,
# sd_Big5__E <dbl>, sd_Big5__N <dbl>, sd_Big5__O <dbl>, min_Big5__A <dbl>,
# min_Big5__C <dbl>, min_Big5__E <dbl>, min_Big5__N <dbl>, min_Big5__O <dbl>,
# max_Big5__A <dbl>, max_Big5__C <dbl>, max_Big5__E <dbl>, max_Big5__N <dbl>,
# max_Big5__O <dbl>
r <- soep_long %>%
filter(year == 2005) %>%
select(Big5__E:Big5__O) %>%
cor(., use = "pairwise")
r[lower.tri(r, diag = T)] <- NA
vars <- rownames(r)
r %>%
data.frame() %>%
rownames_to_column("V1") %>%
pivot_longer(
cols = -V1
, names_to = "V2"
, values_to = "r"
) %>%
mutate(V1 = factor(V1, levels = vars)
, V2 = factor(V2, levels = rev(vars))) %>%
ggplot(aes(x = V1, y = V2, fill = r)) +
geom_raster() +
geom_text(aes(label = round(r, 2))) +
scale_fill_gradient2(
limits = c(-1,1)
, breaks = c(-1, -.5, 0, .5, 1)
, low = "blue", high = "red"
, mid = "white", na.value = "white") +
labs(
x = NULL
, y = NULL
, fill = "Zero-Order Correlation"
, title = "Zero-Order Correlations Among Variables"
) +
theme_classic() +
theme(
legend.position = "bottom"
, axis.text = element_text(face = "bold")
, axis.text.x = element_text(angle = 45, hjust = 1)
, plot.title = element_text(face = "bold", hjust = .5)
, plot.subtitle = element_text(face = "italic", hjust = .5)
, panel.background = element_rect(color = "black", size = 1)
)
Parts of Part 1 of these slides was adapted from Ozan Jaquette’s EDUC 260A at UCLA.
PSC 290 - Data Management and Cleaning