In this case, we are going to using some data from the German Socioeconomic Panel Study (GSOEP), which is an ongoing Panel Study in Germany. Note that these data are for teaching purposes only, shared under the license for the Comprehensive SOEP teaching dataset, which I, as a contracted SOEP user, can use for teaching purposes. These data represent select cases from the full data set and should not be used for the purpose of publication. The full data are available for free at https://www.diw.de/en/diw_02.c.222829.en/access_and_ordering.html.
For this tutorial, I created the codebook for you: Download, and included what I believe are the core columns you may need. Some of these columns may not be particularly helpful for every dataset.
Here are my core columns that are based on the original data:
dataset
)category
)item
)name
)old_name
)item_text
)scale
)recode_desc
)recode
)reverse
)mini
)maxi
)year
or wave
)meta
)Here are additional columns that will make our lives easier or are applicable to some but not all data sets:
Below, I’ll load in the codebook we will use for this study, which will include all of the above columns.
The resulting codebook looks something like this:
## Codebooks - In addition, to the codebook, I also document other overarching info three other tabs: * Overview just lists what variables I’m considering as serving different functions (e.g., demographics, covariates, moderators, predictors, outcomes, indepenedent variables, dependent variables, etc.) * Key helps me create tables that I’ll be able to use in R
to help me rename things. This is super helpful for making final tables and figures!! * Sample helps other people understand how you’re using the columns. This is generally good practice and also helpful if you have research assistants or collaborators helping you out!
First, we need to load in the data. We’re going to use three waves of data from the German Socioeconomic Panel Study, which is a longitudinal study of German households that has been conducted since 1984. We’re going to use more recent data from three waves of personality data collected between 2005 and 2013.
Note: we will be using the teaching set of the GSOEP data set. I will not be pulling from the raw files as a result of this. I will also not be mirroring the format that you would usually load the GSOEP from because that is slightly more complicated and something we will return to in a later tutorial on purrr
(link) after we have more skills. I’ve left that code in the .qmd
for now, but it won’t make a lot of sense right now.
The code below first prints the sheets and then reads in the codebook page that has all the variables
[1] "codebook" "Overview" "Key" "Sample"
codebook <- readxl::read_excel(path = "codebook.xlsx", sheet = "codebook") %>%
mutate(old_name = str_to_lower(old_name))
codebook
# A tibble: 153 × 15
dataset category name item_name old_name year item_text scale recode_text
<chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 <NA> Procedural <NA> SID persnr 0 Never Ch… <NA> <NA>
2 <NA> Procedural <NA> household hhnr 0 househol… <NA> <NA>
3 ppfad Demograph… DOB DOB gebjahr 0 Year of … "num… <NA>
4 ppfad Demograph… sex sex sex 0 Sex "\r\… 0 = male, …
5 vp Big 5 A coarse vp12503 2005 Am somet… "\"[… Values < 1…
6 vp Big 5 A forgive vp12506 2005 Able to … "\"[… Values < 1…
7 vp Big 5 A friendly vp12513 2005 Friendly… "\"[… Values < 1…
8 zp Big 5 A coarse zp12003 2009 Am somet… "\"[… Values < 1…
9 zp Big 5 A forgive zp12006 2009 Able to … "\"[… Values < 1…
10 zp Big 5 A friendly zp12013 2009 Friendly… "\"[… Values < 1…
# ℹ 143 more rows
# ℹ 6 more variables: recode <chr>, reverse <chr>, mini <dbl>, maxi <dbl>,
# comp_rule <chr>, long_rule <chr>
Let’s read in the raw, wide-format data set using readr::read_csv()
:
vars <- codebook$old_name
soep <- read_csv(file = "soep.csv") %>%
select(one_of(vars)) # keep vars from codebook
soep
# A tibble: 28,290 × 153
persnr hhnr gebjahr sex vp12503 vp12506 vp12513 zp12003 zp12006 zp12013
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 901 94 1951 2 4 5 5 3 5 5
2 1202 124 1913 2 1 -1 7 NA NA NA
3 2301 230 1946 1 5 6 6 4 7 6
4 2302 230 1946 2 6 3 5 5 4 3
5 2304 230 1978 1 4 7 6 NA NA NA
6 2305 230 1946 2 NA NA NA NA NA NA
7 4601 469 1933 2 2 7 6 NA NA NA
8 4701 477 1919 2 1 5 6 1 5 5
9 4901 493 1925 2 2 4 6 4 2 6
10 5201 523 1955 1 2 7 6 6 7 5
# ℹ 28,280 more rows
# ℹ 143 more variables: bdp15103 <dbl>, bdp15106 <dbl>, bdp15113 <dbl>,
# vp12511 <dbl>, vp12507 <dbl>, vp12501 <dbl>, zp12011 <dbl>, zp12007 <dbl>,
# zp12001 <dbl>, bdp15111 <dbl>, bdp15107 <dbl>, bdp15101 <dbl>,
# vp12502 <dbl>, vp12512 <dbl>, vp12508 <dbl>, zp12002 <dbl>, zp12012 <dbl>,
# zp12008 <dbl>, bdp15102 <dbl>, bdp15112 <dbl>, bdp15108 <dbl>,
# vp12515 <dbl>, vp12510 <dbl>, vp12505 <dbl>, zp12015 <dbl>, …
## change data to long format
soep_long <- soep %>%
pivot_longer(
cols = c(-persnr, -hhnr) # all but SID & HHID
, names_to = "old_name" # match the codebook
, values_to = "value" # arbitrary name
, values_drop_na = T # drop missings
) %>%
rename(SID = persnr, HHID = hhnr) # rename
soep_long
# A tibble: 1,902,323 × 4
SID HHID old_name value
<dbl> <dbl> <chr> <dbl>
1 901 94 gebjahr 1951
2 901 94 sex 2
3 901 94 vp12503 4
4 901 94 vp12506 5
5 901 94 vp12513 5
6 901 94 zp12003 3
7 901 94 zp12006 5
8 901 94 zp12013 5
9 901 94 bdp15103 4
10 901 94 bdp15106 4
# ℹ 1,902,313 more rows
left_join()
here becuase we want to keep all observations in our raw soep_long
data frame.# merge in codebook
soep_long <- soep_long %>% # long data
left_join( # keep all rows in long data
codebook %>% # merge in the following variables from the codebook
select(old_name, category, name, item_name, year, recode,
reverse, mini, maxi, comp_rule, long_rule)
) %>%
select(-old_name) # get rid of old_name because we're done with it
soep_long
# A tibble: 1,902,323 × 13
SID HHID value category name item_name year recode reverse mini maxi
<dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl>
1 901 94 1951 Demograph… DOB DOB 0 ifels… no NA NA
2 901 94 2 Demograph… sex sex 0 ifels… no NA NA
3 901 94 4 Big 5 A coarse 2005 ifels… yes 1 7
4 901 94 5 Big 5 A forgive 2005 ifels… no 1 7
5 901 94 5 Big 5 A friendly 2005 ifels… no 1 7
6 901 94 3 Big 5 A coarse 2009 ifels… yes 1 7
7 901 94 5 Big 5 A forgive 2009 ifels… no 1 7
8 901 94 5 Big 5 A friendly 2009 ifels… no 1 7
9 901 94 4 Big 5 A coarse 2013 ifels… yes 1 7
10 901 94 4 Big 5 A forgive 2013 ifels… no 1 7
# ℹ 1,902,313 more rows
# ℹ 2 more variables: comp_rule <chr>, long_rule <chr>
recode
)reverse
, mini
, maxi
)comp_rule
, long_rule
)soep_recode <- soep_long %>%
group_by(recode) %>% # group by the r code rule
nest() %>% # create a nested data frame
ungroup() %>% # ungroup()
# apply the recode function
mutate(data = pmap(list(recode, data), recode_fun)) %>%
unnest(data) %>% # unnest the data to get back to a normal df
# change any negative, nan, or Inf values to NA
mutate(value = ifelse(value < 0 | is.nan(value) | is.infinite(value), NA, value)) %>%
select(-recode) # we're done with the recode column, so remove it
soep_recode
# A tibble: 1,902,323 × 12
SID HHID value category name item_name year reverse mini maxi
<dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl>
1 901 94 1951 Demographic DOB DOB 0 no NA NA
2 1202 124 1913 Demographic DOB DOB 0 no NA NA
3 2301 230 1946 Demographic DOB DOB 0 no NA NA
4 2302 230 1946 Demographic DOB DOB 0 no NA NA
5 2304 230 1978 Demographic DOB DOB 0 no NA NA
6 2305 230 1946 Demographic DOB DOB 0 no NA NA
7 4601 469 1933 Demographic DOB DOB 0 no NA NA
8 4701 477 1919 Demographic DOB DOB 0 no NA NA
9 4901 493 1925 Demographic DOB DOB 0 no NA NA
10 5201 523 1955 Demographic DOB DOB 0 no NA NA
# ℹ 1,902,313 more rows
# ℹ 2 more variables: comp_rule <chr>, long_rule <chr>
reverse
column to tell us which rows to reverse and what the mini
and maxi
scale values are.soep_recode <- soep_recode %>%
# if reverse = no leave along
# otherwise reverse code it according to the mini and maxi
mutate(value =
ifelse(is.na(reverse) | reverse == "no"
, value
, as.numeric(reverse.code(
-1
, value
, mini = mini
, maxi = maxi
))
)
) %>%
# get rid of reverse, mini, & maxi because we're done with them
select(-reverse, -mini, -maxi)
# A tibble: 1,902,323 × 9
SID HHID value category name item_name year comp_rule long_rule
<dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 901 94 1951 Demographic DOB DOB 0 mode select
2 1202 124 1913 Demographic DOB DOB 0 mode select
3 2301 230 1946 Demographic DOB DOB 0 mode select
4 2302 230 1946 Demographic DOB DOB 0 mode select
5 2304 230 1978 Demographic DOB DOB 0 mode select
6 2305 230 1946 Demographic DOB DOB 0 mode select
7 4601 469 1933 Demographic DOB DOB 0 mode select
8 4701 477 1919 Demographic DOB DOB 0 mode select
9 4901 493 1925 Demographic DOB DOB 0 mode select
10 5201 523 1955 Demographic DOB DOB 0 mode select
# ℹ 1,902,313 more rows
comp_rule
is average, so want to get the meanlong_rule
is select because we’ll choose what to do with each yearpurrr
and functions yetsoep_big5 <- soep_recode %>%
# keep Big Five & drop missings
filter(category == "Big 5" & !is.na(value)) %>%
# "split" the data by category, person, household, trait, item, and year
group_by(category, SID, HHID, name, item_name, year) %>%
# "apply" the mean function, collapsing within splits
summarize(value = mean(value)) %>%
# "split" the data by category, person, household, trait, and item
group_by(category, SID, HHID, name, year) %>%
# "apply" the mean function, collapsing within splits
summarize(value = mean(value)) %>%
# "combine" the data back together
ungroup()
# A tibble: 151,186 × 6
category SID HHID name year value
<chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 Big 5 901 94 A 2005 4.67
2 Big 5 901 94 A 2009 5
3 Big 5 901 94 A 2013 4.67
4 Big 5 901 94 C 2005 5
5 Big 5 901 94 C 2009 5
6 Big 5 901 94 C 2013 5.67
7 Big 5 901 94 E 2005 3.67
8 Big 5 901 94 E 2009 3.67
9 Big 5 901 94 E 2013 3.67
10 Big 5 901 94 N 2005 3.33
# ℹ 151,176 more rows
comp_rule
and the long_rule
are maxsoep_out <- soep_recode %>%
# keep Life events & drop missings
filter(category == "Life Event" & !is.na(value)) %>%
# "split" the data by category, person, household, event, and year
group_by(SID, HHID, category, name, year) %>%
# "apply" the max function, collapsing within splits
summarize(value = max(value)) %>%
# "split" the data by category, person, household, event
group_by(SID, HHID, category, name) %>%
# "apply" the max function, collapsing within splits
summarize(value = max(value)) %>%
# "combine" the data back together
ungroup()
# A tibble: 253,363 × 5
SID HHID category name value
<dbl> <dbl> <chr> <chr> <dbl>
1 901 94 Life Event ChldBrth 0
2 901 94 Life Event ChldMvOut 0
3 901 94 Life Event DadDied 0
4 901 94 Life Event Divorce 0
5 901 94 Life Event Married 0
6 901 94 Life Event MomDied 1
7 901 94 Life Event MoveIn 0
8 901 94 Life Event NewPart 0
9 901 94 Life Event PartDied 0
10 901 94 Life Event SepPart 0
# ℹ 253,353 more rows
comp_rule
and long_rule
columns tell us that the two variables actually have the same rule (mode), which make it easy to cleanMode <- function(x) {
ux <- unique(x)
ux <- ux[!is.na(ux)]
ux[which.max(tabulate(match(x, ux)))]
}
soep_cov <- soep_recode %>%
# keep demographics & drop missings
filter(category == "Demographic" & !is.na(value)) %>%
# "split" the data by category, person, household, covariate, and year
group_by(category, SID, HHID, name, year) %>%
# "apply" the Mode function, collapsing within splits
summarize(value = Mode(value)) %>%
# "split" the data by category, person, household, and covariate
group_by(SID, HHID, name) %>%
# "apply" the Mode function, collapsing within splits
summarize(value = Mode(value)) %>%
# "combine" the data back together
ungroup() %>%
# pivot data wider so there are separate columns for each covariate
pivot_wider(
names_from = "name"
, values_from = "value"
)
# A tibble: 28,290 × 4
SID HHID DOB sex
<dbl> <dbl> <dbl> <dbl>
1 901 94 1951 1
2 1202 124 1913 1
3 2301 230 1946 0
4 2302 230 1946 1
5 2304 230 1978 0
6 2305 230 1946 1
7 4601 469 1933 1
8 4701 477 1919 1
9 4901 493 1925 1
10 5201 523 1955 0
# ℹ 28,280 more rows
name
and value
columns to be specific to the variable categoriesSID | HHID | year | event | o_value | trait | p_value | sex | DOB |
---|
soep_clean <- soep_big5 %>%
# select key variables and rename for personality
select(SID, HHID, year, trait = name, p_value = value) %>%
# bring in matching rows (by SID and HHID)
inner_join(
soep_out %>%
# select key variables and rename for outcomes
select(SID, HHID, event = name, o_value = value)
) %>%
# bring the covariates
left_join(soep_cov)
# A tibble: 1,477,972 × 9
SID HHID year trait p_value event o_value DOB sex
<dbl> <dbl> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 901 94 2005 A 4.67 ChldBrth 0 1951 1
2 901 94 2005 A 4.67 ChldMvOut 0 1951 1
3 901 94 2005 A 4.67 DadDied 0 1951 1
4 901 94 2005 A 4.67 Divorce 0 1951 1
5 901 94 2005 A 4.67 Married 0 1951 1
6 901 94 2005 A 4.67 MomDied 1 1951 1
7 901 94 2005 A 4.67 MoveIn 0 1951 1
8 901 94 2005 A 4.67 NewPart 0 1951 1
9 901 94 2005 A 4.67 PartDied 0 1951 1
10 901 94 2005 A 4.67 SepPart 0 1951 1
# ℹ 1,477,962 more rows
soep_clean %>%
mutate(trait = factor(trait, levels = traits$name, labels = traits$long_name)
, event = factor(event, levels = outcomes$name, labels = outcomes$long_name))
# A tibble: 1,477,972 × 9
SID HHID year trait p_value event o_value DOB sex
<dbl> <dbl> <dbl> <fct> <dbl> <fct> <dbl> <dbl> <dbl>
1 901 94 2005 Agreeableness 4.67 Child Birth 0 1951 1
2 901 94 2005 Agreeableness 4.67 Child Moves Out 0 1951 1
3 901 94 2005 Agreeableness 4.67 Father Died 0 1951 1
4 901 94 2005 Agreeableness 4.67 Divorce 0 1951 1
5 901 94 2005 Agreeableness 4.67 Marriage 0 1951 1
6 901 94 2005 Agreeableness 4.67 Mother Died 1 1951 1
7 901 94 2005 Agreeableness 4.67 Move in with a P… 0 1951 1
8 901 94 2005 Agreeableness 4.67 New Partner 0 1951 1
9 901 94 2005 Agreeableness 4.67 Partner Died 0 1951 1
10 901 94 2005 Agreeableness 4.67 Separated from P… 0 1951 1
# ℹ 1,477,962 more rows
soep_clean %>%
mutate(trait = factor(trait, levels = traits$name, labels = traits$long_name)
, event = factor(event, levels = outcomes$name, labels = outcomes$long_name)
, o_value = factor(o_value, levels = c(0,1), labels = c("No Event", "Event"))) %>%
group_by(trait, event, o_value) %>%
summarize_at(vars(p_value), lst(mean, sd)) %>%
ungroup() %>%
g ggplot(aes(x = mean, y = event, shape = o_value)) +
geom_errorbar(
aes(xmin = mean - sd, xmax = mean + sd)
, width = .1
, position = position_dodge(width=.8)
) +
geom_point(position = position_dodge(width=.8)) +
facet_grid(~trait) +
theme_classic()
PSC 290 - Data Management and Cleaning