In this case, we are going to using some data from the German Socioeconomic Panel Study (GSOEP), which is an ongoing Panel Study in Germany. Note that these data are for teaching purposes only, shared under the license for the Comprehensive SOEP teaching dataset, which I, as a contracted SOEP user, can use for teaching purposes. These data represent select cases from the full data set and should not be used for the purpose of publication. The full data are available for free at https://www.diw.de/en/diw_02.c.222829.en/access_and_ordering.html.
For this tutorial, I created the codebook for you: Download, and included what I believe are the core columns you may need. Some of these columns may not be particularly helpful for every dataset.
Here are my core columns that are based on the original data:
dataset)category)item)name)old_name)item_text)scale)recode_desc)recode)reverse)mini)maxi)year or wave)meta)notes)Here are additional columns that will make our lives easier or are applicable to some but not all data sets:
Below, I’ll load in the codebook we will use for this study, which will include all of the above columns.
The resulting codebook looks something like this:
R to help me rename things. This is super helpful for making final tables and figures!!First, we need to load in the data. We’re going to use three waves of data from the German Socioeconomic Panel Study, which is a longitudinal study of German households that has been conducted since 1984. We’re going to use more recent data from three waves of personality data collected between 2005 and 2013.
Note: we will be using the teaching set of the GSOEP data set. I will not be pulling from the raw files as a result of this. I will also not be mirroring the format that you would usually load the GSOEP from because that is slightly more complicated and something we will return to in a later tutorial on purrr (link) after we have more skills. I’ve left that code in the .qmd for now, but it won’t make a lot of sense right now.
The code below first prints the sheets and then reads in the codebook page that has all the variables
[1] "codebook" "Overview" "Key" "Sample"
codebook <- readxl::read_excel(path = "codebook.xlsx", sheet = "codebook") %>%
mutate(old_name = str_to_lower(old_name))
codebook %>%
print(n = 6)# A tibble: 153 × 15
dataset category name item_name old_name year item_text scale recode_text
<chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 <NA> Procedural <NA> SID persnr 0 Never Ch… <NA> <NA>
2 <NA> Procedural <NA> household hhnr 0 househol… <NA> <NA>
3 ppfad Demographic DOB DOB gebjahr 0 Year of … "num… <NA>
4 ppfad Demographic sex sex sex 0 Sex "\r\… 0 = male, …
5 vp Big 5 A coarse vp12503 2005 Am somet… "\"[… Values < 1…
6 vp Big 5 A forgive vp12506 2005 Able to … "\"[… Values < 1…
# ℹ 147 more rows
# ℹ 6 more variables: recode <chr>, reverse <chr>, mini <dbl>, maxi <dbl>,
# comp_rule <chr>, long_rule <chr>
key <- readxl::read_excel(path = "codebook.xlsx", sheet = "Key") %>%
select(-Summary)
traits <- key %>% filter(category == "Big 5")
outcomes <- key %>% filter(category == "out")
covars <- key %>% filter(category == "dem")
traits %>% print(n = 3)# A tibble: 5 × 7
long_category long_name category name breaks mod mod_name
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Personality Extraversion Big 5 E <NA> <NA> <NA>
2 Personality Agreeableness Big 5 A <NA> <NA> <NA>
3 Personality Conscientiousness Big 5 C <NA> <NA> <NA>
# ℹ 2 more rows
Let’s read in the raw, wide-format data set using readr::read_csv():
vars <- codebook$old_name
soep <- read_csv(file = "soep.csv") %>%
select(one_of(vars)) # keep vars from codebook
soep %>% print(n = 8)# A tibble: 28,290 × 153
persnr hhnr gebjahr sex vp12503 vp12506 vp12513 zp12003 zp12006 zp12013
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 901 94 1951 2 4 5 5 3 5 5
2 1202 124 1913 2 1 -1 7 NA NA NA
3 2301 230 1946 1 5 6 6 4 7 6
4 2302 230 1946 2 6 3 5 5 4 3
5 2304 230 1978 1 4 7 6 NA NA NA
6 2305 230 1946 2 NA NA NA NA NA NA
7 4601 469 1933 2 2 7 6 NA NA NA
8 4701 477 1919 2 1 5 6 1 5 5
# ℹ 28,282 more rows
# ℹ 143 more variables: bdp15103 <dbl>, bdp15106 <dbl>, bdp15113 <dbl>,
# vp12511 <dbl>, vp12507 <dbl>, vp12501 <dbl>, zp12011 <dbl>, zp12007 <dbl>,
# zp12001 <dbl>, bdp15111 <dbl>, bdp15107 <dbl>, bdp15101 <dbl>,
# vp12502 <dbl>, vp12512 <dbl>, vp12508 <dbl>, zp12002 <dbl>, zp12012 <dbl>,
# zp12008 <dbl>, bdp15102 <dbl>, bdp15112 <dbl>, bdp15108 <dbl>,
# vp12515 <dbl>, vp12510 <dbl>, vp12505 <dbl>, zp12015 <dbl>, …
## change data to long format
soep_long <- soep %>%
pivot_longer(
cols = c(-persnr, -hhnr) # all but SID & HHID
, names_to = "old_name" # match the codebook
, values_to = "value" # arbitrary name
, values_drop_na = T # drop missings
) %>%
rename(SID = persnr, HHID = hhnr) # rename
soep_long# A tibble: 1,902,323 × 4
SID HHID old_name value
<dbl> <dbl> <chr> <dbl>
1 901 94 gebjahr 1951
2 901 94 sex 2
3 901 94 vp12503 4
4 901 94 vp12506 5
5 901 94 vp12513 5
6 901 94 zp12003 3
7 901 94 zp12006 5
8 901 94 zp12013 5
9 901 94 bdp15103 4
10 901 94 bdp15106 4
# ℹ 1,902,313 more rows
left_join() here becuase we want to keep all observations in our raw soep_long data frame.# merge in codebook
soep_long <- soep_long %>% # long data
left_join( # keep all rows in long data
codebook %>% # merge in the following variables from the codebook
select(
old_name, category, name, item_name, year, recode,
reverse, mini, maxi, comp_rule, long_rule
)
) %>%
select(-old_name) # get rid of old_name because we're done with it
soep_long %>% print(n =6)# A tibble: 1,902,323 × 13
SID HHID value category name item_name year recode reverse mini maxi
<dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl>
1 901 94 1951 Demographic DOB DOB 0 ifels… no NA NA
2 901 94 2 Demographic sex sex 0 ifels… no NA NA
3 901 94 4 Big 5 A coarse 2005 ifels… yes 1 7
4 901 94 5 Big 5 A forgive 2005 ifels… no 1 7
5 901 94 5 Big 5 A friendly 2005 ifels… no 1 7
6 901 94 3 Big 5 A coarse 2009 ifels… yes 1 7
# ℹ 1,902,317 more rows
# ℹ 2 more variables: comp_rule <chr>, long_rule <chr>
recode)reverse, mini, maxi)comp_rule, long_rule)soep_recode <- soep_long %>%
group_by(recode) %>% # group by the r code rule
nest() %>% # create a nested data frame
ungroup() %>% # ungroup()
# apply the recode function
mutate(data = pmap(list(recode, data), recode_fun)) %>%
unnest(data) %>% # unnest the data to get back to a normal df
# change any negative, nan, or Inf values to NA
mutate(value = ifelse(value < 0 | is.nan(value) | is.infinite(value), NA, value)) %>%
select(-recode) # we're done with the recode column, so remove it
soep_recode# A tibble: 1,902,323 × 12
SID HHID value category name item_name year reverse mini maxi
<dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl>
1 901 94 1951 Demographic DOB DOB 0 no NA NA
2 1202 124 1913 Demographic DOB DOB 0 no NA NA
3 2301 230 1946 Demographic DOB DOB 0 no NA NA
4 2302 230 1946 Demographic DOB DOB 0 no NA NA
5 2304 230 1978 Demographic DOB DOB 0 no NA NA
6 2305 230 1946 Demographic DOB DOB 0 no NA NA
7 4601 469 1933 Demographic DOB DOB 0 no NA NA
8 4701 477 1919 Demographic DOB DOB 0 no NA NA
9 4901 493 1925 Demographic DOB DOB 0 no NA NA
10 5201 523 1955 Demographic DOB DOB 0 no NA NA
# ℹ 1,902,313 more rows
# ℹ 2 more variables: comp_rule <chr>, long_rule <chr>
reverse, mini, and maxi columns to tell us what to reverse and how.soep_recode <- soep_recode %>%
# if reverse = no leave along
# otherwise reverse code it according to the mini and maxi
mutate(value =
ifelse(is.na(reverse) | reverse == "no"
, value
, as.numeric(reverse.code(
-1
, value
, mini = mini
, maxi = maxi
))
)
) %>%
# get rid of reverse, mini, & maxi because we're done with them
select(-reverse, -mini, -maxi) # A tibble: 1,902,323 × 9
SID HHID value category name item_name year comp_rule long_rule
<dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 901 94 1951 Demographic DOB DOB 0 mode select
2 1202 124 1913 Demographic DOB DOB 0 mode select
3 2301 230 1946 Demographic DOB DOB 0 mode select
4 2302 230 1946 Demographic DOB DOB 0 mode select
5 2304 230 1978 Demographic DOB DOB 0 mode select
6 2305 230 1946 Demographic DOB DOB 0 mode select
7 4601 469 1933 Demographic DOB DOB 0 mode select
8 4701 477 1919 Demographic DOB DOB 0 mode select
9 4901 493 1925 Demographic DOB DOB 0 mode select
10 5201 523 1955 Demographic DOB DOB 0 mode select
# ℹ 1,902,313 more rows
comp_rule is average, so want to get the meanlong_rule is select because we’ll choose what to do with each yearpurrr and functions yetsoep_big5 <- soep_recode %>%
# keep Big Five & drop missings
filter(category == "Big 5" & !is.na(value)) %>%
# "split" the data by category, person, household, trait, item, and year
group_by(category, SID, HHID, name, item_name, year) %>%
# "apply" the mean function, collapsing within splits
summarize(value = mean(value)) %>%
# "split" the data by category, person, household, trait, and item
group_by(category, SID, HHID, name, year) %>%
# "apply" the mean function, collapsing within splits
summarize(value = mean(value)) %>%
# "combine" the data back together
ungroup()# A tibble: 151,186 × 6
category SID HHID name year value
<chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 Big 5 901 94 A 2005 4.67
2 Big 5 901 94 A 2009 5
3 Big 5 901 94 A 2013 4.67
4 Big 5 901 94 C 2005 5
5 Big 5 901 94 C 2009 5
6 Big 5 901 94 C 2013 5.67
7 Big 5 901 94 E 2005 3.67
8 Big 5 901 94 E 2009 3.67
# ℹ 151,178 more rows
comp_rule and the long_rule are maxsoep_out <- soep_recode %>%
# keep Life events & drop missings
filter(category == "Life Event" & !is.na(value)) %>%
# "split" the data by category, person, household, event, and year
group_by(SID, HHID, category, name, year) %>%
# "apply" the max function, collapsing within splits
summarize(value = max(value)) %>%
# "split" the data by category, person, household, event
group_by(SID, HHID, category, name) %>%
# "apply" the max function, collapsing within splits
summarize(value = max(value)) %>%
# "combine" the data back together
ungroup()# A tibble: 253,363 × 5
SID HHID category name value
<dbl> <dbl> <chr> <chr> <dbl>
1 901 94 Life Event ChldBrth 0
2 901 94 Life Event ChldMvOut 0
3 901 94 Life Event DadDied 0
4 901 94 Life Event Divorce 0
5 901 94 Life Event Married 0
6 901 94 Life Event MomDied 1
7 901 94 Life Event MoveIn 0
8 901 94 Life Event NewPart 0
# ℹ 253,355 more rows
comp_rule and long_rule columns tell us that the two variables actually have the same rule (mode), which make it easy to cleanMode <- function(x) {
ux <- unique(x)
ux <- ux[!is.na(ux)]
ux[which.max(tabulate(match(x, ux)))]
}
soep_cov <- soep_recode %>%
# keep demographics & drop missings
filter(category == "Demographic" & !is.na(value)) %>%
# "split" the data by category, person, household, covariate, and year
group_by(category, SID, HHID, name, year) %>%
# "apply" the Mode function, collapsing within splits
summarize(value = Mode(value)) %>%
# "split" the data by category, person, household, and covariate
group_by(SID, HHID, name) %>%
# "apply" the Mode function, collapsing within splits
summarize(value = Mode(value)) %>%
# "combine" the data back together
ungroup() %>%
# pivot data wider so there are separate columns for each covariate
pivot_wider(
names_from = "name"
, values_from = "value"
)# A tibble: 28,290 × 4
SID HHID DOB sex
<dbl> <dbl> <dbl> <dbl>
1 901 94 1951 1
2 1202 124 1913 1
3 2301 230 1946 0
4 2302 230 1946 1
5 2304 230 1978 0
6 2305 230 1946 1
7 4601 469 1933 1
8 4701 477 1919 1
9 4901 493 1925 1
10 5201 523 1955 0
# ℹ 28,280 more rows
name and value columns to be specific to the variable categories| SID | HHID | year | event | o_value | trait | p_value | sex | DOB |
|---|
soep_clean <- soep_big5 %>%
# select key variables and rename for personality
select(SID, HHID, year, trait = name, p_value = value) %>%
# bring in matching rows (by SID and HHID)
inner_join(
soep_out %>%
# select key variables and rename for outcomes
select(SID, HHID, event = name, o_value = value)
) %>%
# bring the covariates
left_join(soep_cov)# A tibble: 1,477,972 × 9
SID HHID year trait p_value event o_value DOB sex
<dbl> <dbl> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 901 94 2005 A 4.67 ChldBrth 0 1951 1
2 901 94 2005 A 4.67 ChldMvOut 0 1951 1
3 901 94 2005 A 4.67 DadDied 0 1951 1
4 901 94 2005 A 4.67 Divorce 0 1951 1
5 901 94 2005 A 4.67 Married 0 1951 1
6 901 94 2005 A 4.67 MomDied 1 1951 1
7 901 94 2005 A 4.67 MoveIn 0 1951 1
8 901 94 2005 A 4.67 NewPart 0 1951 1
# ℹ 1,477,964 more rows
soep_clean %>%
mutate(trait = factor(trait, levels = traits$name, labels = traits$long_name)
, event = factor(event, levels = outcomes$name, labels = outcomes$long_name))# A tibble: 1,477,972 × 9
SID HHID year trait p_value event o_value DOB sex
<dbl> <dbl> <dbl> <fct> <dbl> <fct> <dbl> <dbl> <dbl>
1 901 94 2005 Agreeableness 4.67 Child Birth 0 1951 1
2 901 94 2005 Agreeableness 4.67 Child Moves Out 0 1951 1
3 901 94 2005 Agreeableness 4.67 Father Died 0 1951 1
4 901 94 2005 Agreeableness 4.67 Divorce 0 1951 1
5 901 94 2005 Agreeableness 4.67 Marriage 0 1951 1
6 901 94 2005 Agreeableness 4.67 Mother Died 1 1951 1
7 901 94 2005 Agreeableness 4.67 Move in with a P… 0 1951 1
8 901 94 2005 Agreeableness 4.67 New Partner 0 1951 1
9 901 94 2005 Agreeableness 4.67 Partner Died 0 1951 1
10 901 94 2005 Agreeableness 4.67 Separated from P… 0 1951 1
# ℹ 1,477,962 more rows
soep_clean %>%
mutate(
trait = factor(trait, levels = traits$name, labels = traits$long_name)
, event = factor(event, levels = outcomes$name, labels = outcomes$long_name)
, o_value = factor(o_value, levels = c(0,1), labels = c("No Event", "Event"))
) %>%
group_by(trait, event, o_value) %>%
summarize_at(vars(p_value), lst(mean, sd)) %>%
ungroup() %>%
ggplot(aes(x = mean, y = event, shape = o_value)) +
geom_errorbar(
aes(xmin = mean - sd, xmax = mean + sd)
, width = .1
, position = position_dodge(width=.8)
) +
geom_point(position = position_dodge(width=.8)) +
facet_grid(~trait) +
theme_classic() +
theme(legend.position = "bottom")readr and havenreadr packagehaven package for reading other kinds of data (e.g., .sav and .dat)readrCompared to the corresponding base functions, readr functions:
.csv?readr primarily works for .csv and .txt filesreadr::read_csv()Core arugments:
file: path to filecol_names: T/F (does first line contain column names); alternatively provide a vector with new column names to be appendedcol_types: what kind of data does each column contain (readr tries to guess by default)
list() or cols() (see later slide)col_select: Which columns to read in (helpful for huge datasets) using select style syntax wrapped in a c()
readr::read_csv(): col_typesreadr does pretty good at guessing column types, with the occasional character/number/logical/date issue. For these instances, you can force the column type using the col_types argument. Here’s some examples of the kinds of columns and the syntax:
col_logical() [l], containing only T, F, TRUE or FALSE
col_integer() [i], integers
col_double() [d], doubles
col_character() [c], everything else
col_factor(levels, ordered) [f], a fixed set of values
col_date(format = "") [D]: with the locale’s date_format
col_time(format = "") [t]: with the locale’s time_format
col_datetime(format = "") [T]: ISO8601 date times
readr::read_csv(): col_typesIn our cleaned data we created, for example, say that I wanted the following specific column types:
character
factor
read_csv(
file = "clean_data_2026-01-31.csv"
, col_types = cols(
SID = col_character()
, sex = col_factor()
)) %>% print(n = 5)# A tibble: 1,477,972 × 9
SID HHID year trait p_value event o_value DOB sex
<chr> <dbl> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <fct>
1 901 94 2005 A 4.67 ChldBrth 0 1951 1
2 901 94 2005 A 4.67 ChldMvOut 0 1951 1
3 901 94 2005 A 4.67 DadDied 0 1951 1
4 901 94 2005 A 4.67 Divorce 0 1951 1
5 901 94 2005 A 4.67 Married 0 1951 1
# ℹ 1,477,967 more rows
havenhaven is another tidyverse package that supports data from
read_sas() and read_xpt())read_sav() and older read_por() files)read_dta())tibbles
labelled data, which play nice for factoring variablesread_sav() since I suspect you’re most likely to encounter that anywayhaven::read_sav()Core arguments:
file: file path and namecol_select: Which columns to read in (helpful for huge datasets) using select style syntax wrapped in a c()
encoding: possibly useful for those of you whose computers and/or work are in multiple languageshaven: Labelled datalabelled() class provides a natural representaion in R.haven has some functions to help deal with this (labelled data doesn’t always play nice)
labelled_spss(): Labelled vectors for SPSSlabelled() is.labelled(): Create a labelled vector.print_labels(): Print the labels of a labelled vectoras_factor(<data.frame>) as_factor(<haven_labelled>) as_factor(<labelled>): Convert labelled vectors to factorszap_label(): Zap variable labelszap_missing(): Zap special missings to regular R missingsFirst, you can access the GUI in your Environment panel:
Second, after selecting the appropriate type and package, you should see this window:
Third, select your data set in your browser:
Fourth, you should see a preview of your data. Depending on the data type, you also may be able to set additional options:
PSC 203A - Data Management and Cleaning