Week 4 - Codebooks and Data Documentation

Emorie D Beck

Outline

  1. Documenting your design
  2. Building a codebook
  3. Cleaning your data using codebooks
  4. Problem set and Question time
  • Note: the schedule says we were going to talk about data import today. I want to focus on codebooks, so please read R for Data Science Chapters 8 and 21 to make sure you understand those pieces

Documenting your design

Documenting your design

  • Documentation is a critical part of open science, but not one we’re really taught
  • Documentation is going to look different for different types of research, but it’s not a hopeless cause to think about common features of documentation
  • Common Documentation:
    • Preregistration
    • Experiment Script (for standardizing across experimenters)
    • Survey / experimental files / stimuli / questions
    • Codebooks of all variables collected
    • Codebooks of variables used in a given study

Document your design

  • Today I want to touch on three things:
    • Preregistration (brief, mostly focusing on pointing you to resources)
    • Protocol and design flow
    • Codebooks of variables used in a given study (and how to use it in R)

Preregistration

  • Preregistration:
    • Specifying your study design, research questions, hypotheses, data cleaning, analytic plan, inference criteria, and robustness checks in advance
  • Why should you preregister?
    • Badges are fun
    • Preregistrations are not rigid but a chance to think through the questions you want to ask and answer and the challenges that might arise in doing so
    • Builds trust in the scientific process

Preregistration

  • Preregistration is hard
    • Specifying your plan in advance takes considerable effort and time, which can feel like very slow science
  • Preregistration is worthwhile
    • But preregistering plans, code, etc. can speed up the analytic portion of your research workflow, which builds great momentum for writing and submitting projects

What should I preregister?

  • Depends on the project, some examples include study design, individual research projects, etc.
    • Study design: A large survey is collected or a multi-part experiment is conducted. Measures, design, some research questions and hypotheses are specified a priori
    • Individual paper / project: A single-part survey or experiment is conducted or a specific piece of a multi-part study is investigated. If part of a multi-part study/experiment, should be linked to the parent preregistration

What should I preregister?

Learning More:

Protocol and Design Flow

  • Procedure sections in scientific papers are meant to map out, as concisely and simply as possible, how data were obtained (adhering to human subjects ethical codes, etc.)
  • But such sections are not sufficient to replicate or reproduce research because study designs are much more intricate and include many more details than what fits in a method section
    • e.g. measures not used because they weren’t focal, the code tha tunderlies how data are collected, preprocessing, etc.

Protocol and Design Flow

  • As researchers, it’s our job to make sure that the work we do is documented so well that someone could replicate our studies.
  • Think of it sort of like doing your taxes. You want to keep enough information that if you were audited, you would be able to quickly and easily provide all the relevant information.

Protocol and Design Flow

  • What you need to document will depend on the kind of work you do.
  • As an example, in my ecological momentary assessment work, I do the following:
    • Preregister the design
    • Write a methods section that includes text for every measure included in any part of the study as well as an extended and detailed procedure description. This also includes information on how data will be cleaned and composited
    • Detailed codebook including all measures that were collected, regardless of whether I have research questions or hypotheses for them. This is shareable for anyone who wants to use the data
    • Make technical workflow. This documents how all documents, scripts, etc. work together to produce the final result, including what is automated, what requires researcher action, etc.
    • Comment all code and documents extensively
    • Deviations document, where I document every deviation from my initial plans after the design is complete and data begin to be collected (or analyses start)

Protocol and Design Flow

  • Extensive documentation is also an investment in future you! My measures and procedures section basically write themselves, and my analytic plan is written in the preregistration
  • This both means that I’m faster and more efficient at writing these and that I feel more confident about the design choices I made, which is a win-win

Codebooks

Codebooks

  • For me, codebooks are the most essential and important part of any research project
  • Codebooks allow me to:
    • parse through documentation and find all the variables I want
    • document detailed information about each of those variables
    • make cleaning and compositing choices for each (e.g., renaming, recoding, removing missings, etc.)
    • differentiate among the kind of variables I have (e.g., predictors, outcomes, covariates, manipulations, and other categories)
    • Pass all this information into R to aid in data cleaning

Codebooks

In this case, we are going to using some data from the German Socioeconomic Panel Study (GSOEP), which is an ongoing Panel Study in Germany. Note that these data are for teaching purposes only, shared under the license for the Comprehensive SOEP teaching dataset, which I, as a contracted SOEP user, can use for teaching purposes. These data represent select cases from the full data set and should not be used for the purpose of publication. The full data are available for free at https://www.diw.de/en/diw_02.c.222829.en/access_and_ordering.html.

For this tutorial, I created the codebook for you: Download, and included what I believe are the core columns you may need. Some of these columns may not be particularly helpful for every dataset.

Codebooks

Here are my core columns that are based on the original data:

  • dataset name (dataset)
  • how I categorize the variables (category)
  • how I rename each item (item)
  • how I composite the variables (name)
  • original variable name (old_name)
  • original item text (item_text)
  • original item values (scale)
  • how I will recode each item (in text; recode_desc)
  • how I will recode each item (in R; recode)
  • whether item is reverse coded (reverse)
  • scale minimum (mini)
  • scale maximum (maxi)
  • timeline of variable collection (year or wave)
  • meta name / never changing name (meta)

Codebooks

  1. dataset: this column indexes the name of the dataset that you will be pulling the data from. This is important because we will use this info later on (see purrr tutorial) to load and clean specific data files. Even if you don’t have multiple data sets, I believe consistency is more important and suggest using this.
  2. category: broad categories that different variables can be put into. I’m a fan of naming them things like “outcome”, “predictor”, “moderator”, “demographic”, “procedural”, etc. but sometimes use more descriptive labels like “Big 5” to indicate the model from which the measures are derived.
  3. name: label is basically one level lower than category. So if the category is Big 5, the label would be, or example, “A” for Agreeableness, “SWB” for subjective well-being, etc. This column is most important and useful when you have multiple items in a scales, so I’ll typically leave this blank when something is a standalone variable (e.g. sex, single-item scales, etc.).
  4. item_name: This is the lowest level and most descriptive variable. It indicates which item in scale something is. So it may be “kind” for Agreebleness or “sex” for the demographic biological sex variable.
  5. old_name: this column is the name of the variable in the data you are pulling it from. This should be exact. The goal of this column is that it will allow us to select() variables from the original data file and rename them something that is more useful to us.
  6. item_text: this column is the original text that participants saw or a description of the item.

Codebooks

  1. scale: this column tells you what the scale of the variable is. Is it a numeric variable, a text variable, etc. This is helpful for knowing the plausible range.
  2. recode_text: sometimes, we want to recode variables for analyses (e.g. for categorical variables with many levels where sample sizes for some levels are too small to actually do anything with it). I use this column to note the kind of recoding I’ll do to a variable for transparency.
  3. recode: I write the R code I’ll parse by reading my codebook into R into this column.

Codebooks

Here are additional columns that will make our lives easier or are applicable to some but not all data sets:

  1. reverse: this column tells you whether items in a scale need to be reverse coded. I recommend coding this as 1 (leave alone) and -1 (reverse) for reasons that will become clear later.
  2. mini: this column represents the minimum value of scales that are numeric. Leave blank otherwise.
  3. maxi: this column represents the maximum value of scales that are numeric. Leave blank otherwise.
  4. year: for longitudinal data, we have several waves of data and the name of the same item across waves is often different, so it’s important to note to which wave an item belongs. You can do this by noting the wave (e.g. 1, 2, 3), but I prefer the actual year the data were collected (e.g. 2005, 2009, etc.)
  5. meta: Some datasets have a meta name, which essentially means a name that variable has across all waves to make it clear which variables are the same. They are not always useful as some data sets have meta names but no great way of extracting variables using them. But they’re still typically useful to include in your codebook regardless.

Codebooks

Below, I’ll load in the codebook we will use for this study, which will include all of the above columns.

# set the path
wd <- "https://github.com/emoriebeck/psc290-data-FQ23/raw/main/04-workshops/04-week4-readr"

download.file(
  url      = sprintf("%s//codebook.xlsx", wd), 
  destfile = "codebook.xlsx"
  )

Codebooks

The resulting codebook looks something like this:

## Codebooks - In addition, to the codebook, I also document other overarching info three other tabs: * Overview just lists what variables I’m considering as serving different functions (e.g., demographics, covariates, moderators, predictors, outcomes, indepenedent variables, dependent variables, etc.) * Key helps me create tables that I’ll be able to use in R to help me rename things. This is super helpful for making final tables and figures!! * Sample helps other people understand how you’re using the columns. This is generally good practice and also helpful if you have research assistants or collaborators helping you out!

Codebooks: Overview

Codebooks: Key

Codebooks: Sample

Data

First, we need to load in the data. We’re going to use three waves of data from the German Socioeconomic Panel Study, which is a longitudinal study of German households that has been conducted since 1984. We’re going to use more recent data from three waves of personality data collected between 2005 and 2013.

Note: we will be using the teaching set of the GSOEP data set. I will not be pulling from the raw files as a result of this. I will also not be mirroring the format that you would usually load the GSOEP from because that is slightly more complicated and something we will return to in a later tutorial on purrr (link) after we have more skills. I’ve left that code in the .qmd for now, but it won’t make a lot of sense right now.

Example

Example

  • Download the following .zip file with an R project
  • We’re going to walk through this script, and then you will spend the rest of class working on your problem set, which basically does the same with your own data.

Example

  • The first thing I do in a script is setup my “Workspace,” which typically includes:
    • Packages
    • Any alternative paths (e.g., if data are stored online)
    • Codebooks
    • Data

Packages

library(psych)
library(plyr)
library(tidyverse)

Codebook

The code below first prints the sheets and then reads in the codebook page that has all the variables

readxl::excel_sheets(path = "codebook.xlsx")
[1] "codebook" "Overview" "Key"      "Sample"  
codebook <- readxl::read_excel(path = "codebook.xlsx", sheet = "codebook") %>%
  mutate(old_name = str_to_lower(old_name))
codebook
# A tibble: 153 × 15
   dataset category   name  item_name old_name  year item_text scale recode_text
   <chr>   <chr>      <chr> <chr>     <chr>    <dbl> <chr>     <chr> <chr>      
 1 <NA>    Procedural <NA>  SID       persnr       0 Never Ch…  <NA> <NA>       
 2 <NA>    Procedural <NA>  household hhnr         0 househol…  <NA> <NA>       
 3 ppfad   Demograph… DOB   DOB       gebjahr      0 Year of … "num… <NA>       
 4 ppfad   Demograph… sex   sex       sex          0 Sex       "\r\… 0 = male, …
 5 vp      Big 5      A     coarse    vp12503   2005 Am somet… "\"[… Values < 1…
 6 vp      Big 5      A     forgive   vp12506   2005 Able to … "\"[… Values < 1…
 7 vp      Big 5      A     friendly  vp12513   2005 Friendly… "\"[… Values < 1…
 8 zp      Big 5      A     coarse    zp12003   2009 Am somet… "\"[… Values < 1…
 9 zp      Big 5      A     forgive   zp12006   2009 Able to … "\"[… Values < 1…
10 zp      Big 5      A     friendly  zp12013   2009 Friendly… "\"[… Values < 1…
# ℹ 143 more rows
# ℹ 6 more variables: recode <chr>, reverse <chr>, mini <dbl>, maxi <dbl>,
#   comp_rule <chr>, long_rule <chr>

Codebook

  • But why did I create that key page?
  • Well, the reason is because I can use it to rename things more easily into readable names that are great in publications
  • Let’s create a data frame for the personality, outcome, and demographic variables we’ll reference later
key <- readxl::read_excel(path = "codebook.xlsx", sheet = "Key")
traits   <- key %>% filter(category == "Big 5")
outcomes <- key %>% filter(category == "out")
covars   <- key %>% filter(category == "dem")

Data

Let’s read in the raw, wide-format data set using readr::read_csv():

vars <- codebook$old_name
soep <- read_csv(file = "soep.csv") %>%
  select(one_of(vars)) # keep vars from codebook
soep
# A tibble: 28,290 × 153
   persnr  hhnr gebjahr   sex vp12503 vp12506 vp12513 zp12003 zp12006 zp12013
    <dbl> <dbl>   <dbl> <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1    901    94    1951     2       4       5       5       3       5       5
 2   1202   124    1913     2       1      -1       7      NA      NA      NA
 3   2301   230    1946     1       5       6       6       4       7       6
 4   2302   230    1946     2       6       3       5       5       4       3
 5   2304   230    1978     1       4       7       6      NA      NA      NA
 6   2305   230    1946     2      NA      NA      NA      NA      NA      NA
 7   4601   469    1933     2       2       7       6      NA      NA      NA
 8   4701   477    1919     2       1       5       6       1       5       5
 9   4901   493    1925     2       2       4       6       4       2       6
10   5201   523    1955     1       2       7       6       6       7       5
# ℹ 28,280 more rows
# ℹ 143 more variables: bdp15103 <dbl>, bdp15106 <dbl>, bdp15113 <dbl>,
#   vp12511 <dbl>, vp12507 <dbl>, vp12501 <dbl>, zp12011 <dbl>, zp12007 <dbl>,
#   zp12001 <dbl>, bdp15111 <dbl>, bdp15107 <dbl>, bdp15101 <dbl>,
#   vp12502 <dbl>, vp12512 <dbl>, vp12508 <dbl>, zp12002 <dbl>, zp12012 <dbl>,
#   zp12008 <dbl>, bdp15102 <dbl>, bdp15112 <dbl>, bdp15108 <dbl>,
#   vp12515 <dbl>, vp12510 <dbl>, vp12505 <dbl>, zp12015 <dbl>, …

Example

  • Once I have my workspace setup, it’s time to clean the data
  • For the sake of time, I’m going to skip the descriptives we talked about last week, but we should be doing those in a real setting!
  • I often have the following sections in my data cleaning section:
    • Rename Variables
      • Change to Long
      • Bring in Codebook
    • Recode Variables
    • Reverse Scoring
    • Predictors (In this case, personality)
    • Outcomes (In this case, life events)
    • Covariates / Demographics
    • [any other variable categories you have]
  • I like to clean different categories of variables separately because I often clean them relatively similarly within categories. Differences within categories are generally captured via columns in my codebook

Rename Variables

Change Data to Long

  • To get our codebook to play nice with the data since our data are in wide form (including across years), we need to make the data long with at least one name that corresponds to the codebook
  • In this case, we’ll all the variables but the participant and household ID’s long, and change the item name to old_name since it contains the original variable names.
## change data to long format
soep_long <- soep %>%
  pivot_longer(
    cols = c(-persnr, -hhnr) # all but SID & HHID
    , names_to = "old_name" # match the codebook
    , values_to = "value" # arbitrary name
    , values_drop_na = T # drop missings
  ) %>%
  rename(SID = persnr, HHID = hhnr) # rename
soep_long
# A tibble: 1,902,323 × 4
     SID  HHID old_name value
   <dbl> <dbl> <chr>    <dbl>
 1   901    94 gebjahr   1951
 2   901    94 sex          2
 3   901    94 vp12503      4
 4   901    94 vp12506      5
 5   901    94 vp12513      5
 6   901    94 zp12003      3
 7   901    94 zp12006      5
 8   901    94 zp12013      5
 9   901    94 bdp15103     4
10   901    94 bdp15106     4
# ℹ 1,902,313 more rows

Rename Variables

Merge in Codebook

  • Now, let’s actually merge in the codebook. We’ll use left_join() here becuase we want to keep all observations in our raw soep_long data frame.
# merge in codebook
soep_long <- soep_long %>% # long data
  left_join( # keep all rows in long data
    codebook %>% # merge in the following variables from the codebook
      select(old_name, category, name, item_name, year, recode, 
             reverse, mini, maxi, comp_rule, long_rule)
  ) %>%
  select(-old_name) # get rid of old_name because we're done with it
soep_long
# A tibble: 1,902,323 × 13
     SID  HHID value category   name  item_name  year recode reverse  mini  maxi
   <dbl> <dbl> <dbl> <chr>      <chr> <chr>     <dbl> <chr>  <chr>   <dbl> <dbl>
 1   901    94  1951 Demograph… DOB   DOB           0 ifels… no         NA    NA
 2   901    94     2 Demograph… sex   sex           0 ifels… no         NA    NA
 3   901    94     4 Big 5      A     coarse     2005 ifels… yes         1     7
 4   901    94     5 Big 5      A     forgive    2005 ifels… no          1     7
 5   901    94     5 Big 5      A     friendly   2005 ifels… no          1     7
 6   901    94     3 Big 5      A     coarse     2009 ifels… yes         1     7
 7   901    94     5 Big 5      A     forgive    2009 ifels… no          1     7
 8   901    94     5 Big 5      A     friendly   2009 ifels… no          1     7
 9   901    94     4 Big 5      A     coarse     2013 ifels… yes         1     7
10   901    94     4 Big 5      A     forgive    2013 ifels… no          1     7
# ℹ 1,902,313 more rows
# ℹ 2 more variables: comp_rule <chr>, long_rule <chr>

Example

  • Now that we’ve merged our codebook and raw data, we’re ready to use the information in the codebook to:
    • recode (recode)
    • reverse score (reverse, mini, maxi)
    • composite (comp_rule, long_rule)

Recode Variables

  • I find the easiest way to recode variables, especially in projects where I may need to differently recode hundreds or thousands of different variables differently, is to write a little function that takes code chunks from my codebook and runs them.
  • The function looks like this:
recode_fun <- function(rule, y){
  x <- y$value
  if(!is.na(rule)){y$value <- eval(parse(text = rule))}
  return(y)
}

Recode Variables

  • Now we’re going to apply each rule to each chunk of the data that uses the same one
  • To do this, we’re going to use some functions from the purrr package that we won’t talk about it this week.
  • Don’t worry too much about this. This is code you can copy paste or add to an R script you source across projects!
soep_recode <- soep_long %>%
  group_by(recode) %>% # group by the r code rule
  nest() %>% # create a nested data frame
  ungroup() %>% # ungroup() 
  # apply the recode function
  mutate(data = pmap(list(recode, data), recode_fun)) %>%
  unnest(data) %>% # unnest the data to get back to a normal df
  # change any negative, nan, or Inf values to NA
  mutate(value = ifelse(value < 0 | is.nan(value) | is.infinite(value), NA, value)) %>%
  select(-recode) # we're done with the recode column, so remove it
soep_recode
# A tibble: 1,902,323 × 12
     SID  HHID value category    name  item_name  year reverse  mini  maxi
   <dbl> <dbl> <dbl> <chr>       <chr> <chr>     <dbl> <chr>   <dbl> <dbl>
 1   901    94  1951 Demographic DOB   DOB           0 no         NA    NA
 2  1202   124  1913 Demographic DOB   DOB           0 no         NA    NA
 3  2301   230  1946 Demographic DOB   DOB           0 no         NA    NA
 4  2302   230  1946 Demographic DOB   DOB           0 no         NA    NA
 5  2304   230  1978 Demographic DOB   DOB           0 no         NA    NA
 6  2305   230  1946 Demographic DOB   DOB           0 no         NA    NA
 7  4601   469  1933 Demographic DOB   DOB           0 no         NA    NA
 8  4701   477  1919 Demographic DOB   DOB           0 no         NA    NA
 9  4901   493  1925 Demographic DOB   DOB           0 no         NA    NA
10  5201   523  1955 Demographic DOB   DOB           0 no         NA    NA
# ℹ 1,902,313 more rows
# ℹ 2 more variables: comp_rule <chr>, long_rule <chr>

Reverse Coding

  • Now we can reverse code our data. We’ll use similar code to what I showed last week, but we’re working on long-format data, so we’ll use the reverse column to tell us which rows to reverse and what the mini and maxi scale values are.
soep_recode <- soep_recode %>%
  # if reverse = no leave along
  # otherwise reverse code it according to the mini and maxi
  mutate(value = 
           ifelse(is.na(reverse) | reverse == "no"
            , value
            , as.numeric(reverse.code(
              -1
              , value
              , mini = mini
              , maxi = maxi
              ))
      )
  ) %>%
  # get rid of reverse, mini, & maxi because we're done with them
  select(-reverse, -mini, -maxi) 
# A tibble: 1,902,323 × 9
     SID  HHID value category    name  item_name  year comp_rule long_rule
   <dbl> <dbl> <dbl> <chr>       <chr> <chr>     <dbl> <chr>     <chr>    
 1   901    94  1951 Demographic DOB   DOB           0 mode      select   
 2  1202   124  1913 Demographic DOB   DOB           0 mode      select   
 3  2301   230  1946 Demographic DOB   DOB           0 mode      select   
 4  2302   230  1946 Demographic DOB   DOB           0 mode      select   
 5  2304   230  1978 Demographic DOB   DOB           0 mode      select   
 6  2305   230  1946 Demographic DOB   DOB           0 mode      select   
 7  4601   469  1933 Demographic DOB   DOB           0 mode      select   
 8  4701   477  1919 Demographic DOB   DOB           0 mode      select   
 9  4901   493  1925 Demographic DOB   DOB           0 mode      select   
10  5201   523  1955 Demographic DOB   DOB           0 mode      select   
# ℹ 1,902,313 more rows

Composite Items

  • Now that our data are recoded and reverse scored, we can clean each category of data:
    • Personality
    • Outcomes / Life Events
    • Demographics / Covariates
  • Then we’ll merge them together

Personality

  • For the Big Five, we want to get composites within years (2005, 2009, 2013)
  • the comp_rule is average, so want to get the mean
  • the long_rule is select because we’ll choose what to do with each year
  • In the workbook, you’ll see an alternate way to do this that uses a function
    • We won’t go over this today because we haven’t learned purrr and functions yet
    • But keep this code and see if you can figure it out because it’s much more flexible and super useful in the case that you need to apply different rules to different variables (especially useful for covariates and moderators!!)

Personality

soep_big5 <- soep_recode %>%
  # keep Big Five & drop missings
  filter(category == "Big 5" & !is.na(value)) %>%
  # "split" the data by category, person, household, trait, item, and year
  group_by(category, SID, HHID, name, item_name, year) %>%
  # "apply" the mean function, collapsing within splits
  summarize(value = mean(value)) %>%
  # "split" the data by category, person, household, trait, and item
  group_by(category, SID, HHID, name, year) %>%
  # "apply" the mean function, collapsing within splits
  summarize(value = mean(value)) %>%
  # "combine" the data back together
  ungroup()
# A tibble: 151,186 × 6
   category   SID  HHID name   year value
   <chr>    <dbl> <dbl> <chr> <dbl> <dbl>
 1 Big 5      901    94 A      2005  4.67
 2 Big 5      901    94 A      2009  5   
 3 Big 5      901    94 A      2013  4.67
 4 Big 5      901    94 C      2005  5   
 5 Big 5      901    94 C      2009  5   
 6 Big 5      901    94 C      2013  5.67
 7 Big 5      901    94 E      2005  3.67
 8 Big 5      901    94 E      2009  3.67
 9 Big 5      901    94 E      2013  3.67
10 Big 5      901    94 N      2005  3.33
# ℹ 151,176 more rows

Outcomes

  • Now onto life events
  • For these data we want to get a single composite for each life event across all years (i.e. did they ever experience each event)
  • Both the comp_rule and the long_rule are max
  • As before, the more flexible way is in the workbook

Outcomes

soep_out <- soep_recode %>%
  # keep Life events & drop missings
  filter(category == "Life Event" & !is.na(value)) %>%
  # "split" the data by category, person, household, event, and year
  group_by(SID, HHID, category, name, year) %>% 
  # "apply" the max function, collapsing within splits
  summarize(value = max(value)) %>%
  # "split" the data by category, person, household, event
  group_by(SID, HHID, category, name) %>% 
  # "apply" the max function, collapsing within splits
  summarize(value = max(value)) %>%
  # "combine" the data back together
  ungroup()
# A tibble: 253,363 × 5
     SID  HHID category   name      value
   <dbl> <dbl> <chr>      <chr>     <dbl>
 1   901    94 Life Event ChldBrth      0
 2   901    94 Life Event ChldMvOut     0
 3   901    94 Life Event DadDied       0
 4   901    94 Life Event Divorce       0
 5   901    94 Life Event Married       0
 6   901    94 Life Event MomDied       1
 7   901    94 Life Event MoveIn        0
 8   901    94 Life Event NewPart       0
 9   901    94 Life Event PartDied      0
10   901    94 Life Event SepPart       0
# ℹ 253,353 more rows

Covariates

  • Now let’s do the covariates
  • The comp_rule and long_rule columns tell us that the two variables actually have the same rule (mode), which make it easy to clean
  • As before, the more flexible way is in the workbook, which is particularly useful for demographics and covariates that may be on super different scales

Covariates

Mode <- function(x) {
  ux <- unique(x)
  ux <- ux[!is.na(ux)]
  ux[which.max(tabulate(match(x, ux)))]
}

soep_cov <- soep_recode %>%
  # keep demographics & drop missings
  filter(category == "Demographic" & !is.na(value)) %>%
  # "split" the data by category, person, household, covariate, and year
  group_by(category, SID, HHID, name, year) %>%
  # "apply" the Mode function, collapsing within splits
  summarize(value = Mode(value)) %>%
  # "split" the data by category, person, household, and covariate
  group_by(SID, HHID, name) %>%
  # "apply" the Mode function, collapsing within splits
  summarize(value = Mode(value)) %>%
  # "combine" the data back together
  ungroup() %>%
  # pivot data wider so there are separate columns for each covariate
  pivot_wider(
    names_from = "name"
    , values_from = "value"
    )
# A tibble: 28,290 × 4
     SID  HHID   DOB   sex
   <dbl> <dbl> <dbl> <dbl>
 1   901    94  1951     1
 2  1202   124  1913     1
 3  2301   230  1946     0
 4  2302   230  1946     1
 5  2304   230  1978     0
 6  2305   230  1946     1
 7  4601   469  1933     1
 8  4701   477  1919     1
 9  4901   493  1925     1
10  5201   523  1955     0
# ℹ 28,280 more rows

Merge Data

  • Lastly, let’s re-merge the data to bring the information back together
  • Because we want the crossings of traits and life events, we’ll need to change the name and value columns to be specific to the variable categories
  • We want the data to look like this:
SID HHID year event o_value trait p_value sex DOB

Merge Data

soep_clean <- soep_big5 %>%
  # select key variables and rename for personality
  select(SID, HHID, year, trait = name, p_value = value) %>%
  # bring in matching rows (by SID and HHID)
  inner_join(
    soep_out %>%
      # select key variables and rename for outcomes
      select(SID, HHID, event = name, o_value = value)
    ) %>%
  # bring the covariates
  left_join(soep_cov)
# A tibble: 1,477,972 × 9
     SID  HHID  year trait p_value event     o_value   DOB   sex
   <dbl> <dbl> <dbl> <chr>   <dbl> <chr>       <dbl> <dbl> <dbl>
 1   901    94  2005 A        4.67 ChldBrth        0  1951     1
 2   901    94  2005 A        4.67 ChldMvOut       0  1951     1
 3   901    94  2005 A        4.67 DadDied         0  1951     1
 4   901    94  2005 A        4.67 Divorce         0  1951     1
 5   901    94  2005 A        4.67 Married         0  1951     1
 6   901    94  2005 A        4.67 MomDied         1  1951     1
 7   901    94  2005 A        4.67 MoveIn          0  1951     1
 8   901    94  2005 A        4.67 NewPart         0  1951     1
 9   901    94  2005 A        4.67 PartDied        0  1951     1
10   901    94  2005 A        4.67 SepPart         0  1951     1
# ℹ 1,477,962 more rows

Save Your Data

write.csv(
  x = soep_clean
  , file = sprintf("clean_data_%s.csv", Sys.Date())
  , row.names = F
  )

Note: Why did we do the keys?

soep_clean %>%
  mutate(trait = factor(trait, levels = traits$name, labels = traits$long_name)
         , event = factor(event, levels = outcomes$name, labels = outcomes$long_name))
# A tibble: 1,477,972 × 9
     SID  HHID  year trait         p_value event             o_value   DOB   sex
   <dbl> <dbl> <dbl> <fct>           <dbl> <fct>               <dbl> <dbl> <dbl>
 1   901    94  2005 Agreeableness    4.67 Child Birth             0  1951     1
 2   901    94  2005 Agreeableness    4.67 Child Moves Out         0  1951     1
 3   901    94  2005 Agreeableness    4.67 Father Died             0  1951     1
 4   901    94  2005 Agreeableness    4.67 Divorce                 0  1951     1
 5   901    94  2005 Agreeableness    4.67 Marriage                0  1951     1
 6   901    94  2005 Agreeableness    4.67 Mother Died             1  1951     1
 7   901    94  2005 Agreeableness    4.67 Move in with a P…       0  1951     1
 8   901    94  2005 Agreeableness    4.67 New Partner             0  1951     1
 9   901    94  2005 Agreeableness    4.67 Partner Died            0  1951     1
10   901    94  2005 Agreeableness    4.67 Separated from P…       0  1951     1
# ℹ 1,477,962 more rows

Note: Why did we do the keys?

soep_clean %>%
  mutate(trait = factor(trait, levels = traits$name, labels = traits$long_name)
         , event = factor(event, levels = outcomes$name, labels = outcomes$long_name)
         , o_value = factor(o_value, levels = c(0,1), labels = c("No Event", "Event"))) %>%
  group_by(trait, event, o_value) %>%
  summarize_at(vars(p_value), lst(mean, sd)) %>%
  ungroup() %>%
  g ggplot(aes(x = mean, y = event, shape = o_value)) + 
    geom_errorbar(
      aes(xmin = mean - sd, xmax = mean + sd)
      , width = .1
      , position = position_dodge(width=.8)
      ) + 
  geom_point(position = position_dodge(width=.8)) + 
  facet_grid(~trait) + 
  theme_classic()