Week 4 - Codebooks and Data Documentation

Emorie D Beck

Outline

Documenting your design
Building a codebook
Cleaning your data using codebooks
Problem set and Question time

Note: the schedule says we were going to talk about data import today. I want to focus on codebooks, so please read R for Data Science Chapters 8 and 21 to make sure you understand those pieces

Documenting your design

Documentation is a critical part of open science, but not one we’re really taught
Documentation is going to look different for different types of research, but it’s not a hopeless cause to think about common features of documentation
Common Documentation:
- Preregistration
- Experiment Script (for standardizing across experimenters)
- Survey / experimental files / stimuli / questions
- Codebooks of all variables collected
- Codebooks of variables used in a given study

Document your design

Today I want to touch on three things:
- Preregistration (brief, mostly focusing on pointing you to resources)
- Protocol and design flow
- Codebooks of variables used in a given study (and how to use it in R)

Preregistration

Preregistration:
- Specifying your study design, research questions, hypotheses, data cleaning, analytic plan, inference criteria, and robustness checks in advance
Why should you preregister?
- Badges are fun
- Preregistrations are not rigid but a chance to think through the questions you want to ask and answer and the challenges that might arise in doing so
- Builds trust in the scientific process

Preregistration

Preregistration is hard
- Specifying your plan in advance takes considerable effort and time, which can feel like very slow science
Preregistration is worthwhile
- But preregistering plans, code, etc. can speed up the analytic portion of your research workflow, which builds great momentum for writing and submitting projects

What should I preregister?

Depends on the project, some examples include study design, individual research projects, etc.
- Study design: A large survey is collected or a multi-part experiment is conducted. Measures, design, some research questions and hypotheses are specified a priori
- Individual paper / project: A single-part survey or experiment is conducted or a specific piece of a multi-part study is investigated. If part of a multi-part study/experiment, should be linked to the parent preregistration

What should I preregister?

Learning More:

Protocol and Design Flow

Procedure sections in scientific papers are meant to map out, as concisely and simply as possible, how data were obtained (adhering to human subjects ethical codes, etc.)
But such sections are not sufficient to replicate or reproduce research because study designs are much more intricate and include many more details than what fits in a method section
- e.g. measures not used because they weren’t focal, the code that underlies how data are collected, preprocessing, journal space restrictions, etc.

Protocol and Design Flow

As researchers, it’s our job to make sure that the work we do is documented so well that someone could replicate our studies.
Think of it sort of like doing your taxes. You want to keep enough information that if you were audited, you would be able to quickly and easily provide all the relevant information.

Protocol and Design Flow

What you need to document will depend on the kind of work you do.
As an example, in my ecological momentary assessment work, I do the following:
- Preregister the design
- Write a methods section that includes text for every measure included in any part of the study as well as an extended and detailed procedure description. This also includes information on how data will be cleaned and composited
- Detailed codebook including all measures that were collected, regardless of whether I have research questions or hypotheses for them. This is shareable for anyone who wants to use the data
- Make technical workflow. This documents how all documents, scripts, etc. work together to produce the final result, including what is automated, what requires researcher action, etc.
- Comment all code and documents extensively
- Deviations document, where I document every deviation from my initial plans after the design is complete and data begin to be collected (or analyses start)

Protocol and Design Flow

Extensive documentation is also an investment in future you! My measures and procedures section basically write themselves, and my analytic plan is written in the preregistration
This both means that I’m faster and more efficient at writing these and that I feel more confident about the design choices I made, which is a win-win

Codebooks

For me, codebooks are the most essential and important part of any research project
Codebooks allow me to:
- parse through documentation and find all the variables I want
- document detailed information about each of those variables
- make cleaning and compositing choices for each (e.g., renaming, recoding, removing missings, etc.)
- differentiate among the kind of variables I have (e.g., predictors, outcomes, covariates, manipulations, and other categories)
- Pass all this information into R to aid in data cleaning

Codebooks

In this case, we are going to using some data from the German Socioeconomic Panel Study (GSOEP), which is an ongoing Panel Study in Germany. Note that these data are for teaching purposes only, shared under the license for the Comprehensive SOEP teaching dataset, which I, as a contracted SOEP user, can use for teaching purposes. These data represent select cases from the full data set and should not be used for the purpose of publication. The full data are available for free at https://www.diw.de/en/diw_02.c.222829.en/access_and_ordering.html.

For this tutorial, I created the codebook for you: Download, and included what I believe are the core columns you may need. Some of these columns may not be particularly helpful for every dataset.

Codebooks

Here are my core columns that are based on the original data:

dataset name (dataset)
how I categorize the variables (category)
how I rename each item (item)
how I composite the variables (name)
original variable name (old_name)
original item text (item_text)
original item values (scale)

how I will recode each item (in text; recode_desc)
how I will recode each item (in R; recode)
whether item is reverse coded (reverse)
scale minimum (mini)
scale maximum (maxi)
timeline of variable collection (year or wave)
meta name / never changing name (meta)
any notes on the variable (notes)

Codebooks

dataset: this column indexes the name of the dataset that you will be pulling the data from. This is important because we will use this info later on (see purrr tutorial) to load and clean specific data files. Even if you don’t have multiple data sets, I believe consistency is more important and suggest using this.
category: broad categories that different variables can be put into. I’m a fan of naming them things like “outcome”, “predictor”, “moderator”, “demographic”, “procedural”, etc. but sometimes use more descriptive labels like “Big 5” to indicate the model from which the measures are derived.
name: label is basically one level lower than category. So if the category is Big 5, the label would be, or example, “A” for Agreeableness, “SWB” for subjective well-being, etc. This column is most important and useful when you have multiple items in a scales, so I’ll typically leave this blank when something is a standalone variable (e.g. sex, single-item scales, etc.).
item_name: This is the lowest level and most descriptive variable. It indicates which item in scale something is. So it may be “kind” for Agreebleness or “sex” for the demographic biological sex variable.
old_name: this column is the name of the variable in the data you are pulling it from. This should be exact. The goal of this column is that it will allow us to select() variables from the original data file and rename them something that is more useful to us.
item_text: this column is the original text that participants saw or a description of the item.

Codebooks

scale: this column tells you what the scale of the variable is. Is it a numeric variable, a text variable, etc. This is helpful for knowing the plausible range.
recode_text: sometimes, we want to recode variables for analyses (e.g. for categorical variables with many levels where sample sizes for some levels are too small to actually do anything with it). I use this column to note the kind of recoding I’ll do to a variable for transparency.
recode: I write the R code I’ll parse by reading my codebook into R into this column.

Codebooks

Here are additional columns that will make our lives easier or are applicable to some but not all data sets:

reverse: this column tells you whether items in a scale need to be reverse coded. I recommend coding this as 1 (leave alone) and -1 (reverse) for reasons that will become clear later.
mini: this column represents the minimum value of scales that are numeric. Leave blank otherwise.
maxi: this column represents the maximum value of scales that are numeric. Leave blank otherwise.
year: for longitudinal data, we have several waves of data and the name of the same item across waves is often different, so it’s important to note to which wave an item belongs. You can do this by noting the wave (e.g. 1, 2, 3), but I prefer the actual year the data were collected (e.g. 2005, 2009, etc.)
meta: Some datasets have a meta name, which essentially means a name that variable has across all waves to make it clear which variables are the same. They are not always useful as some data sets have meta names but no great way of extracting variables using them. But they’re still typically useful to include in your codebook regardless.
notes: Sometimes things come up that you might want to note about the data, such as ceiling or floor effects, little oddities that happen due to mistakes, etc. that are helpful to document.

Codebooks

Below, I’ll load in the codebook we will use for this study, which will include all of the above columns.

# set the path
wd <- "https://github.com/emoriebeck/psc290-data-FQ23/raw/main/04-workshops/04-week4-readr"

download.file(
  url      = sprintf("%s//codebook.xlsx", wd), 
  destfile = "codebook.xlsx"
  )

Codebooks

The resulting codebook looks something like this:

Codebooks

In addition, to the codebook, I also document other overarching info three other tabs:
- Overview just lists what variables I’m considering as serving different functions (e.g., demographics, covariates, moderators, predictors, outcomes, indepenedent variables, dependent variables, etc.)
- Key helps me create tables that I’ll be able to use in R to help me rename things. This is super helpful for making final tables and figures!!
- Sample helps other people understand how you’re using the columns. This is generally good practice and also helpful if you have research assistants or collaborators helping you out!

Codebooks: Overview

Codebooks: Key

Codebooks: Sample

Data

First, we need to load in the data. We’re going to use three waves of data from the German Socioeconomic Panel Study, which is a longitudinal study of German households that has been conducted since 1984. We’re going to use more recent data from three waves of personality data collected between 2005 and 2013.

Note: we will be using the teaching set of the GSOEP data set. I will not be pulling from the raw files as a result of this. I will also not be mirroring the format that you would usually load the GSOEP from because that is slightly more complicated and something we will return to in a later tutorial on purrr (link) after we have more skills. I’ve left that code in the .qmd for now, but it won’t make a lot of sense right now.

Example

Download the following .zip file with an R project
We’re going to walk through this script, and then you will spend the rest of class working on your problem set, which basically does the same with your own data.

Example

The first thing I do in a script is setup my “Workspace,” which typically includes:
- Packages
- Any alternative paths (e.g., if data are stored online)
- Codebooks
- Data

Packages

library(psych)
library(plyr)
library(tidyverse)

Codebook

The code below first prints the sheets and then reads in the codebook page that has all the variables

readxl::excel_sheets(path = "codebook.xlsx")

[1] "codebook" "Overview" "Key"      "Sample"

codebook <- readxl::read_excel(path = "codebook.xlsx", sheet = "codebook") %>%
  mutate(old_name = str_to_lower(old_name))
codebook %>%
  print(n = 6)

# A tibble: 153 × 15
  dataset category    name  item_name old_name  year item_text scale recode_text
  <chr>   <chr>       <chr> <chr>     <chr>    <dbl> <chr>     <chr> <chr>      
1 <NA>    Procedural  <NA>  SID       persnr       0 Never Ch…  <NA> <NA>       
2 <NA>    Procedural  <NA>  household hhnr         0 househol…  <NA> <NA>       
3 ppfad   Demographic DOB   DOB       gebjahr      0 Year of … "num… <NA>       
4 ppfad   Demographic sex   sex       sex          0 Sex       "\r\… 0 = male, …
5 vp      Big 5       A     coarse    vp12503   2005 Am somet… "\"[… Values < 1…
6 vp      Big 5       A     forgive   vp12506   2005 Able to … "\"[… Values < 1…
# ℹ 147 more rows
# ℹ 6 more variables: recode <chr>, reverse <chr>, mini <dbl>, maxi <dbl>,
#   comp_rule <chr>, long_rule <chr>

Codebook

The “Key” page in the codebook is helpful to rename things more easily into readable names that are great in publications
Let’s create a data frame for the personality, outcome, and demographic variables we’ll reference later

key <- readxl::read_excel(path = "codebook.xlsx", sheet = "Key") %>%
  select(-Summary)
traits   <- key %>% filter(category == "Big 5")
outcomes <- key %>% filter(category == "out")
covars   <- key %>% filter(category == "dem")
traits %>% print(n = 3)

# A tibble: 5 × 7
  long_category long_name         category name  breaks mod   mod_name
  <chr>         <chr>             <chr>    <chr> <chr>  <chr> <chr>   
1 Personality   Extraversion      Big 5    E     <NA>   <NA>  <NA>    
2 Personality   Agreeableness     Big 5    A     <NA>   <NA>  <NA>    
3 Personality   Conscientiousness Big 5    C     <NA>   <NA>  <NA>    
# ℹ 2 more rows

Data

Let’s read in the raw, wide-format data set using readr::read_csv():

vars <- codebook$old_name
soep <- read_csv(file = "soep.csv") %>%
  select(one_of(vars)) # keep vars from codebook
soep %>% print(n = 8)

# A tibble: 28,290 × 153
  persnr  hhnr gebjahr   sex vp12503 vp12506 vp12513 zp12003 zp12006 zp12013
   <dbl> <dbl>   <dbl> <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1    901    94    1951     2       4       5       5       3       5       5
2   1202   124    1913     2       1      -1       7      NA      NA      NA
3   2301   230    1946     1       5       6       6       4       7       6
4   2302   230    1946     2       6       3       5       5       4       3
5   2304   230    1978     1       4       7       6      NA      NA      NA
6   2305   230    1946     2      NA      NA      NA      NA      NA      NA
7   4601   469    1933     2       2       7       6      NA      NA      NA
8   4701   477    1919     2       1       5       6       1       5       5
# ℹ 28,282 more rows
# ℹ 143 more variables: bdp15103 <dbl>, bdp15106 <dbl>, bdp15113 <dbl>,
#   vp12511 <dbl>, vp12507 <dbl>, vp12501 <dbl>, zp12011 <dbl>, zp12007 <dbl>,
#   zp12001 <dbl>, bdp15111 <dbl>, bdp15107 <dbl>, bdp15101 <dbl>,
#   vp12502 <dbl>, vp12512 <dbl>, vp12508 <dbl>, zp12002 <dbl>, zp12012 <dbl>,
#   zp12008 <dbl>, bdp15102 <dbl>, bdp15112 <dbl>, bdp15108 <dbl>,
#   vp12515 <dbl>, vp12510 <dbl>, vp12505 <dbl>, zp12015 <dbl>, …

Example

Once I have my workspace setup, it’s time to clean the data
For the sake of time, I’m going to skip the descriptives we talked about last week, but we should be doing those in a real setting!
I often have the following sections in my data cleaning section:
- Rename Variables
  - Change to Long
  - Bring in Codebook
- Recode Variables
- Reverse Scoring
- Predictors (In this case, personality)
- Outcomes (In this case, life events)
- Covariates / Demographics
- [any other variable categories you have]
I like to clean different categories of variables separately because I often clean them relatively similarly within categories. Differences within categories are generally captured via columns in my codebook

Rename Variables

Change Data to Long

To get our codebook to play nice with the data since our data are in wide form (including across years), we need to make the data long with at least one name that corresponds to the codebook
In this case, we’ll all the variables but the participant and household ID’s long, and change the item name to old_name since it contains the original variable names.

## change data to long format
soep_long <- soep %>%
  pivot_longer(
    cols = c(-persnr, -hhnr) # all but SID & HHID
    , names_to = "old_name" # match the codebook
    , values_to = "value" # arbitrary name
    , values_drop_na = T # drop missings
  ) %>%
  rename(SID = persnr, HHID = hhnr) # rename
soep_long

# A tibble: 1,902,323 × 4
     SID  HHID old_name value
   <dbl> <dbl> <chr>    <dbl>
 1   901    94 gebjahr   1951
 2   901    94 sex          2
 3   901    94 vp12503      4
 4   901    94 vp12506      5
 5   901    94 vp12513      5
 6   901    94 zp12003      3
 7   901    94 zp12006      5
 8   901    94 zp12013      5
 9   901    94 bdp15103     4
10   901    94 bdp15106     4
# ℹ 1,902,313 more rows

Rename Variables

Merge in Codebook

Now, let’s actually merge in the codebook. We’ll use left_join() here becuase we want to keep all observations in our raw soep_long data frame.

# merge in codebook
soep_long <- soep_long %>% # long data
  left_join( # keep all rows in long data
    codebook %>% # merge in the following variables from the codebook
      select(
        old_name, category, name, item_name, year, recode, 
        reverse, mini, maxi, comp_rule, long_rule
        )
  ) %>%
  select(-old_name) # get rid of old_name because we're done with it
soep_long %>% print(n =6)

# A tibble: 1,902,323 × 13
    SID  HHID value category    name  item_name  year recode reverse  mini  maxi
  <dbl> <dbl> <dbl> <chr>       <chr> <chr>     <dbl> <chr>  <chr>   <dbl> <dbl>
1   901    94  1951 Demographic DOB   DOB           0 ifels… no         NA    NA
2   901    94     2 Demographic sex   sex           0 ifels… no         NA    NA
3   901    94     4 Big 5       A     coarse     2005 ifels… yes         1     7
4   901    94     5 Big 5       A     forgive    2005 ifels… no          1     7
5   901    94     5 Big 5       A     friendly   2005 ifels… no          1     7
6   901    94     3 Big 5       A     coarse     2009 ifels… yes         1     7
# ℹ 1,902,317 more rows
# ℹ 2 more variables: comp_rule <chr>, long_rule <chr>

Example

Now that we’ve merged our codebook and raw data, we’re ready to use the information in the codebook to:
- recode (recode)
- reverse score (reverse, mini, maxi)
- composite (comp_rule, long_rule)

Recode Variables

I find the easiest way to recode variables, especially in projects where I may need to differently recode hundreds or thousands of different variables differently, is to write a little function that takes code chunks from my codebook and runs them.
The function looks like this:

recode_fun <- function(rule, y){
  x <- y$value
  if(!is.na(rule)){y$value <- eval(parse(text = rule))}
  return(y)
}

Recode Variables

Now we’re going to apply each rule to each chunk of the data that uses the same one
To do this, we’re going to use some functions from the purrr package that we won’t talk about it this week.
Don’t worry too much about this. This is code you can copy paste or add to an R script you source across projects!

soep_recode <- soep_long %>%
  group_by(recode) %>% # group by the r code rule
  nest() %>% # create a nested data frame
  ungroup() %>% # ungroup() 
  # apply the recode function
  mutate(data = pmap(list(recode, data), recode_fun)) %>%
  unnest(data) %>% # unnest the data to get back to a normal df
  # change any negative, nan, or Inf values to NA
  mutate(value = ifelse(value < 0 | is.nan(value) | is.infinite(value), NA, value)) %>%
  select(-recode) # we're done with the recode column, so remove it
soep_recode

# A tibble: 1,902,323 × 12
     SID  HHID value category    name  item_name  year reverse  mini  maxi
   <dbl> <dbl> <dbl> <chr>       <chr> <chr>     <dbl> <chr>   <dbl> <dbl>
 1   901    94  1951 Demographic DOB   DOB           0 no         NA    NA
 2  1202   124  1913 Demographic DOB   DOB           0 no         NA    NA
 3  2301   230  1946 Demographic DOB   DOB           0 no         NA    NA
 4  2302   230  1946 Demographic DOB   DOB           0 no         NA    NA
 5  2304   230  1978 Demographic DOB   DOB           0 no         NA    NA
 6  2305   230  1946 Demographic DOB   DOB           0 no         NA    NA
 7  4601   469  1933 Demographic DOB   DOB           0 no         NA    NA
 8  4701   477  1919 Demographic DOB   DOB           0 no         NA    NA
 9  4901   493  1925 Demographic DOB   DOB           0 no         NA    NA
10  5201   523  1955 Demographic DOB   DOB           0 no         NA    NA
# ℹ 1,902,313 more rows
# ℹ 2 more variables: comp_rule <chr>, long_rule <chr>

Reverse Coding

Now we can reverse code our data. We’ll use similar code to what I showed last week, but we’re working on long-format data, so we’ll use the reverse, mini, and maxi columns to tell us what to reverse and how.

soep_recode <- soep_recode %>%
  # if reverse = no leave along
  # otherwise reverse code it according to the mini and maxi
  mutate(value = 
           ifelse(is.na(reverse) | reverse == "no"
            , value
            , as.numeric(reverse.code(
              -1
              , value
              , mini = mini
              , maxi = maxi
              ))
      )
  ) %>%
  # get rid of reverse, mini, & maxi because we're done with them
  select(-reverse, -mini, -maxi)

# A tibble: 1,902,323 × 9
     SID  HHID value category    name  item_name  year comp_rule long_rule
   <dbl> <dbl> <dbl> <chr>       <chr> <chr>     <dbl> <chr>     <chr>    
 1   901    94  1951 Demographic DOB   DOB           0 mode      select   
 2  1202   124  1913 Demographic DOB   DOB           0 mode      select   
 3  2301   230  1946 Demographic DOB   DOB           0 mode      select   
 4  2302   230  1946 Demographic DOB   DOB           0 mode      select   
 5  2304   230  1978 Demographic DOB   DOB           0 mode      select   
 6  2305   230  1946 Demographic DOB   DOB           0 mode      select   
 7  4601   469  1933 Demographic DOB   DOB           0 mode      select   
 8  4701   477  1919 Demographic DOB   DOB           0 mode      select   
 9  4901   493  1925 Demographic DOB   DOB           0 mode      select   
10  5201   523  1955 Demographic DOB   DOB           0 mode      select   
# ℹ 1,902,313 more rows

Composite Items

Now that our data are recoded and reverse scored, we can clean each category of data:
- Personality
- Outcomes / Life Events
- Demographics / Covariates
Then we’ll merge them together

Personality

For the Big Five, we want to get composites within years (2005, 2009, 2013)
the comp_rule is average, so want to get the mean
the long_rule is select because we’ll choose what to do with each year
In the workbook, you’ll see an alternate way to do this that uses a function
- We won’t go over this today because we haven’t learned purrr and functions yet
- But keep this code and see if you can figure it out because it’s much more flexible and super useful in the case that you need to apply different rules to different variables (especially useful for covariates and moderators!!)

Personality

soep_big5 <- soep_recode %>%
  # keep Big Five & drop missings
  filter(category == "Big 5" & !is.na(value)) %>%
  # "split" the data by category, person, household, trait, item, and year
  group_by(category, SID, HHID, name, item_name, year) %>%
  # "apply" the mean function, collapsing within splits
  summarize(value = mean(value)) %>%
  # "split" the data by category, person, household, trait, and item
  group_by(category, SID, HHID, name, year) %>%
  # "apply" the mean function, collapsing within splits
  summarize(value = mean(value)) %>%
  # "combine" the data back together
  ungroup()

# A tibble: 151,186 × 6
  category   SID  HHID name   year value
  <chr>    <dbl> <dbl> <chr> <dbl> <dbl>
1 Big 5      901    94 A      2005  4.67
2 Big 5      901    94 A      2009  5   
3 Big 5      901    94 A      2013  4.67
4 Big 5      901    94 C      2005  5   
5 Big 5      901    94 C      2009  5   
6 Big 5      901    94 C      2013  5.67
7 Big 5      901    94 E      2005  3.67
8 Big 5      901    94 E      2009  3.67
# ℹ 151,178 more rows

Outcomes

Now onto life events
For these data we want to get a single composite for each life event across all years (i.e. did they ever experience each event)
Both the comp_rule and the long_rule are max
As before, the more flexible way is in the workbook

Outcomes

soep_out <- soep_recode %>%
  # keep Life events & drop missings
  filter(category == "Life Event" & !is.na(value)) %>%
  # "split" the data by category, person, household, event, and year
  group_by(SID, HHID, category, name, year) %>% 
  # "apply" the max function, collapsing within splits
  summarize(value = max(value)) %>%
  # "split" the data by category, person, household, event
  group_by(SID, HHID, category, name) %>% 
  # "apply" the max function, collapsing within splits
  summarize(value = max(value)) %>%
  # "combine" the data back together
  ungroup()

# A tibble: 253,363 × 5
    SID  HHID category   name      value
  <dbl> <dbl> <chr>      <chr>     <dbl>
1   901    94 Life Event ChldBrth      0
2   901    94 Life Event ChldMvOut     0
3   901    94 Life Event DadDied       0
4   901    94 Life Event Divorce       0
5   901    94 Life Event Married       0
6   901    94 Life Event MomDied       1
7   901    94 Life Event MoveIn        0
8   901    94 Life Event NewPart       0
# ℹ 253,355 more rows

Covariates

Now let’s do the covariates
The comp_rule and long_rule columns tell us that the two variables actually have the same rule (mode), which make it easy to clean
As before, the more flexible way is in the workbook, which is particularly useful for demographics and covariates that may be on super different scales

Covariates

Mode <- function(x) {
  ux <- unique(x)
  ux <- ux[!is.na(ux)]
  ux[which.max(tabulate(match(x, ux)))]
}

soep_cov <- soep_recode %>%
  # keep demographics & drop missings
  filter(category == "Demographic" & !is.na(value)) %>%
  # "split" the data by category, person, household, covariate, and year
  group_by(category, SID, HHID, name, year) %>%
  # "apply" the Mode function, collapsing within splits
  summarize(value = Mode(value)) %>%
  # "split" the data by category, person, household, and covariate
  group_by(SID, HHID, name) %>%
  # "apply" the Mode function, collapsing within splits
  summarize(value = Mode(value)) %>%
  # "combine" the data back together
  ungroup() %>%
  # pivot data wider so there are separate columns for each covariate
  pivot_wider(
    names_from = "name"
    , values_from = "value"
    )

# A tibble: 28,290 × 4
     SID  HHID   DOB   sex
   <dbl> <dbl> <dbl> <dbl>
 1   901    94  1951     1
 2  1202   124  1913     1
 3  2301   230  1946     0
 4  2302   230  1946     1
 5  2304   230  1978     0
 6  2305   230  1946     1
 7  4601   469  1933     1
 8  4701   477  1919     1
 9  4901   493  1925     1
10  5201   523  1955     0
# ℹ 28,280 more rows

Merge Data

Lastly, let’s re-merge the data to bring the information back together
Because we want the crossings of traits and life events, we’ll need to change the name and value columns to be specific to the variable categories
We want the data to look like this:

SID	HHID	year	event	o_value	trait	p_value	sex	DOB

Merge Data

soep_clean <- soep_big5 %>%
  # select key variables and rename for personality
  select(SID, HHID, year, trait = name, p_value = value) %>%
  # bring in matching rows (by SID and HHID)
  inner_join(
    soep_out %>%
      # select key variables and rename for outcomes
      select(SID, HHID, event = name, o_value = value)
    ) %>%
  # bring the covariates
  left_join(soep_cov)

# A tibble: 1,477,972 × 9
    SID  HHID  year trait p_value event     o_value   DOB   sex
  <dbl> <dbl> <dbl> <chr>   <dbl> <chr>       <dbl> <dbl> <dbl>
1   901    94  2005 A        4.67 ChldBrth        0  1951     1
2   901    94  2005 A        4.67 ChldMvOut       0  1951     1
3   901    94  2005 A        4.67 DadDied         0  1951     1
4   901    94  2005 A        4.67 Divorce         0  1951     1
5   901    94  2005 A        4.67 Married         0  1951     1
6   901    94  2005 A        4.67 MomDied         1  1951     1
7   901    94  2005 A        4.67 MoveIn          0  1951     1
8   901    94  2005 A        4.67 NewPart         0  1951     1
# ℹ 1,477,964 more rows

Save Your Data

write.csv(
  x = soep_clean
  , file = sprintf("clean_data_%s.csv", Sys.Date())
  , row.names = F
  )

Note: Why did we do the keys?

soep_clean %>%
  mutate(trait = factor(trait, levels = traits$name, labels = traits$long_name)
         , event = factor(event, levels = outcomes$name, labels = outcomes$long_name))

# A tibble: 1,477,972 × 9
     SID  HHID  year trait         p_value event             o_value   DOB   sex
   <dbl> <dbl> <dbl> <fct>           <dbl> <fct>               <dbl> <dbl> <dbl>
 1   901    94  2005 Agreeableness    4.67 Child Birth             0  1951     1
 2   901    94  2005 Agreeableness    4.67 Child Moves Out         0  1951     1
 3   901    94  2005 Agreeableness    4.67 Father Died             0  1951     1
 4   901    94  2005 Agreeableness    4.67 Divorce                 0  1951     1
 5   901    94  2005 Agreeableness    4.67 Marriage                0  1951     1
 6   901    94  2005 Agreeableness    4.67 Mother Died             1  1951     1
 7   901    94  2005 Agreeableness    4.67 Move in with a P…       0  1951     1
 8   901    94  2005 Agreeableness    4.67 New Partner             0  1951     1
 9   901    94  2005 Agreeableness    4.67 Partner Died            0  1951     1
10   901    94  2005 Agreeableness    4.67 Separated from P…       0  1951     1
# ℹ 1,477,962 more rows

Note: Why did we do the keys?

soep_clean %>%
  mutate(
    trait = factor(trait, levels = traits$name, labels = traits$long_name)
    , event = factor(event, levels = outcomes$name, labels = outcomes$long_name)
    , o_value = factor(o_value, levels = c(0,1), labels = c("No Event", "Event"))
    ) %>%
  group_by(trait, event, o_value) %>%
  summarize_at(vars(p_value), lst(mean, sd)) %>%
  ungroup() %>%
  ggplot(aes(x = mean, y = event, shape = o_value)) + 
    geom_errorbar(
      aes(xmin = mean - sd, xmax = mean + sd)
      , width = .1
      , position = position_dodge(width=.8)
      ) + 
  geom_point(position = position_dodge(width=.8)) + 
  facet_grid(~trait) + 
  theme_classic() + 
  theme(legend.position = "bottom")

Reading data

Reading data with `readr` and `haven`

Above, I demonstrated how to read data from csv files.
Now, I want to briefly walk you through some of the additional functions and arguments in the readr package
Then, I’ll briefly introduce the haven package for reading other kinds of data (e.g., .sav and .dat)

Benefits of `readr`

Compared to the corresponding base functions, readr functions:

Use a consistent naming scheme for the parameters (e.g. col_names and col_types not header and colClasses).
Are generally much faster (up to 10x-100x) depending on the dataset.
Leave strings as is by default, and automatically parse common date/time formats.
Have a helpful progress bar if loading is going to take a while.
All functions work exactly the same way regardless of the current locale. To override the US-centric defaults, use locale().

Why `.csv`?

readr primarily works for .csv and .txt files
Relative to other file types, these are quite primitive and won’t save meta-data information, etc., so why use them?
- Not platform-specific: .csv’s can be read by virtually all programs
- Simple square data: plays nicely on sites like GitHub, etc.
- Primitive core: some types of data become unreadable following updates (e.g., a warning about not being able to read a file with magic number x)
In practice, I often save my data as both a .csv (for sharing) and .RData (for personal use)

`readr::read_csv()`

Core arugments:

file: path to file
col_names: T/F (does first line contain column names); alternatively provide a vector with new column names to be appended
col_types: what kind of data does each column contain (readr tries to guess by default)
- Specifications are created by a list() or cols() (see later slide)
col_select: Which columns to read in (helpful for huge datasets) using select style syntax wrapped in a c()

`readr::read_csv()`: `col_types`

readr does pretty good at guessing column types, with the occasional character/number/logical/date issue. For these instances, you can force the column type using the col_types argument. Here’s some examples of the kinds of columns and the syntax:
col_logical() [l], containing only T, F, TRUE or FALSE
col_integer() [i], integers
col_double() [d], doubles
col_character() [c], everything else
col_factor(levels, ordered) [f], a fixed set of values
col_date(format = "") [D]: with the locale’s date_format
col_time(format = "") [t]: with the locale’s time_format
col_datetime(format = "") [T]: ISO8601 date times

`readr::read_csv()`: `col_types`

In our cleaned data we created, for example, say that I wanted the following specific column types:

SID as a character
sex as a factor

read_csv(
  file = "clean_data_2026-01-31.csv"
  , col_types = cols(
    SID = col_character()
    , sex = col_factor()
  )) %>% print(n = 5)

# A tibble: 1,477,972 × 9
  SID    HHID  year trait p_value event     o_value   DOB sex  
  <chr> <dbl> <dbl> <chr>   <dbl> <chr>       <dbl> <dbl> <fct>
1 901      94  2005 A        4.67 ChldBrth        0  1951 1    
2 901      94  2005 A        4.67 ChldMvOut       0  1951 1    
3 901      94  2005 A        4.67 DadDied         0  1951 1    
4 901      94  2005 A        4.67 Divorce         0  1951 1    
5 901      94  2005 A        4.67 Married         0  1951 1    
# ℹ 1,477,967 more rows

`haven`

haven is another tidyverse package that supports data from
- SAS (read_sas() and read_xpt())
- SPSS (read_sav() and older read_por() files)
- Stata (read_dta())
all are read in as tibbles
Made to work with labelled data, which play nice for factoring variables
These work close to the same, so let’s practice with read_sav() since I suspect you’re most likely to encounter that anyway

`haven::read_sav()`

Core arguments:

file: file path and name
col_select: Which columns to read in (helpful for huge datasets) using select style syntax wrapped in a c()
encoding: possibly useful for those of you whose computers and/or work are in multiple languages

`haven`: Labelled data

SAS, SPSS, and Stata share a simply concept of “labelled” vectors, which are similar to factors but a little more general. The labelled() class provides a natural representaion in R.
haven has some functions to help deal with this (labelled data doesn’t always play nice)
- labelled_spss(): Labelled vectors for SPSS
- labelled() is.labelled(): Create a labelled vector.
- print_labels(): Print the labels of a labelled vector
- as_factor(<data.frame>) as_factor(<haven_labelled>) as_factor(<labelled>): Convert labelled vectors to factors
- zap_label(): Zap variable labels
- zap_missing(): Zap special missings to regular R missings

Using the Import Dataset GUI

RStudio actually has a great GUI for reading in data
Although it’s not good practice to use it for reproducibility, it’s great when initially setting that up and will provide you with the code to read it in.

Using the Import Dataset GUI

First, you can access the GUI in your Environment panel:

Using the Import Dataset GUI

Second, after selecting the appropriate type and package, you should see this window:

Using the Import Dataset GUI

Third, select your data set in your browser:

Using the Import Dataset GUI

Fourth, you should see a preview of your data. Depending on the data type, you also may be able to set additional options:

Week 4 - Codebooks and Data Documentation

Outline

Documenting your design

Documenting your design

Document your design

Preregistration

Preregistration

What should I preregister?

What should I preregister?

Learning More:

Protocol and Design Flow

Protocol and Design Flow

Protocol and Design Flow

Protocol and Design Flow

Codebooks

Codebooks

Codebooks

Codebooks

Codebooks

Codebooks

Codebooks

Codebooks

Codebooks

Codebooks

Codebooks: Overview

Codebooks: Key

Codebooks: Sample

Data

Example

Example

Example

Packages

Codebook

Codebook

Data

Example

Rename Variables

Change Data to Long

Rename Variables

Merge in Codebook

Example

Recode Variables

Recode Variables

Reverse Coding

Composite Items

Personality

Personality

Outcomes

Outcomes

Covariates

Covariates

Merge Data

Merge Data

Save Your Data

Note: Why did we do the keys?

Note: Why did we do the keys?

Reading data

Reading data with readr and haven

Benefits of readr

Why .csv?

readr::read_csv()

readr::read_csv(): col_types

readr::read_csv(): col_types

haven

haven::read_sav()

haven: Labelled data

Using the Import Dataset GUI

Using the Import Dataset GUI

Using the Import Dataset GUI

Using the Import Dataset GUI

Using the Import Dataset GUI

Reading data with `readr` and `haven`

Benefits of `readr`

Why `.csv`?

`readr::read_csv()`

`readr::read_csv()`: `col_types`

`readr::read_csv()`: `col_types`

`haven`

`haven::read_sav()`

`haven`: Labelled data