Chapter 2 Data Cleaning

Mode <- function(x) {
  ux <- unique(x)
  ux <- ux[!is.na(ux)]
  ux[which.max(tabulate(match(x, ux)))]
}

The data preparation procedure across samples and methods was nearly identical, with the exception of whether data were pooled across samples (Methods 1A-2B) or kept separate (Methods 3 and 4). First, before preregistration, data preparation began by cataloging the available data in each sample in a comprehensive codebook, with the goal of compiling all of the information necessary for harmonizing across samples. For each sample, we compiled information on (1) the original variable name in the raw data, (2) the question text, (3) the scale (e.g., numeric, Likert-like scale points, labels for nominal variables), and (4) the available waves or years for each of the target variables. We also included information (5) categorizing the variable (e.g., personality, outcome), (6) a higher order variable name construct (e.g., Agreeableness, crystallized cognitive ability) that matched across samples, (7) a lower order variable name that indexed unique items (e.g., friendly, WAIS Information), (8) a note on whether an item was reverse scored along with its (9) minimum and (10) maximum, (11) a string of R code for recoding scales (e.g., recoding missing values as NA, changing levels of nominal variables to harmonize across samples), and two rules for how composites were to be created, either (12) within a wave (e.g., compositing a personality or cognitive scale) or (13) longitudinally across them (e.g., for covariates to index participants’ highest level of education prior to baseline).

Second, once the codebook was created and all the data were received, the codebook and data files were sequentially loaded into R, and data were renamed, reverse-scored, and rescaled as documented in the codebook. Then, for each of the categories of data (covariates, personality, and cognitive ability), data were composited as documented in Table 2. Finally, these three sources of data were combined into a single data set within each sample separately. The full data cleaning procedure can be seen in Chapter 1 (Data Cleaning) in the online materials.

Third, data from each sample was combined into a single data set with harmonized labels and scales for Methods 1 (A and B) and 2 (A and B). For methods that include sample-level moderators (Methods 1A, 2B, and 3), this also includes combining the personality, cognitive, and covariate data with sample-level data. In Methods 1A and 2B, this happens in the initial step, while in Method 3, these are added after estimating the models for each sample separately (see online materials).

Fourth, we harmonized the continuous personality and cognitive ability measures using POMP scoring (range 0-10). Continuous variables with easily harmonized scales (i.e. age and years of education) were centered, with age centered at 60 and education centered at 12 years. Nominal variables were harmonized during the earlier data cleaning steps such that 0 = “male” and 1 = “female” for gender.

2.1 Berlin Aging Study (BASE-I)

The Berlin Aging Study (I) is a longitudinal interdisciplinary study of aging that began in 1990. The data are available, by application, from https://www.base-berlin.mpg.de/en/system/files/media/pdf/a97295f241d06bf496b5eeae4251a34e/base-data-transfer_form_online.pdf.

Participants included 516 individuals over 70, stratified on age and gender, recruited from a sample of the Berlin City Registry. To date, most participants are deceased, with most attrition from the study due to mortality or severe health complications. Since 1990, there have been seven follow-up measurement occasions (8 total).

Sample sizes vary over time, from 516 (T1) to 23 (T8). To ensure adequate sample sizes, the present study will use data up to the fourth wave, which included 164 participants with full data. This provides 99% power to detect a correlation effect size of ~.32, two-tailed alpha at .05.

2.1.1 Load Data

(basei_codebook <- (codebook %>% filter(study == "BASE-I"))$codebook[[1]] %>%
  mutate(orig_itemname = str_to_lower(orig_itemname)))

## # A tibble: 138 × 17
##    study  dataset   category name  itemname  wave  year stem  orig_itemname description scale reverse_code recode  mini  maxi
##    <chr>  <chr>     <chr>    <chr> <chr>    <dbl> <dbl> <chr> <chr>         <chr>       <chr> <chr>        <chr>  <dbl> <dbl>
##  1 base-i social    covaria… educ… eduYears     1  1990 <NA>  s1kon28       Years of E… "int… <NA>         ifels…    NA    NA
##  2 base-i social    covaria… educ… eduYears     1  1990 <NA>  z1e10         General ed… "1 =… <NA>         ifels…    NA    NA
##  3 base-i demograp… covaria… gend… sex          1  1990 esex  z1esex        Sex of par…  <NA> <NA>         ifels…    NA    NA
##  4 base-i demograp… covaria… gend… sex          2  1993 esex  z2esex        Sex of par…  <NA> <NA>         ifels…    NA    NA
##  5 base-i demograp… covaria… gend… sex          3  1995 esex  z3esex        Sex of par…  <NA> <NA>         ifels…    NA    NA
##  6 base-i demograp… covaria… gend… sex          4  1997 esex  z4esex        Sex of par…  <NA> <NA>         ifels…    NA    NA
##  7 base-i demograp… covaria… gend… sex          5  2000 esex  z5esex        Sex of par…  <NA> <NA>         ifels…    NA    NA
##  8 base-i demograp… covaria… gend… sex          6  2004 esex  z6esex        Sex of par…  <NA> <NA>         ifels…    NA    NA
##  9 base-i demograp… covaria… gend… sex          7  2005 esex  z7esex        Sex of par…  <NA> <NA>         ifels…    NA    NA
## 10 base-i demograp… covaria… SRhe… SRhealth     1  1990 E13X… z1e13x1r      How would … "1 =… yes          ifels…     1     5
## # ℹ 128 more rows
## # ℹ 2 more variables: comp_rule <chr>, long_rule <chr>

basei <- sprintf("%s/base-i/Beck_202010.sav", data_path) %>% read_sav() %>%
  full_join(sprintf("%s/base-i/Beck_202010_2.sav", data_path) %>% read_sav())

2.1.2 Recoding & Reverse Scoring

rename_fun <- function(cb, var){
  old.names <- unique((basei_codebook %>% filter(name == var))$orig_itemname)
  df <- basei %>% 
    select(SID = id, one_of(old.names)) %>%
    gather(key = orig_itemname, value = value, -SID, na.rm=T)
  if(length(old.names) > 1){
      df %>% left_join(cb %>% select(itemname, wave, orig_itemname, reverse_code:long_rule))
  } else {
    df %>% left_join(cb %>% select(itemname, orig_itemname, reverse_code:long_rule) %>% distinct()) %>% mutate(wave = "0")
  }
}

`
# join data with recoding info 
basei_recode <- basei_codebook %>%
  select(category, name, itemname, wave, orig_itemname, reverse_code:long_rule) %>%
  filter(category %in% c("pers", "outcome", "covariates")) %>%
  group_by(category, name) %>%
  nest() %>% 
  ungroup() %>%
  mutate(data = map2(data, name, rename_fun))

# recode 
recode_fun <- function(rule, y, year){
  x <- y$value
  if(!is.na(rule)){y$value <- eval(parse(text = rule))}
  return(y)
}

basei_recode <- basei_recode %>% 
  mutate(data = map(data, ~(.) %>% 
    mutate(wave = as.numeric(wave)) %>%
    group_by(recode, wave) %>%
    nest() %>%
    ungroup() %>%
    mutate(data = pmap(list(recode, data, wave), recode_fun)) %>%
    unnest(data) %>%
    mutate(value = ifelse(value < 0 | is.nan(value) | is.infinite(value), NA, value))))


# reverse code 
basei_recode <- basei_recode %>%
  mutate(data = map(data, ~(.) %>%
    mutate(value = ifelse(reverse_code == "no" | is.na(reverse_code), value, 
                        reverse.code(-1, value, mini = mini, maxi = maxi)))))

fun_call <- function(x, rule){
    switch(rule,
           average = mean(x, na.rm = T),
           mode = Mode(x)[1],
           sum = sum(x, na.rm = T),
           skip = unique(x)[1],
           select = unique(x)[1],
           max = max(x, na.rm = T),
           min = min(x, na.rm = T))
}

2.1.3 Covariates

# composite WITHIN years 
basei_cov <- basei_recode %>%
  filter(category == "covariates") %>%
  select(-category) %>%
  mutate(data = map(data, ~(.) %>% 
      filter(!is.na(value)) %>%
      mutate(comp_rule = ifelse(is.na(comp_rule), "skip", comp_rule)) %>%
      group_by(comp_rule, SID, wave, long_rule) %>%
      nest() %>%
      ungroup() %>%
      mutate(data = map2(data, comp_rule, ~fun_call((.x)$value, .y))) %>%
      unnest(data) %>%
      filter(wave <= 1)
      ))

comp_fun <- function(d, rule){
  d %>%
    group_by(SID, name) %>%
    summarize(value = fun_call(data, rule)) %>%
    ungroup() %>%
    distinct()
}

# composite ACROSS years
basei_cov <- basei_cov %>%
  unnest(data) %>%
  mutate(long_rule = ifelse(is.na(long_rule), "skip", long_rule)) %>%
  group_by(long_rule) %>%
  nest() %>%
  ungroup() %>%
  mutate(data = pmap(list(data, long_rule), comp_fun)) %>%
  unnest(data) %>%
  mutate(value = ifelse(is.infinite(value) | is.nan(value), NA, value)) %>%
  select(-long_rule) %>%
  spread(name, value) %>%
  filter(!is.na(SID))

2.1.4 Personality Variables

basei_pers <- basei_recode %>%
  filter(category == "pers") %>%
  select(-category) %>% 
  unnest(data) %>%
  filter(wave == 1 & !is.na(value)) %>%
  group_by(SID, wave, name, itemname, comp_rule) %>%
  summarize(value = mean(value, na.rm = T)) %>%
  ungroup()

# alpha's
basei_alpha <- basei_pers %>%
  filter(!is.na(value)) %>%
  select(name, itemname, wave, SID, value) %>%
  group_by(name, wave) %>%
  nest() %>%
  mutate(data = map(data, ~(.) %>% distinct() %>% pivot_wider(names_from = itemname, values_from = value, values_fn = list(mean))),
         alpha = map(data, possibly(~psych::alpha((.) %>% select(-SID), check.keys = T), NA_real_)))

comp_fun <- function(df, rule){
  df %>%
    group_by(SID) %>%
    summarize(value = fun_call(value, rule)) %>%
    ungroup() %>%
    mutate(value = ifelse(is.infinite(value) | is.nan(value), NA, value))
}

# create composites
basei_pers <- basei_pers %>%
  group_by(name, comp_rule, wave) %>%
  nest() %>%
  ungroup() %>%
  mutate(data = map2(data, comp_rule, comp_fun)) %>%
  unnest(data) %>%
  select(-comp_rule)

2.1.5 Outcome Variables

basei_cog_waves <- basei_recode %>%
  filter(category == "outcome") %>%
  unnest(data) %>%
  group_by(SID, name) %>%
  summarize(o_year = max(wave)) %>%
  ungroup()

# composite within years
basei_out <- basei_recode %>%
  filter(category == "outcome") %>%
  select(-category) %>%
  unnest(data) %>%
  right_join(basei_cog_waves) %>%
  filter(wave == o_year & wave > 1) %>%
  group_by(name, wave, SID) %>%
  summarize(value = sum(value, na.rm = T)) %>%
  ungroup()

2.1.6 Combine Data

basei_combined <- basei_pers %>% 
  rename(Trait = name, p_value = value, p_year = wave) %>%
  full_join(
    basei_out %>% 
      rename(Outcome = name, o_value = value, o_year = wave)
    ) %>% full_join(basei_cov) %>%
  filter(!is.na(p_value) & !is.na(o_value))

save(basei_cov, basei_alpha, basei_pers, basei_out, basei_combined,
     file = sprintf("%s/data/clean/base-i_cleaned.RData", local_path))

rm(list =ls()[grepl("basei", ls())])

2.2 Einstein Aging Study (EAS)

The Einstein Aging Study is an ongoing longitudinal study of older adults in the United States that began in 1980. The data are available on a project-by-project basis by submitting a concept proposal at https://www.einstein.yu.edu/departments/neurology/clinical-research-program/eas/data-sharing.aspx.

Since 1993, the EAS has systematically recruited a representative aging sample in the Bronx, New York. As of 2017, 2,600 participants were enrolled in the study. As of 2010, approximately 200 of the enrolled participants had autopsy data. More information on the study can be found at http://www.einstein.yu.edu/departments/neurology/clinical-research-program/EAS/.

Sample sizes vary over time, with ranges across waves not publicly available. However, we suspect approximately 2,000 participants to have basic personality and cognitive ability data. This yields 99% power to detect a zero-order correlation effect size of .10, two-tailed at alpha .05.

2.2.1 Load Data

(eas_codebook <- (codebook %>% filter(study == "EAS"))$codebook[[1]])

## # A tibble: 16 × 15
##    study dataset category name       itemname year  orig_itemname description scale reverse_code recode mini  maxi  comp_rule
##    <chr> <chr>   <chr>    <chr>      <chr>    <chr> <chr>         <chr>       <chr> <lgl>        <chr>  <lgl> <lgl> <chr>    
##  1 EAS   <NA>    match    education  educati… base… Educyrs       Education   <NA>  NA           <NA>   NA    NA    skip     
##  2 EAS   <NA>    match    gender     gender   annu… Gender        Gender      <NA>  NA           ifels… NA    NA    skip     
##  3 EAS   <NA>    match    SRhealth   SRhealth annu… <NA>          Self-rated… <NA>  NA           code … NA    NA    skip     
##  4 EAS   <NA>    match    age        age      long… AgeAtWave     Raw age     inte… NA           ifels… NA    NA    skip     
##  5 EAS   <NA>    outcome  crystalli… BosNami… annu… BostonFree    Word Fluen… inte… NA           ifels… NA    NA    sum      
##  6 EAS   <NA>    outcome  crystalli… waisInfo annu… Infraw        WAIS III: … inte… NA           ifels… NA    NA    sum      
##  7 EAS   dataset pers     A          ipipApos 1     Agreeablenes… IPIP NEO A  5 to… NA           ifels… NA    NA    average  
##  8 EAS   dataset pers     C          ipipCpos 1     Conscientiou… IPIP NEO C  5 to… NA           ifels… NA    NA    average  
##  9 EAS   dataset pers     E          ipipEpos 1     Extraversion… IPIP NEO E  5 to… NA           ifels… NA    NA    average  
## 10 EAS   dataset pers     N          ipipNpos 1     NeuroticismP… IPIP NEO N  5 to… NA           ifels… NA    NA    average  
## 11 EAS   dataset pers     O          ipipOpos 1     OpenToExperi… IPIP NEO O  5 to… NA           ifels… NA    NA    average  
## 12 EAS   dataset pers     A          ipipAneg 1     Agreeablenes… IPIP NEO A  5 to… NA           ifels… NA    NA    average  
## 13 EAS   dataset pers     C          ipipCneg 1     Conscientiou… IPIP NEO C  5 to… NA           ifels… NA    NA    average  
## 14 EAS   dataset pers     E          ipipEneg 1     Extraversion… IPIP NEO E  5 to… NA           ifels… NA    NA    average  
## 15 EAS   dataset pers     N          ipipNneg 1     NeuroticismN… IPIP NEO N  5 to… NA           ifels… NA    NA    average  
## 16 EAS   dataset pers     O          ipipOneg 1     OpenToExperi… IPIP NEO O  5 to… NA           ifels… NA    NA    average  
## # ℹ 1 more variable: long_rule <chr>

# old.names1 <- unique((eas_codebook %>% filter(dataset == "jrp"))$orig_itemname)
# old.names2 <- unique((eas_codebook %>% filter(dataset == "dataset"))$orig_itemname)
old.names <- unique(eas_codebook$orig_itemname)
eas <- sprintf("%s/eas/Behavior_with_Master_Data_2021_09_24.xlsx", data_path) %>% 
  read_excel(.) %>% 
  select(SID = Id, wave = Wave, year = BehaviorDate, one_of(old.names)) %>%
  mutate(year = lubridate::year(year)
        , Gender = as.numeric(mapvalues(Gender, c("M", "F"), c(0,1))))
  # full_join(
  #   sprintf("%s/eas/dataset.xlsx", data_path) %>% 
  #     read_excel(.) %>% 
  #     select(SID = id, one_of(old.names2))
  # )

eas_long <- eas %>%
  pivot_longer(values_to = "value"
               , names_to = "orig_itemname"
               , cols = c(-SID, -wave, -year)
               , values_drop_na = T)

2.2.2 Recoding & Reverse Scoring

# join data with recoding info 
eas_recode <- eas_codebook %>%
  filter(category %in% c("pers", "outcome", "covariates") & !is.na(orig_itemname)) %>%
  select(category, name, itemname, orig_itemname, reverse_code:long_rule) %>%
  group_by(category, name) %>% 
  nest() %>%
  ungroup() %>%
  mutate(data = map(data, ~(.) %>% left_join(eas_long))) 

# recode 
recode_fun <- function(rule, y, year){
  x <- y$value
  if(!is.na(rule)){y$value <- eval(parse(text = rule))}
  return(y)
}

eas_recode <- eas_recode %>% 
  mutate(data = map(data, ~(.) %>% 
    group_by(recode, year) %>%
    nest() %>%
    ungroup() %>%
    mutate(data = pmap(list(recode, data, year), recode_fun)) %>%
    unnest(data) %>%
    mutate(value = ifelse(value < 0 | is.nan(value) | is.infinite(value), NA, value))))


# reverse code 
eas_recode <- eas_recode %>%
  mutate(data = map(data, ~(.) %>%
    mutate(value = ifelse(reverse_code == "no" | is.na(reverse_code), value, 
                        reverse.code(-1, value, mini = mini, maxi = maxi)))))

fun_call <- function(x, rule){
    switch(rule,
           average = mean(x, na.rm = T),
           mode = Mode(x)[1],
           sum = sum(x, na.rm = T),
           skip = unique(x)[1],
           select = unique(x)[1],
           max = max(x, na.rm = T),
           min = min(x, na.rm = T))
  }

eas_waves <- eas_recode %>% 
  filter(category == "pers") %>%
  unnest(data) %>% 
  select(SID, p_year = year) %>%
  distinct() %>%
  group_by(SID) %>%
  filter(p_year == min(p_year)) %>%
  ungroup()

2.2.3 Covariates

# composite WITHIN years 
eas_cov <- eas_recode %>%
  filter(category == "covariates") %>%
  select(-category) %>%
  mutate(data = map(data, ~(.) %>% 
      filter(!is.na(value)) %>%
      left_join(eas_waves) %>%
      mutate(comp_rule = ifelse(is.na(comp_rule), "skip", comp_rule)) %>%
      group_by(comp_rule, SID, year, p_year, long_rule) %>%
      nest() %>%
      ungroup() %>%
      mutate(data = map2(data, comp_rule, ~fun_call((.x)$value, .y))) %>%
      unnest(data) %>%
      filter(year <= p_year)
      # summarize(value = fun_call(value, comp_rule)) %>%
      # ungroup() %>% 
      # mutate(value = ifelse(is.infinite(value) | is.nan(value), NA, value))
      ))

comp_fun <- function(d, rule, p_year){
  print(paste(rule, p_year))
  d %>%
    group_by(SID, name) %>%
    summarize(value = fun_call(data, rule)) %>%
    ungroup() %>%
    distinct()
}

# composite ACROSS years
eas_cov <- eas_cov %>%
  unnest(data) %>%
  mutate(long_rule = ifelse(is.na(long_rule), "skip", long_rule)) %>%
  filter(year <= p_year) %>%
  group_by(p_year, long_rule) %>%
  nest() %>%
  ungroup() %>%
  mutate(data = pmap(list(data, long_rule, p_year), comp_fun)) %>%
  unnest(data) %>%
  mutate(value = ifelse(is.infinite(value) | is.nan(value), NA, value)) %>%
  select(-long_rule) %>%
  spread(name, value) %>%
  filter(!is.na(SID))

2.2.4 Personality Variables

eas_pers <- eas_recode %>%
  filter(category == "pers") %>%
  select(-category) %>% 
  unnest(data) %>%
  left_join(eas_waves) %>%
  group_by(SID, p_year, year, name, itemname, comp_rule) %>%
  summarize(value = mean(value, na.rm = T)) %>%
  ungroup()

# alpha's
eas_alpha <- eas_pers %>%
  filter(!is.na(value)) %>%
  select(name, itemname, year, SID, value) %>%
  group_by(name, year) %>%
  nest() %>%
  mutate(data = map(data, ~(.) %>% distinct() %>% pivot_wider(names_from = itemname, values_from = value, values_fn = list(mean))),
         alpha = map(data, possibly(~psych::alpha((.) %>% select(-SID)), NA_real_)))

comp_fun <- function(df, rule){
  df %>%
    group_by(SID) %>%
    summarize(value = fun_call(value, rule)) %>%
    ungroup() %>%
    mutate(value = ifelse(is.infinite(value) | is.nan(value), NA, value))
}

# create composites
eas_pers <- eas_pers %>%
  group_by(name, comp_rule, year) %>%
  nest() %>%
  ungroup() %>%
  mutate(data = map2(data, comp_rule, comp_fun)) %>%
  unnest(data) %>%
  select(-comp_rule)

2.2.5 Outcome Variables

# composite within years
eas_out <- eas_recode %>%
  filter(category == "outcome") %>%
  select(-category) %>%
  unnest(data) %>%
  group_by(SID, name) %>%
  filter(year == max(year)) %>%
  filter(!is.na(value)) %>%
  group_by(name, itemname, year) %>%
  mutate(value = pomp(value)) %>%
  group_by(name, year, SID) %>%
  summarize(value = mean(value, na.rm = T)) %>%
  ungroup()

2.2.6 Combine Data

eas_combined <- eas_pers %>% 
  rename(Trait = name, p_value = value, p_year = year) %>%
  full_join(eas_out %>% rename(Outcome = name, o_value = value, o_year = year)) %>%
  full_join(eas_cov) %>%
  filter(!is.na(p_value) & !is.na(o_value))

save(eas_cov, eas_alpha, eas_pers, eas_out, eas_combined,
     file = sprintf("%s/data/clean/eas_cleaned.RData", local_path))

rm(list =ls()[grepl("eas", ls())])

2.3 German Socioeconomic Panel (GSOEP)

The German Socioeconomic Panel Study (GSOEP; Socio-Economic Panel, 2017) is an ongoing longitudinal study of Germans collected by the German Institute of Economic Research (DIW Berlin). The data are freely available at https://www.diw.de/soep by application.

Data have been collected annually since 1984 (the latest data release includes data up to 2017). Participants have been recruited from more than 11,000 households, which are nationally representative of private German households. 20,000 individuals are sampled each year, on average. It is critical to note that the GSOEP samples households, not individuals, and the households consist of individuals living in both the “old” and “new” federal states (the former West and East Germany), foreigners, and recent immigrants to Germany.

Sample size varies by year, ranging from approximately 10,000 (1989) to 31,000 (2013). This provides 99% power to detect a zero-order correlation effect size of ~.06, two-tailed at alpha < .05.

2.3.1 Load Data

gsoep_read_fun <- function(Year, WL){
  old.names <- (gsoep_codebook %>% filter(year == Year | category == "proc"))$orig_itemname 
  p <- sprintf("%s/gsoep/%sp.sav", data_path, WL) %>% haven::read_sav(.) %>%
    full_join(sprintf("%s/gsoep/%skind.sav", data_path, WL) %>% haven::read_sav(.)) %>%
    full_join(sprintf("%s/gsoep/%spequiv.sav", data_path, WL) %>% haven::read_sav(.)) %>%
    full_join(sprintf("%s/gsoep/%spgen.sav", data_path, WL) %>% haven::read_sav(.)) %>%
    full_join(sprintf("%s/gsoep/%spkal.sav", data_path, WL) %>% haven::read_sav(.)) %>%
    select(one_of(old.names)) %>%
    gather(key = orig_itemname, value = value, -persnr, -hhnr, na.rm = T)
  
 sprintf("%s/gsoep/%shbrutto.sav", data_path, WL) %>% haven::read_sav(.) %>%
    full_join(sprintf("%s/gsoep/%sh.sav", data_path, WL) %>% haven::read_sav(.)) %>%
    select(one_of(old.names)) %>%
    gather(key = orig_itemname, value = value, -hhnr, na.rm = T) %>%
    full_join(p %>% select(persnr, hhnr) %>% distinct()) %>%
    full_join(p) 
}

gsoep_codebook <- (codebook %>% filter(study == "GSOEP"))$codebook[[1]] %>%
  mutate(orig_itemname = str_to_lower(orig_itemname))
gsoep_codebook

## # A tibble: 250 × 17
##    study dataset category   name  itemname  wave waveletter  year orig_itemname description   scale reverse_code recode  mini
##    <chr> <chr>   <chr>      <chr> <chr>    <dbl> <chr>      <dbl> <chr>         <chr>         <chr> <chr>        <chr>  <dbl>
##  1 gsoep apequiv covariates age   age          1 a           1984 d1110184      Age of Indiv… <NA>  <NA>         ifels…    NA
##  2 gsoep bpequiv covariates age   age          2 b           1985 d1110185      Age of Indiv… <NA>  <NA>         ifels…    NA
##  3 gsoep cpequiv covariates age   age          3 c           1986 d1110186      Age of Indiv… <NA>  <NA>         ifels…    NA
##  4 gsoep dpequiv covariates age   age          4 d           1987 d1110187      Age of Indiv… <NA>  <NA>         ifels…    NA
##  5 gsoep epequiv covariates age   age          5 e           1988 d1110188      Age of Indiv… <NA>  <NA>         ifels…    NA
##  6 gsoep fpequiv covariates age   age          6 f           1989 d1110189      Age of Indiv… <NA>  <NA>         ifels…    NA
##  7 gsoep gpequiv covariates age   age          7 g           1990 d1110190      Age of Indiv… <NA>  <NA>         ifels…    NA
##  8 gsoep hpequiv covariates age   age          8 h           1991 d1110191      Age of Indiv… <NA>  <NA>         ifels…    NA
##  9 gsoep ipequiv covariates age   age          9 i           1992 d1110192      Age of Indiv… <NA>  <NA>         ifels…    NA
## 10 gsoep jpequiv covariates age   age         10 j           1993 d1110193      Age of Indiv… <NA>  <NA>         ifels…    NA
## # ℹ 240 more rows
## # ℹ 3 more variables: maxi <dbl>, comp_rule <chr>, long_rule <chr>

gsoep <- gsoep_codebook %>% 
  select(wave, waveletter, year) %>%
  filter(complete.cases(.)) %>%
  distinct() %>%
  arrange(year) %>%
  filter(year != "2018") %>%
  mutate(data = map2(year, waveletter, gsoep_read_fun)) 

old.names <- unique(gsoep_codebook$orig_itemname)
gsoep_cog <- sprintf("%s/gsoep/cognit.sav", data_path) %>% haven::read_sav(.) %>%
  select(persnr, hhnr, year = syear, one_of(old.names)) %>%
  filter(year == 2012) %>%
  haven::zap_labels(.) %>%
  select(-hhnr) %>%
  gather(key = orig_itemname, value = value, -persnr, -year, na.rm = T)

gsoep_cog_subs <- unique(gsoep_cog$persnr)

gsoep_long <- gsoep %>% unnest(data) %>%
  select(-hhnr) %>%
  filter(persnr %in% gsoep_cog_subs) %>%
  full_join(gsoep_cog) %>%
  select(-wave, -waveletter) %>%
  rename(SID = persnr)

save(gsoep, file = sprintf("%s/data/clean/gsoep_raw.RData", local_path))
rm(gsoep)

2.3.2 Recoding & Reverse Scoring

gsoep_waves <- p_waves %>% filter(Study == "GSOEP") %>% select(Used) %>% distinct()

# join data with recoding info 
gsoep_recode <- gsoep_codebook %>%
  filter(category %in% c("pers", "outcome", "covariates")) %>%
  select(category, name, itemname, wave, year, orig_itemname, reverse_code:long_rule) %>%
  group_by(category, name) %>% 
  nest() %>%
  ungroup() %>%
  mutate(data = map(data, ~(.) %>% left_join(gsoep_long))) 

# recode 
recode_fun <- function(rule, y, year, p_year){
  x <- y$value
  if(!is.na(rule)){y$value <- eval(parse(text = rule))}
  return(y)
}

gsoep_recode <- gsoep_recode %>% 
  mutate(data = map(data, ~(.) %>% 
    mutate(p_year = 2005) %>%
    group_by(recode, year, p_year) %>%
    nest() %>%
    ungroup() %>%
    mutate(data = pmap(list(recode, data, year, p_year), recode_fun)) %>%
    unnest(data) %>%
    mutate(value = ifelse(value < 0 | is.nan(value) | is.infinite(value), NA, value))))


# reverse code 
gsoep_recode <- gsoep_recode %>%
  mutate(data = map(data, ~(.) %>%
    mutate(value = ifelse(reverse_code == "no" | is.na(reverse_code), value, 
                        reverse.code(-1, value, mini = mini, maxi = maxi)))))

fun_call <- function(x, rule){
    switch(rule,
           average = mean(x, na.rm = T),
           mode = Mode(x)[1],
           sum = sum(x, na.rm = T),
           skip = unique(x)[1],
           select = unique(x)[1],
           max = max(x, na.rm = T),
           min = min(x, na.rm = T))
  }

2.3.3 Covariates

# composite WITHIN years 
gsoep_cov <- gsoep_recode %>%
  filter(category == "covariates") %>%
  select(-category) %>%
  mutate(data = map(data, ~(.) %>% 
      filter(!is.na(value)) %>%
      mutate(comp_rule = ifelse(is.na(comp_rule), "skip", comp_rule)) %>%
      group_by(comp_rule, SID, year, p_year, long_rule) %>%
      nest() %>%
      ungroup() %>%
      mutate(data = map2(data, comp_rule, ~fun_call((.x)$value, .y))) %>%
      unnest(data) %>%
      filter(year <= p_year)
      # summarize(value = fun_call(value, comp_rule)) %>%
      # ungroup() %>% 
      # mutate(value = ifelse(is.infinite(value) | is.nan(value), NA, value))
      ))

comp_fun <- function(d, rule, p_year){
  print(paste(rule, p_year))
  d %>%
    group_by(SID, name) %>%
    summarize(value = fun_call(data, rule)) %>%
    ungroup() %>%
    distinct()
}

# composite ACROSS years
gsoep_cov <- gsoep_cov %>%
  unnest(data) %>%
  mutate(long_rule = ifelse(is.na(long_rule), "skip", long_rule)) %>%
  filter(year <= p_year) %>%
  group_by(p_year, long_rule) %>%
  nest() %>%
  ungroup() %>%
  mutate(data = pmap(list(data, long_rule, p_year), comp_fun)) %>%
  unnest(data) %>%
  mutate(value = ifelse(is.infinite(value) | is.nan(value), NA, value)) %>%
  select(-long_rule) %>%
  spread(name, value) %>%
  filter(!is.na(SID))

2.3.4 Personality Variables

gsoep_pers <- gsoep_recode %>%
  filter(category == "pers") %>%
  select(-category) %>% 
  unnest(data) %>%
  filter(year == "2005" & !is.na(value)) %>%
  group_by(SID, p_year, year, name, itemname, comp_rule) %>%
  summarize(value = mean(value, na.rm = T)) %>%
  ungroup()

# alpha's
gsoep_alpha <- gsoep_pers %>%
  filter(!is.na(value)) %>%
  select(name, itemname, year, SID, value) %>%
  group_by(name, year) %>%
  nest() %>%
  mutate(data = map(data, ~(.) %>% distinct() %>% pivot_wider(names_from = itemname, values_from = value, values_fn = list(mean))),
         alpha = map(data, possibly(~psych::alpha((.) %>% select(-SID)), NA_real_)))

comp_fun <- function(df, rule){
  df %>%
    group_by(SID) %>%
    summarize(value = fun_call(value, rule)) %>%
    ungroup() %>%
    mutate(value = ifelse(is.infinite(value) | is.nan(value), NA, value))
}

# create composites
gsoep_pers <- gsoep_pers %>%
  group_by(name, comp_rule, year) %>%
  nest() %>%
  ungroup() %>%
  mutate(data = map2(data, comp_rule, comp_fun)) %>%
  unnest(data) %>%
  select(-comp_rule)

2.3.5 Outcome Variables

# composite within years
gsoep_out <- gsoep_recode %>%
  filter(category == "outcome") %>%
  select(-category) %>%
  unnest(data) %>%
  group_by(name, year, SID) %>%
  summarize(value = sum(value, na.rm = T)) %>%
  ungroup()

2.3.6 Combine Data

gsoep_combined <- gsoep_pers %>% 
  rename(Trait = name, p_value = value, p_year = year) %>%
  full_join(gsoep_out %>% rename(Outcome = name, o_value = value, o_year = year)) %>%
  full_join(gsoep_cov) %>%
  filter(!is.na(p_value) & !is.na(o_value))

save(gsoep_cov, gsoep_alpha, gsoep_pers, gsoep_out, gsoep_combined,
     file = sprintf("%s/data/clean/gsoep_cleaned.RData", local_path))

rm(list =ls()[grepl("gsoep", ls())])

2.4 Household, Income, and Labour Dynamics in Australia (HILDA)

The Household Income and Labour Dynamics in Australia (HILDA; Wilkins, Laß, Butterworth, & Vera-Toscano, 2019) study is an ongoing longitudinal study of Australian households. These data are available through application from https://melbourneinstitute.unimelb.edu.au/hilda/for-data-users.

Participants were recruited from more than 17,000 individuals. Data have been collected annually since 2001. The latest data release includes 17 waves of data from 2001 to 2017. More documentation can be found in the HILDA data dictionary at https://www.online.fbe.unimelb.edu.au/HILDAodd/srchSubjectAreas.aspx.

Sample sizes vary by year, ranging from 12,408 (2004) to 17,693 (2016). This provides 99% power to detect a zero-order correlation effect size of ~.03, two tailed at alpha .05.

2.4.1 Load Data

hilda_read_fun <- function(Year, WL){
  old.names <- (hilda_codebook %>% filter(year == Year | year == 0))$orig_itemname
  sprintf("%s/hilda/Combined %s180c.sav", data_path, WL) %>% haven::read_sav(.) %>%
    select(one_of(old.names)) 
}

hilda_codebook <- (codebook %>% filter(study == "HILDA"))$codebook[[1]]
hilda_codebook

## # A tibble: 217 × 17
##    study dataset category   name   itemname wave_letter ...7   year orig_itemname description scale reverse_code recode  mini
##    <chr> <lgl>   <chr>      <chr>  <chr>    <chr>       <chr> <dbl> <chr>         <chr>       <chr> <chr>        <chr>  <dbl>
##  1 hilda NA      covariates educa… educati… A           01     2001 aedhigh1      Highest ed… 1=po… <NA>         ifels…    NA
##  2 hilda NA      covariates educa… educati… B           02     2002 bedhigh1      Highest ed… 1=po… <NA>         ifels…    NA
##  3 hilda NA      covariates educa… educati… C           03     2003 cedhigh1      Highest ed… 1=po… <NA>         ifels…    NA
##  4 hilda NA      covariates educa… educati… D           04     2004 dedhigh1      Highest ed… 1=po… <NA>         ifels…    NA
##  5 hilda NA      covariates educa… educati… E           05     2005 eedhigh1      Highest ed… 1=po… <NA>         ifels…    NA
##  6 hilda NA      covariates educa… educati… F           06     2006 fedhigh1      Highest ed… 1=po… <NA>         ifels…    NA
##  7 hilda NA      covariates educa… educati… G           07     2007 gedhigh1      Highest ed… 1=po… <NA>         ifels…    NA
##  8 hilda NA      covariates educa… educati… H           08     2008 hedhigh1      Highest ed… 1=po… <NA>         ifels…    NA
##  9 hilda NA      covariates educa… educati… I           09     2009 iedhigh1      Highest ed… 1=po… <NA>         ifels…    NA
## 10 hilda NA      covariates educa… educati… J           10     2010 jedhigh1      Highest ed… 1=po… <NA>         ifels…    NA
## # ℹ 207 more rows
## # ℹ 3 more variables: maxi <dbl>, comp_rule <chr>, long_rule <chr>

hilda <- hilda_codebook %>% 
  select(year, wave_letter) %>%
  filter(!is.na(wave_letter)) %>%
  distinct() %>%
  mutate(wave_letter = tolower(wave_letter), 
         data = map2(year, wave_letter, hilda_read_fun)) 

save(hilda, file = sprintf("%s/data/clean/hilda_raw.RData", local_path))

2.4.2 Recoding & Reverse Scoring

hilda_waves <- p_waves %>% filter(Study == "HILDA") %>% select(Used) %>% distinct()

rename_fun <- function(df, Year){
  df <- df %>%
    haven::zap_labels(.) %>%
    gather(key = orig_itemname, value = value, -xwaveid, na.rm=T) %>%
    mutate(value=as.numeric(value))
  hilda_codebook %>% 
    select(category, name, itemname, year, orig_itemname, reverse_code:long_rule) %>%
    filter(year == Year & category %in% c("pers", "outcome", "covariates")) %>%
    full_join(df)
}

# join data with recoding info 
hilda_recode <- hilda %>%
  mutate(data = map2(data, year, rename_fun)) %>% 
  select(-year, -wave_letter) %>%
  unnest(data) %>%
  distinct() %>%
  rename(SID = xwaveid) %>%
  group_by(category, name) %>%
  nest() %>% 
  ungroup()

# recode 
recode_fun <- function(rule, y, year, p_year){
  x <- y$value
  if(!is.na(rule)){y$value <- eval(parse(text = rule))}
  return(y)
}

hilda_recode <- hilda_recode %>% 
  mutate(data = map(data, ~(.) %>% 
    mutate(p_year = 2005) %>%
    group_by(recode, year, p_year) %>%
    nest() %>%
    ungroup() %>%
    mutate(data = pmap(list(recode, data, year, p_year), recode_fun)) %>%
    unnest(data) %>%
    mutate(value = ifelse(value < 0 | is.nan(value) | is.infinite(value), NA, value))))


# reverse code 
hilda_recode <- hilda_recode %>%
  mutate(data = map(data, ~(.) %>%
    mutate(value = ifelse(reverse_code == "no" | is.na(reverse_code), value, 
                        reverse.code(-1, value, mini = mini, maxi = maxi)))))

fun_call <- function(x, rule){
    switch(rule,
           average = mean(x, na.rm = T),
           mode = Mode(x)[1],
           sum = sum(x, na.rm = T),
           skip = unique(x)[1],
           select = unique(x)[1],
           max = max(x, na.rm = T),
           min = min(x, na.rm = T))
  }

2.4.3 Covariates

# composite WITHIN years 
hilda_cov <- hilda_recode %>%
  filter(category == "covariates") %>%
  select(-category) %>%
  mutate(data = map(data, ~(.) %>% 
      filter(!is.na(value)) %>%
      mutate(comp_rule = ifelse(is.na(comp_rule), "skip", comp_rule)) %>%
      group_by(comp_rule, SID, year, p_year, long_rule) %>%
      nest() %>%
      ungroup() %>%
      mutate(data = map2(data, comp_rule, ~fun_call((.x)$value, .y))) %>%
      unnest(data) %>%
      filter(year <= p_year)
      # summarize(value = fun_call(value, comp_rule)) %>%
      # ungroup() %>% 
      # mutate(value = ifelse(is.infinite(value) | is.nan(value), NA, value))
      ))

comp_fun <- function(d, rule, p_year){
  print(paste(rule, p_year))
  d %>%
    group_by(SID, name) %>%
    summarize(value = fun_call(data, rule)) %>%
    ungroup() %>%
    distinct()
}

# composite ACROSS years
hilda_cov <- hilda_cov %>%
  unnest(data) %>%
  mutate(long_rule = ifelse(is.na(long_rule), "skip", long_rule)) %>%
  filter(year <= p_year) %>%
  group_by(p_year, long_rule) %>%
  nest() %>%
  ungroup() %>%
  mutate(data = pmap(list(data, long_rule, p_year), comp_fun)) %>%
  unnest(data) %>%
  mutate(value = ifelse(is.infinite(value) | is.nan(value), NA, value)) %>%
  select(-long_rule) %>%
  spread(name, value) %>%
  filter(!is.na(SID))

2.4.4 Personality Variables

hilda_pers <- hilda_recode %>%
  filter(category == "pers") %>%
  select(-category) %>% 
  unnest(data) %>%
  filter(year == "2005" & !is.na(value)) %>%
  group_by(SID, p_year, year, name, itemname, comp_rule) %>%
  summarize(value = mean(value, na.rm = T)) %>%
  ungroup()

# alpha's
hilda_alpha <- hilda_pers %>%
  filter(!is.na(value)) %>%
  select(name, itemname, year, SID, value) %>%
  group_by(name, year) %>%
  nest() %>%
  mutate(data = map(data, ~(.) %>% distinct() %>% pivot_wider(names_from = itemname, values_from = value, values_fn = list(mean))),
         alpha = map(data, possibly(~psych::alpha((.) %>% select(-SID)), NA_real_)))

comp_fun <- function(df, rule){
  df %>%
    group_by(SID) %>%
    summarize(value = fun_call(value, rule)) %>%
    ungroup() %>%
    mutate(value = ifelse(is.infinite(value) | is.nan(value), NA, value))
}

# create composites
hilda_pers <- hilda_pers %>%
  group_by(name, comp_rule, year) %>%
  nest() %>%
  ungroup() %>%
  mutate(data = map2(data, comp_rule, comp_fun)) %>%
  unnest(data) %>%
  select(-comp_rule)

2.4.5 Outcome Variables

# composite within years
hilda_out <- hilda_recode %>%
  filter(category == "outcome") %>%
  select(-category) %>%
  unnest(data) %>%
  group_by(name, year, SID) %>%
  summarize(value = sum(value, na.rm = T)) %>%
  ungroup()

2.4.6 Combine Data

hilda_combined <- hilda_pers %>% 
  rename(Trait = name, p_value = value, p_year = year) %>%
  full_join(hilda_out %>% rename(Outcome = name, o_value = value, o_year = year)) %>%
  full_join(hilda_cov) %>%
  filter(!is.na(p_value) & !is.na(o_value))

save(hilda_cov, hilda_alpha, hilda_pers, hilda_out, hilda_combined,
     file = sprintf("%s/data/clean/hilda_cleaned.RData", local_path))

rm(list =ls()[grepl("hilda", ls())])

2.5 Health and Retirement Study (HRS)

The Health and Retirement Study [HRS; (juster1995overview?)] is an ongoing longitudinal study of households in the United States. These data are available at https://hrs.isr.umich.edu by creating a free account.

Participants were recruited from more than 35,000 individuals from the financial households of individuals born between 1931 and 1941 in the US. Data have been collected biannually since 1992. The latest data release includes data up to 2016. On average, 10,000 individuals are sampled each wave More information on the HRS can be found at https://hrs.isr.umich.edu/documentation/survey-design, but, in short, the HRS is a nationally representative sample of adults over 50 in the US. It is critical to note that the HRS samples households of the original cohort and follows individuals and their spouses or partners until their death.

Sample size varies by year, ranging from approximately 7,500 (2014) to 15,500 (1992). (https://hrs.isr.umich.edu/sites/default/files/biblio/ResponseRates_2017.pdf). This provides 99% power to detect a zero-order correlation effect size of ~.04, two-tailed at alpha .05.

2.5.1 Load Data

hrs_read_fun <- function(year) {
  read_da <- function(da, dct, Year){
    print(paste(da, dct, year, sep = " "))
    data.file <- sprintf("%s/hrs/%s/%s", data_path, Year, da)
    # Set path to the dictionary file "*.DCT"
    dict.file <- sprintf("%s/hrs/%s/%s", data_path, Year, dct)
    # Read the dictionary file
    df.dict <- read.table(dict.file, skip = 1, fill = TRUE, stringsAsFactors = FALSE)
    # Set column names for dictionary dataframe
    colnames(df.dict) <- c("col.num","col.type","col.name","col.width","col.lbl")
    # Remove last row which only contains a closing }
    row <- which(df.dict$col.name == "HHID")
    df.dict <- df.dict[-nrow(df.dict),]
    if(row == 2){df.dict <- df.dict[-1,]}
    # Extract numeric value from column width field
    df.dict$col.width <- as.integer(sapply(df.dict$col.width, gsub, pattern = "[^0-9\\.]", replacement = ""))
    # Convert column types to format to be used with read_fwf function
    df.dict$col.type <- sapply(df.dict$col.type, function(x) ifelse(x %in% c("int","byte","long"), "i", ifelse(x == "float", "n", ifelse(x == "double", "d", "c"))))
    # Read the data file into a dataframe
    df <- read_fwf(file = data.file, fwf_widths(widths = df.dict$col.width, col_names = df.dict$col.name), col_types = paste(df.dict$col.type, collapse = ""))
    # Add column labels to headers
    attributes(df)$variable.labels <- df.dict$col.lbl
    old.names <- (hrs_codebook %>% filter(year == Year))$orig_itemname
    if(any(c("PN", "HHID") %in% colnames(df)) & any(old.names %in% colnames(df))){
    # if(any(c("PN", "HHID") %in% colnames(df))){
      df <- df %>%
      mutate(hhidpn = 1000*as.numeric(HHID) + as.numeric(PN)) %>%
      select(one_of(c("PN", "HHID")), one_of(old.names)) %>%
        distinct()
      # gather(key = item, value = value, -hhidpn)
    } else {df <- NA}
    return(df)
  }
    # Set path to the data file "*.DA"
  files <- list.files(sprintf("%s/hrs/%s", data_path, year))
  df2 <- tibble(
    da = files[grepl(".da",  files) | grepl(".DA", files)],
    dct = files[grepl(".dct", files) | grepl(".DCT", files)]
  ) %>%
    mutate(data = map2(da, dct, possibly(~read_da(.x, .y, year), NA_real_))) %>%
    filter(!is.na(data)) %>%
    select(-da, -dct) 
  if(nrow(df2) != 0){df2$data %>% reduce(full_join) %>% distinct()} else {NA}
}

hrs_codebook <- (codebook %>% filter(study == "HRS"))$codebook[[1]] %>% 
  mutate(orig_itemname = str_to_upper(orig_itemname)) %>%
  mutate_at(vars(orig_itemname, name, itemname), ~str_remove_all(., "[[:space:]]"))
hrs_codebook

## # A tibble: 183 × 17
##    study dataset category   name    itemname  wave waveletter  year orig_itemname description scale reverse_code recode  mini
##    <chr> <chr>   <chr>      <chr>   <chr>    <dbl> <lgl>      <dbl> <chr>         <chr>       <chr> <chr>        <chr>  <dbl>
##  1 hrs   Rand    covariates educat… eduYears     1 NA          1992 RAEDYRS       R Years of… ".M=… <NA>         ifels…    NA
##  2 hrs   Rand    covariates gender  demgend…     1 NA          1992 RAGENDER      R Gender    ".M=… <NA>         ifels…    NA
##  3 hrs   Rand    covariates SRheal… rHlthSR      1 NA          1992 R1SHLT        W1 Self-re… "1.E… yes          ifels…     1
##  4 hrs   Rand    covariates SRheal… rHlthSR      2 NA          1994 R2SHLT        W2 Self-re… "1.E… yes          ifels…     1
##  5 hrs   Rand    covariates SRheal… rHlthSR      3 NA          1996 R3SHLT        W3 Self-re… "1.E… yes          ifels…     1
##  6 hrs   Rand    covariates SRheal… rHlthSR      4 NA          1998 R4SHLT        W4 Self-re… "1.E… yes          ifels…     1
##  7 hrs   Rand    covariates SRheal… rHlthSR      5 NA          2000 R5SHLT        W5 Self-re… "1.E… yes          ifels…     1
##  8 hrs   Rand    covariates SRheal… rHlthSR      6 NA          2002 R6SHLT        W6 Self-re… "1.E… yes          ifels…     1
##  9 hrs   Rand    covariates SRheal… rHlthSR      7 NA          2004 R7SHLT        W7 Self-re… "1.E… yes          ifels…     1
## 10 hrs   Rand    covariates SRheal… rHlthSR      8 NA          2006 R8SHLT        W8 Self-re… "1.E… yes          ifels…     1
## # ℹ 173 more rows
## # ℹ 3 more variables: maxi <dbl>, comp_rule <chr>, long_rule <chr>

old.names <- unique(hrs_codebook$orig_itemname)
hrs.paq <- tibble(
  year = sprintf("%s/hrs", data_path) %>% list.files(., pattern = "^[0-9]")
  , data = map(year, hrs_read_fun)
  , names = map(data, colnames)
  ) %>%
  filter(!is.na(data))

old.names <- unique((hrs_codebook %>% filter(dataset == "Rand"))$orig_itemname)
hrs.rand <- sprintf("%s/hrs/randhrs1992_2016v1.sav", data_path) %>% 
  haven::read_sav(.) %>%
  haven::zap_labels(.) %>%
  select(SID = HHIDPN, one_of(old.names)) %>%
  gather(key = orig_itemname, value = value, -SID, na.rm = T)


hrs_long <- hrs.paq %>%
  mutate(data = map(data, ~(.) %>% 
      gather(key = orig_itemname, value = value, -HHID, -PN))) %>%
  select(-names, -year) %>%
  unnest(data) %>%
  mutate(SID = 1000*as.numeric(HHID) + as.numeric(PN)) %>%
  select(-PN, -HHID) 

hrs.subs <- unique(hrs_long$SID)[unique(hrs_long$SID) %in% unique(hrs.rand$SID)]
hrs_long <- hrs_long %>%
  bind_rows(hrs.rand %>% select(orig_itemname, value, SID)) %>%
  filter(SID %in% hrs.subs)

save(hrs.rand, hrs.paq, file = sprintf("%s/data/clean/hrs_raw.RData", local_path))
rm(list = c("hrs.paq", "hrs.rand"))

2.5.2 Recoding & Reverse Scoring

hrs_waves <- p_waves %>% filter(Study == "HRS") %>% select(Used) %>% distinct()

# join data with recoding info 
hrs_recode <- hrs_codebook %>%
  filter(category %in% c("pers", "outcome", "covariates")) %>%
  select(category, name, itemname, wave, year, orig_itemname, reverse_code:long_rule) %>%
  group_by(category, name) %>% 
  nest() %>%
  ungroup() %>%
  mutate(data = map(data, ~(.) %>% left_join(hrs_long))) 

# recode 
recode_fun <- function(rule, y, year, p_year){
  x <- y$value
  if(!is.na(rule)){y$value <- eval(parse(text = rule))}
  return(y)
}

hrs_recode <- hrs_recode %>% 
  mutate(data = map(data, ~(.) %>% 
    mutate(year = mapvalues(year, seq(2006, 2016, 2), rep(c(2006, 2010, 2014), each = 2)),
           p_year = 2006) %>%
    group_by(recode, year, p_year) %>%
    nest() %>%
    ungroup() %>%
    mutate(data = pmap(list(recode, data, year, p_year), recode_fun)) %>%
    unnest(data) %>%
    mutate(value = ifelse(value < 0 | is.nan(value) | is.infinite(value), NA, value))))


# reverse code 
hrs_recode <- hrs_recode %>%
  mutate(data = map(data, ~(.) %>%
    mutate(value = ifelse(reverse_code == "no" | is.na(reverse_code), value, 
                        reverse.code(-1, value, mini = mini, maxi = maxi)))))

fun_call <- function(x, rule){
    switch(rule,
           average = mean(x, na.rm = T),
           mode = Mode(x)[1],
           sum = sum(x, na.rm = T),
           skip = unique(x)[1],
           select = unique(x)[1],
           max = max(x, na.rm = T),
           min = min(x, na.rm = T))
  }

2.5.3 Covariates

# compositing within years
year_comp_fun <- function(df, rule){
  df %>%
    # group by person and item (collapse across age)
    group_by(SID, name, year, p_year, long_rule) %>% 
    summarize(value = fun_call(value, rule)) %>%
    ungroup() %>% 
    mutate(value = ifelse(is.infinite(value) | is.nan(value), NA, value))
}

# composite WITHIN years 
hrs_cov <- hrs_recode %>%
  filter(category == "covariates") %>%
  select(-category) %>%
  unnest(data) %>%
  mutate(comp_rule = ifelse(is.na(comp_rule), "skip", comp_rule)) %>%
  group_by(comp_rule) %>%
  nest() %>%
  ungroup() %>%
  mutate(data = map2(data, comp_rule, year_comp_fun)) %>%
  unnest(data)

comp_fun <- function(d, rule, p_year){
  d %>%
    filter(year <= p_year) %>%
    group_by(SID, name) %>%
    summarize(value = fun_call(value, rule)) %>%
    ungroup() %>%
    distinct()
}

# composite ACROSS years
hrs_cov <- hrs_cov %>%
  mutate(long_rule = ifelse(is.na(long_rule), "skip", long_rule)) %>%
  group_by(p_year, long_rule) %>%
  nest() %>%
  ungroup() %>%
  mutate(data = pmap(list(data, long_rule, p_year), comp_fun)) %>%
  unnest(data) %>%
  mutate(value = ifelse(is.infinite(value) | is.nan(value), NA, value)) %>%
  select(-long_rule) %>%
  spread(name, value) %>%
  filter(!is.na(SID))

2.5.4 Personality Variables

hrs_pers <- hrs_recode %>%
  filter(category == "pers") %>%
  select(-category) %>% 
  unnest(data) %>%
  filter(year == "2006" & !is.na(value)) %>%
  group_by(SID, p_year, year, name, itemname, comp_rule) %>%
  summarize(value = mean(value, na.rm = T)) %>%
  ungroup()

# alpha's
hrs_alpha <- hrs_pers %>%
  filter(!is.na(value)) %>%
  select(name, itemname, year, SID, value) %>%
  group_by(name, year) %>%
  nest() %>%
  mutate(data = map(data, ~(.) %>% distinct() %>% pivot_wider(names_from = itemname, values_from = value, values_fn = list(mean))),
         alpha = map(data, possibly(~psych::alpha((.) %>% select(-SID)), NA_real_)))

comp_fun <- function(df, rule){
  df %>%
    group_by(SID) %>%
    summarize(value = fun_call(value, rule)) %>%
    ungroup() %>%
    mutate(value = ifelse(is.infinite(value) | is.nan(value), NA, value))
}

# create composites
hrs_pers <- hrs_pers %>%
  group_by(name, comp_rule, year) %>%
  nest() %>%
  ungroup() %>%
  mutate(data = map2(data, comp_rule, comp_fun)) %>%
  unnest(data) %>%
  select(-comp_rule)

2.5.5 Outcome Variables

# composite within years
hrs_out <- hrs_recode %>%
  filter(category == "outcome") %>%
  select(-category) %>%
  unnest(data) %>%
  group_by(name, year, SID, long_rule) %>%
  summarize(value = sum(value, na.rm = T)) %>%
  filter(year == 2010) %>%
  ungroup() %>%
  select(-long_rule)

2.5.6 Combine Data

hrs_combined <- hrs_pers %>% 
  rename(Trait = name, p_value = value, p_year = year) %>%
  full_join(hrs_out %>% rename(Outcome = name, o_value = value, o_year = year)) %>%
  full_join(hrs_cov) %>%
  filter(!is.na(p_value) & !is.na(o_value))

save(hrs_cov, hrs_alpha, hrs_pers, hrs_out, hrs_combined,
     file = sprintf("%s/data/clean/hrs_cleaned.RData", local_path))

rm(list =ls()[grepl("hrs", ls())])

2.6 Longitudinal Aging Study Amsterdam (LASA)

The Longitudinal Aging Study Amsterdam is an ongoing longitudinal study that began in 1992. These data are available, through application at https://www.lasa-vu.nl/data/availability_data/availability_data.htm.

Participants who were between 55 and 84 years at the start of the study were recruited from the NESTOR study on Living Arrangements and Social Networks of older adults, which was randomly selected from municipality registers in 1992, with an oversampling of the oldest old and men. Data are collected approximately every three years (1992, 1995, 1998, 2001, 2005, 2008, 2011, 2015, 2018). Additional cohorts are introduced every 10 years, with additional participants (Cohorts 2 and 3) recruited in 2002 and 2012. Additional information and documentation are available at https://www.lasa-vu.nl/index.htm.

Sample sizes vary by year, ranging from 1522 (Wave H; 763 Cohort 1, 759 Cohort 2) to 3107 (Wave B, Cohort 1). This provides 99% power to detect a zero-order correlation effect size of ~.10, two-tailed at alpha .05.

2.6.1 Load Data

lasa_read_fun <- function(x){
  d <- sprintf("%s/lasa/%s", data_path, x) %>% 
    read_sav(.) %>% 
    select(one_of(lasa_sids),
           one_of(old.names), 
           one_of(str_to_lower(old.names)),
           one_of(str_to_upper(old.names))) %>%
    as_tibble() %>%
    haven::zap_labels(.)
  colnames(d) <- str_to_lower(colnames(d))
  return(d)
}

lasa_codebook <- (codebook %>% filter(study == "LASA"))$codebook[[1]] %>%
  mutate(orig_itemname = orig_itemname)
lasa_codebook

## # A tibble: 105 × 17
##    study dataset  category   name  itemname wave  wave_letter  year orig_itemname description scale reverse_code recode  mini
##    <chr> <chr>    <chr>      <chr> <chr>    <lgl> <chr>       <dbl> <chr>         <chr>       <chr> <chr>        <chr>  <dbl>
##  1 LASA  LASAz004 covariates educ… eduYears NA    b            1992 aedu          education … "no … <NA>         ifels…    NA
##  2 LASA  LASAz004 covariates gend… sex      NA    b            1992 sex           sex respon… "mal… <NA>         ifels…    NA
##  3 LASA  LASAb036 covariates SRhe… SRhealth NA    b            1992 bsubhea1      Self-perce… "na,… yes          ifels…     1
##  4 LASA  LASAc036 covariates SRhe… SRhealth NA    c            1995 csubhea1      Self-perce… "na,… yes          ifels…     1
##  5 LASA  LASAd036 covariates SRhe… SRhealth NA    d            1998 dsubhea1      Self-perce… "na,… yes          ifels…     1
##  6 LASA  LASAe036 covariates SRhe… SRhealth NA    e            2001 esubhea1      Self-perce… "na,… yes          ifels…     1
##  7 LASA  LASAf036 covariates SRhe… SRhealth NA    f            2005 fsubhea1      Self-perce… "na,… yes          ifels…     1
##  8 LASA  LASAg036 covariates SRhe… SRhealth NA    g            2008 gsubhea1      Self-perce… "na,… yes          ifels…     1
##  9 LASA  LASAh036 covariates SRhe… SRhealth NA    h            2011 hsubhea1      Self-perce… "na,… yes          ifels…     1
## 10 LASA  LASAi036 covariates SRhe… SRhealth NA    i            2015 isubhea1      Self-perce… "na,… yes          ifels…     1
## # ℹ 95 more rows
## # ℹ 3 more variables: maxi <dbl>, comp_rule <chr>, long_rule <chr>

lasa_sids <- c("respnr", "RESPNR", "RespNr", "Respnr")
old.names <- unique(lasa_codebook$orig_itemname)
datasets <- sprintf("%s/lasa", data_path) %>% list.files(., pattern = "SAV")
lasa <- tibble(datasets = datasets) %>%
  mutate(data = map(datasets, lasa_read_fun),
         ncol = map_dbl(data, ncol)) %>%
  filter(ncol > 1)

lasa <- reduce(lasa$data, full_join) %>% 
  haven::zap_labels(.) 

lasa_long <- lasa %>%
  rename(SID = respnr) %>%
  pivot_longer(-SID
               , names_to = "orig_itemname"
               , values_to = "value"
               , values_drop_na = T)

save(lasa, file = sprintf("%s/data/clean/lasa_raw.RData", local_path))

2.6.2 Recoding & Reverse Scoring

# join data with recoding info 
lasa_recode <- lasa_codebook %>%
  select(category, name, itemname, year, orig_itemname, reverse_code:long_rule) %>%
  mutate(orig_itemname = str_to_lower(orig_itemname)) %>%
  filter(category %in% c("pers", "outcome", "covariates")) %>%
  group_by(category, name) %>%
  nest() %>% 
  ungroup() %>%
  mutate(data = map(data, ~(.) %>% left_join(lasa_long)))

# recode 
recode_fun <- function(rule, y, year){
  x <- y$value
  if(!is.na(rule)){y$value <- eval(parse(text = rule))}
  return(y)
}

lasa_recode <- lasa_recode %>% 
  mutate(data = map(data, ~(.) %>% 
    group_by(recode, year) %>%
    nest() %>%
    ungroup() %>%
    mutate(data = pmap(list(recode, data, year), recode_fun)) %>%
    unnest(data) %>%
    mutate(value = ifelse(value < 0 | is.nan(value) | is.infinite(value), NA, value))))


# reverse code 
lasa_recode <- lasa_recode %>%
  mutate(data = map(data, ~(.) %>%
    mutate(value = ifelse(reverse_code == "no" | is.na(reverse_code), value, 
                        reverse.code(-1, value, mini = mini, maxi = maxi)))))

fun_call <- function(x, rule){
    switch(rule,
           average = mean(x, na.rm = T),
           mode = Mode(x)[1],
           sum = sum(x, na.rm = T),
           skip = unique(x)[1],
           select = unique(x)[1],
           max = max(x, na.rm = T),
           min = min(x, na.rm = T))
}

lasa_waves <- lasa_recode %>%
  filter(category == "pers") %>%
  unnest(data) %>%
  group_by(SID) %>%
  summarize(p_year = min(year)) %>%
  ungroup()

2.6.3 Covariates

# composite WITHIN years 
lasa_cov <- lasa_recode %>%
  filter(category == "covariates") %>%
  select(-category) %>%
  mutate(data = map(data, ~(.) %>% 
      filter(!is.na(value)) %>%
      left_join(lasa_waves) %>%
      mutate(comp_rule = ifelse(is.na(comp_rule), "skip", comp_rule)) %>%
      group_by(comp_rule, SID, year, p_year, long_rule) %>%
      nest() %>%
      ungroup() %>%
      mutate(data = map2(data, comp_rule, ~fun_call((.x)$value, .y))) %>%
      unnest(data) %>%
      filter(year <= p_year)
      ))

comp_fun <- function(d, rule, p_year){
  print(paste(rule, p_year))
  d %>%
    group_by(SID, name) %>%
    summarize(value = fun_call(data, rule)) %>%
    ungroup() %>%
    distinct()
}

# composite ACROSS years
lasa_cov <- lasa_cov %>%
  unnest(data) %>%
  mutate(long_rule = ifelse(is.na(long_rule), "skip", long_rule)) %>%
  filter(year <= p_year) %>%
  group_by(p_year, long_rule) %>%
  nest() %>%
  ungroup() %>%
  mutate(data = pmap(list(data, long_rule, p_year), comp_fun)) %>%
  unnest(data) %>%
  mutate(value = ifelse(is.infinite(value) | is.nan(value), NA, value)) %>%
  select(-long_rule) %>%
  spread(name, value) %>%
  filter(!is.na(SID))

2.6.4 Personality Variables

lasa_pers <- lasa_recode %>%
  filter(category == "pers") %>%
  select(-category) %>% 
  unnest(data) %>%
  left_join(lasa_waves) %>%
  filter(year == p_year) %>%
  filter(!is.na(value)) %>%
  group_by(SID, year, p_year, name, itemname, comp_rule) %>%
  summarize(value = mean(value, na.rm = T)) %>%
  ungroup()

# alpha's
lasa_alpha <- lasa_pers %>%
  filter(!is.na(value)) %>%
  select(name, itemname, year, SID, value) %>%
  group_by(name) %>%
  nest() %>%
  mutate(data = map(data, ~(.) %>% distinct() %>% pivot_wider(names_from = itemname, values_from = value, values_fn = list(mean))),
         alpha = map(data, possibly(~psych::alpha((.) %>% select(-SID)), NA_real_)))

comp_fun <- function(df, rule){
  df %>%
    group_by(SID) %>%
    summarize(value = fun_call(value, rule)) %>%
    ungroup() %>%
    mutate(value = ifelse(is.infinite(value) | is.nan(value), NA, value))
}

# create composites
lasa_pers <- lasa_pers %>%
  group_by(name, comp_rule, year) %>%
  nest() %>%
  ungroup() %>%
  mutate(data = map2(data, comp_rule, comp_fun)) %>%
  unnest(data) %>%
  select(-comp_rule)

2.6.5 Outcome Variables

lasa_cog_waves <- lasa_recode %>%
  filter(category == "outcome") %>%
  unnest(data) %>%
  group_by(name, SID) %>%
  summarize(o_year = max(year)) %>%
  ungroup()
# composite within years
lasa_out <- lasa_recode %>%
  filter(category == "outcome") %>%
  select(-category) %>%
  unnest(data) %>%
  right_join(lasa_cog_waves) %>%
  filter(year == o_year) %>%
  group_by(name, year, SID) %>%
  summarize(value = sum(value, na.rm = T)) %>%
  ungroup()

2.6.6 Combine Data

lasa_combined <- lasa_pers %>% 
  rename(Trait = name, p_value = value, p_year = year) %>%
  full_join(
    lasa_out %>% 
      rename(Outcome = name, o_value = value, o_year = year)
    ) %>% full_join(lasa_cov) %>%
  filter(!is.na(p_value) & !is.na(o_value))

save(lasa_cov, lasa_alpha, lasa_pers, lasa_out, lasa_combined,
     file = sprintf("%s/data/clean/lasa_cleaned.RData", local_path))

rm(list =ls()[grepl("lasa", ls())])

2.7 MAP

The RUSH Memory and Aging Project (MAP) is an ongoing longitudinal study that began in 1997 (a2012overview?). These data are available, through application from https://www.radc.rush.edu/requests.htm.

Participants who were 65 and older were recruited from retirement communities and subsidized senior housing facilities throughout Chicagoland and northeastern Illinois beginning in 1997. Data are collected annually, and all participants are organ donors. Additional participants are recuited each year. Additional information and documentation on the data can be found at https://www.radc.rush.edu/docs/var/variables.htm.

Sample sizes vary by year, ranging from 52 (1997) to 2205 participants. This provides 99% power to detect a zero-order correlation effect size of ~.09, two-tailed at alpha .05.

2.7.1 Load Data

(map_codebook <- (codebook %>% filter(study == "RADC-MAP"))$codebook[[1]] %>%
  mutate(orig_itemname = str_to_lower(orig_itemname)))

## # A tibble: 7 × 15
##   study    dataset category   name   itemname year  orig_itemname description scale reverse_code recode mini  maxi  comp_rule
##   <chr>    <lgl>   <chr>      <chr>  <chr>    <chr> <chr>         <chr>       <chr> <chr>        <chr>  <lgl> <lgl> <chr>    
## 1 RADC-MAP NA      outcome    cryst… bosNami… long… cts_bname     Boston nam…  <NA> no           ifels… NA    NA    sum      
## 2 RADC-MAP NA      covariates age    ageBase… 0     age_bl        The age at…  <NA> no           ifels… NA    NA    skip     
## 3 RADC-MAP NA      covariates educa… eduYears 0     educ          The years …  <NA> no           ifels… NA    NA    skip     
## 4 RADC-MAP NA      covariates gender sex      0     msex          Self-repor… "1 =… no           ifels… NA    NA    skip     
## 5 RADC-MAP NA      pers       C      C        0     conscientiou… The variab… "Res… no           ifels… NA    NA    average  
## 6 RADC-MAP NA      pers       E      E        0     extraversion… The variab… "Res… no           ifels… NA    NA    average  
## 7 RADC-MAP NA      pers       N      N        0     neuroticism_… The variab… "Res… no           ifels… NA    NA    average  
## # ℹ 1 more variable: long_rule <chr>

map <- sprintf("%s/rush-radc/dataset_1034_long_03-25-2021.xlsx", data_path) %>% read_excel() %>%
  full_join(sprintf("%s/rush-radc/dataset_1034_basic_03-25-2021.xlsx", data_path) %>% read_excel()) %>%
  filter(study == "MAP")

2.7.2 Recoding & Reverse Scoring

rename_fun <- function(cb, var){
  old.names <- unique((map_codebook %>% filter(name == var))$orig_itemname)
  df <- map %>% 
    select(SID = projid, wave = fu_year, one_of(old.names)) %>%
    gather(key = orig_itemname, value = value, -SID, -wave, na.rm = T) %>%
    left_join(cb %>% select(itemname, orig_itemname, reverse_code:long_rule))
}

# join data with recoding info 
map_recode <- map_codebook %>%
  select(category, name, itemname, year, orig_itemname, reverse_code:long_rule) %>%
  filter(category %in% c("pers", "outcome", "covariates")) %>%
  group_by(category, name) %>%
  nest() %>% 
  ungroup() %>%
  mutate(data = map2(data, name, rename_fun))

# recode 
recode_fun <- function(rule, y, year){
  x <- y$value
  if(!is.na(rule)){y$value <- eval(parse(text = rule))}
  return(y)
}

map_recode <- map_recode %>% 
  mutate(data = map(data, ~(.) %>% 
    mutate(wave = as.numeric(wave)) %>%
    group_by(recode, wave) %>%
    nest() %>%
    ungroup() %>%
    mutate(data = pmap(list(recode, data, wave), recode_fun)) %>%
    unnest(data) %>%
    mutate(value = ifelse(value < 0 | is.nan(value) | is.infinite(value), NA, value))))

# reverse code 
map_recode <- map_recode %>%
  mutate(data = map(data, ~(.) %>%
    mutate(value = ifelse(reverse_code == "no" | is.na(reverse_code), value, 
                        reverse.code(-1, value, mini = mini, maxi = maxi)))))

fun_call <- function(x, rule){
    switch(rule,
           average = mean(x, na.rm = T),
           mode = Mode(x)[1],
           sum = sum(x, na.rm = T),
           skip = unique(x)[1],
           select = unique(x)[1],
           max = max(x, na.rm = T),
           min = min(x, na.rm = T))
}

2.7.3 Covariates

# composite WITHIN years 
map_cov <- map_recode %>%
  filter(category == "covariates") %>%
  select(-category) %>%
  mutate(data = map(data, ~(.) %>% 
      filter(!is.na(value)) %>%
      mutate(comp_rule = ifelse(is.na(comp_rule), "skip", comp_rule)) %>%
      group_by(comp_rule, SID, wave, long_rule) %>%
      nest() %>%
      ungroup() %>%
      mutate(data = map2(data, comp_rule, ~fun_call((.x)$value, .y))) %>%
      unnest(data) %>%
      filter(wave == 0)
      ))

comp_fun <- function(d, rule){
  d %>%
    group_by(SID, name) %>%
    summarize(value = fun_call(data, rule)) %>%
    ungroup() %>%
    distinct()
}

# composite ACROSS years
map_cov <- map_cov %>%
  unnest(data) %>%
  mutate(long_rule = ifelse(is.na(long_rule), "skip", long_rule)) %>%
  group_by(long_rule) %>%
  nest() %>%
  ungroup() %>%
  mutate(data = pmap(list(data, long_rule), comp_fun)) %>%
  unnest(data) %>%
  mutate(value = ifelse(is.infinite(value) | is.nan(value), NA, value)) %>%
  select(-long_rule) %>%
  spread(name, value) %>%
  filter(!is.na(SID))

2.7.4 Personality Variables

map_pers <- map_recode %>%
  filter(category == "pers") %>%
  select(-category) %>% 
  unnest(data) %>% 
  distinct() %>%
  group_by(SID, name) %>%
  filter(wave == min(wave)) %>%
  ungroup() %>%
  select(SID, name, wave, value)

2.7.5 Outcome Variables

map_cog_waves <- map_recode %>%
  filter(category == "outcome") %>%
  unnest(data) %>%
  group_by(SID, name) %>%
  summarize(o_year = max(wave)) %>%
  ungroup()

# composite within years
map_out <- map_recode %>%
  filter(category == "outcome") %>%
  select(-category) %>%
  unnest(data) %>%
  right_join(map_cog_waves) %>%
  filter(wave == o_year & wave > 0) %>%
  filter(!is.na(value)) %>%
  group_by(name, itemname, wave) %>%
  mutate(value = pomp(value)) %>%
  group_by(name, wave, SID) %>%
  summarize(value = mean(value, na.rm = T)) %>%
  ungroup()

2.7.6 Combine Data

map_combined <- map_pers %>% 
  rename(Trait = name, p_value = value, p_year = wave) %>%
  full_join(
    map_out %>% 
      rename(Outcome = name, o_value = value, o_year = wave)
    ) %>% full_join(map_cov) %>%
  filter(!is.na(p_value) & !is.na(o_value))

save(map_cov, map_pers, map_out, map_combined,
     file = sprintf("%s/data/clean/map_cleaned.RData", local_path))

rm(list =ls()[grepl("map", ls())])

2.8 MARS

The RUSH Minority Aging Research Study (MARS) is an ongoing longitudinal study that began in 2004. These data are available, through application, from https://www.radc.rush.edu/requests.htm.

Participants were Black individuals 65 and older who were recruited from community locations in the Chicago Metropolitan area and suburbs beginning in 2004. Data are collected annually. Additional participants are recruited each year. Additional information and documentation on the data can be found at https://www.radc.rush.edu/docs/var/variables.htm.

Sample sizes vary by year, ranging from 80 (2004) to 790 (2020). This provides 99% power to detect a zero-order correlation effect size of ~.17, two-tailed at alpha .05.

(mars_codebook <- (codebook %>% filter(study == "MARS"))$codebook[[1]] %>%
  mutate(orig_itemname = str_to_lower(orig_itemname)))

## # A tibble: 5 × 15
##   study dataset category   name      itemname year  orig_itemname description scale reverse_code recode mini  maxi  comp_rule
##   <chr> <lgl>   <chr>      <chr>     <chr>    <chr> <chr>         <chr>       <chr> <chr>        <chr>  <lgl> <lgl> <chr>    
## 1 MARS  NA      outcome    crystall… bosNami… long… cts_bname     Boston nam…  <NA> no           ifels… NA    NA    sum      
## 2 MARS  NA      covariates age       ageBase… base… age_bl        The age at…  <NA> no           ifels… NA    NA    skip     
## 3 MARS  NA      covariates education eduYears base… educ          The years …  <NA> no           ifels… NA    NA    skip     
## 4 MARS  NA      covariates gender    sex      base… msex          Self-repor… "1 =… no           ifels… NA    NA    skip     
## 5 MARS  NA      pers       N         N        base… neuroticism_… The variab… "Res… no           ifels… NA    NA    average  
## # ℹ 1 more variable: long_rule <chr>

mars <- sprintf("%s/rush-radc/dataset_1034_long_03-26-2021.xlsx", data_path) %>% read_excel() %>%
  full_join(sprintf("%s/rush-radc/dataset_1034_basic_03-26-2021.xlsx", data_path) %>% read_excel()) %>%
  filter(study == "MARS")

2.8.1 Recoding & Reverse Scoring

rename_fun <- function(cb, var){
  old.names <- unique((mars_codebook %>% filter(name == var))$orig_itemname)
  df <- mars %>% 
    select(SID = projid, wave = fu_year, one_of(old.names)) %>%
    gather(key = orig_itemname, value = value, -SID, -wave, na.rm = T) %>%
    left_join(cb %>% select(itemname, orig_itemname, reverse_code:long_rule))
}

# join data with recoding info 
mars_recode <- mars_codebook %>%
  select(category, name, itemname, year, orig_itemname, reverse_code:long_rule) %>%
  filter(category %in% c("pers", "outcome", "covariates")) %>%
  group_by(category, name) %>%
  nest() %>% 
  ungroup() %>%
  mutate(data = map2(data, name, rename_fun))

# recode 
recode_fun <- function(rule, y, year){
  x <- y$value
  if(!is.na(rule)){y$value <- eval(parse(text = rule))}
  return(y)
}

mars_recode <- mars_recode %>% 
  mutate(data = map(data, ~(.) %>% 
    mutate(wave = as.numeric(wave)) %>%
    group_by(recode, wave) %>%
    nest() %>%
    ungroup() %>%
    mutate(data = pmap(list(recode, data, wave), recode_fun)) %>%
    unnest(data) %>%
    mutate(value = ifelse(value < 0 | is.nan(value) | is.infinite(value), NA, value))))

# reverse code 
mars_recode <- mars_recode %>%
  mutate(data = map(data, ~(.) %>%
    mutate(value = ifelse(reverse_code == "no" | is.na(reverse_code), value, 
                        reverse.code(-1, value, mini = mini, maxi = maxi)))))

fun_call <- function(x, rule){
    switch(rule,
           average = mean(x, na.rm = T),
           mode = Mode(x)[1],
           sum = sum(x, na.rm = T),
           skip = unique(x)[1],
           select = unique(x)[1],
           max = max(x, na.rm = T),
           min = min(x, na.rm = T))
}

2.8.2 Covariates

# composite WITHIN years 
mars_cov <- mars_recode %>%
  filter(category == "covariates") %>%
  select(-category) %>%
  mutate(data = map(data, ~(.) %>% 
      filter(!is.na(value)) %>%
      mutate(comp_rule = ifelse(is.na(comp_rule), "skip", comp_rule)) %>%
      group_by(comp_rule, SID, wave, long_rule) %>%
      nest() %>%
      ungroup() %>%
      mutate(data = map2(data, comp_rule, ~fun_call((.x)$value, .y))) %>%
      unnest(data) %>%
      filter(wave == 0)
      ))

comp_fun <- function(d, rule){
  d %>%
    group_by(SID, name) %>%
    summarize(value = fun_call(data, rule)) %>%
    ungroup() %>%
    distinct()
}

# composite ACROSS years
mars_cov <- mars_cov %>%
  unnest(data) %>%
  mutate(long_rule = ifelse(is.na(long_rule), "skip", long_rule)) %>%
  group_by(long_rule) %>%
  nest() %>%
  ungroup() %>%
  mutate(data = pmap(list(data, long_rule), comp_fun)) %>%
  unnest(data) %>%
  mutate(value = ifelse(is.infinite(value) | is.nan(value), NA, value)) %>%
  select(-long_rule) %>%
  spread(name, value) %>%
  filter(!is.na(SID))

2.8.3 Personality Variables

mars_pers <- mars_recode %>%
  filter(category == "pers") %>%
  select(-category) %>% 
  unnest(data) %>% 
  distinct() %>%
  group_by(SID, name) %>%
  filter(wave == min(wave)) %>%
  ungroup() %>%
  select(SID, name, wave, value)

2.8.4 Outcome Variables

mars_cog_waves <- mars_recode %>%
  filter(category == "outcome") %>%
  unnest(data) %>%
  group_by(SID, name) %>%
  summarize(o_year = max(wave)) %>%
  ungroup()

# composite within years
mars_out <- mars_recode %>%
  filter(category == "outcome") %>%
  select(-category) %>%
  unnest(data) %>%
  right_join(mars_cog_waves) %>%
  filter(wave == o_year & wave > 0) %>%
  filter(!is.na(value)) %>%
  group_by(name, itemname, wave) %>%
  mutate(value = pomp(value)) %>%
  group_by(name, wave, SID) %>%
  summarize(value = mean(value, na.rm = T)) %>%
  ungroup()

2.8.5 Combine Data

mars_combined <- mars_pers %>% 
  rename(Trait = name, p_value = value, p_year = wave) %>%
  full_join(
    mars_out %>% 
      rename(Outcome = name, o_value = value, o_year = wave)
    ) %>% full_join(mars_cov) %>%
  filter(!is.na(p_value) & !is.na(o_value))

save(mars_cov, mars_pers, mars_out, mars_combined,
     file = sprintf("%s/data/clean/mars_cleaned.RData", local_path))

rm(list =ls()[grepl("mars", ls())])

2.9 ROS

The RUSH Religious Orders Study (ROS) is an ongoing longitudinal study that began in 1994 (a2012overview?). These data are available, through application from https://www.radc.rush.edu/requests.htm.

Older (65 and above) Catholic nuns, priests, and brothers with no prior dementia diagnosis and who agreed to annual evaluations and eventual organ donation were recruited from more than 40 groups across the United States. Additional participants are recuited each year. Additional information and documentation on the data can be found at https://www.radc.rush.edu/docs/var/variables.htm.

Sample sizes vary bt year from 353 participants (1994) to 1487 participants, including 797 deceased participants with autopsy data (2019, 2020). This provides 99% power to detect a zero-order correlation effect size of ~.11, two-tailed at alpha .05.

2.9.1 Load Data

(ros_codebook <- (codebook %>% filter(study == "ROS"))$codebook[[1]] %>%
  mutate(orig_itemname = str_to_lower(orig_itemname)))

## # A tibble: 9 × 15
##   study dataset category   name      itemname year  orig_itemname description scale reverse_code recode mini  maxi  comp_rule
##   <chr> <lgl>   <chr>      <chr>     <chr>    <chr> <chr>         <chr>       <chr> <chr>        <chr>  <lgl> <lgl> <chr>    
## 1 ROS   NA      outcome    crystall… bosNami… long… cts_bname     Boston nam…  <NA> no           ifels… NA    NA    sum      
## 2 ROS   NA      covariates age       ageBase… base… age_bl        The age at…  <NA> no           ifels… NA    NA    skip     
## 3 ROS   NA      covariates education eduYears base… educ          The years …  <NA> no           ifels… NA    NA    skip     
## 4 ROS   NA      covariates gender    sex      base… msex          Self-repor… "1 =… no           ifels… NA    NA    skip     
## 5 ROS   NA      pers       A         A        base… agreeableness The variab… "Res… no           ifels… NA    NA    average  
## 6 ROS   NA      pers       C         C        base… conscientiou… The variab… "Res… no           ifels… NA    NA    average  
## 7 ROS   NA      pers       E         E        base… extraversion… The variab… "Res… no           ifels… NA    NA    average  
## 8 ROS   NA      pers       N         N        base… neuroticism_… The variab… "Res… no           ifels… NA    NA    average  
## 9 ROS   NA      pers       O         O        base… openness      The variab… "Res… no           ifels… NA    NA    average  
## # ℹ 1 more variable: long_rule <chr>

ros <- sprintf("%s/rush-radc/dataset_1034_long_03-25-2021.xlsx", data_path) %>% read_excel() %>%
  full_join(sprintf("%s/rush-radc/dataset_1034_basic_03-25-2021.xlsx", data_path) %>% read_excel()) %>%
  filter(study == "ROS")

2.9.2 Recoding & Reverse Scoring

rename_fun <- function(cb, var){
  old.names <- unique((ros_codebook %>% filter(name == var))$orig_itemname)
  df <- ros %>% 
    select(SID = projid, wave = fu_year, one_of(old.names)) %>%
    gather(key = orig_itemname, value = value, -SID, -wave, na.rm = T) %>%
    left_join(cb %>% select(itemname, orig_itemname, reverse_code:long_rule))
}

# join data with recoding info 
ros_recode <- ros_codebook %>%
  select(category, name, itemname, year, orig_itemname, reverse_code:long_rule) %>%
  filter(category %in% c("pers", "outcome", "covariates")) %>%
  group_by(category, name) %>%
  nest() %>% 
  ungroup() %>%
  mutate(data = map2(data, name, rename_fun))

# recode 
recode_fun <- function(rule, y, year){
  x <- y$value
  if(!is.na(rule)){y$value <- eval(parse(text = rule))}
  return(y)
}

ros_recode <- ros_recode %>% 
  mutate(data = map(data, ~(.) %>% 
    mutate(wave = as.numeric(wave)) %>%
    group_by(recode, wave) %>%
    nest() %>%
    ungroup() %>%
    mutate(data = pmap(list(recode, data, wave), recode_fun)) %>%
    unnest(data) %>%
    mutate(value = ifelse(value < 0 | is.nan(value) | is.infinite(value), NA, value))))

# reverse code 
ros_recode <- ros_recode %>%
  mutate(data = map(data, ~(.) %>%
    mutate(value = ifelse(reverse_code == "no" | is.na(reverse_code), value, 
                        reverse.code(-1, value, mini = mini, maxi = maxi)))))

fun_call <- function(x, rule){
    switch(rule,
           average = mean(x, na.rm = T),
           mode = Mode(x)[1],
           sum = sum(x, na.rm = T),
           skip = unique(x)[1],
           select = unique(x)[1],
           max = max(x, na.rm = T),
           min = min(x, na.rm = T))
}

2.9.3 Covariates

# composite WITHIN years 
ros_cov <- ros_recode %>%
  filter(category == "covariates") %>%
  select(-category) %>%
  mutate(data = map(data, ~(.) %>% 
      filter(!is.na(value)) %>%
      mutate(comp_rule = ifelse(is.na(comp_rule), "skip", comp_rule)) %>%
      group_by(comp_rule, SID, wave, long_rule) %>%
      nest() %>%
      ungroup() %>%
      mutate(data = map2(data, comp_rule, ~fun_call((.x)$value, .y))) %>%
      unnest(data) %>%
      filter(wave == 0)
      ))

comp_fun <- function(d, rule){
  d %>%
    group_by(SID, name) %>%
    summarize(value = fun_call(data, rule)) %>%
    ungroup() %>%
    distinct()
}

# composite ACROSS years
ros_cov <- ros_cov %>%
  unnest(data) %>%
  mutate(long_rule = ifelse(is.na(long_rule), "skip", long_rule)) %>%
  group_by(long_rule) %>%
  nest() %>%
  ungroup() %>%
  mutate(data = pmap(list(data, long_rule), comp_fun)) %>%
  unnest(data) %>%
  mutate(value = ifelse(is.infinite(value) | is.nan(value), NA, value)) %>%
  select(-long_rule) %>%
  spread(name, value) %>%
  filter(!is.na(SID))

2.9.4 Personality Variables

ros_pers <- ros_recode %>%
  filter(category == "pers") %>%
  select(-category) %>% 
  unnest(data) %>% 
  distinct() %>%
  group_by(SID, name) %>%
  filter(wave == min(wave)) %>%
  ungroup() %>%
  select(SID, name, wave, value)

2.9.5 Outcome Variables

ros_cog_waves <- ros_recode %>%
  filter(category == "outcome") %>%
  unnest(data) %>%
  group_by(SID, name) %>%
  summarize(o_year = max(wave)) %>%
  ungroup()

# composite within years
ros_out <- ros_recode %>%
  filter(category == "outcome") %>%
  select(-category) %>%
  unnest(data) %>%
  right_join(ros_cog_waves) %>%
  filter(wave == o_year & wave > 0) %>%
  filter(!is.na(value)) %>%
  group_by(name, itemname, wave) %>%
  mutate(value = pomp(value)) %>%
  group_by(name, wave, SID) %>%
  summarize(value = mean(value, na.rm = T)) %>%
  ungroup()

2.9.6 Combine Data

ros_combined <- ros_pers %>% 
  rename(Trait = name, p_value = value, p_year = wave) %>%
  full_join(
    ros_out %>% 
      rename(Outcome = name, o_value = value, o_year = wave)
    ) %>% full_join(ros_cov) %>%
  filter(!is.na(p_value) & !is.na(o_value))

save(ros_cov, ros_pers, ros_out, ros_combined,
     file = sprintf("%s/data/clean/ros_cleaned.RData", local_path))

rm(list =ls()[grepl("ros", ls())])

2.10 Origins of the Variances of the Oldest-Old: Octogenarian Twins (OCTO-TWIN)

The Origins of the Variances of the Oldest-Old: Octogenarian Twins (OCTO-TWIN) study was a longitudinal study of twin-pairs in Sweden born before 1913 that began in 1991. Data are available, by application, from the study leaders at the Karolinska Institute.

Twin-pairs from the Swedish Twin registry who were born before 1913 were invited to be part of the study in 1991. Additional follow-ups were collected in 1993, 1995, 1997, and 1999 on all surviving twins. The initial sample included 351 twin pairs (149 monozygotic and 202 same-sex dizygotic pairs). More information on the study can be found at https://webcache.googleusercontent.com/search?q=cache:HcheF2l9zc0J:https://psy.gu.se/digitalAssets/1469/1469717_octo-twin-brief-presentation.pdf+&cd=1&hl=en&ct=clnk&gl=us&client=safari.

Sample sizes vary by year, from 222 (wave 5; 43 pairs) to 702 (wave 1; 351 pairs), which yields 99% power to detect a correlation effect size of ~.19, two-tailed at alpha .05.

2.10.1 Load Data

(octotwin_codebook <- (codebook %>% filter(study == "OCTO-TWIN"))$codebook[[1]] %>%
  mutate(orig_itemname = str_to_lower(orig_itemname)))

## # A tibble: 99 × 16
##    study     dataset    category   name  itemname  wave  year orig_itemname description scale reverse_code recode  mini  maxi
##    <chr>     <chr>      <chr>      <chr> <chr>    <dbl> <dbl> <chr>         <chr>       <chr> <chr>        <chr>  <dbl> <dbl>
##  1 octo-twin OCTO_Wave1 covariates educ… eduYears     1  1991 v130          years scho… "int… <NA>         ifels…    NA    NA
##  2 octo-twin OCTO_Wave1 covariates gend… sex          1  1991 sex           Respondent…  <NA> <NA>         ifels…    NA    NA
##  3 octo-twin OCTO_Wave1 covariates gend… sex2         1  1991 medrec_sex    Sex          <NA> <NA>         ifels…    NA    NA
##  4 octo-twin OCTO_Wave2 covariates gend… sex2         2  1993 medrec_sex    Sex          <NA> <NA>         ifels…    NA    NA
##  5 octo-twin OCTO_Wave3 covariates gend… sex2         3  1995 medrec_sex    Sex          <NA> <NA>         ifels…    NA    NA
##  6 octo-twin OCTO_Wave4 covariates gend… sex2         4  1997 medrec_sex    Sex          <NA> <NA>         ifels…    NA    NA
##  7 octo-twin OCTO_Wave5 covariates gend… sex2         5  2000 medrec_sex    Sex          <NA> <NA>         ifels…    NA    NA
##  8 octo-twin OCTO_Wave1 covariates SRhe… overall…     1  1991 v201          overall he… "Nam… yes          ifels…     1     3
##  9 octo-twin OCTO_Wave2 covariates SRhe… overall…     2  1993 b201          overall he… "Nam… yes          ifels…     1     3
## 10 octo-twin OCTO_Wave3 covariates SRhe… overall…     3  1995 c201          overall he… "Nam… yes          ifels…     1     3
## # ℹ 89 more rows
## # ℹ 2 more variables: comp_rule <chr>, long_rule <chr>

octotwin <- sprintf("%s/octo-twin/OCTO_Twin_Emorie.sav", data_path) %>% 
  read_sav() %>%
  haven::zap_labels(.)

2.10.2 Recoding & Reverse Scoring

rename_fun <- function(cb, var){
  old.names <- unique((octotwin_codebook %>% filter(name == var))$orig_itemname)
  df <- octotwin %>% 
    mutate(SID = paste0(v1, v2)) %>%
    select(SID, one_of(old.names)) %>%
    gather(key = orig_itemname, value = value, -SID, na.rm=T)
  if(length(old.names) > 1){
      df %>% left_join(cb %>% select(itemname, wave, orig_itemname, reverse_code:long_rule))
  } else {
    df %>% left_join(cb %>% select(itemname, orig_itemname, reverse_code:long_rule) %>% distinct()) %>% mutate(wave = "0")
  }
}

`
# join data with recoding info 
octotwin_recode <- octotwin_codebook %>%
  select(category, name, itemname, wave, orig_itemname, reverse_code:long_rule) %>%
  filter(category %in% c("pers", "outcome", "covariates")) %>%
  group_by(category, name) %>%
  nest() %>% 
  ungroup() %>%
  mutate(data = map2(data, name, possibly(rename_fun, NA_real_)))

# recode 
recode_fun <- function(rule, y, year){
  x <- y$value
  if(!is.na(rule)){y$value <- eval(parse(text = rule))}
  return(y)
}

octotwin_recode <- octotwin_recode %>% 
  mutate(data = map(data, ~(.) %>% 
    mutate(wave = as.numeric(wave)) %>%
    group_by(recode, wave) %>%
    nest() %>%
    ungroup() %>%
    mutate(data = pmap(list(recode, data, wave), recode_fun)) %>%
    unnest(data) %>%
    mutate(value = ifelse(value < 0 | is.nan(value) | is.infinite(value), NA, value))))


# reverse code 
octotwin_recode <- octotwin_recode %>%
  mutate(data = map(data, ~(.) %>%
    mutate(value = ifelse(reverse_code == "no" | is.na(reverse_code), value, 
                        reverse.code(-1, value, mini = mini, maxi = maxi)))))

fun_call <- function(x, rule){
    switch(rule,
           average = mean(x, na.rm = T),
           mode = Mode(x)[1],
           sum = sum(x, na.rm = T),
           skip = unique(x)[1],
           select = unique(x)[1],
           max = max(x, na.rm = T),
           min = min(x, na.rm = T))
}

2.10.3 Covariates

# composite WITHIN years 
octotwin_cov <- octotwin_recode %>%
  filter(category == "covariates") %>%
  select(-category) %>%
  mutate(data = map(data, ~(.) %>% 
      filter(!is.na(value)) %>%
      mutate(comp_rule = ifelse(is.na(comp_rule), "skip", comp_rule)) %>%
      group_by(comp_rule, SID, wave, long_rule) %>%
      nest() %>%
      ungroup() %>%
      mutate(data = map2(data, comp_rule, ~fun_call((.x)$value, .y))) %>%
      unnest(data) %>%
      filter(wave <= 1)
      ))

comp_fun <- function(d, rule){
  d %>%
    group_by(SID, name) %>%
    summarize(value = fun_call(data, rule)) %>%
    ungroup() %>%
    distinct()
}

# composite ACROSS years
octotwin_cov <- octotwin_cov %>%
  unnest(data) %>%
  mutate(long_rule = ifelse(is.na(long_rule), "skip", long_rule)) %>%
  group_by(long_rule) %>%
  nest() %>%
  ungroup() %>%
  mutate(data = pmap(list(data, long_rule), comp_fun)) %>%
  unnest(data) %>%
  mutate(value = ifelse(is.infinite(value) | is.nan(value), NA, value)) %>%
  select(-long_rule) %>%
  spread(name, value) %>%
  filter(!is.na(SID))

2.10.4 Personality Variables

octotwin_pers <- octotwin_recode %>%
  filter(category == "pers") %>%
  select(-category) %>% 
  unnest(data) %>%
  filter(wave == 1 & !is.na(value)) %>%
  group_by(SID, wave, name, itemname, comp_rule) %>%
  summarize(value = mean(value, na.rm = T)) %>%
  ungroup()

# alpha's
octotwin_alpha <- octotwin_pers %>%
  filter(!is.na(value)) %>%
  select(name, itemname, wave, SID, value) %>%
  group_by(name, wave) %>%
  nest() %>%
  mutate(data = map(data, ~(.) %>% distinct() %>% pivot_wider(names_from = itemname, values_from = value, values_fn = list(mean))),
         alpha = map(data, possibly(~psych::alpha((.) %>% select(-SID), check.keys = T), NA_real_)))

comp_fun <- function(df, rule){
  df %>%
    group_by(SID) %>%
    summarize(value = fun_call(value, rule)) %>%
    ungroup() %>%
    mutate(value = ifelse(is.infinite(value) | is.nan(value), NA, value))
}

# create composites
octotwin_pers <- octotwin_pers %>%
  group_by(name, comp_rule, wave) %>%
  nest() %>%
  ungroup() %>%
  mutate(data = map2(data, comp_rule, comp_fun)) %>%
  unnest(data) %>%
  select(-comp_rule)

2.10.5 Outcome Variables

octotwin_cog_waves <- octotwin_recode %>%
  filter(category == "outcome") %>%
  unnest(data) %>%
  group_by(SID, name) %>%
  summarize(o_year = max(wave)) %>%
  ungroup()

# composite within years
octotwin_out <- octotwin_recode %>%
  filter(category == "outcome") %>%
  select(-category) %>%
  unnest(data) %>%
  right_join(octotwin_cog_waves) %>%
  filter(wave == o_year & wave > 1) %>%
  group_by(name, wave, SID) %>%
  summarize(value = sum(value, na.rm = T)) %>%
  ungroup()

2.10.6 Combine Data

octotwin_combined <- octotwin_pers %>% 
  rename(Trait = name, p_value = value, p_year = wave) %>%
  full_join(
    octotwin_out %>% 
      rename(Outcome = name, o_value = value, o_year = wave)
    ) %>% full_join(octotwin_cov) %>%
  filter(!is.na(p_value) & !is.na(o_value))

save(octotwin_cov, octotwin_alpha, octotwin_pers, octotwin_out, octotwin_combined,
     file = sprintf("%s/data/clean/octo-twin_cleaned.RData", local_path))

rm(list =ls()[grepl("octotwin", ls())])

2.11 Swedish Adoption Twin Study of Aging (SATSA)

The Swedish Adoption Twin Study of Aging (SATSA) is a longitudinal study of twin pairs from the Swedish Twin Registry that began in 1984. Data are available through the ICPSR database at https://www.icpsr.umich.edu/web/ICPSR/studies/3843.

All twin-pairs on the Swedish Twin Registry who were separated at an early age were invited to be a part of the study in 1984. A control sample of twins reared together were also included. Additional waves of all participants were collected in 1987, 1990, 1993, 2004, 2007, 2010, 2012, and 2014. More information, including codebooks, scales, and variable search functions can be found at https://www.maelstrom-research.org/mica/individual-study/satsa/#.

Sample sizes vary by wave, ranging from 2018 participants at baseline (1984) to 379 participants (IPT7). Given that the target measures were collected at baseline, this provides 99% power to detect a zero-order correlation effect size of ~.10, two-tailed at alpha .05.

2.11.1 Load Data

loadRData <- function(fileName){
#loads an RData file, and returns it
    load(fileName)
    get(ls()[ls() != "fileName"])
}

satsa_read_fun <- function(x){
  # prob_vars <- c("FHEART", "FPARKIN", "FSTROKE")
  y <- sprintf("%s/satsa/%s", data_path, x) %>% loadRData(.) %>% 
    select(SID = TWINNR, one_of(old.names)) %>%
    as_tibble()
  # if(any(prob_vars %in% colnames(y))){
  #   y <- y %>% mutate_at(vars(one_of(prob_vars)), ~as.numeric(as.character(.)))
  #   }
  return(y)
}

satsa_codebook <- (codebook %>% filter(study == "SATSA"))$codebook[[1]] %>%
  mutate_at(vars(orig_itemname), str_to_upper)
satsa_codebook

## # A tibble: 302 × 18
##    study dataset    category   name   itemname wave  wave_letter  year item_stem orig_itemname description scale reverse_code
##    <chr> <chr>      <chr>      <chr>  <chr>    <lgl> <chr>       <dbl> <chr>     <chr>         <chr>       <chr> <chr>       
##  1 satsa SATSA_Q1   covariates educa… educati… NA    <NA>         1984 EDUC      EDUC          Education   "1\t… <NA>        
##  2 satsa SATSA_Q1   covariates gender sex      NA    <NA>         1984 <NA>      SEX           SEX         "1 =… <NA>        
##  3 satsa SATSA_IPT1 covariates gender sex      NA    <NA>         1985 <NA>      SEX           SEX         "1 =… <NA>        
##  4 satsa SATSA_Q2   covariates gender sex      NA    <NA>         1987 <NA>      SEX           SEX         "1 =… <NA>        
##  5 satsa SATSA_IPT2 covariates gender sex      NA    <NA>         1989 <NA>      SEX           SEX         "1 =… <NA>        
##  6 satsa SATSA_Q3   covariates gender sex      NA    <NA>         1990 <NA>      SEX           SEX         "1 =… <NA>        
##  7 satsa SATSA_IPT3 covariates gender sex      NA    <NA>         1992 <NA>      SEX           SEX         "1 =… <NA>        
##  8 satsa SATSA_Q4   covariates gender sex      NA    <NA>         1993 <NA>      SEX           SEX         "1 =… <NA>        
##  9 satsa SATSA_IPT4 covariates gender sex      NA    <NA>         1995 <NA>      SEX           SEX         "1 =… <NA>        
## 10 satsa SATSA_IPT5 covariates gender sex      NA    <NA>         1999 <NA>      SEX           SEX         "1 =… <NA>        
## # ℹ 292 more rows
## # ℹ 5 more variables: recode <chr>, mini <dbl>, maxi <dbl>, comp_rule <chr>, long_rule <chr>

old.names <- unique(satsa_codebook$orig_itemname) %>% str_to_upper
datasets <- sprintf("%s/satsa", data_path) %>% list.files(., pattern = ".rda")
satsa <- tibble(datasets = datasets) %>%
  mutate(data = map(datasets, satsa_read_fun),
         ncol = map_dbl(data, ncol)) %>%
  filter(ncol != 0)

satsa <- reduce(satsa$data, full_join) %>% haven::zap_labels(.)
satsa <- satsa %>%
  mutate_if(is.factor, ~as.numeric(sub("^\\(0*([0-9]+)\\).+$", "\\1", .))) 
satsa_long <- satsa %>% 
  gather(key = orig_itemname, value = value, -SID, na.rm = T)
save(satsa, file = sprintf("%s/data/clean/satsa_raw.RData", local_path))

2.11.2 Recoding & Reverse Scoring

satsa_waves <- p_waves %>% filter(Study == "SATSA") %>% select(Used) %>% distinct()

# join data with recoding info 
satsa_recode <- satsa_codebook %>%
  select(category, name, itemname, year, orig_itemname, reverse_code:long_rule) %>%
  filter(category %in% c("pers", "outcome", "covariates")) %>%
  group_by(category, name) %>%
  nest() %>% 
  ungroup() %>%
  mutate(data = map(data, ~(.) %>% left_join(satsa_long)))

# recode 
recode_fun <- function(rule, y, year, p_year){
  x <- y$value
  if(!is.na(rule)){y$value <- eval(parse(text = rule))}
  return(y)
}

satsa_recode <- satsa_recode %>% 
  mutate(data = map(data, ~(.) %>% 
    mutate(p_year = 1984) %>%
    group_by(recode, year, p_year) %>%
    nest() %>%
    ungroup() %>%
    mutate(data = pmap(list(recode, data, year, p_year), recode_fun)) %>%
    unnest(data) %>%
    mutate(value = ifelse(value < 0 | is.nan(value) | is.infinite(value), NA, value))))


# reverse code 
satsa_recode <- satsa_recode %>%
  mutate(data = map(data, ~(.) %>%
    mutate(value = ifelse(reverse_code == "no" | is.na(reverse_code), value, 
                        reverse.code(-1, value, mini = mini, maxi = maxi)))))

fun_call <- function(x, rule){
    switch(rule,
           average = mean(x, na.rm = T),
           mode = Mode(x)[1],
           sum = sum(x, na.rm = T),
           skip = unique(x)[1],
           select = unique(x)[1],
           max = max(x, na.rm = T),
           min = min(x, na.rm = T))
  }

2.11.3 Covariates

# composite WITHIN years 
satsa_cov <- satsa_recode %>%
  filter(category == "covariates") %>%
  select(-category) %>%
  mutate(data = map(data, ~(.) %>% 
      filter(!is.na(value)) %>%
      mutate(comp_rule = ifelse(is.na(comp_rule), "skip", comp_rule)) %>%
      group_by(comp_rule, SID, year, p_year, long_rule) %>%
      nest() %>%
      ungroup() %>%
      mutate(data = map2(data, comp_rule, ~fun_call((.x)$value, .y))) %>%
      unnest(data) %>%
      filter(year <= p_year)
      ))

comp_fun <- function(d, rule, p_year){
  print(paste(rule, p_year))
  d %>%
    group_by(SID, name) %>%
    summarize(value = fun_call(data, rule)) %>%
    ungroup() %>%
    distinct()
}

# composite ACROSS years
satsa_cov <- satsa_cov %>%
  unnest(data) %>%
  mutate(long_rule = ifelse(is.na(long_rule), "skip", long_rule)) %>%
  filter(year <= p_year) %>%
  group_by(p_year, long_rule) %>%
  nest() %>%
  ungroup() %>%
  mutate(data = pmap(list(data, long_rule, p_year), comp_fun)) %>%
  unnest(data) %>%
  mutate(value = ifelse(is.infinite(value) | is.nan(value), NA, value)) %>%
  select(-long_rule) %>%
  spread(name, value) %>%
  filter(!is.na(SID))

2.11.4 Personality Variables

satsa_pers <- satsa_recode %>%
  filter(category == "pers") %>%
  select(-category) %>% 
  unnest(data) %>%
  filter(year == 1984 & !is.na(value)) %>%
  group_by(SID, p_year, year, name, itemname, comp_rule) %>%
  summarize(value = mean(value, na.rm = T)) %>%
  ungroup()

# alpha's
satsa_alpha <- satsa_pers %>%
  filter(!is.na(value)) %>%
  select(name, itemname, year, SID, value) %>%
  group_by(name, year) %>%
  nest() %>%
  mutate(data = map(data, ~(.) %>% distinct() %>% pivot_wider(names_from = itemname, values_from = value, values_fn = list(mean))),
         alpha = map(data, possibly(~psych::alpha((.) %>% select(-SID)), NA_real_)))

comp_fun <- function(df, rule){
  df %>%
    group_by(SID) %>%
    summarize(value = fun_call(value, rule)) %>%
    ungroup() %>%
    mutate(value = ifelse(is.infinite(value) | is.nan(value), NA, value))
}

# create composites
satsa_pers <- satsa_pers %>%
  group_by(name, comp_rule, year) %>%
  nest() %>%
  ungroup() %>%
  mutate(data = map2(data, comp_rule, comp_fun)) %>%
  unnest(data) %>%
  select(-comp_rule)

2.11.5 Outcome Variables

# composite within years
satsa_out <- satsa_recode %>%
  filter(category == "outcome") %>%
  select(-category) %>%
  unnest(data) %>%
  filter(year == 1999) %>%
  group_by(name, year, SID) %>%
  summarize(value = sum(value, na.rm = T)) %>%
  ungroup()

2.11.6 Combine Data

satsa_combined <- satsa_pers %>% 
  rename(Trait = name, p_value = value, p_year = year) %>%
  full_join(satsa_out %>% rename(Outcome = name, o_value = value, o_year = year)) %>%
  full_join(satsa_cov) %>%
  filter(!is.na(p_value) & !is.na(o_value))

save(satsa_cov, satsa_alpha, satsa_pers, satsa_out, satsa_combined,
     file = sprintf("%s/data/clean/satsa_cleaned.RData", local_path))

rm(list =ls()[grepl("satsa", ls())])

2.12 Seattle Longitudinal Study (SLS)

The Seattle Longitudinal Study is an ongoing longitudinal study of adult Seattle residents that began in 1956. Data are available, by request from https://sls.psychiatry.uw.edu/researchers/public-access-data-sets/.

The first cohort of participants were recruited from HMO plan members of the Group Health Cooperative of Puget Sound in Seattle. Seven years later, the first cohort and a new cohort of similarly aged participants were solicited again. Over the next half century, this procedure was repeated every seven years and is ongoing. More information on the measures collected can be found at https://sls.psychiatry.uw.edu/researchers/measures/.

Sample sizes vary by year, from 302 (1956) to 421 (2005). This provides 99% power to detect correlation effect sizes of ~.23, two-tailed alpha at .05.

2.13 Descriptives of All Studies

loadRData <- function(fileName, type){
#loads an RData file, and returns it
    path <- sprintf("%s/data/clean/%s_cleaned.RData", local_path, fileName)
    load(path)
    get(ls()[grepl(type, ls())])
}

ipd_data <- tibble(
  study = studies[!studies %in% c("CNLSY", "SLS")]
  , data = map(str_to_lower(study), ~loadRData(., "combined"))
  ) %>% mutate(
    data = map(data, ~(.) %>% 
                 ungroup() %>% 
                 mutate(SID = as.character(SID)))
    , study = mapvalues(study, studies, studies_long)
  ) %>%
  unnest(data) %>%
  mutate(age = ifelse(is.na(age), p_year - yearBrth, age))

ipd_data <- ipd_data %>% 
  group_by(study, Trait, SID) %>%
  filter(p_year == min(p_year)) %>%
  ungroup() %>%
  select(-p_year, -o_year, -Outcome) %>%
  rename(crystallized = o_value) %>%
  distinct() %>%
  group_by(study) %>%
  nest() %>%
  ungroup() %>%
  mutate(wide_data = map(data, ~(.) %>% pivot_wider(names_from = "Trait", values_from = "p_value")))

cln <- c("Study", "E", "A", "C", "N", "O", "Crystallized / Knowledge", "Age (Years)", "Education (Years)", "% Women", "Valid N (Range)")

desc_tab <- ipd_data %>%
  select(-p_year, -o_year, -yearBrth, -SRhealth, -gender) %>%
  group_by(study, Trait, Outcome) %>%
  mutate_at(vars(p_value, o_value), 
            ~((. - min(., na.rm = T))/(max(., na.rm = T) - min(., na.rm = T))*10)) %>%
  pivot_wider(names_from = Outcome, values_from = o_value, values_fn = mean) %>%
  pivot_wider(names_from = Trait, values_from = p_value, values_fn = mean) %>%
  pivot_longer(cols = c(education:C), values_to = "value", names_to = "item", values_drop_na = T) %>%
  group_by(study, item) %>%
  summarize(est = sprintf("%.2f (%.2f)", mean(value, na.rm = T), sd(value, na.rm = T))) %>%
  ungroup() %>%
  pivot_wider(names_from = item, values_from = est) %>%
  full_join(
    ipd_data %>% 
      select(study, Trait, p_value, Outcome, SID, o_value) %>%
      distinct() %>%
      group_by(study, Trait) %>% 
      tally() %>%
      group_by(study) %>%
      summarize(n = sprintf("%i - %i", min(n), max(n)))
    ) %>%
  full_join(
    ipd_data %>% 
      select(study, SID, gender) %>% 
      distinct()  %>% 
      group_by(study) %>% 
      summarize(gender = mean(gender, na.rm = T)*100) %>% 
      ungroup()
    ) %>%
  select(study, E, A, C, N, O, crystallized, age, education, gender, n) %>%
  kable(., "html"
        , digits = 2
        , col.names = cln
        , caption = "<strong>Table 3</strong><br><em>Descriptive Statistics of All Harmonized Measures Across Samples"
        , align = c("l", rep("c",9))) %>%
  kable_classic(full_width = F, html_font = "Times New Roman") %>%
  add_header_above(c(" " = 1, "Personality Characteristics" = 5, " " = 5)) %>% 
  add_footnote(notation = "none", label = "<em>Note.</em> E = Extraversion; A = Agreeableness; C = Conscientiousness; N = Neuroticism; O = Openness; Age, education, and gender were assessed at the first baseline personality assessment. Valid N (Range) indicates the range of valid observations with complete personality trait and outcome data across different trait measures.", escape = F)
desc_tab
save_kable(desc_tab, file = sprintf("%s/results/tables/tab-3-desc.html", local_path))

r_fun <- function(d, study){
  cols <- c("E", "A", "C", "N", "O", "crystallized", "age", "gender", "education")
  cols_long <- mapvalues(cols, c(traits$short_name, outcomes$short_name, covars$short_name)
                         , c(traits$long_name, outcomes$long_name, covars$long_name))
  colns <- paste(1:length(cols), ". ", cols_long, sep = "")
  r <- d %>% 
    select(all_of(cols)) %>%
    cor(., use = "pairwise") 
  r <- apply(r, c(1,2), function(x) ifelse(is.na(x), "", ifelse(x == 1, "--", sprintf("%.2f", x))))
  r[upper.tri(r)] <- ""
  diag(r) <- "--"
  rownames(r) <- colns; colnames(r) <- seq(1,length(cols))
  tab <- r %>% 
    as.data.frame() %>%
    rownames_to_column(" ") %>%
    kable(.
          , "html"
          , escape = F
          , align = c("l", rep("c", length(cols)))
          , caption = sprintf("<strong>Table SX</strong><br><em>Zero-Order Correlations Between All Study Variables in %s</em>", study)
          ) %>%
    kable_classic(full_width = F, html_font = "Times New Roman")
  save_kable(tab, file = sprintf("%s/results/tables/zero-order-cors/%s.html", local_path, study))
  return(tab)
}

loadRData <- function(fileName, type){
#loads an RData file, and returns it
    path <- sprintf("%s/data/clean/%s_cleaned.RData", local_path, fileName)
    print(path)
    load(path)
    get(ls()[grepl(type, ls())])
}

ipd3_meta_data <- tibble(
  study = studies[!studies %in% c("CNLSY", "SLS")]
  , data = map(str_to_lower(study), ~loadRData(., "combined"))
  ) %>% mutate(
    data = map(data, ~(.) %>% 
                 ungroup() %>% 
                 mutate(SID = as.character(SID)))
    , study = mapvalues(study, studies, studies_long)
  ) %>%
  unnest(data) %>%
  mutate(age = ifelse(is.na(age), p_year - yearBrth, age))

## [1] "~/Documents/projects/data synthesis/crystallized/data/clean/base-i_cleaned.RData"
## [1] "~/Documents/projects/data synthesis/crystallized/data/clean/eas_cleaned.RData"
## [1] "~/Documents/projects/data synthesis/crystallized/data/clean/gsoep_cleaned.RData"
## [1] "~/Documents/projects/data synthesis/crystallized/data/clean/hilda_cleaned.RData"
## [1] "~/Documents/projects/data synthesis/crystallized/data/clean/hrs_cleaned.RData"
## [1] "~/Documents/projects/data synthesis/crystallized/data/clean/lasa_cleaned.RData"
## [1] "~/Documents/projects/data synthesis/crystallized/data/clean/map_cleaned.RData"
## [1] "~/Documents/projects/data synthesis/crystallized/data/clean/mars_cleaned.RData"
## [1] "~/Documents/projects/data synthesis/crystallized/data/clean/octo-twin_cleaned.RData"
## [1] "~/Documents/projects/data synthesis/crystallized/data/clean/ros_cleaned.RData"
## [1] "~/Documents/projects/data synthesis/crystallized/data/clean/satsa_cleaned.RData"

ipd3_meta_data

## # A tibble: 119,597 × 13
##    study Trait p_value p_year SID   Outcome      o_value o_year education gender SRhealth yearBrth   age
##    <chr> <chr>   <dbl>  <dbl> <chr> <chr>          <dbl>  <dbl>     <dbl>  <dbl>    <dbl>    <dbl> <dbl>
##  1 BASE  E        3.83   1990 10004 crystallized   2.17    2000        17      0        4     1918    72
##  2 BASE  E        3.83   1990 10010 crystallized   3.08    1995        11      0        2     1911    79
##  3 BASE  E        2.5    1990 10033 crystallized   0.909   1997         8      1        5     1910    80
##  4 BASE  E        3      1990 10034 crystallized   6.54    1995        10      0        5     1914    76
##  5 BASE  E        2.5    1990 10111 crystallized   6.54    1995        13      1        4     1901    89
##  6 BASE  E        3.17   1990 10115 crystallized  10       1995        11      0        2     1918    72
##  7 BASE  E        3      1990 10116 crystallized   3.85    1995        10      0        4     1917    73
##  8 BASE  E        3      1990 10145 crystallized   6.36    1997        10      1        5     1897    93
##  9 BASE  E        3.33   1990 10175 crystallized   2.31    1995         8      1        3     1911    79
## 10 BASE  E        3.33   1990 10188 crystallized   5.77    2004        10      1        2     1903    87
## # ℹ 119,587 more rows

ipd_cors <- ipd3_meta_data %>%
  select(-p_year, -o_year, -yearBrth, -SRhealth) %>%
  group_by(study, Trait, Outcome) %>%
  mutate_at(vars(p_value, o_value), 
            ~((. - min(., na.rm = T))/(max(., na.rm = T) - min(., na.rm = T))*10)) %>%
  pivot_wider(names_from = Outcome, values_from = o_value, values_fn = mean) %>%
  pivot_wider(names_from = Trait, values_from = p_value, values_fn = mean) %>%
  group_by(study) %>% 
  nest() %>% 
  ungroup() %>%
  mutate(rtab = map2(data, study, r_fun))