Author

Emorie D Beck

This is lavaan 0.6-15
lavaan is FREE software! Please report any bugs.

Attaching package: 'lavaan'
The following object is masked from 'package:psych':

    cor2cov
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ ggplot2::%+%()     masks psych::%+%()
✖ ggplot2::alpha()   masks psych::alpha()
✖ dplyr::arrange()   masks plyr::arrange()
✖ purrr::compact()   masks plyr::compact()
✖ dplyr::count()     masks plyr::count()
✖ dplyr::desc()      masks plyr::desc()
✖ dplyr::failwith()  masks plyr::failwith()
✖ dplyr::filter()    masks stats::filter()
✖ dplyr::id()        masks plyr::id()
✖ dplyr::lag()       masks stats::lag()
✖ dplyr::mutate()    masks plyr::mutate()
✖ dplyr::rename()    masks plyr::rename()
✖ dplyr::summarise() masks plyr::summarise()
✖ dplyr::summarize() masks plyr::summarize()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Week 10 - Review & Reflection

Topics

  1. Intro to Base R
  2. dplyr: Manipulating Data
  3. tidyr: Reshaping and Transforming Data
  4. Codebooks
  5. purrr & Functions
  6. Review
  7. Strings, dates, & regex
  8. Functional Tables & Figures
  9. GitHub & Parallelization
  10. Today

Lessons & Takeaways

Lesson 1

Always load tidyverse last

  • Always load all packages at the beginning of a script
Code
library(psych)
library(ggdist)
library(knitr)
library(kableExtra)
library(brms)
library(broom)
library(broom.mixed)
library(patchwork)
library(plyr)
library(tidyverse)
library(furrr)
  • Note: tidyverse loads: dplyr, forcats (factors), ggplot2, lubrdiate, purrr, readr, stringr, tibble, and tidyr
  • This is good! It reduces the number of packages you have to load and ensures there’s no order issues

Deal with Conflicts

  • Use the conflicts() function to figure out what conflicts you have
  • Use package::fnName() to call a function directly without loading a package / to override conflicts
Code
kable(tab) %>%
  kable_classic(html_font = "Times") %>%
  kableExtra::group_rows("Header", 1, 3)

Lesson 2

There is no single way to do anything

  • The best way to do something is a way that you understand or you can introduce the mistakes you’re trying to prevent
Code
bfi %>%
  mutate(
    sid = 1:n(),
    E = rowMeans(pick(matches("E\\d")), na.rm = T), 
    A = rowMeans(pick(matches("A\\d")), na.rm = T), 
    C = rowMeans(pick(matches("C\\d")), na.rm = T), 
    N = rowMeans(pick(matches("N\\d")), na.rm = T), 
    O = rowMeans(pick(matches("O\\d")), na.rm = T)
  ) %>%
  ungroup() %>%
  select(sid, E:O)
Code
bfi %>% 
  mutate(sid = 1:n()) %>%
  pivot_longer(
    cols = c(-sid, -gender, -education, -age)
    , names_to = c("trait", "item")
    , names_sep = -1
    , values_to = "value"
  ) %>%
  group_by(sid, trait) %>%
  summarize(value = mean(value, na.rm = T)) %>%
  pivot_wider(names_from = "trait", values_from = "value") %>%
  ungroup()

Lesson 3

Start at the end

  • What do you want your data to look like?
  • What do they look like now?
  • Now fill in the middle

HLM / MLM / MEM: RE ex: time, trial, stimuli, group, study, day, w/in person conditions FE ex: gender, baseline age, b/w subject conditions, country, etc. (can also be an RE)

ID | RE1 | RE2 | DV | FE1
1 | 1 | 1 | 4 | 3
1 | 1 | 2 | 3 | 3
1 | 2 | 1 | 2 | 3
1 | 2 | 2 | 1 | 3
2 | 1 | 1 | 5 | 1
2 | 1 | 2 | 3 | 1
2 | 2 | 1 | 1 | 1
2 | 2 | 2 | 2 | 1


Start at the end

ID | RE_1_1 | RE_1_2 | RE+2_1 | RE_2_2 | FE2 1 | 4 | 3 | 2 | 1 | 3
2 | 5 | 3 | 1 | 2 | 1

  • pivot_longer(): RE_1_1:RE_2_2
    • names_to = c(“RE1”, “RE2”)
    • values_to = “DV”
    • names_sep = “_”
    • names_prefix = “RE_”
  • Note this is only possible / easy because of the naming scheme! If we had them named “RE1_1”, this would not have been possible / would have been CONSIDERABLY more difficult

Lesson 4

Don’t be afraid to split your data into chunks

  • In alignment with starting at the end, a key strategy is knowing how you can chunk your data
  • No right or wrong way to chunk, but some examples are:
    • items / values from the same scale / task (e.g., DV across trials / conditions)
    • baseline items (from other survey or from baseline wave)
    • outcome variables
    • descriptive variables
    • item-level variables v. composites

Lesson 5

Joining data requires a key, so be thoughtful and you’ll always be able to put the pieces together

  • The most important thing when splitting data into chunks is to make sure you can put it back together
  • This requires one (e.g., participant ID) or more (e.g., participant ID, wave) keys that allows R to match the right values together
  • This should be the last thing you do
    • Please don’t create mega datasets where you tack things on to the raw data as you go
    • This will eat RAM and make your life harder (and sometimes could end up in you accidentally sharing identifying information!!)

Lesson 6

One of the most important skills is getting comfortable making data move flexibly from wide to long

  • Remember our example above? Without knowledge of how to do so, it would have been almost impossible
  • It’s okay if it takes more than one step! That’s better than manually moving stuff in excel and not creating a reproducible path!
Code
bfi %>% 
  mutate(sid = 1:n()) %>%
  pivot_longer(
    cols = c(-sid, -gender, -education, -age)
    , names_to = c("trait", "item")
    , names_sep = -1
    , values_to = "value"
  ) %>%
  group_by(sid, trait) %>%
  summarize(value = mean(value, na.rm = T)) %>%
  pivot_wider(names_from = "trait", values_from = "value") %>%
  ungroup()

Lesson 7

Establish a consistent naming scheme

  • label objects relative to a stage or research question
    • e.g., nested_RQ1, RQ1_mods, raw_df
    • this will help you clear your environment of clutter
  • use temporary objects repeatedly
    • e.g., if you need to use an object as an intermediary step, call it tmp and overwrite it as many times as is useful
    • You can always remove it using rm(tmp)

Lesson 8

You won’t remember details about raw variables or variables you create, document them

  • Clearly document your raw data and planned transformations before (preregistration) or as (deviations or just reactive responses to aspects of the data) you clean your data
  • Clearly document all new variables you create, including their scale, etc.

Lesson 9

Reference data frames are great keys for ordering and renaming

  • Clearly documenting all new variables you create also creates the opportunity to create reference data frames, which can include variable names in the data, category information, longer names for the variables, descriptions of the scales, and the link function / column names e.g.,

cat | name | scale | long_name | lab
Outcome | dementia | 0/1 | Clinical Dementia | OR [CI] Outcome | braak | 1-5 | Braak Stage | est. [CI] Predictor | E | 0-10 | Extraversion |
Predictor | C | 0-10 | Conscientiousness |
Moderator | age | num | Baseline Age | Moderator | ses | 1-7 | Baseline SES |


Reference data frames are great keys for ordering and renaming

Code
out <- tribble(
  ~cat,     ~name,        ~scale,      ~long_name,              ~lab,   
"Outcome",  "dementia",   "0/1",       "Clinical Dementia",      "OR [CI]",
"Outcome",  "braak",      "1-5",       "Braak Stage",            "est. [CI]" 
)

tab %>%
  left_join(out %>% select(outcome = name, long_out = long_name, lab)) 

Lesson 10

Reorder everything using factors

  • Often, there are specific orders we want / need strings to be in; this is where factors come in
  • There’s a whole package for this called forcats.
  • Reference data frames are a great way to order your variables
Code
out <- tribble(
  ~cat,     ~name,        ~scale,      ~long_name,              ~lab,   
"Outcome",  "dementia",   "0/1",       "Clinical Dementia",      "OR [CI]",
"Outcome",  "braak",      "1-5",       "Braak Stage",            "est. [CI]" 
)

tab %>%
  left_join(out %>% select(outcome = name, long_out = long_name, lab)) %>%
  mutate(long_out = factor(long_out, levels = out$long_name))
# mutate(long_out = factor(oucome, levels = out$name, labels = out$long_name))

Reorder everything using factors

  • Remember this?
Code
terms <- tribble(
  ~path,     ~new,                         ~level
  "i~1",     "Intercept",                  "Fixed",
  "s~1",     "Slope",                      "Fixed",
  "i~~i",    "Intercept Variance",         "Random",
  "s~~s",    "Slope Variance",             "Random",
  "i~~s",    "Intercept-Slope Covariance", "Random"
)

extract_fun <- function(m, trait){
  p <- parameterEstimates(m) %>%
    data.frame()
  # saveRDS(p, file = sprintf("results/summary/%s.RDS", trait))
  p %>%
    unite(path, lhs, op, rhs, sep = "") %>%
    filter(path %in% terms$path) %>%
    left_join(terms) %>%
    select(term = new, est, ci.lower, ci.upper, pvalue) %>%
    mutate(term = factor(term, levels = terms$new)) %>%
    arrange(term)
}

Lesson 11

Long lists of anything are asking for trouble

  • It took me way too long to even make the short examples above using tribble()
  • Using a spreadsheet is an easier way to compile that information
  • The googlesheets4 package is also a package dedicated to helping you to read, write, and parse Google Sheets
  • It’s easy to load files stored on GitHub
  • Spreadsheets are user friendly even for those who aren’t code literate
  • It’s way easier to reorder a spreadsheet (cut-insert cut rows) than to have to move rows around in an R script
  • f%#*ing commas and quotes

Lesson 12

File structure and organization are your most important data cleaning & management tools

  • Your data will never be clean if you don’t know where your files are!
  • No one wants to have to rerun things repeatedly
    • Store large files (models, bootstrapped resamples, bayesian samples, etc.) using a clear, machine readable, parseable file structure (e.g., dementia-E-age-unadj.RDS)
    • These can then be read in like:
Code
nested_res <- tibble(
  file = list.files("models"),
  mod = map(file, \(x) readRDS(sprintf("models/%s", x)))
  ) %>%
  separate(file, c("outcome", "trait", "moderator", "adj"), sep = "-")

File structure and organization are your most important data cleaning & management tools

  • Same thing goes for smaller objects
  • Save those small ones, like summaries (e.g., from broom::tidy(), coef(), etc.), predicted values, random effects, etc. using the same file structure, and you always have everything at your fingertips
  • Plus you can merge them more easily!
  • This organization also transfers to GitHub for easy loading via raw links!

Lesson 13

Some things / functions are portable across projects, some need modification

  • Some functions are portable:
Code
z_scale <- function(x) (x - mean(x, na.rm = T))/sd(x, na.rm = T)
pomp_score <- function(x){
  rng <- range(x, na.rm = T)
  (x - rng[1])/(rng[2] - rng[1])*100
}

Some things / functions are portable across projects, some need modification

  • Some are not:
  • This function works for lavaan. With slight modifications, it could also work for broom::tidy() output
Code
format_fun <- function(d){
  d %>%
    mutate(sig = ifelse(pvalue < .05, "sig", "ns")) %>%
    rowwise() %>%
    mutate_at(vars(est, ci.lower, ci.upper), round_fun) %>%
    mutate_at(vars(pvalue), pround_fun) %>%
    ungroup() %>%
    mutate(CI = sprintf("[%s,%s]", ci.lower, ci.upper)) %>%
    mutate_at(vars(est, CI, pvalue), ~ifelse(sig == "sig" & !is.na(sig), sprintf("<strong>%s</strong>", .), .)) 
}

Some things / functions are portable across projects, some need modification

  • One possibility is to create an .R script that you can “source” (source("custom_functions.R"))
    • You could have general functions (e.g., z_scale(), pomp_score()) and use case specific ones (e.g., lavaan_format_fun() or broom_format_fun())
    • I often like to copy these into my R workflow because it means that everything is included in the scripts (even though the .R script can be included in the repo)

Lesson 14

Resources are finite, so be aware of how you’re using them

  • Using grid view in your Environment tab is a great way to track resources
  • The environment below came after running 95 separate models, all of which were held in memory


Resources are finite, so be aware of how you’re using them

  • Using grid view in your Environment tab is a great way to track resources
  • The environment below came from reloading smaller summary objects rather than keeping all the models in working memory


Resources are finite, so be aware of how you’re using them

  • Activity Monitor (Mac) or Process Monitor (Windows) is another great way to track general system usage across many programs, not just R
  • I use this in particular when I’m doing parallelization
    • Sometimes threads stall (drop to 0% CPU or Memory)
    • Sometimes threads use way too much memory (and you start using swap)
    • It’s great to track this so you can interrupt

Lesson 15

Data frames are your friend

  • They are the easiest objects to work with in R because of the number of dedicated tools and functions for working with them
  • But they can get unwieldy (e.g., printing a data frame with hundreds of columns and thousands of rows)
  • tibbles help with this but don’t always play nice
    • You can’t go from some classes to tibble directly
    • Instead go data.frame -> tibble
Code
r <- cor(df, use = "pairwise")
r[upper.tri(r, diag = T)] <- NA 
r %>%
  data.frame() %>%
  as_tibble()

Lesson 16

RStudio / Pivot Cheat Sheets

Posit Cheatsheets


RStudio / Pivot Cheat Sheets

Hacks