Week 5 - Functions and purrr

Emorie D Beck

Outline

  1. Questions on Homework
  2. Functions
  3. purrr
  4. Problem Set and Question Time

Functions

Functions

  • A function or subroutine is a sequence of program instructions that performs a specific task, packaged as a unit.
  • Functions typically take some input and return some specified output

Why Write Your Own Functions?

  • Automate your workflow
  • Prevent copy-paste errors (only need to update the function, not update the same task in a bunch of areas)
  • Saves you time by providing tools that port to new projects
  • Functions make you think, which improves the quality of our work

When Should You Write Your Own Functions?

  • Hadley Wickham’s rule of thumb (which is common in a lot of CS and DS circles) is that if you have to copy-paste something more than twice, you should write a function
  • But really you can write one whenever you’d like
  • I often write functions when a link of code starts getting because it’s usually a good signal that I’m trying to do too much at once.

Types of Functions

  • Vector functions take one or more vectors as input and return a vector as output.
  • Data frame functions take a data frame as input and return a data frame as output.
  • Plot functions that take a data frame as input and return a plot as output.
  • and so on…
  • We’ll talk extensively about this again when we talk about building figures and tables into your workflow in Week 8

Examples

Vector Functions

  • In my work, I often want to POMP (Percentage Of Maximum Possible) score or z-score (standardize) variables, which allows me to get data on a standard and interpretable scale (percentage or standard deviations)
  • I often want to do this with lots of variables, so to do each one separately would take forever and introduce chances for error.
bfi %>%
  select(matches("1$")) %>%
  mutate(
    E1 = (E1 - min(E1, na.rm = T))/(max(E1, na.rm = T) - min(E1, na.rm = T))*100
    , A1 = (A1 - min(A1, na.rm = T))/(max(A1, na.rm = T) - min(A1, na.rm = T))*100
    , C1 = (C1 - min(C1, na.rm = T))/(max(C1, na.rm = T) - min(C1, na.rm = T))*100
    , N1 = (N1 - min(N1, na.rm = T))/(max(N1, na.rm = T) - min(N1, na.rm = T))*100
    , O1 = (O1 - min(O1, na.rm = T))/(max(O1, na.rm = T) - min(O1, na.rm = T))*100
  ) %>%
  head(10)
       A1  C1 E1  N1  O1
61617  20  20 40  40  40
61618  20  80  0  40  60
61620  80  60 20  60  60
61621  60  60 80  20  40
61622  20  60 20  20  40
61623 100 100 20  40  60
61624  20  80 60   0  80
61629  60  40 40 100  40
61630  60 100 80  80 100
61633  20 100 20  80  80

Examples

Vector Functions

  • Instead we could make this a function because I don’t want to talk about how long it took me to do that copy-pasting 😭
pomp <- function(x, na) (x - min(x, na.rm = na))/(max(x, na.rm = na) - min(x, na.rm = na))*100
bfi %>%
  select(matches("1$")) %>%
  mutate(
    E1 = pomp(E1, T)
    , A1 =  pomp(A1, T)
    , C1 =  pomp(C1, T)
    , N1 =  pomp(N1, T)
    , O1 =  pomp(O1, T)
  ) %>%
  head(10)
       A1  C1 E1  N1  O1
61617  20  20 40  40  40
61618  20  80  0  40  60
61620  80  60 20  60  60
61621  60  60 80  20  40
61622  20  60 20  20  40
61623 100 100 20  40  60
61624  20  80 60   0  80
61629  60  40 40 100  40
61630  60 100 80  80 100
61633  20 100 20  80  80

Examples

Vector Functions

  • And remember, we could simplify this even more with mutate_at()!
  • And I know in this example it saved us about four lines of code, but having a function like this that you can just reference, especially in conjunction with mutate_at() saves you time in the short and long run.1
pomp <- function(x, na) (x - min(x, na.rm = na))/(max(x, na.rm = na) - min(x, na.rm = na))*100
bfi %>%
  select(matches("1$")) %>%
  mutate_at(vars(matches("1$")), pomp, na = T) %>%
  head(10)
       A1  C1 E1  N1  O1
61617  20  20 40  40  40
61618  20  80  0  40  60
61620  80  60 20  60  60
61621  60  60 80  20  40
61622  20  60 20  20  40
61623 100 100 20  40  60
61624  20  80 60   0  80
61629  60  40 40 100  40
61630  60 100 80  80 100
61633  20 100 20  80  80

Writing Functions

  • The first step in writing a function is to find the pattern:
  • Let’s look back at the code above:
E1 = (E1 - min(E1, na.rm = T))/(max(E1, na.rm = T) - min(E1, na.rm = T))*100
A1 = (A1 - min(A1, na.rm = T))/(max(A1, na.rm = T) - min(A1, na.rm = T))*100
C1 = (C1 - min(C1, na.rm = T))/(max(C1, na.rm = T) - min(C1, na.rm = T))*100
N1 = (N1 - min(N1, na.rm = T))/(max(N1, na.rm = T) - min(N1, na.rm = T))*100
O1 = (O1 - min(O1, na.rm = T))/(max(O1, na.rm = T) - min(O1, na.rm = T))*100
  • Do you see the pattern?
(█ - min(█, na.rm = TRUE)) / (max(█, na.rm = TRUE) - min(█, na.rm = TRUE))

Writing Functions

We need three things to turn this into a function:

(█ - min(█, na.rm = TRUE)) / (max(█, na.rm = TRUE) - min(█, na.rm = TRUE))
  1. A name. Here we’ll use rescale01 because this function rescales a vector to lie between 0 and 1.
  2. The arguments. The arguments are things that vary across calls and our analysis above tells us that we have just one. We’ll call it x because this is the conventional name for a numeric vector.
  3. The body. The body is the code that’s repeated across all the calls.

Writing Functions

  • The template of a function looks something like this:
name <- function(arguments){
  body
}

Writing Functions

  • Although for local functions, I suggest something like this:
name <- function(
    argument1 # description of first argument
    , argument2 # description of second argument
    ){
  body
}
  • That’s not going to fit well on my slides, so I will minimally do it in class.

Writing Functions

  • Following this, we return to our function:
pomp <- function(x, na) {
  (x - min(x, na.rm = na))/(max(x, na.rm = na) - min(x, na.rm = na))*100
}

Function Efficiency

  • For most of our purposes, we may not care that much about how fast our code is
  • Unless you’re working with huge data, things typically run fast
  • But with a function like POMP, I may need to run it on 100+ variables each with 10,000+ observations, so speed may be a consideration
  • So how do you speed up your functions?
    1. Avoid computing something twice
    2. Try to anticipate where things could go wrong

Function Efficiency

  • So in our function, we calculate min() twice and min() once
  • We could just calculate range() instead, which reduces that computation time
  • Watch (note you have to run this in your console to see the difference)!
pomp <- function(x) {
  (x - min(x, na.rm = T))/(max(x, na.rm = T) - min(x, na.rm = T))*100
}

start <- Sys.time()
tmp <- bfi %>%
  select(matches("1$")) %>%
  mutate_at(vars(matches("1$")), pomp)
(diff1 <- Sys.time() - start)
Time difference of 0.01944995 secs
pomp <- function(x) {
  rng <- range(x, na.rm = T)
  (x - rng[1])/(rng[2] - rng[1])*100
}

start <- Sys.time()
tmp <- bfi %>%
  select(matches("1$")) %>%
  mutate_at(vars(matches("1$")), pomp)
(diff2 <- Sys.time() - start)
Time difference of 0.01314497 secs

Function Efficiency

  • But those differences are really minimal right? Just 0.0063 seconds.
  • But across 150 variables that’s 0.9457 seconds.

Cautionary note on outputs

  • The function we wrote above could be a “mutate()” function because the output is the same length / size as the input.
  • If you write a function that drops cases, you will get an error because R can’t make the resulting vector fit back into the data frame
  • Or, if you write a function that returns length 1 within mutate, it will just repeat that value for every row
  • So, just be careful to think about what the output is and make sure that it matches the use case you are working with!

More notes on outputs

  • You can output anything you want from a function!
  • data frames
  • vectors
  • single values
  • strings
  • lists (very common for base R and package functions!!)
  • model objects
  • plots
  • tables
  • etc.!

You Try:

Try turning the following into functions:

mean(is.na(x))
mean(is.na(y))
mean(is.na(z))

x / sum(x, na.rm = TRUE)
y / sum(y, na.rm = TRUE)
z / sum(z, na.rm = TRUE)

round(x / sum(x, na.rm = TRUE) * 100, 1)
round(y / sum(y, na.rm = TRUE) * 100, 1)
round(z / sum(z, na.rm = TRUE) * 100, 1)

(x - mean(x, na.rm = T))/sd(x, na.rm = T)
(y - mean(y, na.rm = T))/sd(y, na.rm = T)
(z - mean(z, na.rm = T))/sd(z, na.rm = T)

You Try (Solutions):

prop_missing <- function(x) mean(is.na(x))

prop_total   <- function(x) x / sum(x, na.rm = T)

round_prop   <- function(x) round(prop_total(x)*100, 1)

z_score      <- function(x) (x - mean(x, na.rm = T))/sd(x, na.rm = T)

Data Frame Functions

  • All the functions we’ve covered so far are vector functions (i.e. they input a vector, not a matrix or data frame and work well within mutate() calls)
  • But if you have a long string of dplyr verbs, it can be useful to put these into a data frame function where you provide a data frame input and flexible way of naming the columns
  • I’m going to give a brief example here, but suggest that you check out more in R for Data Science.

Data Frame Functions

  • Let’s start with a brief example. To do it, let’s first make the bfi data frame long
bfi_long <- bfi %>%
  rownames_to_column("SID") %>%
  pivot_longer(
    cols = c(-SID, -education, -gender, -age)
    , names_to = c("trait", "item")
    , names_sep = -1
    , values_to = "value"
    , values_drop_na = T
  )
head(bfi_long, 8)
# A tibble: 8 × 7
  SID   gender education   age trait item  value
  <chr>  <int>     <int> <int> <chr> <chr> <int>
1 61617      1        NA    16 A     1         2
2 61617      1        NA    16 A     2         4
3 61617      1        NA    16 A     3         3
4 61617      1        NA    16 A     4         4
5 61617      1        NA    16 A     5         4
6 61617      1        NA    16 C     1         2
7 61617      1        NA    16 C     2         3
8 61617      1        NA    16 C     3         3

Data Frame Functions

  • So maybe I want to get means for each of the Big Five from this, so I write a function like this:
grouped_mean <- function(df, group_var, mean_var) {
  df %>%
    group_by(group_var) %>%
    summarize(mean(mean_var))
}

bfi_long %>% grouped_mean(trait, value)

Data Frame Functions

  • That didn’t work because we can’t specify grouping variables like that
  • To get around it, we have to use something called embracing: We have to wrap the variables like this {{ trait }}
  • This just nudges the dplyr functions to look inside the data frame for the column you specify
tidy_describe <- function(df, var) {
  df %>%
    summarize(
      mean   = mean({{ var }},   na.rm = TRUE),
      sd     = sd({{ var }},     na.rm = TRUE),
      median = median({{ var }}, na.rm = TRUE),
      min    = min({{ var }},    na.rm = TRUE),
      max    = max({{ var }},    na.rm = TRUE),
      n      = n(),
      n_miss = sum(is.na({{ var }})),
      .groups = "drop"
      )
}

bfi_long %>% 
  group_by(trait) %>%
  tidy_describe(value)
# A tibble: 5 × 8
  trait  mean    sd median   min   max     n n_miss
  <chr> <dbl> <dbl>  <dbl> <int> <int> <int>  <int>
1 A      4.22  1.61      5     1     6 13896      0
2 C      3.81  1.57      4     1     6 13893      0
3 E      3.79  1.61      4     1     6 13906      0
4 N      3.16  1.59      3     1     6 13881      0
5 O      3.87  1.67      4     1     6 13916      0

Data Frame Functions

  • Or remember when I had us getting counts and proportions for continuous x categorical relationships?
count_prop <- function(df, var, sort = FALSE) {
  df %>%
    count({{ var }}, sort = sort) %>%
    mutate(prop = n / sum(n))
}

bfi_long %>% 
  count_prop(education)
# A tibble: 6 × 3
  education     n   prop
      <int> <int>  <dbl>
1         1  5565 0.0801
2         2  7248 0.104 
3         3 30988 0.446 
4         4  9797 0.141 
5         5 10381 0.149 
6        NA  5513 0.0793

Wrap-Up

  • This is just a brief introduction to functions.
  • Functions are absolutely essential part of workflows, and you’ll see them pop up in every lesson from here on out (and you already saw them pop up in previous lessons)
  • As we continue to see them, I’ll ramp up their complexity, showing you how to write functions for estimating models, model predictions, figures, tables, and more

Iteration and purrr

Iteration

Iteration is everywhere. It underpins much of mathematics and statistics. If you’ve ever seen the \(\Sigma\) symbol, then you’ve seen (and probably used) iteration.

Reasons for iteration:
- reading in multiple files from a directory
- running the same operation multiple times
- running different combinations of the same model
- creating similar figures / tables / outputs

for loops

Enter for loops. for loops are the “OG” form of iteration in computer science. The basic syntax is below. Basically, we can use a for loop to loop through and print a series of things.

for(i in letters[1:5]){
  print(i)
}
[1] "a"
[1] "b"
[1] "c"
[1] "d"
[1] "e"
  • The code above “loops” through 5 times, printing the iteration letter.

_apply() family

  • A somewhat faster version of for loops comes from the _apply() family of functions, including:
lapply(letters[1:5], print)
[1] "a"
[1] "b"
[1] "c"
[1] "d"
[1] "e"
[[1]]
[1] "a"

[[2]]
[1] "b"

[[3]]
[1] "c"

[[4]]
[1] "d"

[[5]]
[1] "e"
sapply(letters[1:5], print)
[1] "a"
[1] "b"
[1] "c"
[1] "d"
[1] "e"
  a   b   c   d   e 
"a" "b" "c" "d" "e" 
mapply(print, letters[1:5])
[1] "a"
[1] "b"
[1] "c"
[1] "d"
[1] "e"
  a   b   c   d   e 
"a" "b" "c" "d" "e" 

purrr and _map_() functions

  • Today, though, we’ll focus on the map() family of functions, which is the functions through which purrr iterates.
map(letters[1:5], print)
[1] "a"
[1] "b"
[1] "c"
[1] "d"
[1] "e"
[[1]]
[1] "a"

[[2]]
[1] "b"

[[3]]
[1] "c"

[[4]]
[1] "d"

[[5]]
[1] "e"
  • For a more thorough comparison of for loops, the _apply() family, and _map_() functions, see https://jennybc.github.io/purrr-tutorial/

purrr and _map_() predicates

  • Today, though, we’ll focus on the map() family of functions, which is the functions through which purrr iterates.
map(letters[1:5], print)
  • Note that this returns a list, which we may not always want.
  • With purrr, we can change the kind of output of map() by adding a predicate, like lgl, dbl, chr, and df.
  • So in the example above, we may have wanted just the characters to print. To do that we’d call map_chr():
map_chr(letters[1:5], print)
[1] "a"
[1] "b"
[1] "c"
[1] "d"
[1] "e"
[1] "a" "b" "c" "d" "e"
  • Note that it also returns the concatenated character vector as well as printing each letter individually (i.e. iteratively).

purrr and _map_() antecedents

  • How many mappings?
    • Single mapping: map_()
    • Parallel (2) mapping(s): map2_()
    • 3 or more mappings: pmap_()
map2_chr(letters[1:5], 1:5, paste)
[1] "a 1" "b 2" "c 3" "d 4" "e 5"
  • Note here that we can use map2() and pmap() with the predicates from above.

Use Cases

  1. Reading Data
  2. Cleaning Data
  3. Running Models
  4. (Plotting Figures - Week 8)
  5. (Creating Tables - Week 8)

Reading Data

There are a number of different cases where purrr and map() maybe useful for reading in data including:

  • subject-level files for an experimental task
  • subject- and task-level files for an experimental task
  • EMA data
  • longitudinal data
  • web scraping and text mining

Reading Data: Participant-Level EMA

For this first example, I’ll show you how this would look with a for loop before I show you how it looks with purrr.

Assuming you have all the data in a single folder and the format is reasonably similar, you have the following basic syntax:

files <- list.files(data_path)
data <- list()
for(i in files){
  data[[i]] <- read_csv(i)
}
data <- combine(data)

Reading Data: Participant-Level EMA

  • This works fine in this simple case, but where purrr really shines in when you need to make modifications to your data before combining, whether this be recoding, removing missing cases, or renaming variables.

  • But first, the simple case of reading data. The code below will download a .zip file when you run it. Once, you do, navigate to your Desktop to unzip the folder. Open the R Project and you should be able to run the rest of the code

data_source <- "https://github.com/emoriebeck/psc290-data-FQ23/raw/main/04-workshops/05-week5-purrr/05-week5-purrr.zip"
data_dest <- "~/Desktop/05-week5-purrr.zip"
download.file(data_source, data_dest)

Reading Data: Participant-Level EMA

df1 <- tibble(
  ID = list.files("data/example_1")
  ) %>%
  mutate(data = map(ID, ~read_csv(sprintf("data/example_1/%s", .)))) %>%
  unnest(data) 
  • The code above creates a list of ID’s from the data path (files named for each person), reads the data in using the map() function from purrr, removes the “.csv” from the ID variable, then unnests the data, resulting in a data frame for each person.

List Columns and the Power of purrr

  • On the previous slide, we saw a data frame inside of a data frame. This is called a list column within a nested data frame.

  • In this case, we created a list column using map, but one of the best things about purrr is how it combines with the nest() and unnest() functions from the tidyr package.

  • We’ll return to nest() later to demonstrate how anything you would iterate across is also something we can nest() by in long format data frames.

Exercise: Reading in Subject-Level Data

  • Now, we’re going to combine with what we learned about last time with codebooks.
codebook <- read_csv("data/codebook_ex1.csv")
codebook
# A tibble: 11 × 7
   old_name     name         short_name item_name description scale reverse_code
   <chr>        <chr>        <chr>      <chr>     <chr>       <chr> <chr>       
 1 O_AesSens    Openness     O          O_1       Value arti… "lik… no          
 2 E_Assert     Extraversion E          E_1       Assertive   "lik… no          
 3 N_Depr       Neuroticism  N          N_1       Depressed   "lik… no          
 4 N_EmoVol     Neuroticism  N          N_2       Feel emoti… "lik… no          
 5 O_IntCur     Openness     O          O_2       Curious, o… "lik… no          
 6 C_Org        Conscientio… C          C_1       Organized   "lik… no          
 7 A_Rspct      Agreeablene… A          A_1       Resepct ot… "lik… no          
 8 C_Rspnbl     Conscientio… C          C_2       Responsible "lik… no          
 9 A_Cmpn       Agreeablene… A          A_2       Compassion… "lik… no          
10 E_EnerLev    Extraversion E          E_2       Energetic   "lik… no          
11 satisfaction Life Satisf… SWL        satisfac… How satisf… "lik… no          

Exercise: What next?

  • Now, that we have a codebook, what are the next steps?
df1 <- tibble(
  ID = list.files("data/example_1")
  ) %>%
  mutate(data = map(ID, ~read_csv(sprintf("data/example_1/%s", .)))) %>%
  unnest(data) 
head(df1, 6)
# A tibble: 6 × 18
  ID       O_AesSens E_Assert N_Depr N_EmoVol O_IntCur A_Rspct A_Cmpn O_CrtvImag
  <chr>        <dbl>    <dbl>  <dbl>    <dbl>    <dbl>   <dbl>  <dbl>      <dbl>
1 031501.…      3.98     3.82   2        3.89     5       3      2.63       2   
2 031501.…      4.00     2      1        3        2.90    2.16   3.54       1.22
3 031501.…      4        3      3.5      4        4       4.69   4          2.87
4 031501.…      3        3      3        4        3.06    4      3.27       4   
5 031501.…      3.35     3.50   2        2        2.01    4      4          3.5 
6 031501.…      3.89     3      2.33     2.38     3.5     4      4.25       2.09
# ℹ 9 more variables: E_EnerLev <dbl>, A_Trust <dbl>, N_Anxty <dbl>,
#   E_Scblty <dbl>, count <dbl>, satisfaction <dbl>, C_Org <dbl>,
#   C_Rspnbl <dbl>, C_Prdctv <dbl>
  1. pull old names in raw data from codebook
  2. pull new names from codebook
  3. select columns from codebook in loaded data
  4. rename columns with new names

Exercise: Reading in Subject-Level Data

old.names <- codebook$old_name # pull old names in raw data from codebook  
new.names <- codebook$item_name # pull new names from codebook  
df1 <- tibble(
  ID = list.files("data/example_1")
  ) %>%
  mutate(data = map(ID, ~read_csv(sprintf("data/example_1/%s", .)))
         , ID = str_remove_all(ID, ".csv")) %>%
  unnest(data) %>%
  select(ID, count, all_of(old.names)) %>% # select columns from codebook in loaded data  
  setNames(c("ID", "count", new.names)) # rename columns with new names  
head(df1, 6)
# A tibble: 6 × 13
  ID     count   O_1   E_1   N_1   N_2   O_2   C_1   A_1   C_2   A_2   E_2
  <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 031501     1  3.98  3.82  2     3.89  5       NA  3       NA  2.63  4   
2 031501     2  4.00  2     1     3     2.90    NA  2.16    NA  3.54  3   
3 031501     3  4     3     3.5   4     4       NA  4.69    NA  4     3   
4 031501     4  3     3     3     4     3.06    NA  4       NA  3.27  3.46
5 031501     5  3.35  3.50  2     2     2.01    NA  4       NA  4     4   
6 031501     6  3.89  3     2.33  2.38  3.5     NA  4       NA  4.25  4   
# ℹ 1 more variable: satisfaction <dbl>

Descriptives: Within-Person Correlations

With these kinds of data, the first thing, we may want to do is look at within-person correlations, which we can do with purrr.

cor_fun <- function(x) cor(x %>% select(-count), use = "pairwise")

nested_r <- df1 %>%
  group_by(ID) %>%
  nest() %>%
  ungroup() %>%
  mutate(r = map(data, cor_fun))
head(nested_r, 6)
# A tibble: 6 × 3
  ID     data               r              
  <chr>  <list>             <list>         
1 031501 <tibble [64 × 12]> <dbl [11 × 11]>
2 031502 <tibble [39 × 12]> <dbl [11 × 11]>
3 032201 <tibble [68 × 12]> <dbl [11 × 11]>
4 032202 <tibble [65 × 12]> <dbl [11 × 11]>
5 032203 <tibble [61 × 12]> <dbl [11 × 11]>
6 032901 <tibble [79 × 12]> <dbl [11 × 11]>

Descriptives: Within-Person Correlations

We can access it like a list:

nested_r$data[[1]]
# A tibble: 64 × 12
   count   O_1   E_1   N_1   N_2   O_2   C_1   A_1   C_2   A_2   E_2
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1     1  3.98  3.82  2     3.89  5       NA  3       NA  2.63  4   
 2     2  4.00  2     1     3     2.90    NA  2.16    NA  3.54  3   
 3     3  4     3     3.5   4     4       NA  4.69    NA  4     3   
 4     4  3     3     3     4     3.06    NA  4       NA  3.27  3.46
 5     5  3.35  3.50  2     2     2.01    NA  4       NA  4     4   
 6     6  3.89  3     2.33  2.38  3.5     NA  4       NA  4.25  4   
 7     7  3.01  3     2.5   2     5.06    NA  4       NA  4     4.27
 8     8  3.00  3     2.5   2     3.56    NA  4       NA  4     4   
 9     9  4.00  2.5   2     3     4       NA  4       NA  3.95  3.65
10    10  4.20  3.45  2     2     3       NA  3.87    NA  4     4   
# ℹ 54 more rows
# ℹ 1 more variable: satisfaction <dbl>

Descriptives: Within-Person Correlations

  • But we can’t easily (well, nicely) just unnest() a matrix
  • We’ve lost a lot of information along the way
  • So what do we do?
nested_r %>%
  select(-data) %>%
  unnest(r)
# A tibble: 649 × 2
   ID        r[,1]    [,2]    [,3]    [,4]    [,5]  [,6]     [,7]  [,8]    [,9]
   <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <dbl>    <dbl> <dbl>   <dbl>
 1 031501  1       -0.0503 -0.133   0.0490  0.0849    NA  0.139      NA  0.197 
 2 031501 -0.0503   1       0.135   0.0467 -0.0439    NA  0.0169     NA -0.171 
 3 031501 -0.133    0.135   1       0.294   0.0122    NA  0.0835     NA -0.0669
 4 031501  0.0490   0.0467  0.294   1       0.248     NA -0.0752     NA -0.114 
 5 031501  0.0849  -0.0439  0.0122  0.248   1         NA -0.0523     NA  0.0653
 6 031501 NA       NA      NA      NA      NA         NA NA          NA NA     
 7 031501  0.139    0.0169  0.0835 -0.0752 -0.0523    NA  1          NA  0.303 
 8 031501 NA       NA      NA      NA      NA         NA NA          NA NA     
 9 031501  0.197   -0.171  -0.0669 -0.114   0.0653    NA  0.303      NA  1     
10 031501  0.00547 -0.201  -0.150  -0.0782  0.229     NA -0.00452    NA  0.236 
# ℹ 639 more rows
# ℹ 1 more variable: r[10:11] <dbl>

Descriptives: Within-Person Correlations

  • Write a function, of course!
cor_fun <- function(x) {
  r <- cor(x %>% select(-count), use = "pairwise")
  r %>% 
    data.frame() %>%
    rownames_to_column("var") %>%
    as_tibble()
}

nested_r <- nested_r %>%
  mutate(r = map(data, cor_fun))
# A tibble: 10 × 3
   ID     data               r                 
   <chr>  <list>             <list>            
 1 031501 <tibble [64 × 12]> <tibble [11 × 12]>
 2 031502 <tibble [39 × 12]> <tibble [11 × 12]>
 3 032201 <tibble [68 × 12]> <tibble [11 × 12]>
 4 032202 <tibble [65 × 12]> <tibble [11 × 12]>
 5 032203 <tibble [61 × 12]> <tibble [11 × 12]>
 6 032901 <tibble [79 × 12]> <tibble [11 × 12]>
 7 040501 <tibble [48 × 12]> <tibble [11 × 12]>
 8 040502 <tibble [41 × 12]> <tibble [11 × 12]>
 9 040503 <tibble [53 × 12]> <tibble [11 × 12]>
10 040504 <tibble [41 × 12]> <tibble [11 × 12]>

Descriptives: Within-Person Correlations

  • Let’s try unnesting again:
nested_r %>%
  select(-data) %>%
  unnest(r) %>% 
  arrange(desc(var))
# A tibble: 649 × 13
   ID     var             O_1     E_1     N_1      N_2      O_2     C_1      A_1
   <chr>  <chr>         <dbl>   <dbl>   <dbl>    <dbl>    <dbl>   <dbl>    <dbl>
 1 031501 satisfact…  0.103    0.0994 -0.261  -0.00209  0.0765  NA       5.65e-2
 2 031502 satisfact…  0.0929   0.126  NA      NA       -0.0482   0.0672 -3.81e-3
 3 032201 satisfact… -0.00857 -0.0182 -0.190   0.00409  0.108   -0.157  NA      
 4 032202 satisfact…  0.186    0.0614 -0.348  -0.190   -0.00751  0.102  NA      
 5 032203 satisfact… -0.00329  0.122  -0.0633  0.113   -0.210   -0.139  NA      
 6 032901 satisfact…  0.137   NA      -0.189  -0.148   -0.0821  -0.0344  3.96e-2
 7 040501 satisfact… -0.0527  -0.137  NA      NA       -0.145    0.0694  6.01e-2
 8 040502 satisfact…  0.273   NA       0.275  -0.0369  -0.00917 -0.0725  2.03e-1
 9 040503 satisfact…  0.0595  NA       0.0272  0.0926  -0.0266  -0.108  -8.85e-4
10 040504 satisfact…  0.339   NA      -0.305   0.0471   0.144    0.0378  3.53e-1
# ℹ 639 more rows
# ℹ 4 more variables: C_2 <dbl>, A_2 <dbl>, E_2 <dbl>, satisfaction <dbl>

Descriptives: Within-Person Correlations

  • There’s more I would usually do here to format a correlation table for each participant and output the file as a PDF or html so I can post it on GitHub / OSF
  • But we’ll get there in Week 8/9!

Descriptives: Means, sds, etc.

tidy_describe <- function(df) {
  df %>%
    pivot_longer(
      -count
      , names_to = "item"
      , values_to = "value"
      , values_drop_na = T
      ) %>%
    group_by(item) %>%
    summarize(
      mean   = mean(value,   na.rm = TRUE),
      sd     = sd(value,     na.rm = TRUE),
      median = median(value, na.rm = TRUE),
      min    = min(value,    na.rm = TRUE),
      max    = max(value,    na.rm = TRUE),
      n      = n(),
      n_miss = sum(is.na(value)),
      .groups = "drop"
      )
}
nested_r <- nested_r %>%
  mutate(desc = map(data, tidy_describe)) 
nested_r
# A tibble: 59 × 4
   ID     data               r                  desc            
   <chr>  <list>             <list>             <list>          
 1 031501 <tibble [64 × 12]> <tibble [11 × 12]> <tibble [9 × 8]>
 2 031502 <tibble [39 × 12]> <tibble [11 × 12]> <tibble [9 × 8]>
 3 032201 <tibble [68 × 12]> <tibble [11 × 12]> <tibble [9 × 8]>
 4 032202 <tibble [65 × 12]> <tibble [11 × 12]> <tibble [9 × 8]>
 5 032203 <tibble [61 × 12]> <tibble [11 × 12]> <tibble [9 × 8]>
 6 032901 <tibble [79 × 12]> <tibble [11 × 12]> <tibble [9 × 8]>
 7 040501 <tibble [48 × 12]> <tibble [11 × 12]> <tibble [9 × 8]>
 8 040502 <tibble [41 × 12]> <tibble [11 × 12]> <tibble [9 × 8]>
 9 040503 <tibble [53 × 12]> <tibble [11 × 12]> <tibble [9 × 8]>
10 040504 <tibble [41 × 12]> <tibble [11 × 12]> <tibble [9 × 8]>
# ℹ 49 more rows

Descriptives: Means, sds, etc.

nested_r %>%
  select(-data, -r) %>%
  unnest(desc)
# A tibble: 531 × 9
   ID     item          mean    sd median   min   max     n n_miss
   <chr>  <chr>        <dbl> <dbl>  <dbl> <dbl> <dbl> <int>  <int>
 1 031501 A_1           3.87 0.616   4    2      5.24    64      0
 2 031501 A_2           3.71 0.658   4    2      5.29    64      0
 3 031501 E_1           2.89 0.798   3    0.346  4.62    64      0
 4 031501 E_2           3.31 0.773   3.50 1.24   4.27    64      0
 5 031501 N_1           2.27 0.503   2    1      4       64      0
 6 031501 N_2           2.51 0.749   2.28 0.581  4       64      0
 7 031501 O_1           3.72 0.749   4.00 1.15   6.12    64      0
 8 031501 O_2           3.44 0.835   3.42 1.35   5.94    64      0
 9 031501 satisfaction  3.16 1.10    3.09 1.05   4.97    64      0
10 031502 A_1           4.07 0.540   4    3      5       39      0
# ℹ 521 more rows

Models

  • We can put essentially anything into a nested data frame.
  • The magic happens because everything is indexed by the other columns in the data frame, so we can keep track of it
  • And unlike a normal list, we aren’t stuck with nested list structures that are really hard to parse and navigate through
  • Next, I’m going to show you how to use purrr with models
  • Modeling is not a focus of this class, but I want to demonstrate this as a workflow because it completely revolutionized mine!

Models

  • But first, we need to format our data:
df_long <- df1 %>%
  select(-satisfaction) %>%
  pivot_longer(
    cols = c(-count, -ID)
    , names_to = c("trait", "item")
    , names_sep = "_"
    , values_to = "value"
    , values_drop_na = T
    )
df_long
# A tibble: 22,104 × 5
   ID     count trait item  value
   <chr>  <dbl> <chr> <chr> <dbl>
 1 031501     1 O     1      3.98
 2 031501     1 E     1      3.82
 3 031501     1 N     1      2   
 4 031501     1 N     2      3.89
 5 031501     1 O     2      5   
 6 031501     1 A     1      3   
 7 031501     1 A     2      2.63
 8 031501     1 E     2      4   
 9 031501     2 O     1      4.00
10 031501     2 E     1      2   
# ℹ 22,094 more rows

Models

To create composites, we’ll:
1. separate traits from items 2. group_by() trait, count, and ID 3. calculate the composites using summarize()

df_long <- df_long %>%
  group_by(ID, count, trait) %>%
  summarize(value = mean(value)) %>%
  ungroup() %>%
  left_join(df1 %>% select(ID, count, satisfaction))
df_long
# A tibble: 11,052 × 5
   ID     count trait value satisfaction
   <chr>  <dbl> <chr> <dbl>        <dbl>
 1 031501     1 A      2.81         4.96
 2 031501     1 E      3.91         4.96
 3 031501     1 N      2.94         4.96
 4 031501     1 O      4.49         4.96
 5 031501     2 A      2.85         4.79
 6 031501     2 E      2.5          4.79
 7 031501     2 N      2            4.79
 8 031501     2 O      3.45         4.79
 9 031501     3 A      4.35         1.06
10 031501     3 E      3            1.06
# ℹ 11,042 more rows

Models

Then we’ll get within-person centered values using our own little function!

center <- function(x) x - mean(x, na.rm = T)

df_long <- df_long %>%
  group_by(ID, trait) %>%
  mutate(value_c = center(value)) %>%
  ungroup()
df_long
# A tibble: 11,052 × 6
   ID     count trait value satisfaction value_c
   <chr>  <dbl> <chr> <dbl>        <dbl>   <dbl>
 1 031501     1 A      2.81         4.96  -0.976
 2 031501     1 E      3.91         4.96   0.807
 3 031501     1 N      2.94         4.96   0.552
 4 031501     1 O      4.49         4.96   0.907
 5 031501     2 A      2.85         4.79  -0.939
 6 031501     2 E      2.5          4.79  -0.604
 7 031501     2 N      2            4.79  -0.393
 8 031501     2 O      3.45         4.79  -0.128
 9 031501     3 A      4.35         1.06   0.557
10 031501     3 E      3            1.06  -0.104
# ℹ 11,042 more rows

Models

And grand-mean centered within-person averages

df_long <- df_long %>%
  group_by(ID, trait) %>%
  mutate(value_gmc = mean(value)) %>%
  group_by(trait) %>%
  mutate(value_gmc = center(value_gmc)) %>%
  ungroup()
df_long
# A tibble: 11,052 × 7
   ID     count trait value satisfaction value_c value_gmc
   <chr>  <dbl> <chr> <dbl>        <dbl>   <dbl>     <dbl>
 1 031501     1 A      2.81         4.96  -0.976   -0.107 
 2 031501     1 E      3.91         4.96   0.807   -0.0644
 3 031501     1 N      2.94         4.96   0.552    0.0443
 4 031501     1 O      4.49         4.96   0.907    0.301 
 5 031501     2 A      2.85         4.79  -0.939   -0.107 
 6 031501     2 E      2.5          4.79  -0.604   -0.0644
 7 031501     2 N      2            4.79  -0.393    0.0443
 8 031501     2 O      3.45         4.79  -0.128    0.301 
 9 031501     3 A      4.35         1.06   0.557   -0.107 
10 031501     3 E      3            1.06  -0.104   -0.0644
# ℹ 11,042 more rows

Models

And now we are ready to run our models. But first, we’ll nest() our data.

nested_mods <- df_long %>%
  group_by(trait) %>%
  nest() %>%
  ungroup() 
nested_mods
# A tibble: 5 × 2
  trait data                
  <chr> <list>              
1 A     <tibble [2,278 × 6]>
2 E     <tibble [2,146 × 6]>
3 N     <tibble [2,184 × 6]>
4 O     <tibble [2,393 × 6]>
5 C     <tibble [2,051 × 6]>

Models

And now run the models.

run_model <- function(d) lmer(satisfaction ~ value_c * value_gmc + (1 | ID), data = d)

nested_mods <- df_long %>%
  group_by(trait) %>%
  nest() %>%
  ungroup() %>% 
  mutate(model = map(data, run_model))
nested_mods
# A tibble: 5 × 3
  trait data                 model    
  <chr> <list>               <list>   
1 A     <tibble [2,278 × 6]> <lmerMod>
2 E     <tibble [2,146 × 6]> <lmerMod>
3 N     <tibble [2,184 × 6]> <lmerMod>
4 O     <tibble [2,393 × 6]> <lmerMod>
5 C     <tibble [2,051 × 6]> <lmerMod>

Models

And get data frames of the results:

sprintfna <- function(x) ifelse(is.na(x), NA_character_, sprintf("%.2f", x))

tidy_tab <- function(m){
  tidy(m, conf.int = T) %>%
    mutate(pval = pnorm(abs(estimate/`std.error`), lower.tail = FALSE),
           p = round(pval, digits = 3),
           p = ifelse(pval < .001, "p &lt; .001", paste0("p = ", p))) %>%
    mutate_at(vars(estimate, conf.low, conf.high), sprintfna) %>%
    mutate(CI = ifelse(is.na(conf.low), "", sprintf("[%s,%s]", conf.low, conf.high))) %>%
    dplyr::select(term, estimate, CI, p)
}

nested_mods <- nested_mods %>%
  mutate(tidy = map(model, tidy_tab))
nested_mods
# A tibble: 5 × 4
  trait data                 model     tidy            
  <chr> <list>               <list>    <list>          
1 A     <tibble [2,278 × 6]> <lmerMod> <tibble [6 × 4]>
2 E     <tibble [2,146 × 6]> <lmerMod> <tibble [6 × 4]>
3 N     <tibble [2,184 × 6]> <lmerMod> <tibble [6 × 4]>
4 O     <tibble [2,393 × 6]> <lmerMod> <tibble [6 × 4]>
5 C     <tibble [2,051 × 6]> <lmerMod> <tibble [6 × 4]>

Models: Unnesting

Which we can print into pretty data frames

nested_mods %>%
  select(trait, tidy) %>%
  unnest(tidy)
# A tibble: 30 × 5
   trait term              estimate CI             p          
   <chr> <chr>             <chr>    <chr>          <chr>      
 1 A     (Intercept)       2.98     "[2.93,3.03]"  p &lt; .001
 2 A     value_c           -0.02    "[-0.10,0.05]" p = 0.279  
 3 A     value_gmc         0.13     "[-0.05,0.31]" p = 0.081  
 4 A     value_c:value_gmc 0.16     "[-0.12,0.44]" p = 0.127  
 5 A     sd__(Intercept)   0.04     ""             <NA>       
 6 A     sd__Observation   1.15     ""             <NA>       
 7 E     (Intercept)       2.98     "[2.93,3.03]"  p &lt; .001
 8 E     value_c           -0.07    "[-0.14,0.00]" p = 0.029  
 9 E     value_gmc         0.17     "[0.03,0.32]"  p = 0.008  
10 E     value_c:value_gmc 0.02     "[-0.17,0.22]" p = 0.399  
# ℹ 20 more rows

Models: Unnesting & Tabling

Which we can pretty easily turn into tables:

mod_tab <- nested_mods %>%
  select(trait, tidy) %>%
  unnest(tidy) %>%
  pivot_wider(
    names_from = "trait"
    , names_glue = "{trait}_{.value}"
    , values_from = c("estimate", "CI", "p")
  ) %>%
  select(term, starts_with("E"), starts_with("A"), starts_with("C"), starts_with("N"), starts_with("O"))

mod_tab
# A tibble: 6 × 16
  term      E_estimate E_CI  E_p   A_estimate A_CI  A_p   C_estimate C_CI  C_p  
  <chr>     <chr>      <chr> <chr> <chr>      <chr> <chr> <chr>      <chr> <chr>
1 (Interce… 2.98       "[2.… p &l… 2.98       "[2.… p &l… 2.97       "[2.… p &l…
2 value_c   -0.07      "[-0… p = … -0.02      "[-0… p = … 0.03       "[-0… p = …
3 value_gmc 0.17       "[0.… p = … 0.13       "[-0… p = … 0.19       "[0.… p = …
4 value_c:… 0.02       "[-0… p = … 0.16       "[-0… p = … 0.15       "[-0… p = …
5 sd__(Int… 0.00       ""    <NA>  0.04       ""    <NA>  0.00       ""    <NA> 
6 sd__Obse… 1.16       ""    <NA>  1.15       ""    <NA>  1.17       ""    <NA> 
# ℹ 6 more variables: N_estimate <chr>, N_CI <chr>, N_p <chr>,
#   O_estimate <chr>, O_CI <chr>, O_p <chr>

Models: Unnesting & Tabling

hdr <- c(1, rep(3, 5))
names(hdr) <- c(" ", "Extraversion", "Agreeableness", "Conscientiousness", "Neuroticism", "Openness")

mod_tab <- mod_tab %>%
  kable(.
        , "html"
        , escape = F
        , col.names = c("Term", rep(c("<em>b</em>", "CI", "<em>p</em>"), times = 5))
        , align = c("r", rep("c", 15))
        , caption = "<strong>Table 1</strong><br><em>Multilevel Model Estimates of Between- and Within-Person Big Five-State Satisfaction Associations"
        ) %>%
  kable_classic(full_width = F, html_font = "Times", font_size = 15) %>%
  add_header_above(hdr)

Models: Unnesting & Tabling

mod_tab
Table 1
Multilevel Model Estimates of Between- and Within-Person Big Five-State Satisfaction Associations
Extraversion
Agreeableness
Conscientiousness
Neuroticism
Openness
Term b CI p b CI p b CI p b CI p b CI p
(Intercept) 2.98 [2.93,3.03] p < .001 2.98 [2.93,3.03] p < .001 2.97 [2.92,3.02] p < .001 2.99 [2.94,3.03] p < .001 2.97 [2.93,3.02] p < .001
value_c -0.07 [-0.14,0.00] p = 0.029 -0.02 [-0.10,0.05] p = 0.279 0.03 [-0.05,0.10] p = 0.231 -0.01 [-0.08,0.06] p = 0.378 0.02 [-0.05,0.09] p = 0.29
value_gmc 0.17 [0.03,0.32] p = 0.008 0.13 [-0.05,0.31] p = 0.081 0.19 [0.06,0.32] p = 0.003 -0.09 [-0.19,0.01] p = 0.032 0.14 [-0.04,0.33] p = 0.067
value_c:value_gmc 0.02 [-0.17,0.22] p = 0.399 0.16 [-0.12,0.44] p = 0.127 0.15 [-0.04,0.34] p = 0.061 -0.19 [-0.34,-0.03] p = 0.009 0.25 [0.01,0.49] p = 0.019
sd__(Intercept) 0.00 0.04 0.00 0.00 0.03
sd__Observation 1.16 1.15 1.17 1.15 1.16

Appendix

Appendix: Sourcing Functions