Problem Set #2

Author

INSERT YOUR NAME HERE

Published

Invalid Date

Code

library(broom)

Warning: package 'broom' was built under R version 4.3.3

Code

library(broom.mixed)
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Overview:

In this problem set, you will be using the ggplot2 package (part of tidyverse) to practice (1) visualizing proportions and (2) visualizing associations. This is the first of the problem sets where you’re encouraged to use your own data, so instructions here will be slightly more vague.

In Part 1, you’ll visualize proportions at least two ways (from the approximately 6 we discussed in class). In Part 2, you’ll run and visualize some sort of association or series of associations.

If you don’t have your own data, we’ll use the data we used in class.

Code

load(url("https://github.com/emoriebeck/psc290-data-viz-2022/blob/main/04-week4-associations/04-data/week4-data.RData?raw=true"))
pred_data

Part 1: Visualizing Proportions

In class, we discussed the following plots:

Pie Charts (one discrete variable)
Stacked Bar Charts (multiple discrete variables)
Side-by-Side Bar Charts (multiple discrete variables)
Bar Charts Across Continuous Variables (1+ discrete and 1+ continuous)
Density Across Continuous Variables (1+ discrete and 1+ continuous [or ordinal])
Mosaic Plots (nested discrete or multiple discrete)
Tree Maps (nested discrete)

Question 1: One discrete variable

First, we’ll visualize proportions of a single discrete variable. If you’re using the provided data

Plot the proportion of people/trials/observations/etc. across a single discrete variable as a pie chart.
Make sure to:
- Directly label the groups
- Directly label the actual proportions
- Use colorblind friendly palettes that helps you tell a story
- Adjust axis and plot labels and titles

If you’re using the provided data, plot parental education (1 = “no college degree”; 2 = “college degree”; 3 = “graduate or professional degree”).

Code

d_q1 <- pred_data %>% 
  filter(study == "Study1")
d_q1 %>%
  group_by(parEdu) %>%
  tally() %>%
  mutate(parEdu = factor(parEdu, 1:3, c("No college Degree", "College Degree", "Graduate Degree"))) %>%
  arrange(desc(parEdu)) %>%
  mutate(prop = n / sum(n) * 100
         , ypos = cumsum(prop)- 0.5*prop) %>%
  ggplot(aes(x = "", y = prop, fill = parEdu)) + 
    geom_bar(stat = "identity", width = 1, color = "black") + 
    geom_label(
      aes(y = ypos, label = sprintf("%s\n%.1f%%", parEdu, prop))
      , color = "white"
      , size = 6
      , fontface = 2) +
    scale_fill_manual(values = c("green4", "grey65", "grey80")) +
    coord_polar("y", start = 0) + 
    labs(
      title = "A majority of participants had parents who\ndid not complete postsecondary education"
    ) + 
    theme_void() + 
    theme(
      legend.position = "none"
      , plot.title = element_text(face = "bold.italic", size = rel(1.4), hjust = .5)
      )

Code

## Your code here ##

Question 2: Multiple Discrete Variables

Now, let’s consider the case where we have multiple discrete or ordinal variables or a combination of discrete and continuous variables.

Plot the proportions using any of the following:

Stacked Bar Charts (multiple discrete variables)
Side-by-Side Bar Charts (multiple discrete variables)
Bar Charts Across Continuous Variables (1+ discrete and 1+ continuous)
Density Across Continuous Variables (1+ discrete and 1+ continuous [or ordinal])

Make sure to make aesthetic adjustments similar to those we made in class!

If you’re using provided data, look at the frequency of smoking across samples.

Code

pred_data %>%
  filter(!is.na(smokes)) %>%
  group_by(study, smokes) %>%
  tally() %>%
  group_by(study) %>%
  mutate(prop = n/sum(n)*100) %>%
  ungroup() %>%
  mutate(smokes = factor(smokes, c(0,1), c("Nonsmoker", "Smoker")),
         study = factor(study, c("Study1", "Study6", "Study4", "Study2", "Study3"))) %>%
  ggplot(aes(x = smokes, y = prop, fill = smokes)) + 
    geom_bar(stat = "identity", color = "black", position = "dodge") + 
    scale_y_continuous(
      limits = c(0,70), breaks = seq(0,70, 20), labels = c("0%", "20%", "40%", "60%")
    ) +
    scale_fill_manual(values = c("black", "white")) +
    facet_grid(~study) + 
    labs(
      x = NULL
      , y = "Percentage of Participants"
      , title = "Studies differ widely in the proportion of smokers"
      ) + 
    theme_classic() + 
    theme(
      legend.position = "none"
      , axis.text = element_text(face = "bold", size = rel(1.2))
      , axis.text.x = element_text(angle = 45, hjust = 1, size = rel(1))
      , axis.title = element_text(face = "bold", size = rel(1.2))
      , strip.background = element_rect(fill = "grey90", color = "black")
      , strip.text = element_text(face = "bold", size = rel(1.2))
      , plot.title = element_text(face = "bold", size = rel(1.1), hjust = .5)
      )

Part 2: Associations and Models

Next, you’ll practice Week 4 skills, specifically how to visualize from model objects. In class, we discussed multiple ways to visualize associations including:

heat maps (multivariate)
correlelograms (multivariate)
scatterplots (bi- or multivariate)
forest plots (multiple effect sizes from models)
prediction plots (association between variables)

Question 1: Multivariate associations

First, look at the between-person associations between variables in your data set. Some hints:

Initially, this is easiest if your data are in wide form (e.g., if you have a “condition” with multiple conditions, you’ll eventually want to move that into columns like “Condition 1”, “Condition 2”, etc.)
If you have longitudinal or repeated measures data, you need to get the person means (if your data are in wide form, consider using mutate_at())
Copy the code from class for creating the correlation matrix itself but feel free to take liberties in the construction of the heat map itself

If you’re using the provided data, look at the correlations between the following variables in Study 2 (or in all studies if you’re feeling adventurous!):

p_value
age
grsWages
SRhealth
exercise
BMI
parOccPrstg

Code

r_reshape_fun <- function(r){
  coln <- colnames(r)
  # remove lower tri and diagonal
  r[lower.tri(r, diag = T)] <- NA
  r %>% data.frame() %>%
    rownames_to_column("V1") %>%
    pivot_longer(
      cols = -V1
      , values_to = "r"
      , names_to = "V2"
    ) %>%
    mutate(V1 = factor(V1, coln)
           , V2 = factor(V2, rev(coln)))
}

r_nested <- pred_data %>%
  select(study, p_value, age, grsWages, SRhealth, exercise, BMI, parOccPrstg) %>%
  mutate_if(is.factor, ~as.numeric(as.character(.))) %>%
  group_by(study) %>%
  nest() %>%
  ungroup() %>%
  mutate(r = map(data, ~cor(., use = "pairwise")),
         r_long = map(r, r_reshape_fun))

r_nested %>%
  select(study, r_long) %>%
  unnest(r_long) %>%
  ggplot(aes(x = V1, y = V2, fill = r)) + 
  geom_raster() + 
  geom_text(aes(label = round(r, 2))) + 
  scale_fill_gradient2(limits = c(-1,1)
    , breaks = c(-1, -.5, 0, .5, 1)
    , low = "blue", high = "red"
    , mid = "white", na.value = "white") + 
  labs(
    x = NULL
    , y = NULL
    , fill = "Zero-Order Correlation"
    , title = "Zero-Order Correlations Among Variables"
    ) + 
  facet_wrap(~study, nrow = 2) + 
  theme_classic() + 
  theme(
    legend.position = "bottom"
    , axis.text = element_text(face = "bold")
    , axis.text.x = element_text(angle = 45, hjust = 1)
    , plot.title = element_text(face = "bold", hjust = .5)
    , plot.subtitle = element_text(face = "italic", hjust = .5)
    , panel.background = element_rect(color = "black", size = 1)
    , strip.background = element_rect(color = "black", fill = "black")
    , strip.text = element_text(face = "bold", color = "white")
  )

Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
ℹ Please use the `linewidth` argument instead.

Warning: Removed 168 rows containing missing values or values outside the scale range
(`geom_text()`).

Question 2: Plotting model objects

Choose an association between an outcome variable and one or more predictor variables in your dataset. Almost any type of model is supported by broom, but if you use bayesian models or multilevel models, you’ll need to install and load broom.mixed and if you use SEM, you can’t use broom.

Note that if you have a saved model from a different project, you can just directly load it into this assignment.

If you’re using class data, we’ll look at the association between personality (p_value), gender (gender), and education (education) and physical health events (physhlthevnt). You can do it for all studies or one study.

First, get the model(s) set up and run. Run broom/broom.mixed::tidy on the model (don’t forget to get confidence intervals [unless you’re using MLM, then don’t bother]).

Code

m_fun <- function(x){
  glm(physhlthevnt ~ p_value + gender + education, data = x, family = binomial(link = "logit"))
}
nested_mods <- pred_data %>%
  select(study, SID, p_value, gender, smokes, education, physhlthevnt) %>%
  filter(physhlthevnt %in% c(1,0)) %>%
  group_by(study) %>%
  nest() %>%
  ungroup() %>%
  mutate(m = map(data, m_fun), 
         tidy = map(m, ~tidy(., conf.int = T)))

Plot the effect sizes as a forest plot.

Code

nested_mods %>%
  select(-data, -m) %>%
  unnest(tidy) %>%
  ggplot(aes(y = term, x = estimate)) + 
    geom_vline(aes(xintercept = 0), linetype = "dashed") +
    geom_errorbar(aes(xmin = conf.low, xmax = conf.high), width = .1) + 
    geom_point(aes(fill = study), color = "black", shape = 22, size = 3) + 
    facet_wrap(~study, nrow = 2) +
    labs(
      x = "Estimate (CI) in OR"
      , y = NULL
      , title = "Education Has a Protective Effect on Health"
      ) + 
    theme_bw() + 
    theme(
      legend.position = "none"
      , axis.text = element_text(face = "bold", size = rel(1.1))
      , axis.title = element_text(face = "bold", size = rel(1.2))
      , axis.line = element_blank()
      , strip.text = element_text(face = "bold", size = rel(1.1), color = "white")
      , strip.background = element_rect(fill = "black")
      , plot.title = element_text(face = "bold", size = rel(1.1), hjust = .5)
      , plot.subtitle = element_text(face = "italic", size = rel(1.1))
      , panel.border = element_rect(color = "black", fill = NA, size = 1)
    )

Last, plot the predicted values (i.e. the fitted association). Some hints:

You need to choose two focal variables and to ignore the others.
Make sure the kind of plot is appropriate for the kind of data you’re plotting (i.e. not a scatterplot for binary variables)
Remember the note in class that you can’t use augment unless you don’t have covariates. If you have more than one predictor, use predict() or fitted() instead.

Code

pred_fun <- function(m){
  d <- m$data
  gweights <- d %>% summarize_at(vars(gender, education), ~mean(as.numeric(as.character(.))))
  crossing(
    p_value = seq(0,10,.5)
    , gender = unique(as.character(d$gender))
    , education = unique(as.character(d$education))
    ) %>%
  bind_cols(predict(m, newdata = ., se.fit = T)) %>%
    group_by(p_value) %>%
    summarize_at(vars(fit, se.fit), mean) %>%
    ungroup()
}

nested_mods <- nested_mods %>%
  mutate(pred = map(m, pred_fun))

Code

nested_mods %>%
  select(study, pred) %>%
  unnest(pred) %>%
  mutate_at(vars(fit, se.fit), exp) %>%
  ggplot(aes(x = p_value, y = fit, fill = study)) + 
    geom_ribbon(aes(ymin = fit-1.96*se.fit, ymax = fit+1.96*se.fit), alpha = .5) + 
    geom_line(aes(y = fit-1.96*se.fit), linetype = "dotted") + 
    geom_line(aes(y = fit+1.96*se.fit), linetype = "dotted") + 
    geom_line() + 
    facet_wrap(~study) + 
    theme_bw() + 
    theme(legend.position = "none")

Render to html and submit problem set

Render to html by clicking the “Render” button near the top of your RStudio window (icon with blue arrow)

Go to the Canvas –> Assignments –> Problem Set 2
Submit both .qmd and .html files
Use this naming convention “lastname_firstname_ps#” for your .qmd and html files (e.g. beck_emorie_ps2.qmd & beck_emorie_ps2.html)