Warning: package 'broom' was built under R version 4.3.3
Code
library(broom.mixed)library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.0 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Overview:
In this problem set, you will be using the ggplot2 package (part of tidyverse) to practice (1) visualizing proportions and (2) visualizing associations. This is the first of the problem sets where you’re encouraged to use your own data, so instructions here will be slightly more vague.
In Part 1, you’ll visualize proportions at least two ways (from the approximately 6 we discussed in class). In Part 2, you’ll run and visualize some sort of association or series of associations.
If you don’t have your own data, we’ll use the data we used in class.
Next, you’ll practice Week 4 skills, specifically how to visualize from model objects. In class, we discussed multiple ways to visualize associations including:
heat maps (multivariate)
correlelograms (multivariate)
scatterplots (bi- or multivariate)
forest plots (multiple effect sizes from models)
prediction plots (association between variables)
Question 1: Multivariate associations
First, look at the between-person associations between variables in your data set. Some hints:
Initially, this is easiest if your data are in wide form (e.g., if you have a “condition” with multiple conditions, you’ll eventually want to move that into columns like “Condition 1”, “Condition 2”, etc.)
If you have longitudinal or repeated measures data, you need to get the person means (if your data are in wide form, consider using mutate_at())
Copy the code from class for creating the correlation matrix itself but feel free to take liberties in the construction of the heat map itself
If you’re using the provided data, look at the correlations between the following variables in Study 2 (or in all studies if you’re feeling adventurous!):
Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
ℹ Please use the `linewidth` argument instead.
Warning: Removed 168 rows containing missing values or values outside the scale range
(`geom_text()`).
Question 2: Plotting model objects
Choose an association between an outcome variable and one or more predictor variables in your dataset. Almost any type of model is supported by broom, but if you use bayesian models or multilevel models, you’ll need to install and load broom.mixed and if you use SEM, you can’t use broom.
Note that if you have a saved model from a different project, you can just directly load it into this assignment.
If you’re using class data, we’ll look at the association between personality (p_value), gender (gender), and education (education) and physical health events (physhlthevnt). You can do it for all studies or one study.
First, get the model(s) set up and run. Run broom/broom.mixed::tidy on the model (don’t forget to get confidence intervals [unless you’re using MLM, then don’t bother]).
Last, plot the predicted values (i.e. the fitted association). Some hints:
You need to choose two focal variables and to ignore the others.
Make sure the kind of plot is appropriate for the kind of data you’re plotting (i.e. not a scatterplot for binary variables)
Remember the note in class that you can’t use augment unless you don’t have covariates. If you have more than one predictor, use predict() or fitted() instead.
Render to html by clicking the “Render” button near the top of your RStudio window (icon with blue arrow)
Go to the Canvas –> Assignments –> Problem Set 2
Submit both .qmd and .html files
Use this naming convention “lastname_firstname_ps#” for your .qmd and html files (e.g. beck_emorie_ps2.qmd & beck_emorie_ps2.html)
Source Code
---title: "Problem Set #2"author: "INSERT YOUR NAME HERE"date: "insert date here"format: html: code-tools: true code-copy: true code-line-numbers: true code-link: true theme: united highlight-style: tango df-print: paged code-fold: show toc: true toc-float: true self-contained: trueeditor_options: chunk_output_type: console---```{r packages}library(broom)library(broom.mixed)library(tidyverse)```# Overview:In this problem set, you will be using the **ggplot2** package (part of tidyverse) to practice (1) visualizing proportions and (2) visualizing associations. This is the first of the problem sets where you're encouraged to use your own data, so instructions here will be slightly more vague. In Part 1, you'll visualize proportions at least two ways (from the approximately 6 we discussed in class). In Part 2, you'll run and visualize some sort of association or series of associations.If you don't have your own data, we'll use the data we used in class. ```{r data}load(url("https://github.com/emoriebeck/psc290-data-viz-2022/blob/main/04-week4-associations/04-data/week4-data.RData?raw=true"))pred_data```# Part 1: Visualizing Proportions In class, we discussed the following plots: - Pie Charts (one discrete variable)- Stacked Bar Charts (multiple discrete variables)- Side-by-Side Bar Charts (multiple discrete variables)- Bar Charts Across Continuous Variables (1+ discrete and 1+ continuous)- Density Across Continuous Variables (1+ discrete and 1+ continuous [or ordinal])- Mosaic Plots (nested discrete or multiple discrete)- Tree Maps (nested discrete)## Question 1: One discrete variable First, we'll visualize proportions of a single discrete variable. If you're using the provided data1. Plot the proportion of people/trials/observations/etc. across a single discrete variable as a pie chart. 2. Make sure to: + Directly label the groups + Directly label the actual proportions + Use colorblind friendly palettes that helps you tell a story + Adjust axis and plot labels and titles If you're using the provided data, plot parental education (1 = "no college degree"; 2 = "college degree"; 3 = "graduate or professional degree"). ```{r q1.1}d_q1 <- pred_data %>%filter(study =="Study1")d_q1 %>%group_by(parEdu) %>%tally() %>%mutate(parEdu =factor(parEdu, 1:3, c("No college Degree", "College Degree", "Graduate Degree"))) %>%arrange(desc(parEdu)) %>%mutate(prop = n /sum(n) *100 , ypos =cumsum(prop)-0.5*prop) %>%ggplot(aes(x ="", y = prop, fill = parEdu)) +geom_bar(stat ="identity", width =1, color ="black") +geom_label(aes(y = ypos, label =sprintf("%s\n%.1f%%", parEdu, prop)) , color ="white" , size =6 , fontface =2) +scale_fill_manual(values =c("green4", "grey65", "grey80")) +coord_polar("y", start =0) +labs(title ="A majority of participants had parents who\ndid not complete postsecondary education" ) +theme_void() +theme(legend.position ="none" , plot.title =element_text(face ="bold.italic", size =rel(1.4), hjust = .5) )## Your code here ##```## Question 2: Multiple Discrete VariablesNow, let's consider the case where we have multiple discrete or ordinal variables or a combination of discrete and continuous variables. Plot the proportions using any of the following: - Stacked Bar Charts (multiple discrete variables)- Side-by-Side Bar Charts (multiple discrete variables)- Bar Charts Across Continuous Variables (1+ discrete and 1+ continuous)- Density Across Continuous Variables (1+ discrete and 1+ continuous [or ordinal])Make sure to make aesthetic adjustments similar to those we made in class!If you're using provided data, look at the frequency of smoking across samples. ```{r q2.1, warning=F}pred_data %>%filter(!is.na(smokes)) %>%group_by(study, smokes) %>%tally() %>%group_by(study) %>%mutate(prop = n/sum(n)*100) %>%ungroup() %>%mutate(smokes =factor(smokes, c(0,1), c("Nonsmoker", "Smoker")),study =factor(study, c("Study1", "Study6", "Study4", "Study2", "Study3"))) %>%ggplot(aes(x = smokes, y = prop, fill = smokes)) +geom_bar(stat ="identity", color ="black", position ="dodge") +scale_y_continuous(limits =c(0,70), breaks =seq(0,70, 20), labels =c("0%", "20%", "40%", "60%") ) +scale_fill_manual(values =c("black", "white")) +facet_grid(~study) +labs(x =NULL , y ="Percentage of Participants" , title ="Studies differ widely in the proportion of smokers" ) +theme_classic() +theme(legend.position ="none" , axis.text =element_text(face ="bold", size =rel(1.2)) , axis.text.x =element_text(angle =45, hjust =1, size =rel(1)) , axis.title =element_text(face ="bold", size =rel(1.2)) , strip.background =element_rect(fill ="grey90", color ="black") , strip.text =element_text(face ="bold", size =rel(1.2)) , plot.title =element_text(face ="bold", size =rel(1.1), hjust = .5) ) ```# Part 2: Associations and Models Next, you'll practice Week 4 skills, specifically how to visualize from model objects. In class, we discussed multiple ways to visualize associations including: - heat maps (multivariate)- correlelograms (multivariate) - scatterplots (bi- or multivariate) - forest plots (multiple effect sizes from models)- prediction plots (association between variables)## Question 1: Multivariate associations First, look at the between-person associations between variables in your data set. Some hints: - Initially, this is easiest if your data are in wide form (e.g., if you have a "condition" with multiple conditions, you'll eventually want to move that into columns like "Condition 1", "Condition 2", etc.) - If you have longitudinal or repeated measures data, you need to get the person means (if your data are in wide form, consider using `mutate_at()`) - Copy the code from class for creating the correlation matrix itself but feel free to take liberties in the construction of the heat map itself If you're using the provided data, look at the correlations between the following variables in Study 2 (or in all studies if you're feeling adventurous!): - `p_value`- `age`- `grsWages`- `SRhealth`- `exercise`- `BMI`- `parOccPrstg````{r}r_reshape_fun <-function(r){ coln <-colnames(r)# remove lower tri and diagonal r[lower.tri(r, diag = T)] <-NA r %>%data.frame() %>%rownames_to_column("V1") %>%pivot_longer(cols =-V1 , values_to ="r" , names_to ="V2" ) %>%mutate(V1 =factor(V1, coln) , V2 =factor(V2, rev(coln)))}r_nested <- pred_data %>%select(study, p_value, age, grsWages, SRhealth, exercise, BMI, parOccPrstg) %>%mutate_if(is.factor, ~as.numeric(as.character(.))) %>%group_by(study) %>%nest() %>%ungroup() %>%mutate(r =map(data, ~cor(., use ="pairwise")),r_long =map(r, r_reshape_fun))r_nested %>%select(study, r_long) %>%unnest(r_long) %>%ggplot(aes(x = V1, y = V2, fill = r)) +geom_raster() +geom_text(aes(label =round(r, 2))) +scale_fill_gradient2(limits =c(-1,1) , breaks =c(-1, -.5, 0, .5, 1) , low ="blue", high ="red" , mid ="white", na.value ="white") +labs(x =NULL , y =NULL , fill ="Zero-Order Correlation" , title ="Zero-Order Correlations Among Variables" ) +facet_wrap(~study, nrow =2) +theme_classic() +theme(legend.position ="bottom" , axis.text =element_text(face ="bold") , axis.text.x =element_text(angle =45, hjust =1) , plot.title =element_text(face ="bold", hjust = .5) , plot.subtitle =element_text(face ="italic", hjust = .5) , panel.background =element_rect(color ="black", size =1) , strip.background =element_rect(color ="black", fill ="black") , strip.text =element_text(face ="bold", color ="white") )```## Question 2: Plotting model objects Choose an association between an outcome variable and one or more predictor variables in your dataset. Almost any type of model is supported by broom, but if you use bayesian models or multilevel models, you'll need to install and load `broom.mixed` and if you use SEM, you can't use broom. Note that if you have a saved model from a different project, you can just directly load it into this assignment. If you're using class data, we'll look at the association between personality (`p_value`), gender (`gender`), and education (`education`) and physical health events (`physhlthevnt`). You can do it for all studies or one study. 1. First, get the model(s) set up and run. Run broom/broom.mixed::tidy on the model (don't forget to get confidence intervals [unless you're using MLM, then don't bother]). ```{r q1.1.1}m_fun <-function(x){glm(physhlthevnt ~ p_value + gender + education, data = x, family =binomial(link ="logit"))}nested_mods <- pred_data %>%select(study, SID, p_value, gender, smokes, education, physhlthevnt) %>%filter(physhlthevnt %in%c(1,0)) %>%group_by(study) %>%nest() %>%ungroup() %>%mutate(m =map(data, m_fun), tidy =map(m, ~tidy(., conf.int = T)))```2. Plot the effect sizes as a forest plot. ```{r}nested_mods %>%select(-data, -m) %>%unnest(tidy) %>%ggplot(aes(y = term, x = estimate)) +geom_vline(aes(xintercept =0), linetype ="dashed") +geom_errorbar(aes(xmin = conf.low, xmax = conf.high), width = .1) +geom_point(aes(fill = study), color ="black", shape =22, size =3) +facet_wrap(~study, nrow =2) +labs(x ="Estimate (CI) in OR" , y =NULL , title ="Education Has a Protective Effect on Health" ) +theme_bw() +theme(legend.position ="none" , axis.text =element_text(face ="bold", size =rel(1.1)) , axis.title =element_text(face ="bold", size =rel(1.2)) , axis.line =element_blank() , strip.text =element_text(face ="bold", size =rel(1.1), color ="white") , strip.background =element_rect(fill ="black") , plot.title =element_text(face ="bold", size =rel(1.1), hjust = .5) , plot.subtitle =element_text(face ="italic", size =rel(1.1)) , panel.border =element_rect(color ="black", fill =NA, size =1) )```3. Last, plot the predicted values (i.e. the fitted association). Some hints: - You need to choose two focal variables and to ignore the others. - Make sure the kind of plot is appropriate for the kind of data you're plotting (i.e. not a scatterplot for binary variables) - Remember the note in class that you can't use augment unless you don't have covariates. If you have more than one predictor, use `predict()` or `fitted()` instead. ```{r}pred_fun <-function(m){ d <- m$data gweights <- d %>%summarize_at(vars(gender, education), ~mean(as.numeric(as.character(.))))crossing(p_value =seq(0,10,.5) , gender =unique(as.character(d$gender)) , education =unique(as.character(d$education)) ) %>%bind_cols(predict(m, newdata = ., se.fit = T)) %>%group_by(p_value) %>%summarize_at(vars(fit, se.fit), mean) %>%ungroup()}nested_mods <- nested_mods %>%mutate(pred =map(m, pred_fun))``````{r}nested_mods %>%select(study, pred) %>%unnest(pred) %>%mutate_at(vars(fit, se.fit), exp) %>%ggplot(aes(x = p_value, y = fit, fill = study)) +geom_ribbon(aes(ymin = fit-1.96*se.fit, ymax = fit+1.96*se.fit), alpha = .5) +geom_line(aes(y = fit-1.96*se.fit), linetype ="dotted") +geom_line(aes(y = fit+1.96*se.fit), linetype ="dotted") +geom_line() +facet_wrap(~study) +theme_bw() +theme(legend.position ="none")```# Render to html and submit problem set**Render to html** by clicking the "Render" button near the top of your RStudio window (icon with blue arrow)- Go to the Canvas --\> Assignments --\> Problem Set 2- Submit both .qmd and .html files\- Use this naming convention "lastname_firstname_ps#" for your .qmd and html files (e.g. beck_emorie_ps2.qmd & beck_emorie_ps2.html)