Week 3: Proportions

Emorie D Beck

Quick Review

Review

What are the core elements of ggplot2 grammar?

From last week:

  • Mappings: base layer
    • ggplot() and aes()
  • Scales: control and modify your mappings
    • e.g., scale_x_continuous() and scale_fill_manual()
  • Geoms: plot elements
    • e.g., geom_point() and geom_line()
  • Facets: panel your plot
    • facet_wrap() and facet_grid()
  • Themes: style your figure
    • Built-in: e.g., theme_classic()
    • Manual: theme() (legend, strip, axis, plot, panel)

Quick Review

Colorblindness and accessible plots

  • Adding in a colorblind-friendly palette from Wong (2011)
cbsafe_pal <- tribble(
  ~name, ~rgb
  , "black", c(0, 0, 0)
  , "sky blue", c(86, 180, 233)
  , "bluish green", c(0, 158, 115)
  , "yellow", c(240, 228, 66)
  , "orange", c(230, 159, 0)
  , "blue", c(0, 114, 178)
  , "vermillion", c(213, 94, 0)
  , "reddish purple", c(204, 121, 167)
) %>%
  mutate(hex = map_chr(rgb, function(x) rgb(x[1], x[2], x[3], maxColorValue = 255)))
cbsafe_pal
# A tibble: 8 × 3
  name           rgb       hex    
  <chr>          <list>    <chr>  
1 black          <dbl [3]> #000000
2 sky blue       <dbl [3]> #56B4E9
3 bluish green   <dbl [3]> #009E73
4 yellow         <dbl [3]> #F0E442
5 orange         <dbl [3]> #E69F00
6 blue           <dbl [3]> #0072B2
7 vermillion     <dbl [3]> #D55E00
8 reddish purple <dbl [3]> #CC79A7

Proportions

Visualizating Proportions

  • Proportions are often important in our research
  • From describing sample-level differences to describing the frequency of behaviors / events / experiences, etc., we often reach toward describing amounts relative to the whole
  • But the goals we are trying to achieve are varied, which necesssitates the use of different graphics

Agenda

  • We will cover X kinds of ways of visualizations, all of which were covered in your readings
  • We will cover both when to use them and how to create them
    • Pie Charts
    • Bar Charts (Stacked)
    • Bar Charts (Side-by-Side)
    • Bar Charts and Density Across Continuous Variables
    • Mosaic Plots
    • Parallel Sets

But First, Our Data

  • Today, we’ll use the teaching sample from the German Socioeconomic Panel Study (GSOEP)
  • GSOEP is an ongoing longitudinal panel study that began in 1984 (26 waves of data!)
  • ~20,000 people are sampled each year
  • Samples households in Germany
  • Has additional sub-projects (e.g., innovation studies, migrant panel, etc.)
  • The data are publicly available via application
load(url("https://github.com/emoriebeck/psc290-data-viz-2022/raw/main/03-week3-proportions/04-data/gsoep.RData"))
gsoep
# A tibble: 480,798 × 10
    year   SID SRhealth marital chldbrth   gender yearBrth mortality   job   age
   <dbl> <dbl>    <dbl>   <dbl>    <dbl> <hvn_lb>    <dbl>     <dbl> <dbl> <dbl>
 1  1984   901       NA      NA       NA        2     1951         0    NA    33
 2  1984  1001       NA      NA       NA        2     1913         0    NA    71
 3  1984  1101       NA      NA       NA        2     1906         0    NA    78
 4  1984  1201       NA      NA       NA        1     1911         0    NA    73
 5  1984  1202       NA      NA       NA        2     1913         0    NA    71
 6  1984  1301       NA      NA       NA        2     1943         0    NA    41
 7  1984  1302       NA      NA       NA        1     1965         0    NA    19
 8  1984  1901       NA      NA       NA        2     1948         0    NA    36
 9  1984  2001       NA      NA       NA        2     1949         0    NA    35
10  1984  2002       NA      NA       NA        1     1952         0    NA    32
# ℹ 480,788 more rows

Pie Charts

  • You may be wondering if you should ever use a pie chart
  • The answer is, of course, it depends
  • Pie charts are great when:
    • What you want to visualize is simple (e.g., basic fractions)
    • You want to clearly emphasize proportion relative to the whole
    • You have a small data set

Pie Charts

  • In our data, we have a few variables that follow this, but we’ll focus on one for pie charts:
    • marital status (4 groups)
  • ggplot2 doesn’t specifically support pie charts
  • Why? Because it’s a layered grammar of graphics and an explicit function for it would be redundant with some of the built in coordinates
    • specifically, coord_polar()
  • So to make a pie chart, we’ll use geom_bar() + coord_polar()

Pie Charts

Basic Syntax

gsoep %>%
  filter(year == 2009 & !is.na(marital)) %>% # random
  group_by(marital) %>%
  tally() %>%
  mutate(marital = factor(
    marital
    , 1:4
    , c("Married", "Separated", "Widowed", "Never Married")
    )) %>%
  ggplot(aes(x = "", y = n, fill = marital)) + 
    geom_bar(stat = "identity", width = 1, color = "white") + 
    coord_polar("y", start = 0) + 
    theme_void()

Pie Charts

Improvements: Directly label

gsoep %>%
  filter(year == 2009 & !is.na(marital)) %>% # random
  group_by(marital) %>%
  tally() %>%
  mutate(marital = factor(
    marital
    , 1:4
    , c("Married", "Separated", "Widowed", "Never Married")
    )) %>%
  arrange(desc(marital)) %>%
  mutate(prop = n / sum(n) * 100
         , ypos = cumsum(prop)- 0.5*prop) %>%
  ggplot(aes(x = "", y = prop, fill = marital)) + 
    geom_bar(stat = "identity", width = 1, color = "white") + 
    geom_text(
      aes(y = ypos, label = marital)
      , color = "white"
      , size=4
      ) +
    coord_polar("y", start = 0) + 
    theme_void()

Pie Charts

Improvements: Your Turn

  1. Add a color-blind friendly palette (hint: use scale_fill_manual the the palette we created earlier)
  2. Remove the legend.
  3. Add the percentages in each category.

Pie Charts

Improvements: Your Turn

gsoep %>%
  filter(year == 2009 & !is.na(marital)) %>% # random
  group_by(marital) %>%
  tally() %>%
  mutate(marital = factor(
    marital
    , 1:4
    , c("Married", "Separated", "Widowed", "Never Married")
    )) %>%
  arrange(desc(marital)) %>%
  mutate(prop = n / sum(n) * 100
         , ypos = cumsum(prop)- 0.5*prop) %>%
  ggplot(aes(x = "", y = prop, fill = marital)) + 
    geom_bar(stat = "identity", width = 1, color = "white") + 
    geom_text(
      aes(y = ypos, label = sprintf("%s\n%.1f%%", marital, prop))
      , color = "white"
      , size=4
      ) +
    scale_fill_manual(values = cbsafe_pal$hex[c(2, 8, 3, 4)]) + 
    coord_polar("y", start = 0) + 
    theme_void() + 
    theme(legend.position = "none")

Pie Charts

More Improvements

  • Remember that we want to tell a story with our data. Thus, with pie charts, it often makes sense to use a color palette that varies in saturation or lightness (v. hue) with the exception of the focal category. Let’s tell a story with our plot:
  1. Add a title (hint: use the labs() function) to the plot that makes it clear what you want the reader to see.
  2. Change the color palette so that the focal category is a different color than the other categories (which should be the same hue but at different levels of saturation or lightness).
  3. Make your labels easier to see by swapping out geom_text() for geom_label().

Pie Charts

More Improvements

gsoep %>%
  filter(year == 2009 & !is.na(marital)) %>% # random
  group_by(marital) %>%
  tally() %>%
  mutate(marital = factor(marital, 1:4, c("Married", "Separated", "Widowed", "Never Married"))) %>%
  arrange(desc(marital)) %>%
  mutate(prop = n / sum(n) * 100
         , ypos = cumsum(prop)- 0.5*prop) %>%
  ggplot(aes(x = "", y = prop, fill = marital)) + 
    geom_bar(stat = "identity", width = 1, color = "black") + 
    geom_label(
      aes(y = ypos, label = sprintf("%s\n%.1f%%", marital, prop))
      , color = "white"
      , size = 6
      , fontface = 2) +
    scale_fill_manual(values = c(rev(brewer.pal(9,"Greens")[c(4,6,8)]), "grey60")) + 
    coord_polar("y", start = 0) + 
    labs(
      title = "In 2009, the majority of GSOEP participants\nwere or had been married/partnered"
    ) + 
    theme_void() + 
    theme(
      legend.position = "none"
      , plot.title = element_text(face = "bold.italic", size = rel(1.4), hjust = .5)
      )

Stacked Bar Charts

  • Like pie charts, stacked bar charts have their time and place
  • In particular:
    • Show proportions relative to the total
    • Can be used to show changes over time
  • To demonstrate, let’s look at marital status across emerging adulthood (18-26) where we expect changes in the proportion of martial status across time.

Stacked Bar Charts

Basic Syntax

gsoep %>%
  filter(age %in% 18:26 & !is.na(marital)) %>%
  group_by(age, marital) %>%
  tally() %>%
  group_by(age) %>%
  mutate(
    marital = factor(
      marital
      , 1:4
      , c("Married", "Separated", "Widowed", "Never Married")
      )
    , age = factor(age)
    , prop = n/sum(n)
    ) %>%
  ggplot(aes(x = age, y = prop, fill = marital)) + 
    geom_col(color = "black") + # geom_bar(stat = "identity", color = "black")
    theme_classic()

Stacked Bar Charts

Improvements: Color, Labels, and Scales Exercise

Let’s improve how this looks. Let’s start by adjusting the color and labels, like last time. Do the following:

  1. Change the color palette to tell the story you see in the data.
  2. Use scale_y_continuous() to improve the labels on the y-axis (i.e. change them to intuitive percentages)
  3. Use labs() to add/improve the title, subtitle, x, and y labels on the plot
  4. Remove the title for fill (hint: use the labs() function) and move the legend to the bottom

Stacked Bar Charts

Improvements: Color, Labels, and Scales Exercise

gsoep %>%
  filter(age %in% 18:26 & !is.na(marital)) %>%
  group_by(age, marital) %>%
  tally() %>%
  group_by(age) %>%
  mutate(
    marital = factor(
      marital
      , seq(4,1,-1)
      , rev(c("Married", "Separated", "Widowed", "Never Married"))
      )
    , age = factor(age)
    , prop = n/sum(n)
    ) %>%
  ggplot(aes(x = age, y = prop, fill = marital)) + 
    geom_bar(stat = "identity", color = "black") + 
    scale_fill_manual(values = c("grey80",brewer.pal(9,"Greens")[c(2,4,6)])) + 
    scale_y_continuous(
      limits = c(0,1)
      , breaks = seq(0, 1, .25)
      , labels = c("0%", "25%", "50%", "75%", "100%")
      ) + 
    labs(
      x = "Age"
      , y = "Percent of Sample"
      , title = "Rates of relationships increase in emerging adulthood"
      , subtitle = "But most remain unpartnered by 26"
      , fill = NULL
      ) +
    theme_classic() + 
    theme(legend.position = "bottom")

Stacked Bar Charts

Improvements: Directly label

gsoep %>%
  filter(age %in% 18:26 & !is.na(marital)) %>%
  group_by(age, marital) %>%
  tally() %>%
  group_by(age) %>%
  mutate(marital = factor(marital, seq(4,1,-1), rev(c("Married", "Separated", "Widowed", "Never Married")))
         , age = factor(age)
         , prop = n/sum(n)) %>%
  ggplot(aes(x = age, y = prop, fill = marital)) + 
    geom_bar(stat = "identity", color = "black") + 
    scale_fill_manual(values = c("grey80",brewer.pal(9,"Greens")[c(2,4,6)])) + 
    scale_y_continuous(
      limits = c(0,1)
      , breaks = seq(0, 1, .25)
      , labels = c("0%", "25%", "50%", "75%", "100%")
      ) + 
    annotate("text", x = "26", y = .60, label = "Never Married", angle = 90) + 
    annotate("text", x = "26", y = .13, label = "Married", angle = 90, color = "white") + 
    labs(
      x = "Age"
      , y = "Percent of Sample"
      , title = "Rates of relationships increase in emerging adulthood"
      , subtitle = "But most remain unpartnered by 26"
      , fill = NULL
      ) +
    theme_classic() + 
    theme(legend.position = "bottom")

Stacked Bar Charts

Improvements: Theme Elements Exercise

  1. Bold axis text and increase size
  2. Bold axis titles and increase size
  3. Bold title and subtitle and center (hint, you may also need to wrap the title text using \n)

Stacked Bar Charts

Improvements: Theme Elements Exercise

gsoep %>%
  filter(age %in% 18:26 & !is.na(marital)) %>%
  group_by(age, marital) %>%
  tally() %>%
  group_by(age) %>%
  mutate(marital = factor(marital, seq(4,1,-1), rev(c("Married", "Separated", "Widowed", "Never Married")))
         , age = factor(age)
         , prop = n/sum(n)) %>%
  ggplot(aes(x = age, y = prop, fill = marital)) + 
    geom_bar(stat = "identity", color = "black") + 
    scale_fill_manual(values = c("grey80",brewer.pal(9,"Greens")[c(2,4,6)])) + 
    scale_y_continuous(
      limits = c(0,1)
      , breaks = seq(0, 1, .25)
      , labels = c("0%", "25%", "50%", "75%", "100%")
      ) + 
    annotate("text", x = "26", y = .60, label = "Never Married", angle = 90) + 
    annotate("text", x = "26", y = .13, label = "Married", angle = 90, color = "white") + 
    labs(
      x = "Age"
      , y = "Percent of Sample"
      , title = "Rates of relationships increase in\nemerging adulthood"
      , subtitle = "But most remain unpartnered by 26"
      , fill = NULL
      ) +
    theme_classic() + 
    theme(
      legend.position = "bottom"
      , axis.text = element_text(face = "bold", size = rel(1.1))
      , axis.title = element_text(face = "bold", size = rel(1.1))
      , plot.title = element_text(face = "bold", size = rel(1.2), hjust = .5)
      , plot.subtitle = element_text(face = "italic", size = rel(1.1), hjust = .5)
      )

Side-by-Side Bar Charts

  • Stacked bar charts are great for showing sequences but can make it difficult to compare within a stack
  • Side-by-side bar charts make it much easier to compare across categories and work well when broken into many categories
  • But they can be difficult to understand across sequences
  • To demonstrate, let’s look at marriage rates across three waves

Side-by-Side Bar Charts

Basic Syntax

gsoep %>%
  filter(year %in% c(2000, 2005, 2010, 2015) & !is.na(marital)) %>% # random
  group_by(year, marital) %>%
  tally() %>%
  mutate(marital = factor(marital, 1:4, c("Married", "Separated", "Widowed", "Never Married"))) %>%
  group_by(year) %>%
  mutate(prop = n/sum(n)) %>%
  ggplot(aes(x = year, y = prop, fill = marital)) + 
    geom_col(position = "dodge", color = "black") + # geom_bar(stat = "identity", color = "black", position =
    theme_classic()

Side-by-Side Bar Charts

Improvements: Order

gsoep %>%
  filter(year %in% c(2000, 2005, 2010, 2015) & !is.na(marital)) %>% # random
  group_by(year, marital) %>%
  tally() %>%
  mutate(marital = factor(marital, c(1,4,2,3), c("Married", "Never Married", "Separated", "Widowed"))) %>%
  group_by(year) %>%
  mutate(prop = n/sum(n)) %>%
  ggplot(aes(x = year, y = prop, fill = marital)) + 
    geom_bar(stat = "identity", color = "black", position = "dodge") + 
    theme_classic()

Side-by-Side Bar Charts

Improvements: Axis Labels and Titles

We could label the bars, but let’s label the axes instead. To do so:

  1. Use theme(axis.text.x = element_text()) to adjust the angle of the x-axis labels (hint: use angle and hjust).
  2. Adjust the y-axis scale using scale_y_continuous() to interpretable percentages.
  3. Using labs():
  • Remove the x-axis title
  • Change the y-axis title to a complete phrase
  • Add an informative title

Side-by-Side Bar Charts

Improvements: Axis Labels and Titles

gsoep %>%
  filter(year %in% c(2000, 2005, 2010, 2015) & !is.na(marital)) %>% # random
  group_by(year, marital) %>%
  tally() %>%
  mutate(marital = factor(marital, c(1,4,2,3), c("Married", "Never Married", "Separated", "Widowed"))) %>%
  group_by(year) %>%
  mutate(prop = n/sum(n)) %>%
  ggplot(aes(x = marital, y = prop, fill = marital)) + 
    geom_bar(stat = "identity", color = "black", position = "dodge") + 
    scale_y_continuous(
      limits = c(0,.7), breaks = seq(0,.7, .2), labels = c("0%", "20%", "40%", "60%")
    ) +
    facet_grid(~year) + 
    labs(
      x = NULL
      , y = "Percentage of Participants"
      , title = "Marital Status Has Remained Consistent Throughout the 21st Century"
      ) + 
    theme_classic() + 
    theme(
      legend.position = "none"
      , axis.text.x = element_text(angle = 45, hjust = 1)
      ) 

Side-by-Side Bar Charts

Improvements: Theme Elements Exercise

As with the last plot type, let’s use theme() to improve the appearance of our plot. Do the following:

  1. Bold axis text and increase size
  2. Bold axis titles and increase size
  3. Bold title and center (hint, you may also need to wrap the title text using \n)
  4. Change the background color of the facet labels (hint: use theme(strip.background = element_rect(fill = [your argument])). Note that if you use a dark color, you will also have to change the text color.)

Side-by-Side Bar Charts

Improvements: Theme Elements Exercise

gsoep %>%
  filter(year %in% c(2000, 2005, 2010, 2015) & !is.na(marital)) %>% # random
  group_by(year, marital) %>%
  tally() %>%
  mutate(marital = factor(marital, c(1,4,2,3), c("Married", "Never Married", "Separated", "Widowed"))) %>%
  group_by(year) %>%
  mutate(prop = n/sum(n)) %>%
  ggplot(aes(x = marital, y = prop, fill = marital)) + 
    geom_bar(stat = "identity", color = "black", position = "dodge") + 
    scale_y_continuous(
      limits = c(0,.7), breaks = seq(0,.7, .2), labels = c("0%", "20%", "40%", "60%")
    ) +
    facet_grid(~year) + 
    labs(
      x = NULL
      , y = "Percentage of Participants"
      , title = "Marital Status Has Remained Consistent\nThroughout the 21st Century"
      ) + 
    theme_classic() + 
    theme(
      legend.position = "none"
      , axis.text = element_text(face = "bold", size = rel(1.2))
      , axis.text.x = element_text(angle = 45, hjust = 1, size = rel(1))
      , axis.title = element_text(face = "bold", size = rel(1.2))
      , strip.background = element_rect(fill = "grey90", color = "black")
      , strip.text = element_text(face = "bold", size = rel(1.2))
      , plot.title = element_text(face = "bold", size = rel(1.1), hjust = .5)
      ) 

Side-by-Side Bar Charts

Improvements: Colors

Exercise:

  • Improve the colors by making them:
    • Colorblind-friendly
    • Match the goal of the plot (see title)

Side-by-Side Bar Charts

Improvements: Colors

gsoep %>%
  filter(year %in% c(2000, 2005, 2010, 2015) & !is.na(marital)) %>% # random
  group_by(year, marital) %>%
  tally() %>%
  mutate(marital = factor(marital, c(1,4,2,3), c("Married", "Never Married", "Separated", "Widowed"))) %>%
  group_by(year) %>%
  mutate(prop = n/sum(n)) %>%
  ggplot(aes(x = marital, y = prop, fill = marital)) + 
    geom_bar(stat = "identity", color = "black", position = "dodge") + 
    scale_y_continuous(
      limits = c(0,.7), breaks = seq(0,.7, .2), labels = c("0%", "20%", "40%", "60%")
    ) +
    scale_fill_manual(values = cbsafe_pal$hex[2:5]) +
    facet_grid(~year) + 
    labs(
      x = NULL
      , y = "Percentage of Participants"
      , title = "Marital Status Has Remained Consistent\nThroughout the 21st Century"
      ) + 
    theme_classic() + 
    theme(
      legend.position = "none"
      , axis.text = element_text(face = "bold", size = rel(1.2))
      , axis.text.x = element_text(angle = 45, hjust = 1, size = rel(1))
      , axis.title = element_text(face = "bold", size = rel(1.2))
      , strip.background = element_rect(fill = "grey90", color = "black")
      , strip.text = element_text(face = "bold", size = rel(1.2))
      , plot.title = element_text(face = "bold", size = rel(1.1), hjust = .5)
      ) 

Bar Charts and Density Across Continuous Variables

  • One challenge with stacked bar charts is that when there are more than two categories, it can be very difficult to track the visualized trend
  • Relative to side-by-side bar charts, it’s easy to see any category relative to the total but somewhat more difficult to also account for differing numbers of people in different categories or across time
  • One possible solution to this is to look at densities across time and groups or relative to the total
  • Let’s do both now

Bar Charts and Density Across Continuous Variables

gsoep %>%
  filter(age %in% c(20, 30, 40, 50, 60, 70, 80) & !is.na(SRhealth)) %>% # random
  group_by(age, SRhealth) %>%
  tally() %>%
  mutate(SRhealth = factor(
    SRhealth
    , seq(5,1,-1)
    , c("Very good", "Good", "Satisfactory", "Poor", "Bad")
    )) %>%
  group_by(age) %>%
  mutate(prop = n/sum(n)) %>%
  ggplot(aes(x = age, y = prop, fill = SRhealth)) + 
    geom_bar(stat = "identity", color = "black") + 
    scale_fill_manual(values = cbsafe_pal$hex[2:6]) +
    theme_classic()

Stacked Area Charts

gsoep %>%
  filter(!is.na(SRhealth) & age >= 18 & age <= 100) %>% # random
  group_by(age, SRhealth) %>%
  tally() %>%
  mutate(SRhealth = factor(
    SRhealth
    , seq(5,1,-1)
    , c("Very good", "Good", "Satisfactory", "Poor", "Bad")
    )) %>%
  group_by(age) %>%
  mutate(prop = n/sum(n)) %>%
  ggplot(aes(x = age, y = prop, fill = SRhealth)) + 
    geom_area() + 
    theme_classic()

Stacked Area Charts

Improvements: Color

Let’s improve the coloring of this figure. Since we have ordinal data, we want these to be a somewhat coherent gradient to communicate amount. But we don’t want to a full gradient because it’s ordinal!

  1. Apply the built-in viridis scale (hint: it’s a variant of scale_fill_viridis_[type]()).
  2. Add boundaries to the areas (hint: use white) and reduce the intensity of the colors (hint: use alpha)

Stacked Area Charts

Improvements: Color

gsoep %>%
  filter(!is.na(SRhealth) & age >= 18 & age <= 100) %>% # random
  group_by(age, SRhealth) %>%
  tally() %>%
  mutate(SRhealth = factor(
    SRhealth
    , seq(5,1,-1)
    , c("Very good", "Good", "Satisfactory", "Poor", "Bad")
    )) %>%
  group_by(age) %>%
  mutate(prop = n/sum(n)) %>%
  ggplot(aes(x = age, y = prop, fill = SRhealth)) + 
    geom_area(color = "white", alpha = .6) + 
    scale_fill_viridis_d() +
    theme_classic()

Stacked Area Charts

Improvements: Color Labels

Remember, legends tax working memory, and these plots are not the exception to the rule. Let’s add the labels to the plot directly using annotate(). (Note, we could also use geom_text() but the setup would be a pain and dubiously worth the hassle.)

  • The annotate() function is a manual powerhouse.
  • It let’s you add grobs to a plot that match any kind of geom (e.g., lines, dots, rectangles, text, labels, and more).

Basic syntax

annotate(
  geom,
  x = NULL,
  y = NULL,
  xmin = NULL,
  xmax = NULL,
  ymin = NULL,
  ymax = NULL,
  xend = NULL,
  yend = NULL,
  ...,
  na.rm = FALSE
)

Stacked Area Charts

Improvements: Color Labels

gsoep %>%
  filter(!is.na(SRhealth) & age >= 18 & age <= 100) %>% # random
  group_by(age, SRhealth) %>%
  tally() %>%
  mutate(SRhealth = factor(
    SRhealth
    , 1:5
    , rev(c("Very good", "Good", "Satisfactory", "Poor", "Bad"))
    )) %>%
  group_by(age) %>%
  mutate(prop = n/sum(n)) %>%
  ggplot(aes(x = age, y = prop, fill = SRhealth)) + 
    geom_area(color = "white", alpha = .6) + 
    annotate("text", x = 85, y = .95, label = "Bad"         , color = "white", fontface = 2) + 
    annotate("text", x = 75, y = .80, label = "Poor"        , color = "white", fontface = 2) + 
    annotate("text", x = 62, y = .55, label = "Satisfactory", color = "white", fontface = 2) + 
    annotate("text", x = 43, y = .3 , label = "Good"        , color = "black", fontface = 2) + 
    annotate("text", x = 30, y = .07, label = "Very Good"   , color = "black", fontface = 2) + 
    scale_fill_viridis_d() +
    theme_classic() + 
    theme(legend.position = "none")

Stacked Area Charts

Improvements: Labels, Titles, and Themes

Exercise:

  1. Add plot title
  2. Change x and y scale labels and titles
  3. Bold axis text and increase size
  4. Bold axis titles and increase size
  5. Bold title and center (hint, you may also need to wrap the title text using \n)

Stacked Area Charts

Improvements: Labels and Title

Note how much clearer this is than if I’d just plotted the mean of self-rated health across groups!!

gsoep %>%
  filter(!is.na(SRhealth) & age >= 18 & age <= 100) %>% # random
  group_by(age, SRhealth) %>%
  tally() %>%
  mutate(SRhealth = factor(SRhealth, 1:5, rev(c("Very good", "Good", "Satisfactory", "Poor", "Bad")))) %>%
  group_by(age) %>%
  mutate(prop = n/sum(n)) %>%
  ggplot(aes(x = age, y = prop, fill = SRhealth)) + 
    geom_area(color = "white", alpha = .6) + 
    annotate("text", x = 85, y = .95, label = "Bad", color = "white", fontface = 2) + 
    annotate("text", x = 75, y = .80, label = "Poor", color = "white", fontface = 2) + 
    annotate("text", x = 62, y = .55, label = "Satisfactory", color = "white", fontface = 2) + 
    annotate("text", x = 43, y = .3, label = "Good", color = "black", fontface = 2) + 
    annotate("text", x = 30, y = .07, label = "Very Good", color = "black", fontface = 2) + 
    scale_x_continuous(limits = c(18, 100), breaks = seq(20, 100, 10)) + 
    scale_y_continuous(limits = c(0,1), breaks = seq(0,1, .25), labels = c("0%", "25%", "50%", "75%", "100%")) + 
    scale_fill_viridis_d() +
    labs(
      x = "Age (Years)"
      , y = "Percentage of Participants"
      , title = "Levels of Self-Rated Health Decrease Across the Lifespan"
    ) + 
    theme_classic() + 
    theme(legend.position = "none"
          , axis.text = element_text(face = "bold", size = rel(1.1))
          , axis.title = element_text(face = "bold", size = rel(1.1))
          , plot.title = element_text(face = "bold", size = rel(1.1), hjust = .5)
    )

Total Density Plots

  • Let’s revisit these data but also demonstrating how sample size changes across the lifespan
  • To do this, we need two pieces of information:
    • sample size in each self-rated health category at each age group
    • total in each age group

Total Density Plots

Let’s start by using stat_smooth() to get a smoothed geom_area() of the total sample size onto the figure

gsoep %>%
  filter(!is.na(SRhealth) & age >= 18 & age <= 100) %>% # random
  group_by(age, SRhealth) %>%
  tally() %>%
  mutate(SRhealth = factor(SRhealth, 1:5, rev(c("Very good", "Good", "Satisfactory", "Poor", "Bad")))) %>%
  group_by(age) %>%
  mutate(total_n = sum(n))  %>%
  ggplot(aes(x = age, y = n)) + 
    stat_smooth(
        aes(y = total_n)
        , geom = 'area'
        , method = 'loess'
        , span = 1/3
        , alpha = .8
        , fill = "grey"
        ) + 
    facet_grid(~SRhealth) + 
    theme_classic()

Total Density Plots

Now let’s add each of the ordinal values

gsoep %>%
  filter(!is.na(SRhealth) & age >= 18 & age <= 100) %>% # random
  group_by(age, SRhealth) %>%
  tally() %>%
  mutate(SRhealth = factor(SRhealth, 1:5, rev(c("Very good", "Good", "Satisfactory", "Poor", "Bad")))) %>%
  group_by(age) %>%
  mutate(total_n = sum(n))  %>%
  ggplot(aes(x = age, y = n)) + 
    stat_smooth(
      aes(y = total_n)
      , geom = 'area'
      , method = 'loess'
      , span = 1/3
      , alpha = .8
      , fill = "grey"
      ) + 
    stat_smooth(
        aes(fill = SRhealth)
        , geom = 'area'
        , method = 'loess'
        , span = 1/3
        , alpha = .8
        ) + 
    annotate("text", x = 45, y = 3000, label = "Total") + 
    facet_grid(~SRhealth) + 
    theme_classic() + 
    theme(legend.position = "none")

Total Density Plots

Let’s not belabor this too much.

gsoep %>%
  filter(!is.na(SRhealth) & age >= 18 & age <= 100) %>% # random
  group_by(age, SRhealth) %>%
  tally() %>%
  mutate(SRhealth = factor(SRhealth, 1:5, rev(c("Very good", "Good", "Satisfactory", "Poor", "Bad")))) %>%
  group_by(age) %>%
  mutate(total_n = sum(n))  %>%
  ggplot(aes(x = age, y = n)) + 
    stat_smooth(aes(y = total_n), geom = 'area', method = 'loess'
        , span = 1/3, alpha = .8, fill = "grey") + 
    stat_smooth(aes(fill = SRhealth), geom = 'area', method = 'loess'
        , span = 1/3, alpha = .8) + 
    scale_x_continuous(limits = c(18, 100), breaks = seq(20, 100, 20)) + 
    scale_fill_viridis_d() +
    annotate("text", x = 45, y = 3000, label = "Total") + 
    labs(
      x = "Age (Years)"
      , y = "Number of People"
      , title = "Good Self-Rated Health Decreases Across the Lifespan"
      , subtitle = "But bad decreases less, likely because all-cause sample drop-out"
      ) + 
    facet_grid(~SRhealth) + 
    theme_classic() + 
    theme(legend.position = "none"
          , axis.text = element_text(face = "bold", size = rel(1.1))
          , axis.title = element_text(face = "bold", size = rel(1.1))
          , plot.title = element_text(face = "bold", size = rel(1.1), hjust = .5)
          , plot.subtitle = element_text(face = "italic", size = rel(1), hjust = .5)
          , strip.background = element_rect(fill = "grey90", color = "black")
          , strip.text = element_text(face = "bold", size = rel(1.2))
          )

Total Density Plots

Perfectly fine but may not communicate what we want to show as well as other methods we’ve seen (at least in this instance).

Nested Proportions

  • Sometimes, the proportions that we want to visualize are more complex and can’t just be simply binned
  • In such cases, there may be hierarchical relationships among the categories
  • Today, we’ll cover two core nested proportion plots:
    • Mosaic plots
    • Parallel Sets
  • To do this, we’ll use two categorical variables: mortality and marital status

Nested Proportions

  • To do this, we’ll use 2-Digit NACE Industry Sector codes from participants’ last reported jobs in the SOEP, which I’ve broken down into 9 higher-order categories
  • This is a lot of categories, so we’ll further eventually exclude categories that don’t have at least 2% of the share of participants
2-Digit NACE Industry Sector Codes and Categories
Category Job Code
Agriculture Agriculture Hunting Rel.Serv.Activities 1
Forestry Logging Rel.Service activities 2
Fishing Fish Hatcheries Fish Farms 5
Energy and Utilities Mining Coal Lignite; Extraction Of Peat 10
Extraction Crude Petroleum Natural Gas 11
Mining Of Uranium And Thorium Ores 12
Mining Of Metal Ores 13
Other Mining And Quarrying 14
Recycling 37
Electricity Gas Steam Hot Water Supply 40
Sewage Refuse Disposal Sanitationa.a.Re 90
Finance and Tech Financ.Intermediat. Exc.Insur. Pens.Fund 65
Insurance Pens.Funding Ex.Compuls.SocSe 66
Activ.Aux.To Financial Intermediation 67
Computer And Related Activities 72
Research And Development 73
Other Business Activities 74
Industry - NEC 96
Manufacturing Manuf Food Products And Beverages 15
Manuf Tobacco Products 16
Manuf Textiles 17
Manuf Wear. Apparel; Dressing Dyeing Fur 18
Tanning Dress.Leather; luggage Footwear 19
Manuf Wood Products Except Furniture 20
Manuf Pulp Paper And Paper Products 21
Manuf Coke Ref.Petroleum Nuclear Fuel 23
Manuf Chemicals And Chemical Products 24
Manuf Rubber And Plastic Products 25
Manuf Other Non-metallic Mineral Product 26
Manuf Basic Metals 27
Manuf Fabric.Metal Prod. Ex.Machin. Equi 28
Manuf Machinery And Equipment NEC 29
Manuf Office Machinery And Computers 30
Manuf Electrical Machinery Apparatus NE 31
Manuf Radio Television Communic.Equipmen 32
Manuf Medical Precision Optical Instrum. 33
Manuf Motor Vehicles Trailers Semi-tr. 34
Manuf Other Transport Equipment 35
Manuf Furniture; Manufacturing NEC 36
Collection Purification Distrib.Of Water 41
Handcraft Trade - NEC 97
Manufacturing - NEC 100
Other Private Households With Employed Persons 95
Extra-territorial Organizations.a.Bodies 99
Public Service Publ.Administr. Defense; Compuls.SocSec 75
Education 80
Health And Social Work 85
Activit.of.Membership Organizations NEC. 91
Other Service Activities 93
Sales and Service Publishing Printing Recorded Media 22
Construction 45
Sale Maint Rep.Mot.Vehicles;Ret.Sale Fue 50
Wholesale Commission Trade Exc.Mot.Vehic 51
Retail Trade Exc.Mot.Vehic;Mot.Cyc Repai 52
Hotels And Restaurants 55
Post And Telecommunications 64
Real Estate Property Activities 70
Rent.Machinery Equip Wo.Oper. P. HH Good 71
Recreational Cultural Sporting Activity 92
Services - NEC 98
Transportation Land Transport; Transport Via Pipelines 60
Water Transport 61
Air Transport 62
Supporting Aux.Transp.Activ;Trav.Agencie 63

Mosaic Plots

  • Unlike bar charts, mosaic plots allow us to index relative areas, sizes, proportions, etc. relative to two dimensions (so not just amount)
  • So in our example, this will let us see relative differences within categories vertically and across categories horizontally
  • To build this, we will finally leave the basic ggplot2 package and use the ggmosaic package
  • There are other packages, but we’ll use this one because (1) it’s great and (2) it let’s us still use everything we’ve learned about ggplot

Mosaic Plots

Wrangle the Data

(Review this on your own time!)

if(!"ggmosaic" %in% installed.packages()) install.packages("ggmosaic")
library(ggmosaic)

gsoep_jobs <- gsoep %>%
  mutate(age_gr = mapvalues(age, 20:99, rep(seq(20, 90, 10), each = 10))) %>%
  filter(!is.na(age_gr) & age >= 20 & age < 100) %>%
  group_by(SID) %>%
  filter(!is.na(job)) %>%
  filter(age_gr == max(age_gr)) %>%
  group_by(SID, age_gr) %>%
  summarize(job = max(job)) %>%
  ungroup() %>%
  rename(code = job) %>%
  left_join(jobs %>% rename(code = old)) %>%
  group_by(code) %>%
  filter(n() / nrow(.) >= .02) %>%
  ungroup() 
gsoep_jobs
# A tibble: 4,326 × 5
      SID age_gr  code job                                      cat             
    <dbl>  <dbl> <dbl> <chr>                                    <chr>           
 1  12303     40    85 Health And Social Work                   Public Service  
 2  13603     60    80 Education                                Public Service  
 3  35905     60    85 Health And Social Work                   Public Service  
 4  71704     30    28 Manuf Fabric.Metal Prod. Ex.Machin. Equi Manufacturing   
 5  73605     80    29 Manuf Machinery And Equipment NEC        Manufacturing   
 6  79603     20    55 Hotels And Restaurants                   Sales and Servi…
 7  84805     70    45 Construction                             Sales and Servi…
 8  85307     30    74 Other Business Activities                Finance and Tech
 9  89405     70    74 Other Business Activities                Finance and Tech
10 112505     30    65 Financ.Intermediat. Exc.Insur. Pens.Fund Finance and Tech
# ℹ 4,316 more rows

Mosaic Plots

  • Let’s say, for example, that we think that functional limitations and more may age people out of some professions
  • We could look at this simply as a stacked bar chart, but it wouldn’t clarify that there are different proportions of people in each job category / age group
  • We’ve already had a bunch of practice today improving plot aesthetics and seen somewhat similar plots, so we’re going to skip that for this exercise
gsoep_jobs %>%
  ggplot() + 
    geom_mosaic(aes(x = product(age_gr), fill = cat)) + 
    theme_classic() + 
    theme(legend.position = "none")

Treemap

  • Mosaic plots are sort of just fancy stacked bar plots that let you also index by size
  • Treemaps are helpful when we have nested categorical (and sometimes, to a lesser degree continuous) variables
  • We’ll use the example of our jobs data, but this could be used for lots of other types of variables
    • Crossed conditions in an experiment
    • Intergenerational data
    • Average scores on variables within categories
    • Brain activation across broader and narrower brain regions
    • Political affiliation across states, demographic groups, and more

Treemap

Wrangle the Data

gsoep_tm <- gsoep %>%
  group_by(SID) %>%
  filter(!is.na(job)) %>%
  group_by(SID) %>%
  summarize(job = max(job)) %>%
  ungroup() %>%
  rename(code = job) %>%
  left_join(jobs %>% rename(code = old)) %>%
  group_by(code, cat, job) %>%
  tally()  %>%
  ungroup() %>%
  filter(n/sum(n) > .02) %>%
  mutate(job = str_wrap(job, 15))
gsoep_tm
# A tibble: 15 × 4
    code cat               job                                               n
   <dbl> <chr>             <chr>                                         <int>
 1     1 Agriculture       "Agriculture\nHunting\nRel.Serv.Activities"     272
 2    15 Manufacturing     "Manuf Food\nProducts And\nBeverages"           172
 3    17 Manufacturing     "Manuf Textiles"                                122
 4    28 Manufacturing     "Manuf\nFabric.Metal\nProd.\nEx.Machin. Equi"   188
 5    29 Manufacturing     "Manuf Machinery\nAnd Equipment\nNEC"           144
 6    31 Manufacturing     "Manuf\nElectrical\nMachinery\nApparatus NE"    123
 7    45 Sales and Service "Construction"                                  370
 8    52 Sales and Service "Retail Trade\nExc.Mot.Vehic;Mot.Cyc\nRepai"    718
 9    55 Sales and Service "Hotels And\nRestaurants"                       305
10    60 Transportation    "Land Transport;\nTransport Via\nPipelines"     162
11    65 Finance and Tech  "Financ.Intermediat.\nExc.Insur.\nPens.Fund"    123
12    74 Finance and Tech  "Other Business\nActivities"                    286
13    75 Public Service    "Publ.Administr.\nDefense;\nCompuls.SocSec"     421
14    80 Public Service    "Education"                                     339
15    85 Public Service    "Health And\nSocial Work"                       587

Treemap

Basic

if(!"treemapify" %in% installed.packages()) install.packages("treemapify")
library(treemapify)

gsoep_tm %>%
  arrange(cat, code) %>%
  ggplot(aes(area = n, fill = cat, label = job, subgroup = cat)) +
  geom_treemap(color = "white", size = 3) 

Treemap

Improvements: Remove Legend and Add Labels

gsoep_tm %>%
  arrange(cat, code) %>%
  ggplot(aes(area = n, fill = cat, label = job, subgroup = cat)) +
  geom_treemap(color = "white", size = 3) +
  geom_treemap_text(
    colour = "white"
    , place = "centre"
    , size = 15
    , grow = FALSE
    ) +
  theme(legend.position = "none")

Treemap

Improvements: Add Subgroup Text

gsoep_tm %>%
  arrange(cat, code) %>%
  ggplot(aes(area = n, fill = cat, label = job, subgroup = cat)) +
  geom_treemap(color = "white", size = 3) +
  geom_treemap_text(
    colour = c(rep("white", 11), rep("black",4))
    , place = "centre"
    , size = 15
    , grow = FALSE
    ) +
  geom_treemap_subgroup_text(
    place = "bottom"
    , grow = TRUE
    , alpha = 0.4
    , colour = "white"
    , fontface = "italic"
    ) +
  theme(legend.position = "none")

Treemap

Improvements: Color Palette

gsoep_tm %>%
  arrange(cat, code) %>%
  ggplot(aes(area = n, fill = cat, label = job, subgroup = cat)) +
  geom_treemap(color = "white", size = 3) +
  geom_treemap_text(
    colour = "white"
    , place = "centre"
    , size = 15
    , grow = FALSE
    ) +
  geom_treemap_subgroup_text(
    place = "bottom"
    , grow = TRUE
    , alpha = 0.4
    , colour = "white"
    , fontface = "italic"
    ) +
  scale_fill_viridis_d()  +
  theme(legend.position = "none")

Treemap

Improvements: Group and Subgroup Borders + Text Color

gsoep_tm %>%
  arrange(cat, code) %>%
  ggplot(aes(area = n, fill = cat, label = job, subgroup = cat)) +
  geom_treemap(color = "white", size = 3) +
  geom_treemap_text(
    colour = c(rep("white", 11), rep("black",4))
    , place = "centre"
    , size = 15
    , grow = FALSE
    ) +
  geom_treemap_subgroup_text(
    place = "bottom"
    , grow = TRUE
    , alpha = 0.4
    , colour = c(rep("white", 11), rep("black",4))
    , fontface = "italic"
    ) +
  geom_treemap_subgroup_border(
    colour = "white"
    , size = 5
    ) +
  scale_fill_viridis_d()  +
  theme(legend.position = "none")

Treemap

Improvements: Title

gsoep_tm %>%
  arrange(cat, code) %>%
  ggplot(aes(area = n, fill = cat, label = job, subgroup = cat)) +
  geom_treemap(color = "white", size = 3) +
  geom_treemap_text(
    colour = c(rep("white", 11), rep("black",4))
    , place = "centre"
    , size = 15
    , grow = FALSE
    ) +
  geom_treemap_subgroup_text(
    place = "bottom"
    , grow = TRUE
    , alpha = 0.4
    , colour = c(rep("white", 11), rep("black",4))
    , fontface = "italic"
    ) +
  geom_treemap_subgroup_border(
    colour = "white"
    , size = 5
    ) +
  scale_fill_viridis_d()  +
  labs(title = "White Collar Public Service, Sales, and\nFinance Jobs Far Outnumber Blue Collar Jobs") + 
  theme(legend.position = "none"
        , plot.title = element_text(face = "bold", hjust = .5))

Activity / Lab Time

Activity / Lab Time (If time)

  • For the rest of class, I want you to take some of your own data or the data provided in the workbook version of these slides to create some visualization of proportions.
  • Don’t overthink this, this is just your chance to apply what you’ve learned as we move into the “content area” portion of the course. * This is not a graded exercise but a chance to try to use your skills while also having the opportunity to ask questions of the instructors or your peers
  • You can work individually or in small groups, but I suggest at least trying to create the visualization independently, even if you work in small groups to come up with the idea and debug