Week 3 (Workbook) - Proportions
Quick Review
What are the core elements of ggplot2 grammar?
From last week:
-
Mappings: base layer
-
ggplot()
andaes()
-
-
Scales: control and modify your mappings
- e.g.,
scale_x_continuous()
andscale_fill_manual()
- e.g.,
-
Geoms: plot elements
- e.g.,
geom_point()
andgeom_line()
- e.g.,
-
Facets: panel your plot
-
facet_wrap()
andfacet_grid()
-
-
Themes: style your figure
- Built-in: e.g.,
theme_classic()
- Manual:
theme()
(legend
,strip
,axis
,plot
,panel
)
- Built-in: e.g.,
Colorblindness and accessible plots
- Adding in a colorblind-friendly palette from Wong (2011)
Code
cbsafe_pal <- tribble(
~name, ~rgb
, "black", c(0, 0, 0)
, "sky blue", c(86, 180, 233)
, "bluish green", c(0, 158, 115)
, "yellow", c(240, 228, 66)
, "orange", c(230, 159, 0)
, "blue", c(0, 114, 178)
, "vermillion", c(213, 94, 0)
, "reddish purple", c(204, 121, 167)
) %>%
mutate(hex = map_chr(rgb, function(x) rgb(x[1], x[2], x[3], maxColorValue = 255)))
cbsafe_pal
Visualizating Proportions
- Proportions are often important in our research
- From describing sample-level differences to describing the frequency of behaviors / events / experiences, etc., we often reach toward describing amounts relative to the whole
- But the goals we are trying to achieve are varied, which necesssitates the use of different graphics
Agenda
- We will cover X kinds of ways of visualizations, all of which were covered in your readings
- We will cover both when to use them and how to create them
- Pie Charts
- Bar Charts (Stacked)
- Bar Charts (Side-by-Side)
- Bar Charts and Density Across Continuous Variables
- Mosaic Plots
Parallel Sets
But First, Our Data
- Today, we’ll use the teaching sample from the German Socioeconomic Panel Study (GSOEP)
- GSOEP is an ongoing longitudinal panel study that began in 1984 (26 waves of data!)
- ~20,000 people are sampled each year
- Samples households in Germany
- Has additional sub-projects (e.g., innovation studies, migrant panel, etc.)
- The data are publicly available via application
Pie Charts
- You may be wondering if you should ever use a pie chart
- The answer is, of course, it depends
- Pie charts are great when:
- What you want to visualize is simple (e.g., basic fractions)
- You want to clearly emphasize proportion relative to the whole
- You have a small data set
- In our data, we have a few variables that follow this, but we’ll focus on one for pie charts:
- marital status (4 groups)
-
ggplot2
doesn’t specifically support pie charts - Why? Because it’s a layered grammar of graphics and an explicit function for it would be redundant with some of the built in coordinates
- specifically,
coord_polar()
- specifically,
- So to make a pie chart, we’ll use
geom_bar() + coord_polar()
Basic Syntax
Code
gsoep %>%
filter(year == 2009 & !is.na(marital)) %>% # random
group_by(marital) %>%
tally() %>%
mutate(marital = factor(
marital
, 1:4
, c("Married", "Separated", "Widowed", "Never Married")
)) %>%
ggplot(aes(x = "", y = n, fill = marital)) +
geom_bar(stat = "identity", width = 1, color = "white") +
coord_polar("y", start = 0) +
theme_void()
Improvements: Directly label
Code
gsoep %>%
filter(year == 2009 & !is.na(marital)) %>% # random
group_by(marital) %>%
tally() %>%
mutate(marital = factor(
marital
, 1:4
, c("Married", "Separated", "Widowed", "Never Married")
)) %>%
arrange(desc(marital)) %>%
mutate(prop = n / sum(n) * 100
, ypos = cumsum(prop)- 0.5*prop) %>%
ggplot(aes(x = "", y = prop, fill = marital)) +
geom_bar(stat = "identity", width = 1, color = "white") +
geom_text(
aes(y = ypos, label = marital)
, color = "white"
, size=4
) +
coord_polar("y", start = 0) +
theme_void()
Improvements: Your Turn
- Add a color-blind friendly palette (hint: use scale_fill_manual the the palette we created earlier)
- Remove the legend.
- Add the percentages in each category.
Code
gsoep %>%
filter(year == 2009 & !is.na(marital)) %>% # random
group_by(marital) %>%
tally() %>%
mutate(marital = factor(
marital
, 1:4
, c("Married", "Separated", "Widowed", "Never Married")
)) %>%
arrange(desc(marital)) %>%
mutate(prop = n / sum(n) * 100
, ypos = cumsum(prop)- 0.5*prop) %>%
ggplot(aes(x = "", y = prop, fill = marital)) +
geom_bar(stat = "identity", width = 1, color = "white") +
geom_text(
aes(y = ypos, label = sprintf("%s\n%.1f%%", marital, prop))
, color = "white"
, size=4
) +
scale_fill_manual(values = cbsafe_pal$hex[c(2, 8, 3, 4)]) +
coord_polar("y", start = 0) +
theme_void() +
theme(legend.position = "none")
More Improvements
- Remember that we want to tell a story with our data. Thus, with pie charts, it often makes sense to use a color palette that varies in saturation or lightness (v. hue) with the exception of the focal category. Let’s tell a story with our plot:
- Add a title (hint: use the
labs()
function) to the plot that makes it clear what you want the reader to see. - Change the color palette so that the focal category is a different color than the other categories (which should be the same hue but at different levels of saturation or lightness).
- Make your labels easier to see by swapping out
geom_text()
forgeom_label()
.
Code
gsoep %>%
filter(year == 2009 & !is.na(marital)) %>% # random
group_by(marital) %>%
tally() %>%
mutate(marital = factor(marital, 1:4, c("Married", "Separated", "Widowed", "Never Married"))) %>%
arrange(desc(marital)) %>%
mutate(prop = n / sum(n) * 100
, ypos = cumsum(prop)- 0.5*prop) %>%
ggplot(aes(x = "", y = prop, fill = marital)) +
geom_bar(stat = "identity", width = 1, color = "black") +
geom_label(
aes(y = ypos, label = sprintf("%s\n%.1f%%", marital, prop))
, color = "white"
, size = 6
, fontface = 2) +
scale_fill_manual(values = c(rev(brewer.pal(9,"Greens")[c(4,6,8)]), "grey60")) +
coord_polar("y", start = 0) +
labs(
title = "In 2009, the majority of GSOEP participants\nwere or had been married/partnered"
) +
theme_void() +
theme(
legend.position = "none"
, plot.title = element_text(face = "bold.italic", size = rel(1.4), hjust = .5)
)
Stacked Bar Charts
- Like pie charts, stacked bar charts have their time and place
- In particular:
- Show proportions relative to the total
- Can be used to show changes over time
- To demonstrate, let’s look at marital status across emerging adulthood (18-26) where we expect changes in the proportion of martial status across time.
Basic Syntax
Code
gsoep %>%
filter(age %in% 18:26 & !is.na(marital)) %>%
group_by(age, marital) %>%
tally() %>%
group_by(age) %>%
mutate(
marital = factor(
marital
, 1:4
, c("Married", "Separated", "Widowed", "Never Married")
)
, age = factor(age)
, prop = n/sum(n)
) %>%
ggplot(aes(x = age, y = prop, fill = marital)) +
geom_col(color = "black") + # geom_bar(stat = "identity", color = "black")
theme_classic()
Improvements: Color, Labels, and Scales Exercise
Let’s improve how this looks. Let’s start by adjusting the color and labels, like last time. Do the following:
- Change the color palette to tell the story you see in the data.
- Use
scale_y_continuous()
to improve the labels on the y-axis (i.e. change them to intuitive percentages) - Use
labs()
to add/improve thetitle
,subtitle
,x
, andy
labels on the plot - Remove the title for fill (hint: use the
labs()
function) and move the legend to the bottom
Code
gsoep %>%
filter(age %in% 18:26 & !is.na(marital)) %>%
group_by(age, marital) %>%
tally() %>%
group_by(age) %>%
mutate(
marital = factor(
marital
, seq(4,1,-1)
, rev(c("Married", "Separated", "Widowed", "Never Married"))
)
, age = factor(age)
, prop = n/sum(n)
) %>%
ggplot(aes(x = age, y = prop, fill = marital)) +
geom_bar(stat = "identity", color = "black") +
scale_fill_manual(values = c("grey80",brewer.pal(9,"Greens")[c(2,4,6)])) +
scale_y_continuous(
limits = c(0,1)
, breaks = seq(0, 1, .25)
, labels = c("0%", "25%", "50%", "75%", "100%")
) +
labs(
x = "Age"
, y = "Percent of Sample"
, title = "Rates of relationships increase in emerging adulthood"
, subtitle = "But most remain unpartnered by 26"
, fill = NULL
) +
theme_classic() +
theme(legend.position = "bottom")
Improvements: Directly label
Code
gsoep %>%
filter(age %in% 18:26 & !is.na(marital)) %>%
group_by(age, marital) %>%
tally() %>%
group_by(age) %>%
mutate(marital = factor(marital, seq(4,1,-1), rev(c("Married", "Separated", "Widowed", "Never Married")))
, age = factor(age)
, prop = n/sum(n)) %>%
ggplot(aes(x = age, y = prop, fill = marital)) +
geom_bar(stat = "identity", color = "black") +
scale_fill_manual(values = c("grey80",brewer.pal(9,"Greens")[c(2,4,6)])) +
scale_y_continuous(
limits = c(0,1)
, breaks = seq(0, 1, .25)
, labels = c("0%", "25%", "50%", "75%", "100%")
) +
annotate("text", x = "26", y = .60, label = "Never Married", angle = 90) +
annotate("text", x = "26", y = .13, label = "Married", angle = 90, color = "white") +
labs(
x = "Age"
, y = "Percent of Sample"
, title = "Rates of relationships increase in emerging adulthood"
, subtitle = "But most remain unpartnered by 26"
, fill = NULL
) +
theme_classic() +
theme(legend.position = "bottom")
Improvements: Theme Elements Exercise
- Bold axis text and increase size
- Bold axis titles and increase size
- Bold title and subtitle and center (hint, you may also need to wrap the title text using
\n
)
Code
gsoep %>%
filter(age %in% 18:26 & !is.na(marital)) %>%
group_by(age, marital) %>%
tally() %>%
group_by(age) %>%
mutate(marital = factor(marital, seq(4,1,-1), rev(c("Married", "Separated", "Widowed", "Never Married")))
, age = factor(age)
, prop = n/sum(n)) %>%
ggplot(aes(x = age, y = prop, fill = marital)) +
geom_bar(stat = "identity", color = "black") +
scale_fill_manual(values = c("grey80",brewer.pal(9,"Greens")[c(2,4,6)])) +
scale_y_continuous(
limits = c(0,1)
, breaks = seq(0, 1, .25)
, labels = c("0%", "25%", "50%", "75%", "100%")
) +
annotate("text", x = "26", y = .60, label = "Never Married", angle = 90) +
annotate("text", x = "26", y = .13, label = "Married", angle = 90, color = "white") +
labs(
x = "Age"
, y = "Percent of Sample"
, title = "Rates of relationships increase in\nemerging adulthood"
, subtitle = "But most remain unpartnered by 26"
, fill = NULL
) +
theme_classic() +
theme(
legend.position = "bottom"
, axis.text = element_text(face = "bold", size = rel(1.1))
, axis.title = element_text(face = "bold", size = rel(1.1))
, plot.title = element_text(face = "bold", size = rel(1.2), hjust = .5)
, plot.subtitle = element_text(face = "italic", size = rel(1.1), hjust = .5)
)
Side-by-Side Bar Charts
- Stacked bar charts are great for showing sequences but can make it difficult to compare within a stack
- Side-by-side bar charts make it much easier to compare across categories and work well when broken into many categories
- But they can be difficult to understand across sequences
- To demonstrate, let’s look at marriage rates across three waves
Basic Syntax
Code
gsoep %>%
filter(year %in% c(2000, 2005, 2010, 2015) & !is.na(marital)) %>% # random
group_by(year, marital) %>%
tally() %>%
mutate(marital = factor(marital, 1:4, c("Married", "Separated", "Widowed", "Never Married"))) %>%
group_by(year) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(x = year, y = prop, fill = marital)) +
geom_col(position = "dodge", color = "black") + # geom_bar(stat = "identity", color = "black", position =
theme_classic()
Improvements: Order
Code
gsoep %>%
filter(year %in% c(2000, 2005, 2010, 2015) & !is.na(marital)) %>% # random
group_by(year, marital) %>%
tally() %>%
mutate(marital = factor(marital, c(1,4,2,3), c("Married", "Never Married", "Separated", "Widowed"))) %>%
group_by(year) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(x = year, y = prop, fill = marital)) +
geom_bar(stat = "identity", color = "black", position = "dodge") +
theme_classic()
Improvements: Axis Labels and Titles
We could label the bars, but let’s label the axes instead. To do so:
- Use
theme(axis.text.x = element_text()
) to adjust the angle of the x-axis labels (hint: useangle
andhjust
). - Adjust the y-axis scale using
scale_y_continuous()
to interpretable percentages. - Using
labs()
:
- Remove the x-axis title
- Change the y-axis title to a complete phrase
- Add an informative title
Code
gsoep %>%
filter(year %in% c(2000, 2005, 2010, 2015) & !is.na(marital)) %>% # random
group_by(year, marital) %>%
tally() %>%
mutate(marital = factor(marital, c(1,4,2,3), c("Married", "Never Married", "Separated", "Widowed"))) %>%
group_by(year) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(x = marital, y = prop, fill = marital)) +
geom_bar(stat = "identity", color = "black", position = "dodge") +
scale_y_continuous(
limits = c(0,.7), breaks = seq(0,.7, .2), labels = c("0%", "20%", "40%", "60%")
) +
facet_grid(~year) +
labs(
x = NULL
, y = "Percentage of Participants"
, title = "Marital Status Has Remained Consistent Throughout the 21st Century"
) +
theme_classic() +
theme(
legend.position = "none"
, axis.text.x = element_text(angle = 45, hjust = 1)
)
Improvements: Theme Elements Exercise
As with the last plot type, let’s use theme() to improve the appearance of our plot. Do the following:
- Bold axis text and increase size
- Bold axis titles and increase size
- Bold title and center (hint, you may also need to wrap the title text using
\n
) - Change the background color of the facet labels (hint: use
theme(strip.background = element_rect(fill = [your argument]))
. Note that if you use a dark color, you will also have to change the text color.)
Code
gsoep %>%
filter(year %in% c(2000, 2005, 2010, 2015) & !is.na(marital)) %>% # random
group_by(year, marital) %>%
tally() %>%
mutate(marital = factor(marital, c(1,4,2,3), c("Married", "Never Married", "Separated", "Widowed"))) %>%
group_by(year) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(x = marital, y = prop, fill = marital)) +
geom_bar(stat = "identity", color = "black", position = "dodge") +
scale_y_continuous(
limits = c(0,.7), breaks = seq(0,.7, .2), labels = c("0%", "20%", "40%", "60%")
) +
facet_grid(~year) +
labs(
x = NULL
, y = "Percentage of Participants"
, title = "Marital Status Has Remained Consistent\nThroughout the 21st Century"
) +
theme_classic() +
theme(
legend.position = "none"
, axis.text = element_text(face = "bold", size = rel(1.2))
, axis.text.x = element_text(angle = 45, hjust = 1, size = rel(1))
, axis.title = element_text(face = "bold", size = rel(1.2))
, strip.background = element_rect(fill = "grey90", color = "black")
, strip.text = element_text(face = "bold", size = rel(1.2))
, plot.title = element_text(face = "bold", size = rel(1.1), hjust = .5)
)
Improvements: Colors Exercise
- Improve the colors by making them:
- Colorblind-friendly
- Match the goal of the plot (see title)
Code
gsoep %>%
filter(year %in% c(2000, 2005, 2010, 2015) & !is.na(marital)) %>% # random
group_by(year, marital) %>%
tally() %>%
mutate(marital = factor(marital, c(1,4,2,3), c("Married", "Never Married", "Separated", "Widowed"))) %>%
group_by(year) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(x = marital, y = prop, fill = marital)) +
geom_bar(stat = "identity", color = "black", position = "dodge") +
scale_y_continuous(
limits = c(0,.7), breaks = seq(0,.7, .2), labels = c("0%", "20%", "40%", "60%")
) +
scale_fill_manual(values = cbsafe_pal$hex[2:5]) +
facet_grid(~year) +
labs(
x = NULL
, y = "Percentage of Participants"
, title = "Marital Status Has Remained Consistent\nThroughout the 21st Century"
) +
theme_classic() +
theme(
legend.position = "none"
, axis.text = element_text(face = "bold", size = rel(1.2))
, axis.text.x = element_text(angle = 45, hjust = 1, size = rel(1))
, axis.title = element_text(face = "bold", size = rel(1.2))
, strip.background = element_rect(fill = "grey90", color = "black")
, strip.text = element_text(face = "bold", size = rel(1.2))
, plot.title = element_text(face = "bold", size = rel(1.1), hjust = .5)
)
Bar Charts and Density Across Continuous Variables
- One challenge with stacked bar charts is that when there are more than two categories, it can be very difficult to track the visualized trend
- Relative to side-by-side bar charts, it’s easy to see any category relative to the total but somewhat more difficult to also account for differing numbers of people in different categories or across time
- One possible solution to this is to look at densities across time and groups or relative to the total
- Let’s do both now
Code
gsoep %>%
filter(age %in% c(20, 30, 40, 50, 60, 70, 80) & !is.na(SRhealth)) %>% # random
group_by(age, SRhealth) %>%
tally() %>%
mutate(SRhealth = factor(
SRhealth
, seq(5,1,-1)
, c("Very good", "Good", "Satisfactory", "Poor", "Bad")
)) %>%
group_by(age) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(x = age, y = prop, fill = SRhealth)) +
geom_bar(stat = "identity", color = "black") +
scale_fill_manual(values = cbsafe_pal$hex[2:6]) +
theme_classic()
Stacked Area Charts
Code
gsoep %>%
filter(!is.na(SRhealth) & age >= 18 & age <= 100) %>% # random
group_by(age, SRhealth) %>%
tally() %>%
mutate(SRhealth = factor(
SRhealth
, seq(5,1,-1)
, c("Very good", "Good", "Satisfactory", "Poor", "Bad")
)) %>%
group_by(age) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(x = age, y = prop, fill = SRhealth)) +
geom_area() +
theme_classic()
Improvements: Color
Let’s improve the coloring of this figure. Since we have ordinal data, we want these to be a somewhat coherent gradient to communicate amount. But we don’t want to a full gradient because it’s ordinal!
- Apply the built-in viridis scale (hint: it’s a variant of
scale_fill_viridis_[type]()
). - Add boundaries to the areas (hint: use white) and reduce the intensity of the colors (hint: use alpha)
Code
gsoep %>%
filter(!is.na(SRhealth) & age >= 18 & age <= 100) %>% # random
group_by(age, SRhealth) %>%
tally() %>%
mutate(SRhealth = factor(
SRhealth
, seq(5,1,-1)
, c("Very good", "Good", "Satisfactory", "Poor", "Bad")
)) %>%
group_by(age) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(x = age, y = prop, fill = SRhealth)) +
geom_area(color = "white", alpha = .6) +
scale_fill_viridis_d() +
theme_classic()
Improvements: Color Labels
Remember, legends tax working memory, and these plots are not the exception to the rule. Let’s add the labels to the plot directly using annotate(). (Note, we could also use geom_text() but the setup would be a pain and dubiously worth the hassle.)
- The
annotate()
function is a manual powerhouse. - It let’s you add
grobs
to a plot that match any kind ofgeom
(e.g., lines, dots, rectangles, text, labels, and more).
Basic syntax
Code
gsoep %>%
filter(!is.na(SRhealth) & age >= 18 & age <= 100) %>% # random
group_by(age, SRhealth) %>%
tally() %>%
mutate(SRhealth = factor(
SRhealth
, 1:5
, rev(c("Very good", "Good", "Satisfactory", "Poor", "Bad"))
)) %>%
group_by(age) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(x = age, y = prop, fill = SRhealth)) +
geom_area(color = "white", alpha = .6) +
annotate("text", x = 85, y = .95, label = "Bad" , color = "white", fontface = 2) +
annotate("text", x = 75, y = .80, label = "Poor" , color = "white", fontface = 2) +
annotate("text", x = 62, y = .55, label = "Satisfactory", color = "white", fontface = 2) +
annotate("text", x = 43, y = .3 , label = "Good" , color = "black", fontface = 2) +
annotate("text", x = 30, y = .07, label = "Very Good" , color = "black", fontface = 2) +
scale_fill_viridis_d() +
theme_classic() +
theme(legend.position = "none")
Improvements: Labels, Titles, and Themes
Exercise:
- Add plot title
- Change
x
andy
scale labels and titles - Bold axis text and increase size
- Bold axis titles and increase size
- Bold title and center (hint, you may also need to wrap the title text using
\n
)
Note how much clearer this is than if I’d just plotted the mean of self-rated health across groups!!
Code
gsoep %>%
filter(!is.na(SRhealth) & age >= 18 & age <= 100) %>% # random
group_by(age, SRhealth) %>%
tally() %>%
mutate(SRhealth = factor(SRhealth, 1:5, rev(c("Very good", "Good", "Satisfactory", "Poor", "Bad")))) %>%
group_by(age) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(x = age, y = prop, fill = SRhealth)) +
geom_area(color = "white", alpha = .6) +
annotate("text", x = 85, y = .95, label = "Bad", color = "white", fontface = 2) +
annotate("text", x = 75, y = .80, label = "Poor", color = "white", fontface = 2) +
annotate("text", x = 62, y = .55, label = "Satisfactory", color = "white", fontface = 2) +
annotate("text", x = 43, y = .3, label = "Good", color = "black", fontface = 2) +
annotate("text", x = 30, y = .07, label = "Very Good", color = "black", fontface = 2) +
scale_x_continuous(limits = c(18, 100), breaks = seq(20, 100, 10)) +
scale_y_continuous(limits = c(0,1), breaks = seq(0,1, .25), labels = c("0%", "25%", "50%", "75%", "100%")) +
scale_fill_viridis_d() +
labs(
x = "Age (Years)"
, y = "Percentage of Participants"
, title = "Levels of Self-Rated Health Decrease Across the Lifespan"
) +
theme_classic() +
theme(legend.position = "none"
, axis.text = element_text(face = "bold", size = rel(1.1))
, axis.title = element_text(face = "bold", size = rel(1.1))
, plot.title = element_text(face = "bold", size = rel(1.1), hjust = .5)
)
Total Density Plots
- Let’s revisit these data but also demonstrating how sample size changes across the lifespan
- To do this, we need two pieces of information:
- sample size in each self-rated health category at each age group
- total in each age group
Let’s start by using stat_smooth()
to get a smoothed geom_area()
of the total sample size onto the figure
Code
gsoep %>%
filter(!is.na(SRhealth) & age >= 18 & age <= 100) %>% # random
group_by(age, SRhealth) %>%
tally() %>%
mutate(SRhealth = factor(SRhealth, 1:5, rev(c("Very good", "Good", "Satisfactory", "Poor", "Bad")))) %>%
group_by(age) %>%
mutate(total_n = sum(n)) %>%
ggplot(aes(x = age, y = n)) +
stat_smooth(
aes(y = total_n)
, geom = 'area'
, method = 'loess'
, span = 1/3
, alpha = .8
, fill = "grey"
) +
facet_grid(~SRhealth) +
theme_classic()
Now let’s add each of the ordinal values
Code
gsoep %>%
filter(!is.na(SRhealth) & age >= 18 & age <= 100) %>% # random
group_by(age, SRhealth) %>%
tally() %>%
mutate(SRhealth = factor(SRhealth, 1:5, rev(c("Very good", "Good", "Satisfactory", "Poor", "Bad")))) %>%
group_by(age) %>%
mutate(total_n = sum(n)) %>%
ggplot(aes(x = age, y = n)) +
stat_smooth(
aes(y = total_n)
, geom = 'area'
, method = 'loess'
, span = 1/3
, alpha = .8
, fill = "grey"
) +
stat_smooth(
aes(fill = SRhealth)
, geom = 'area'
, method = 'loess'
, span = 1/3
, alpha = .8
) +
annotate("text", x = 45, y = 3000, label = "Total") +
facet_grid(~SRhealth) +
theme_classic() +
theme(legend.position = "none")
Let’s not belabor this too much.
Code
gsoep %>%
filter(!is.na(SRhealth) & age >= 18 & age <= 100) %>% # random
group_by(age, SRhealth) %>%
tally() %>%
mutate(SRhealth = factor(SRhealth, 1:5, rev(c("Very good", "Good", "Satisfactory", "Poor", "Bad")))) %>%
group_by(age) %>%
mutate(total_n = sum(n)) %>%
ggplot(aes(x = age, y = n)) +
stat_smooth(aes(y = total_n), geom = 'area', method = 'loess'
, span = 1/3, alpha = .8, fill = "grey") +
stat_smooth(aes(fill = SRhealth), geom = 'area', method = 'loess'
, span = 1/3, alpha = .8) +
scale_x_continuous(limits = c(18, 100), breaks = seq(20, 100, 20)) +
scale_fill_viridis_d() +
annotate("text", x = 45, y = 3000, label = "Total") +
labs(
x = "Age (Years)"
, y = "Number of People"
, title = "Good Self-Rated Health Decreases Across the Lifespan"
, subtitle = "But bad decreases less, likely because all-cause sample drop-out"
) +
facet_grid(~SRhealth) +
theme_classic() +
theme(legend.position = "none"
, axis.text = element_text(face = "bold", size = rel(1.1))
, axis.title = element_text(face = "bold", size = rel(1.1))
, plot.title = element_text(face = "bold", size = rel(1.1), hjust = .5)
, plot.subtitle = element_text(face = "italic", size = rel(1), hjust = .5)
, strip.background = element_rect(fill = "grey90", color = "black")
, strip.text = element_text(face = "bold", size = rel(1.2))
)
Perfectly fine but may not communicate what we want to show as well as other methods we’ve seen (at least in this instance).
Code
gsoep %>%
filter(!is.na(SRhealth) & age >= 18 & age <= 100) %>% # random
group_by(age, SRhealth) %>%
tally() %>%
mutate(SRhealth = factor(SRhealth, 1:5, rev(c("Very good", "Good", "Satisfactory", "Poor", "Bad")))) %>%
group_by(age) %>%
mutate(total_n = sum(n)) %>%
ggplot(aes(x = age, y = n)) +
stat_smooth(aes(y = total_n), geom = 'area', method = 'loess'
, span = 1/3, alpha = .8, fill = "grey") +
stat_smooth(aes(fill = SRhealth), geom = 'area', method = 'loess'
, span = 1/3, alpha = .8) +
scale_x_continuous(limits = c(15, 105), breaks = seq(20, 100, 20)) +
scale_fill_viridis_d() +
annotate("text", x = 45, y = 3000, label = "Total") +
labs(
x = "Age (Years)"
, y = "Number of People"
, title = "Good Self-Rated Health Decreases Across the Lifespan"
, subtitle = "But bad decreases less, likely because all-cause sample drop-out"
) +
facet_grid(~SRhealth) +
theme_classic() +
theme(legend.position = "none"
, axis.text = element_text(face = "bold", size = rel(1.1))
, axis.title = element_text(face = "bold", size = rel(1.1))
, plot.title = element_text(face = "bold", size = rel(1.1), hjust = .5)
, plot.subtitle = element_text(face = "italic", size = rel(1), hjust = .5)
, strip.background = element_rect(fill = "grey90", color = "black")
, strip.text = element_text(face = "bold", size = rel(1.2))
)
Nested Proportions
- Sometimes, the proportions that we want to visualize are more complex and can’t just be simply binned
- In such cases, there may be hierarchical relationships among the categories
- Today, we’ll cover two core nested proportion plots:
- Mosaic plots
- Parallel Sets
- To do this, we’ll use two categorical variables: mortality and marital status
Example Data
- To do this, we’ll use 2-Digit NACE Industry Sector codes from participants’ last reported jobs in the SOEP, which I’ve broken down into 9 higher-order categories
- This is a lot of categories, so we’ll further eventually exclude categories that don’t have at least 2% of the share of participants
Category | Job | Code |
---|---|---|
Agriculture | Agriculture Hunting Rel.Serv.Activities | 1 |
Forestry Logging Rel.Service activities | 2 | |
Fishing Fish Hatcheries Fish Farms | 5 | |
Energy and Utilities | Mining Coal Lignite; Extraction Of Peat | 10 |
Extraction Crude Petroleum Natural Gas | 11 | |
Mining Of Uranium And Thorium Ores | 12 | |
Mining Of Metal Ores | 13 | |
Other Mining And Quarrying | 14 | |
Recycling | 37 | |
Electricity Gas Steam Hot Water Supply | 40 | |
Sewage Refuse Disposal Sanitationa.a.Re | 90 | |
Finance and Tech | Financ.Intermediat. Exc.Insur. Pens.Fund | 65 |
Insurance Pens.Funding Ex.Compuls.SocSe | 66 | |
Activ.Aux.To Financial Intermediation | 67 | |
Computer And Related Activities | 72 | |
Research And Development | 73 | |
Other Business Activities | 74 | |
Industry - NEC | 96 | |
Manufacturing | Manuf Food Products And Beverages | 15 |
Manuf Tobacco Products | 16 | |
Manuf Textiles | 17 | |
Manuf Wear. Apparel; Dressing Dyeing Fur | 18 | |
Tanning Dress.Leather; luggage Footwear | 19 | |
Manuf Wood Products Except Furniture | 20 | |
Manuf Pulp Paper And Paper Products | 21 | |
Manuf Coke Ref.Petroleum Nuclear Fuel | 23 | |
Manuf Chemicals And Chemical Products | 24 | |
Manuf Rubber And Plastic Products | 25 | |
Manuf Other Non-metallic Mineral Product | 26 | |
Manuf Basic Metals | 27 | |
Manuf Fabric.Metal Prod. Ex.Machin. Equi | 28 | |
Manuf Machinery And Equipment NEC | 29 | |
Manuf Office Machinery And Computers | 30 | |
Manuf Electrical Machinery Apparatus NE | 31 | |
Manuf Radio Television Communic.Equipmen | 32 | |
Manuf Medical Precision Optical Instrum. | 33 | |
Manuf Motor Vehicles Trailers Semi-tr. | 34 | |
Manuf Other Transport Equipment | 35 | |
Manuf Furniture; Manufacturing NEC | 36 | |
Collection Purification Distrib.Of Water | 41 | |
Handcraft Trade - NEC | 97 | |
Manufacturing - NEC | 100 | |
Other | Private Households With Employed Persons | 95 |
Extra-territorial Organizations.a.Bodies | 99 | |
Public Service | Publ.Administr. Defense; Compuls.SocSec | 75 |
Education | 80 | |
Health And Social Work | 85 | |
Activit.of.Membership Organizations NEC. | 91 | |
Other Service Activities | 93 | |
Sales and Service | Publishing Printing Recorded Media | 22 |
Construction | 45 | |
Sale Maint Rep.Mot.Vehicles;Ret.Sale Fue | 50 | |
Wholesale Commission Trade Exc.Mot.Vehic | 51 | |
Retail Trade Exc.Mot.Vehic;Mot.Cyc Repai | 52 | |
Hotels And Restaurants | 55 | |
Post And Telecommunications | 64 | |
Real Estate Property Activities | 70 | |
Rent.Machinery Equip Wo.Oper. P. HH Good | 71 | |
Recreational Cultural Sporting Activity | 92 | |
Services - NEC | 98 | |
Transportation | Land Transport; Transport Via Pipelines | 60 |
Water Transport | 61 | |
Air Transport | 62 | |
Supporting Aux.Transp.Activ;Trav.Agencie | 63 |
Mosaic Plots
- Unlike bar charts, mosaic plots allow us to index relative areas, sizes, proportions, etc. relative to two dimensions (so not just amount)
- So in our example, this will let us see relative differences within categories vertically and across categories horizontally
- To build this, we will finally leave the basic
ggplot2
package and use theggmosaic
package - There are other packages, but we’ll use this one because (1) it’s great and (2) it let’s us still use everything we’ve learned about ggplot
Wrangle the Data
Code
if(!"ggmosaic" %in% installed.packages()) install.packages("ggmosaic")
library(ggmosaic)
gsoep_jobs <- gsoep %>%
mutate(age_gr = mapvalues(age, 20:99, rep(seq(20, 90, 10), each = 10))) %>%
filter(!is.na(age_gr) & age >= 20 & age < 100) %>%
group_by(SID) %>%
filter(!is.na(job)) %>%
filter(age_gr == max(age_gr)) %>%
group_by(SID, age_gr) %>%
summarize(job = max(job)) %>%
ungroup() %>%
rename(code = job) %>%
left_join(jobs %>% rename(code = old)) %>%
group_by(code) %>%
filter(n() / nrow(.) >= .02) %>%
ungroup()
gsoep_jobs
- Let’s say, for example, that we think that functional limitations and more may age people out of some professions
- We could look at this simply as a stacked bar chart, but it wouldn’t clarify that there are different proportions of people in each job category / age group
- We’ve already had a bunch of practice today improving plot aesthetics and seen somewhat similar plots, so we’re going to skip that for this exercise
Treemap
- Mosaic plots are sort of just fancy stacked bar plots that let you also index by size
- Treemaps are helpful when we have nested categorical (and sometimes, to a lesser degree continuous) variables
- We’ll use the example of our jobs data, but this could be used for lots of other types of variables
- Crossed conditions in an experiment
- Intergenerational data
- Average scores on variables within categories
- Brain activation across broader and narrower brain regions
- Political affiliation across states, demographic groups, and more
Wrangle the Data
Code
gsoep_tm <- gsoep %>%
group_by(SID) %>%
filter(!is.na(job)) %>%
group_by(SID) %>%
summarize(job = max(job)) %>%
ungroup() %>%
rename(code = job) %>%
left_join(jobs %>% rename(code = old)) %>%
group_by(code, cat, job) %>%
tally() %>%
ungroup() %>%
filter(n/sum(n) > .02) %>%
mutate(job = str_wrap(job, 15))
gsoep_tm
Basic
Improvements: Remove Legend and Add Labels
Improvements: Add Subgroup Text
Code
gsoep_tm %>%
arrange(cat, code) %>%
ggplot(aes(area = n, fill = cat, label = job, subgroup = cat)) +
geom_treemap(color = "white", size = 3) +
geom_treemap_text(
colour = c(rep("white", 11), rep("black",4))
, place = "centre"
, size = 15
, grow = FALSE
) +
geom_treemap_subgroup_text(
place = "bottom"
, grow = TRUE
, alpha = 0.4
, colour = "white"
, fontface = "italic"
) +
theme(legend.position = "none")
Improvements: Color Palette
Code
gsoep_tm %>%
arrange(cat, code) %>%
ggplot(aes(area = n, fill = cat, label = job, subgroup = cat)) +
geom_treemap(color = "white", size = 3) +
geom_treemap_text(
colour = "white"
, place = "centre"
, size = 15
, grow = FALSE
) +
geom_treemap_subgroup_text(
place = "bottom"
, grow = TRUE
, alpha = 0.4
, colour = "white"
, fontface = "italic"
) +
scale_fill_viridis_d() +
theme(legend.position = "none")
Improvements: Group and Subgroup Borders + Text Color
Code
gsoep_tm %>%
arrange(cat, code) %>%
ggplot(aes(area = n, fill = cat, label = job, subgroup = cat)) +
geom_treemap(color = "white", size = 3) +
geom_treemap_text(
colour = c(rep("white", 11), rep("black",4))
, place = "centre"
, size = 15
, grow = FALSE
) +
geom_treemap_subgroup_text(
place = "bottom"
, grow = TRUE
, alpha = 0.4
, colour = c(rep("white", 11), rep("black",4))
, fontface = "italic"
) +
geom_treemap_subgroup_border(
colour = "white"
, size = 5
) +
scale_fill_viridis_d() +
theme(legend.position = "none")
Improvements: Title
Code
gsoep_tm %>%
arrange(cat, code) %>%
ggplot(aes(area = n, fill = cat, label = job, subgroup = cat)) +
geom_treemap(color = "white", size = 3) +
geom_treemap_text(
colour = c(rep("white", 11), rep("black",4))
, place = "centre"
, size = 15
, grow = FALSE
) +
geom_treemap_subgroup_text(
place = "bottom"
, grow = TRUE
, alpha = 0.4
, colour = c(rep("white", 11), rep("black",4))
, fontface = "italic"
) +
geom_treemap_subgroup_border(
colour = "white"
, size = 5
) +
scale_fill_viridis_d() +
labs(title = "White Collar Public Service, Sales, and\nFinance Jobs Far Outnumber Blue Collar Jobs") +
theme(legend.position = "none"
, plot.title = element_text(face = "bold", hjust = .5))