Week 8: Odds, Ends, and Polishing Visualizations

Emorie D Beck

Polishing & Hacking Your Visualizations

Packages

library(RColorBrewer)
library(knitr)
library(kableExtra)
library(plyr)
library(broom)
library(modelr)
library(lme4)
library(broom.mixed)
library(tidyverse)
library(ggdist)
library(patchwork)
library(cowplot)
library(DiagrammeR)
library(wordcloud)
library(tidytext)
library(ggExtra)
library(distributional)
library(gganimate)

Custom Theme:

my_theme <- function(){
  theme_classic() + 
  theme(
    legend.position = "bottom"
    , legend.title = element_text(face = "bold", size = rel(1))
    , legend.text = element_text(face = "italic", size = rel(1))
    , axis.text = element_text(face = "bold", size = rel(1.1), color = "black")
    , axis.title = element_text(face = "bold", size = rel(1.2))
    , plot.title = element_text(face = "bold", size = rel(1.2), hjust = .5)
    , plot.subtitle = element_text(face = "italic", size = rel(1.2), hjust = .5)
    , strip.text = element_text(face = "bold", size = rel(1.1), color = "white")
    , strip.background = element_rect(fill = "black")
    )
}

Diagrams

Diagrams

  • In research, we often need to make diagrams all points in our research, from
    • conceptualizing study flow
    • mapping measures
    • mapping verbal models
    • SEM models
    • and more

DiagrammeR

  • DiagrammeR is a unique interface because it brings together multiple ways of building diagrams in R and tries ot unite them with consistent syntax
  • We could spend a whole course, not just part of one class parsing through the DiagrammeR package, so I’m going to make a strong assumption based on my knowledge of your ongoing interests and research:
    • SEM plots
    • network visualizations
    • combinations of both

DiagrammeR: Graphviz

  • Let’s just jump in!
[strict] (graph | digraph) [ID] '{' stmt_list '}'
  1. strict basically determines whether we add can multiple nodes going into / out of a node
  2. We have to tell Graphviz whether want a directed [digraph] or undirected [graph] graph.
  3. [ID] is what you want to name your graph object
  4. '{' stmt_list '}' is where you specify the nodes and edges the graph (more on this next)

DiagrammeR: Graphviz

grViz("
digraph ex1 {

  # a 'graph' statement
  graph [overlap = true, fontsize = 10]

  # several 'node' statements
  node [shape = box,
        fontname = Helvetica]
  A; B; C; D; E; F
}"
)
  • digraph says we want the graph to be directed
  • graph lets us control elements of the graph in the []
    • overlap = true means nodes can overlap
  • node means we’re about to specify some nodes (and their properties in [])

DiagrammeR: Graphviz

We can control lots of properties of nodes (either as groups or individually):

  • color
  • fillcolor
  • fontcolor
  • alpha
  • shape
  • style (like linestyle)
  • sides
  • peripheries
  • fixedsize
  • height
  • width
  • distortion
  • penwidth
  • x
  • y
  • tooltip
  • fontname
  • fontsize
  • icon

DiagrammeR: Graphviz

But we also want to add edges

grViz("
digraph ex1 {

  # a 'graph' statement
  graph [overlap = true, fontsize = 10]

  # several 'node' statements
  node [shape = box,
        fontname = Helvetica]
  A; B; C; D; E; F
  
  # several 'edge' statements
  A->B B->C C->D D->E E->F
}"
)
  • -> indicates directed edges
  • -- indicates undirected edges
  • A->{B,C} is the same as A->B A->C

DiagrammeR: Graphviz

Edge properties can be defined like node properties:

  • arrowsize
  • arrowhead
  • arrowtail
  • dir
  • color
  • alpha
  • headport
  • tailport
  • fontname
  • fontsize
  • fontcolor
  • penwidth
  • menlin
  • tooltip

DiagrammeR: Graphviz

  • Let’s do the Big Five because why not?
grViz("
digraph b5 {

  # a 'graph' statement
  graph [overlap = true, fontsize = 10]

  # def latent Big Five
  node [shape = circle]
  E; A; C; N; O
  
  # def observed indicators
  node [shape = square]
  e1; e2; e3
  a1; a2; a3
  c1; c2; c3
  n1; n2; n3
  o1; o2; o3
  
  # several 'edge' statements
  E->{e1,e2,e3}
  A->{a1,a2,a3}
  C->{c1,c2,c3}
  N->{n1,n2,n3}
  O->{o1,o2,o3}
}"
)
  • But they aren’t orthogonal, so we need to let the factors correlate.

DiagrammeR: Graphviz

But they aren’t orthogonal, so we need to let the factors correlate.

grViz("
digraph b5 {

  # a 'graph' statement
  graph [overlap = true, fontsize = 10]

  # def latent Big Five
  node [shape = circle]
  E; A; C; N; O
  
  # def observed indicators
  node [shape = square]
  e1; e2; e3
  a1; a2; a3
  c1; c2; c3
  n1; n2; n3
  o1; o2; o3
  
  # several 'edge' statements
  E->{e1,e2,e3}
  A->{a1,a2,a3}
  C->{c1,c2,c3}
  N->{n1,n2,n3}
  O->{o1,o2,o3}
  
  E->{A,C,N,O} [dir = both]
  A->{C,N,O} [dir = both]
  C->{N,O} [dir = both]
  N->{O} [dir = both]
}"
)

DiagrammeR: Graphviz

But they aren’t orthogonal, so we need to let the factors correlate. Let’s change the layout to neato:

grViz("
digraph b5 {

  # a 'graph' statement
  graph [overlap = true, fontsize = 10, layout = neato]

  # def latent Big Five
  node [shape = circle]
  E; A; C; N; O
  
  # def observed indicators
  node [shape = square,
        fixedsize = true,
        width = 0.25]
  e1; e2; e3
  a1; a2; a3
  c1; c2; c3
  n1; n2; n3
  o1; o2; o3
  
  # several 'edge' statements
  E->{e1,e2,e3}
  A->{a1,a2,a3}
  C->{c1,c2,c3}
  N->{n1,n2,n3}
  O->{o1,o2,o3}
  
  E->{A,C,N,O} [dir = both]
  A->{C,N,O} [dir = both]
  C->{N,O} [dir = both]
  N->{O} [dir = both]
}"
)

DiagrammeR: Graphviz

  • That was all very lavaan, wasn’t it?
  • Well, sometimes we want to create diagrams using code or pipelines, which isn’t easy or intuitive using the syntax we’ve been using
  • So instead, we can create the same visualizations using create_graph() and accompanying functions
  • Unfortunately, we don’t have time for that today, but there’s a great tutorial online

Basic Text Visualization

Basic Text Visualization

  • In some ways, the hardest part of text visualization is getting the text into R.
  • Once text is in R, there are lots of great tools for tokenizing, basic sentiment analysis, and more
  • We’ll be relying on Tidy Text Analysis in R
  • Today, we’ll use some data from an ongoing project of mine that applies NLP to Letters from Jenny (Anonymous, 1942), which were published in the Journal of Abnormal and Social Psychology
  • The PDF’s have been converted to a .txt file

Basic Text Visualization

text_df <- read.table("https://github.com/emoriebeck/psc290-data-viz-2022/raw/main/08-week8-polishing/01-data/part2_pymupdf.txt", sep = "\n") %>%
  setNames("text") %>%
  mutate(line = 1:n()) %>%
  as_tibble() %>%
  mutate(text = str_remove_all(text, "[0-9]"))
text_df
# A tibble: 1,521 × 2
   text                                             line
   <chr>                                           <int>
 1 "CASE REPORTS"                                      1
 2 "LETTERS FROM JENNY (continued)"                    2
 3 "ANONYMOUS"                                         3
 4 " (continued)"                                      4
 5 "N.Y.C. Sunday /"                                   5
 6 "My dearest Boy and Girl:"                          6
 7 "This is not a regular letter, but even if it"      7
 8 "were I could never begin to express my"            8
 9 "gratitude to you. I believe that when two"         9
10 "persons really love each other in the highest"    10
# ℹ 1,511 more rows

Basic Text Visualization

text_df$text[1:10]
 [1] "CASE REPORTS"                                 
 [2] "LETTERS FROM JENNY (continued)"               
 [3] "ANONYMOUS"                                    
 [4] " (continued)"                                 
 [5] "N.Y.C. Sunday /"                              
 [6] "My dearest Boy and Girl:"                     
 [7] "This is not a regular letter, but even if it" 
 [8] "were I could never begin to express my"       
 [9] "gratitude to you. I believe that when two"    
[10] "persons really love each other in the highest"

Basic Text Visualization

  • The first step with text data is to clean and tokenize it.
  • Cleaning basically means makoing sure that everything parsed correctly
  • Tokenizing means that we break the text down into tokens that we can then analyze
  • We tokenize for lots of reasons. It let’s us:
    • Remove filler words
    • Group words in different forms, tenses
    • Get rid of punctuation, etc.
    • And more

A token is a meaningful unit of text, most often a word, that we are interested in using for further analysis, and tokenization is the process of splitting text into tokens (Silge & Robinson, Tidy Text Mining in R)

Basic Text Visualization: Tokens

tidy_text <- text_df %>%
  unnest_tokens(word, text)
tidy_text
# A tibble: 22,561 × 2
    line word     
   <int> <chr>    
 1     1 case     
 2     1 reports  
 3     2 letters  
 4     2 from     
 5     2 jenny    
 6     2 continued
 7     3 anonymous
 8     4 continued
 9     5 n.y.c    
10     5 sunday   
# ℹ 22,551 more rows

Basic Text Visualization: Tokens

data(stop_words)
stop_words
# A tibble: 1,149 × 2
   word        lexicon
   <chr>       <chr>  
 1 a           SMART  
 2 a's         SMART  
 3 able        SMART  
 4 about       SMART  
 5 above       SMART  
 6 according   SMART  
 7 accordingly SMART  
 8 across      SMART  
 9 actually    SMART  
10 after       SMART  
# ℹ 1,139 more rows

Basic Text Visualization: Tokens

tidy_text <- tidy_text %>%
  anti_join(stop_words)
tidy_text
# A tibble: 7,119 × 2
    line word     
   <int> <chr>    
 1     1 reports  
 2     2 letters  
 3     2 jenny    
 4     2 continued
 5     3 anonymous
 6     4 continued
 7     5 n.y.c    
 8     5 sunday   
 9     6 dearest  
10     6 boy      
# ℹ 7,109 more rows

Basic Text Visualization: Tokens

tidy_text %>%
  count(word, sort = T)
# A tibble: 2,651 × 2
   word        n
   <chr>   <int>
 1 lady      121
 2 dearest    95
 3 day        81
 4 love       53
 5 time       51
 6 girl       46
 7 ross       44
 8 house      42
 9 boy        41
10 prison     41
# ℹ 2,641 more rows

Basic Text Visualization: Tokens

Let’s visualize the count:

tidy_text %>%
  count(word, sort = T) %>%
  top_n(20) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word)) +
  geom_col() +
  labs(y = NULL) + 
  my_theme()

Basic Text Visualization: Tokens

How negative is Jenny?

tidy_text %>%
  inner_join(get_sentiments("bing")) %>%
  count(sentiment, sort = T)
# A tibble: 2 × 2
  sentiment     n
  <chr>     <int>
1 negative    742
2 positive    514

Basic Text Visualization: Tokens

Does her negativity change over time?

tidy_text %>%
  inner_join(get_sentiments("bing")) %>%
  count(sentiment, index = line%/%100, sort = T)
# A tibble: 32 × 3
   sentiment index     n
   <chr>     <dbl> <int>
 1 positive      4    97
 2 negative      4    94
 3 negative     11    89
 4 negative      7    63
 5 negative     10    60
 6 positive      7    60
 7 negative     12    53
 8 negative      1    52
 9 negative      8    42
10 negative      2    39
# ℹ 22 more rows

Basic Text Visualization: Tokens

Does her negativity change over time?

p <- tidy_text %>%
  inner_join(get_sentiments("bing")) %>%
  count(sentiment, index = line%/%100, sort = T) %>%
  ggplot(aes(x = index, y = n, color = sentiment)) + 
    geom_line() + 
    geom_point() + 
    my_theme()
p

Basic Text Visualization: Tokens

Does her negativity change over time?

p + 
  scale_color_manual(
    values = c("grey40", "goldenrod")
    ) + 
  scale_x_continuous(
    limits = c(0,18)
    , breaks = seq(0,15,5)
    ) + 
  annotate("label"
           , label = "negative"
           , y = 32
           , x = 15.5
           , hjust = 0
           , fill = "grey40"
           , color = "white") + 
  annotate("label"
           , label = "positive"
           , y = 13
           , x = 15.5
           , hjust = 0
           , fill = "goldenrod")  +
  labs(x = "Chunk", y = "Count") + 
  theme(legend.position = "none")

Basic Text Visualization: Tokens

We can also look at most common negative and positive words:

tidy_text %>%
  inner_join(get_sentiments("bing")) %>%
  count(sentiment, word, sort = T) %>%
  group_by(sentiment) %>%
  top_n(10)
# A tibble: 21 × 3
# Groups:   sentiment [2]
   sentiment word         n
   <chr>     <chr>    <int>
 1 positive  love        53
 2 negative  prison      41
 3 negative  dead        20
 4 positive  fine        20
 5 positive  lovely      18
 6 positive  pretty      16
 7 negative  death       15
 8 negative  terrible    15
 9 positive  nice        14
10 negative  damn        13
# ℹ 11 more rows

Basic Text Visualization: Tokens

We can also look at most common negative and positive words:

p <- tidy_text %>%
  inner_join(get_sentiments("bing")) %>%
  count(sentiment, word, sort = T) %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = n, y = word, fill = sentiment)) +
  geom_col() +
  labs(y = NULL) + 
  facet_wrap(~sentiment, scales = "free_y") +
  my_theme()
p

Basic Text Visualization: Tokens

p + 
  scale_fill_manual(
    values = c("grey40", "goldenrod")
    ) + 
  theme(legend.position = "none")

Basic Text Visualization: Word Clouds

tidy_text %>%
  count(word) %>%
  with(wordcloud(
    word
    , n
    , max.words = 100)
    )

Basic Text Visualization: Word Clouds

pal <- brewer.pal(6,"Dark2")
tidy_text %>%
  count(word) %>%
  with(wordcloud(
    word
    , n
    , max.words = 100
    , colors = pal)
    )

Basic Text Visualization: Word Clouds

par(mar = c(0, 0, 0, 0), mfrow = c(1,2))
tidy_text %>%
  inner_join(get_sentiments("bing")) %>%
  count(sentiment, word, sort = T) %>%
  filter(sentiment == "negative") %>%
  with(wordcloud(
    word
    , n
    , max.words = 100
    , colors = "grey40")
    )
title("Negative", line = -2)

tidy_text %>%
  inner_join(get_sentiments("bing")) %>%
  count(sentiment, word, sort = T) %>%
  filter(sentiment == "positive") %>%
  with(wordcloud(
    word
    , n
    , max.words = 100
    , colors = "goldenrod")
    )
title("Positive", line = -2)

Basic Text Visualization: Word Clouds

ggplot2 hacks

Data

  • Data cleaning is often the hardest, most time consuming part of our research flow
  • Whether we are cleaning raw data, or cleaning data that come out of a model object, we have to be able to wrangle it to the shape we need for whatever program we’re using
  • Other than lots of tools in your toolbox for reshaping (see Week 1), the biggest data cleaning hack I have has nothing to do with cleaning, per se
  • Specifically, it requires two things:
    • You have to know what the output you want is (in our case, plots)
    • You have to know how what the data need to look like to produce that

Data

  • Let’s consider an example, going back to when we wanted to make correlelograms / heat maps.
  • Here’s the plot we wanted to create:

Data

  • Let’s consider an example, going back to when we wanted to make correlelograms / heat maps.
  • Here’s the plot we wanted to create:
  • This seems like it should be straightforward because we’re taking a correlation matrix and… visualizing it as a matrix
  • But ggplot2 doesn’t communicate with correlation matrices because they are in wide format

Data

  • So we need to figure out how to make the correlation matrix long format in ways that gives us:
    • Variables on the x-axis
    • Variables on the y-axis
    • Correlations for fill
    • Correlations (rounded) for text
    • no double dipping on values
  • If you remember nothing else from this course, please remember this:
    • AESTHETIC MAPPINGS CORRESPOND TO COLUMNS IN THE DATA FRAME YOU ARE PLOTTING
  • So if want all of the above we need the following columns:
    • V1 (x)
    • V2 (y)
    • r (fill, text)

Data

  • But what do we currently have?
    • A p*p correlation matrix
    • ggplot2 wants a data frame
  • Where are the variable labels (our eventual V1 [x] and V2 [y])?
  • Where are our correlations?
    • In wide format (unindexed by explicit columns)
r_data$r[[1]]
               p_value          age       gender    SRhealth       smokes
p_value    1.000000000 -0.005224085  0.053627861  0.15917525 -0.069013463
age       -0.005224085  1.000000000 -0.057243245 -0.22438335 -0.078788619
gender     0.053627861 -0.057243245  1.000000000 -0.03182278  0.022275557
SRhealth   0.159175251 -0.224383351 -0.031822781  1.00000000 -0.129241536
smokes    -0.069013463 -0.078788619  0.022275557 -0.12924154  1.000000000
exercise   0.048576025 -0.361768736  0.061659017  0.34546038 -0.155018841
BMI       -0.019741798  0.036151816  0.012217132 -0.09340105 -0.037713371
education  0.001465775 -0.173399716 -0.001603648  0.11008540 -0.096936630
parEdu     0.019871078 -0.374733606  0.055468171  0.08273023  0.005215303
mortality -0.089637524  0.627069166 -0.092109448 -0.31142292  0.035759332
             exercise         BMI    education       parEdu   mortality
p_value    0.04857602 -0.01974180  0.001465775  0.019871078 -0.08963752
age       -0.36176874  0.03615182 -0.173399716 -0.374733606  0.62706917
gender     0.06165902  0.01221713 -0.001603648  0.055468171 -0.09210945
SRhealth   0.34546038 -0.09340105  0.110085399  0.082730234 -0.31142292
smokes    -0.15501884 -0.03771337 -0.096936630  0.005215303  0.03575933
exercise   1.00000000 -0.06217297  0.210204022  0.176766791 -0.32138385
BMI       -0.06217297  1.00000000 -0.048914825 -0.075000576  0.01643219
education  0.21020402 -0.04891483  1.000000000  0.232321970 -0.17215791
parEdu     0.17676679 -0.07500058  0.232321970  1.000000000 -0.18796244
mortality -0.32138385  0.01643219 -0.172157913 -0.187962436  1.00000000

Data

  • As a reminder, here’s our criteria for what we want our data to look like to plot:
    • V1 (x)
    • V2 (y)
    • r (fill, text)
    • no double dipping on values
    • Must be a data frame
  • But these aren’t in the right order

Data

  • It should be these steps:
    • no double dipping on values
    • Must be a data frame
    • V1 (x)
    • V2 (y); r (fill, text)

Data

  • Last but, BUT we have also been learning lots about ggplot2 default behavior, and one of those things is that it will treat columns of class() character as something that should be ordered alphabetically via scale_[map]_discrete()
    • If we don’t want it to, we need to make it a factor with levels and/or labels we provide
    • For a heat map / correlelogram, it is imperative that this order is the same order you gave cor() with the raw data.

Data

You can see that order by looking at the row and column names:

r_data$r[[1]]
               p_value          age       gender    SRhealth       smokes
p_value    1.000000000 -0.005224085  0.053627861  0.15917525 -0.069013463
age       -0.005224085  1.000000000 -0.057243245 -0.22438335 -0.078788619
gender     0.053627861 -0.057243245  1.000000000 -0.03182278  0.022275557
SRhealth   0.159175251 -0.224383351 -0.031822781  1.00000000 -0.129241536
smokes    -0.069013463 -0.078788619  0.022275557 -0.12924154  1.000000000
exercise   0.048576025 -0.361768736  0.061659017  0.34546038 -0.155018841
BMI       -0.019741798  0.036151816  0.012217132 -0.09340105 -0.037713371
education  0.001465775 -0.173399716 -0.001603648  0.11008540 -0.096936630
parEdu     0.019871078 -0.374733606  0.055468171  0.08273023  0.005215303
mortality -0.089637524  0.627069166 -0.092109448 -0.31142292  0.035759332
             exercise         BMI    education       parEdu   mortality
p_value    0.04857602 -0.01974180  0.001465775  0.019871078 -0.08963752
age       -0.36176874  0.03615182 -0.173399716 -0.374733606  0.62706917
gender     0.06165902  0.01221713 -0.001603648  0.055468171 -0.09210945
SRhealth   0.34546038 -0.09340105  0.110085399  0.082730234 -0.31142292
smokes    -0.15501884 -0.03771337 -0.096936630  0.005215303  0.03575933
exercise   1.00000000 -0.06217297  0.210204022  0.176766791 -0.32138385
BMI       -0.06217297  1.00000000 -0.048914825 -0.075000576  0.01643219
education  0.21020402 -0.04891483  1.000000000  0.232321970 -0.17215791
parEdu     0.17676679 -0.07500058  0.232321970  1.000000000 -0.18796244
mortality -0.32138385  0.01643219 -0.172157913 -0.187962436  1.00000000

Data

  • It should be these steps:
    • Get variable order from correlation matrix
    • no double dipping on values
    • Must be a data frame
    • V1 (x)
    • V2 (y); r (fill, text)
    • Preserve variable order through factors
r <- r_data$r[[1]]
coln <- colnames(r)
coln
 [1] "p_value"   "age"       "gender"    "SRhealth"  "smokes"    "exercise" 
 [7] "BMI"       "education" "parEdu"    "mortality"

Data

  • It should be these steps:
    • Get variable order from correlation matrix
    • no double dipping on values
    • Must be a data frame
    • V1 (x)
    • V2 (y); r (fill, text)
    • Preserve variable order through factors
r <- r_data$r[[1]]
coln <- colnames(r)
r[lower.tri(r, diag = T)] <- NA
r
          p_value          age      gender    SRhealth      smokes    exercise
p_value        NA -0.005224085  0.05362786  0.15917525 -0.06901346  0.04857602
age            NA           NA -0.05724324 -0.22438335 -0.07878862 -0.36176874
gender         NA           NA          NA -0.03182278  0.02227556  0.06165902
SRhealth       NA           NA          NA          NA -0.12924154  0.34546038
smokes         NA           NA          NA          NA          NA -0.15501884
exercise       NA           NA          NA          NA          NA          NA
BMI            NA           NA          NA          NA          NA          NA
education      NA           NA          NA          NA          NA          NA
parEdu         NA           NA          NA          NA          NA          NA
mortality      NA           NA          NA          NA          NA          NA
                  BMI    education       parEdu   mortality
p_value   -0.01974180  0.001465775  0.019871078 -0.08963752
age        0.03615182 -0.173399716 -0.374733606  0.62706917
gender     0.01221713 -0.001603648  0.055468171 -0.09210945
SRhealth  -0.09340105  0.110085399  0.082730234 -0.31142292
smokes    -0.03771337 -0.096936630  0.005215303  0.03575933
exercise  -0.06217297  0.210204022  0.176766791 -0.32138385
BMI                NA -0.048914825 -0.075000576  0.01643219
education          NA           NA  0.232321970 -0.17215791
parEdu             NA           NA           NA -0.18796244
mortality          NA           NA           NA          NA

Data

  • It should be these steps:
    • Get variable order from correlation matrix
    • no double dipping on values
    • Must be a data frame
    • V1 (x)
    • V2 (y); r (fill, text)
    • Preserve variable order through factors
r <- r_data$r[[1]]
coln <- colnames(r)
r[lower.tri(r, diag = T)] <- NA
r %>% data.frame()
          p_value          age      gender    SRhealth      smokes    exercise
p_value        NA -0.005224085  0.05362786  0.15917525 -0.06901346  0.04857602
age            NA           NA -0.05724324 -0.22438335 -0.07878862 -0.36176874
gender         NA           NA          NA -0.03182278  0.02227556  0.06165902
SRhealth       NA           NA          NA          NA -0.12924154  0.34546038
smokes         NA           NA          NA          NA          NA -0.15501884
exercise       NA           NA          NA          NA          NA          NA
BMI            NA           NA          NA          NA          NA          NA
education      NA           NA          NA          NA          NA          NA
parEdu         NA           NA          NA          NA          NA          NA
mortality      NA           NA          NA          NA          NA          NA
                  BMI    education       parEdu   mortality
p_value   -0.01974180  0.001465775  0.019871078 -0.08963752
age        0.03615182 -0.173399716 -0.374733606  0.62706917
gender     0.01221713 -0.001603648  0.055468171 -0.09210945
SRhealth  -0.09340105  0.110085399  0.082730234 -0.31142292
smokes    -0.03771337 -0.096936630  0.005215303  0.03575933
exercise  -0.06217297  0.210204022  0.176766791 -0.32138385
BMI                NA -0.048914825 -0.075000576  0.01643219
education          NA           NA  0.232321970 -0.17215791
parEdu             NA           NA           NA -0.18796244
mortality          NA           NA           NA          NA

Data

  • It should be these steps:
    • Get variable order from correlation matrix
    • no double dipping on values
    • Must be a data frame
    • V1 (x)
    • V2 (y); r (fill, text)
    • Preserve variable order through factors
r <- r_data$r[[1]]
coln <- colnames(r)
r[lower.tri(r, diag = T)] <- NA
r %>% data.frame() %>%
  rownames_to_column("V1")
          V1 p_value          age      gender    SRhealth      smokes
1    p_value      NA -0.005224085  0.05362786  0.15917525 -0.06901346
2        age      NA           NA -0.05724324 -0.22438335 -0.07878862
3     gender      NA           NA          NA -0.03182278  0.02227556
4   SRhealth      NA           NA          NA          NA -0.12924154
5     smokes      NA           NA          NA          NA          NA
6   exercise      NA           NA          NA          NA          NA
7        BMI      NA           NA          NA          NA          NA
8  education      NA           NA          NA          NA          NA
9     parEdu      NA           NA          NA          NA          NA
10 mortality      NA           NA          NA          NA          NA
      exercise         BMI    education       parEdu   mortality
1   0.04857602 -0.01974180  0.001465775  0.019871078 -0.08963752
2  -0.36176874  0.03615182 -0.173399716 -0.374733606  0.62706917
3   0.06165902  0.01221713 -0.001603648  0.055468171 -0.09210945
4   0.34546038 -0.09340105  0.110085399  0.082730234 -0.31142292
5  -0.15501884 -0.03771337 -0.096936630  0.005215303  0.03575933
6           NA -0.06217297  0.210204022  0.176766791 -0.32138385
7           NA          NA -0.048914825 -0.075000576  0.01643219
8           NA          NA           NA  0.232321970 -0.17215791
9           NA          NA           NA           NA -0.18796244
10          NA          NA           NA           NA          NA

Data

  • It should be these steps:
    • Get variable order from correlation matrix
    • no double dipping on values
    • Must be a data frame
    • V1 (x)
    • V2 (y); r (fill, text)
    • Preserve variable order through factors
r <- r_data$r[[1]]
coln <- colnames(r)
r[lower.tri(r, diag = T)] <- NA
r %>% data.frame() %>%
  rownames_to_column("V1") %>%
  pivot_longer(
    cols = -V1
    , values_to = "r"
    , names_to = "V2"
  )
# A tibble: 100 × 3
   V1      V2               r
   <chr>   <chr>        <dbl>
 1 p_value p_value   NA      
 2 p_value age       -0.00522
 3 p_value gender     0.0536 
 4 p_value SRhealth   0.159  
 5 p_value smokes    -0.0690 
 6 p_value exercise   0.0486 
 7 p_value BMI       -0.0197 
 8 p_value education  0.00147
 9 p_value parEdu     0.0199 
10 p_value mortality -0.0896 
# ℹ 90 more rows

Data

  • It should be these steps:
    • Get variable order from correlation matrix
    • no double dipping on values
    • Must be a data frame
    • V1 (x)
    • V2 (y); r (fill, text)
    • Preserve variable order through factors
r <- r_data$r[[1]]
coln <- colnames(r)
r[lower.tri(r, diag = T)] <- NA
r %>% data.frame() %>%
  rownames_to_column("V1") %>%
  pivot_longer(
    cols = -V1
    , values_to = "r"
    , names_to = "V2"
  ) %>%
  mutate(V1 = factor(V1, levels = rev(coln))
         , V2 = factor(V2, levels = coln))
# A tibble: 100 × 3
   V1      V2               r
   <fct>   <fct>        <dbl>
 1 p_value p_value   NA      
 2 p_value age       -0.00522
 3 p_value gender     0.0536 
 4 p_value SRhealth   0.159  
 5 p_value smokes    -0.0690 
 6 p_value exercise   0.0486 
 7 p_value BMI       -0.0197 
 8 p_value education  0.00147
 9 p_value parEdu     0.0199 
10 p_value mortality -0.0896 
# ℹ 90 more rows

Data Final Words

  • Data cleaning is anxiety-provoking for lots of really valid reasons
  • You probably outline your writing, so why not outline your data cleaning? It’s writing, too
  • Start by figuring out three things:
    • What do you data look like now
    • What’s your final product (table, visualization, etc.)
    • What do your data need to look like to be able to feed into that final product?
  • Then, start filling out the middle:
    • How you do get to that end point?

Axes

Axes

  • Remember when we talked about bar charts?
  • When we measure things, we are careful about scales, wording, etc.
  • But when we plot our measures, we sometimes fail to give it the same thoughtfulness
  • Our axes should be representative of our measures!
load(url("https://github.com/emoriebeck/psc290-data-viz-2022/raw/main/05-week5-time-series/01-data/ipcs_data.RData"))
ipcs_data
# A tibble: 4,222 × 70
   SID   Full_Date   afraid angry attentive content excited goaldir guilty happy
   <chr> <chr>        <dbl> <dbl>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl> <dbl>
 1 02    2018-10-22…      1     2         4       4       2       5      2     3
 2 02    2018-10-22…      1     1         4       3       2       5      1     3
 3 02    2018-10-23…      2     1         2       3       1       2      2     3
 4 02    2018-10-23…      2     2         4       3       2       4      1     3
 5 02    2018-10-23…      2     1         4       4       3       4      1     3
 6 02    2018-10-24…      2     1         4       4       2       4      1     3
 7 02    2018-10-24…      2     1         4       3       2       4      1     3
 8 02    2018-10-24…      2     1         4       4       4       4      1     4
 9 02    2018-10-24…      2     2         3       3       3       3      2     2
10 02    2018-10-25…      2     1         4       4       3       3      2     4
# ℹ 4,212 more rows
# ℹ 60 more variables: proud <dbl>, purposeful <dbl>,
#   agreeableness_Compassion <dbl>, agreeableness_Respectfulness <dbl>,
#   agreeableness_Trust <dbl>, conscientiousness_Organization <dbl>,
#   conscientiousness_Productiveness <dbl>,
#   conscientiousness_Responsibility <dbl>, extraversion_Assertiveness <dbl>,
#   extraversion_Energy.Level <dbl>, extraversion_Sociability <dbl>, …

Axes: Bar Charts

ipcs_long <- ipcs_data %>%
  filter(SID == "02") %>%
  select(SID:purposeful) %>%
  pivot_longer(
    cols = c(-SID, -Full_Date)
    , values_to = "value"
    , names_to = "var"
    , values_drop_na = T
  ) %>%
  mutate(valence = ifelse(var %in% c("afraid", "angry", "guilty"), "Negative", "Positive"))
ipcs_long
# A tibble: 480 × 5
   SID   Full_Date        var        value valence 
   <chr> <chr>            <chr>      <dbl> <chr>   
 1 02    2018-10-22 13:23 afraid         1 Negative
 2 02    2018-10-22 13:23 angry          2 Negative
 3 02    2018-10-22 13:23 attentive      4 Positive
 4 02    2018-10-22 13:23 content        4 Positive
 5 02    2018-10-22 13:23 excited        2 Positive
 6 02    2018-10-22 13:23 goaldir        5 Positive
 7 02    2018-10-22 13:23 guilty         2 Negative
 8 02    2018-10-22 13:23 happy          3 Positive
 9 02    2018-10-22 13:23 proud          4 Positive
10 02    2018-10-22 13:23 purposeful     4 Positive
# ℹ 470 more rows

Axes: Bar Charts

ipcs_long %>%
  group_by(var, valence) %>%
  summarize_at(vars(value), lst(mean, sd)) %>%
  ungroup() %>%
  ggplot(aes(x = var, y = mean, fill = valence)) + 
    geom_bar(
      stat = "identity"
      , position = "dodge"
      ) + 
    geom_errorbar(
      aes(ymin = mean - sd, ymax = mean + sd)
      , width = .1
      ) +
    facet_grid(~valence, scales = "free_x", space = "free_x") + 
    my_theme()

Axes: Bar Charts

ipcs_long %>%
  group_by(var, valence) %>%
  summarize_at(vars(value), lst(mean, sd)) %>%
  ungroup() %>%
  ggplot(aes(x = var, y = mean - 1, fill = valence)) + 
    geom_bar(
      stat = "identity"
      , position = "dodge"
      ) + 
    geom_errorbar(
      aes(ymin = mean - 1 - sd, ymax = mean - 1 + sd)
      , width = .1
      ) +
    scale_y_continuous(limits = c(0,4), breaks = seq(0,4,1), labels = 1:5) + 
    facet_grid(~valence, scales = "free_x", space = "free_x") + 
    my_theme()

But our scale doesn’t start at 0!

Axes: Bar Charts

p <- ipcs_long %>%
  group_by(var, valence) %>%
  summarize_at(vars(value), lst(mean, sd)) %>%
  ungroup() %>%
  ggplot(aes(x = var, y = mean - 1, fill = valence)) + 
    geom_bar(
      stat = "identity"
      , position = "dodge"
      ) + 
    geom_jitter(
      data = ipcs_long
      , aes(y = value - 1, fill = valence)
      , color = "black"
      , shape = 21
      , alpha = .5
      , width = .2
      , height = .1
    ) + 
    geom_errorbar(
      aes(ymin = mean - 1 - sd, ymax = mean - 1 + sd)
      , width = .1
      ) +
    scale_y_continuous(limits = c(-.1,4), breaks = seq(0,4,1), labels = 1:5) +
    facet_grid(~valence, scales = "free_x", space = "free_x") + 
    my_theme()
p

Axes: Bar Charts

p + 
  labs(
    x = NULL
    , y = "Mean Rating (1-5) + SD"
  ) + 
  theme(
    legend.position = "none"
    , axis.text.x = element_text(angle = 45, hjust = 1)
    )

Axes: Another Example

  • Here’s a plot I made for my NSF grant, demonstrating different mean-level patterns of a behavior across situations from 1 to n. 
  • Note the … in the axis, which is normal notation to indicate some unknown quantity.
  • How would we create this?

Axes: Another Example

Here’s the data:

tibble(
  p = as.character(rep(1, 4))
  , x = paste0("S", c(1,2,3,"p"))
  , y = c(1, 2, 4, 3)
  ) 
# A tibble: 4 × 3
  p     x         y
  <chr> <chr> <dbl>
1 1     S1        1
2 1     S2        2
3 1     S3        4
4 1     Sp        3

Axes: Another Example

Let’s add the core ggplot code:

tibble(
  p = as.character(rep(1, 4))
  , x = paste0("S", c(1,2,3,"p"))
  , y = c(1, 2, 4, 3)
  ) %>%
  ggplot(aes(x = x, y = y, group = p))

Axes: Another Example

And our geoms, labs, and theme:

tibble(
  p = as.character(rep(1, 4))
  , x = paste0("S", c(1,2,3,"p"))
  , y = c(1, 2, 4, 3)
  ) %>%
  ggplot(aes(x = x, y = y, group = p)) + 
    geom_line(size = 1, color = "#8cdbbe") + 
    geom_point(
      size = 2.5
      , color = "black"
      , shape = "square"
      ) + 
    labs(
      x = "Situation"
      , y = "Mean Response"
      , title = "Intraindividual Variability"
      , subtitle = "Person 1"
      ) + 
    my_theme()

But how do we add the …?

Axes: Another Example

Let’s switch to a continuous scale, then we can use labels to add it!

tibble(
  p = as.character(rep(1, 4))
  , x = paste0("S", c(1,2,3,"p"))
  , x2 = 1:4
  , y = c(1, 2, 4, 3)
  ) %>%
  ggplot(aes(x = x2, y = y, group = p)) + 
    geom_line(size = 1, color = "#8cdbbe") + 
    geom_point(
      size = 2.5
      , color = "black"
      , shape = "square"
      ) + 
    labs(
      x = "Situation"
      , y = "Mean Response"
      , title = "Intraindividual Variability"
      , subtitle = "Person 1"
      ) + 
    my_theme()

Axes: Another Example

Let’s switch to a continuous scale, then we can use labels to add it!

tibble(
  p = as.character(rep(1, 4))
  , x = paste0("S", c(1,2,3,"p"))
  , x2 = 1:4
  , y = c(1, 2, 4, 3)
  ) %>%
  ggplot(aes(x = x2, y = y, group = p)) + 
    geom_line(size = 1, color = "#8cdbbe") + 
    geom_point(
      size = 2.5
      , color = "black"
      , shape = "square"
      ) + 
    scale_x_continuous(
      limits = c(.9, 4.1)
      , breaks = c(1,2,3,3.5,4)
      , labels = c("S1", "S2", "S3", "...", "S4")
      ) + 
    labs(
      x = "Situation"
      , y = "Mean Response"
      , title = "Intraindividual Variability"
      , subtitle = "Person 1"
      ) + 
    my_theme()

Almost there, but we don’t want the tick mark at …

Axes: Another Example

We can actually supply a vector of length breaks to axis.ticks.x specifying the size of the ticks!

tibble(
  p = as.character(rep(1, 4))
  , x = paste0("S", c(1,2,3,"p"))
  , x2 = 1:4
  , y = c(1, 2, 4, 3)
  ) %>%
  ggplot(aes(x = x2, y = y, group = p)) + 
    geom_line(size = 1, color = "#8cdbbe") + 
    geom_point(
      size = 2.5
      , color = "black"
      , shape = "square"
      ) + 
    scale_x_continuous(
      limits = c(.9, 4.1)
      , breaks = c(1,2,3,3.5,4)
      , labels = c("S1", "S2", "S3", "...", "Sn")
      ) + 
    labs(
      x = "Situation"
      , y = "Mean Response"
      , title = "Intraindividual Variability"
      , subtitle = "Person 1"
      ) + 
    my_theme() + 
    theme(axis.ticks.x = element_line(color = c(rep(.5, 3), 0, .5)))

Scales

  • coord_cartesian(): the default and what you’ll use most of the time
  • coord_polar(): remember Trig and Calculus?
  • coord_quickmap(): sets you up to plot maps
  • coord_trans(): apply transformations to coordinate plane
  • coord_flip(): flip x and y

Scales: coord_polar()

ipcs_m <- ipcs_data %>% 
  filter(SID %in% c(216, 211, 174)) %>%
  select(SID, Full_Date, afraid:purposeful, Adversity:Sociability)
ipcs_m
# A tibble: 310 × 20
   SID   Full_Date   afraid angry attentive content excited goaldir guilty happy
   <chr> <chr>        <dbl> <dbl>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl> <dbl>
 1 174   2019-10-23…      1     1         3       3       3       3      1     4
 2 174   2019-10-23…      1     1         4       3       3       3      1     3
 3 174   2019-10-23…      1     1         4       4       3       3      1     4
 4 174   2019-10-23…      1     1         3       3       3       3      1     3
 5 174   2019-10-24…      1     1         3       3       3       3      1     3
 6 174   2019-10-24…      1     1         3       3       3       3      1     3
 7 174   2019-10-24…      1     1         3       3       3       3      1     3
 8 174   2019-10-24…      1     1         3       3       3       3      1     3
 9 174   2019-10-25…      1     1         3       3       3       3      1     3
10 174   2019-10-25…      1     1         3       3       3       3      1     3
# ℹ 300 more rows
# ℹ 10 more variables: proud <dbl>, purposeful <dbl>, Adversity <dbl>,
#   Deception <dbl>, Duty <dbl>, Intellect <dbl>, Mating <dbl>,
#   Negativity <dbl>, pOsitivity <dbl>, Sociability <dbl>

Scales: coord_polar()

vars <- colnames(ipcs_m)[c(-1, -2)]
ipcs_m <- ipcs_m %>%
  pivot_longer(
    cols = c(-SID, -Full_Date)
    , values_to = "value"
    , names_to = "var"
    , values_drop_na = T
  ) %>%
  group_by(SID, var) %>%
  summarize(m = mean(value)
         , sd = sd(value)) %>%
  ungroup()
ipcs_m
# A tibble: 54 × 4
   SID   var             m    sd
   <chr> <chr>       <dbl> <dbl>
 1 174   Adversity    1    0    
 2 174   Deception    1    0    
 3 174   Duty         1.93 0.482
 4 174   Intellect    1.76 0.432
 5 174   Mating       2.03 0.739
 6 174   Negativity   1.03 0.173
 7 174   Sociability  1.96 0.591
 8 174   afraid       1.01 0.101
 9 174   angry        1.05 0.362
10 174   attentive    3.07 0.296
# ℹ 44 more rows

Scales: coord_polar()

vars <- tibble(
  var = vars
  , cat = c(rep("Emotion", 10), rep("Situation", 8))
  , num = 1:length(vars)
)

ipcs_m <- ipcs_m %>%
  left_join(vars %>% rename(var2 = num)) 

p <- ipcs_m %>%
  ggplot(aes(x = var2, y = m, fill = cat)) + 
    geom_bar(
      stat = "identity"
      , position = "dodge"
      ) +
    my_theme() +
    facet_wrap(~SID)
p

Scales: coord_polar()

p <- p + 
  scale_fill_brewer(palette = "Set2")
p

Scales: coord_polar()

p <- p + 
  coord_polar()
p

Scales: coord_polar()

angle <- 90 - 360 * (ipcs_m$var2-0.5) / nrow(vars)

p <- p + 
  geom_text(aes(label = var, y = m + .5), angle = angle
            , hjust = 0, size = 3, alpha = .6) + 
  scale_y_continuous(limits = c(-2, 6.5))
p

Scales: coord_polar()

p <- p + 
  labs(
    fill = "Feature Category"
    , title = "Relative Differences in Intraindividual Means"
    , subtitle = "Across Emotions and Situation Perceptions"
    ) + 
  theme(
    axis.line = element_blank()
    , axis.text = element_blank()
    , axis.ticks = element_blank()
    , axis.title = element_blank()
    , panel.background = element_rect(color = "black", fill = NA, size = 1)
  ) 
p

Scales: coord_polar()

p

Points

You can make points any text character

pred_data %>%
  filter(study == "Study1") %>%
  ggplot(aes(x = p_value, y = SRhealth)) + 
    geom_point(
      aes(shape = gender, color = gender)
      , size = 3
      , alpha = .75
      ) + 
    scale_shape_manual(
      values = c("M", "W")
      ) + 
    scale_color_manual(
      values = c("blue", "red")
      ) + 
    my_theme()

Annotations

Text

  • You’ve already seen lots of example of using annotate("text", ...)
  • But we can also use annotate("text", label = "mu", parse = T) or annotate("text", label = expression(mu[i]), parse = T) to produce math text in our geoms

Text

Here’s another figure from a grant I’m working on that uses several of the features we’ve been discussing:

set.seed(11)

dist_df = tibble(
  dist = dist_normal(3,0.75),
  dist_name = format(dist)
)

dist_df %>%
  ggplot(aes(y = 1, xdist = dist)) +
  stat_slab(fill = "#8cdbbe") + 
  annotate("point", x = 3, y = 1, size = 3) +
  annotate("text", label = "mu", x = 3, y = .92, parse = T, size = 8) + 
  annotate("text", label = "people", x = 2, y = .95) + 
  annotate("segment", size = 1, x = 2.8, xend = 1.2, y = .98, yend = .98
           , arrow = arrow(type = "closed", length=unit(2, "mm"))) + 
  annotate("text", label = "people", x = 4, y = .95) + 
  annotate("segment", size = 1, x = 3.2, xend = 4.8, y = .98, yend = .98
           , arrow = arrow(type = "closed", length=unit(2, "mm"))) + 
  labs(title = "Between-Person Differences") + 
  theme_void() + 
  theme(
    plot.title = element_text(face = "bold", size = rel(1.2), hjust = .5)
    )

Legends

  • There are several ways to control legends:
    • use theme(legend.position = [arg]) to change its position
    • use labs([mappings] = "[titles]") to control legend titles
    • use guides() to do about everything else

Legends: theme()

  • legend.position takes two kinds of arguments
    • text: "none", "left", "right" (default), "bottom", "top"
    • vector: x and y position (e.g. c(1,1))
hmp + 
  theme(legend.position = "right")

Legends: theme()

  • legend.position takes two kinds of arguments
    • text: "none", "left", "right" (default), "bottom", "top"
    • vector: x and y position (e.g. c(1,1))
hmp + 
  theme(legend.position = c(.8, .35))

Legends: labs

  • I won’t spend too much time here. We’ve seen this a lot
  • Say that you set color and fill equal to variable V1
  • Unless you specify differently, that will be the axis title
  • You can change this using labs(fill = "My Title", color = "My Title)
  • But make sure you
    • Set both
    • Make the labels the same or they will not be combined into a single legend

Legends: guides()

  • theme() lets you control the position of the legend and how it appears
  • labs() lets you control its titles
  • scale_[map]_[type] lets you control limits, breaks, and labels
  • guides() lets your control individual legend components

Legends: guides()

Remember correlelograms? Do we need the size legend?

Legends: guides()

Remember correlelograms? Do we need the size legend?

p + 
  guides(size = "none")

Legends: guides()

Remember correlelograms? Do we need the size legend?

p + 
  guides(
    size = "none"
    , fill = guide_legend(
      direction = "vertical"
      , ncol = 2
      )
    ) + 
  theme(legend.position = c(.7,.3))

AMA