── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.3 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.2 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ ggplot2::%+%() masks psych::%+%()
✖ ggplot2::alpha() masks psych::alpha()
✖ dplyr::arrange() masks plyr::arrange()
✖ purrr::compact() masks plyr::compact()
✖ dplyr::count() masks plyr::count()
✖ dplyr::desc() masks plyr::desc()
✖ dplyr::failwith() masks plyr::failwith()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::id() masks plyr::id()
✖ dplyr::lag() masks stats::lag()
✖ dplyr::mutate() masks plyr::mutate()
✖ dplyr::rename() masks plyr::rename()
✖ dplyr::summarise() masks plyr::summarise()
✖ dplyr::summarize() masks plyr::summarize()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Week 1 Workbook
Week 1 - Getting Situated in R & Quarto
You can download the code for this workbook and the rendered workbook here.
Goals for Today
- Course Overview
- What Is a Workflow?
- Fundamentals of R
- Brief Quarto Overview
Course Overview
Course Goals & Learning Outcomes
After successful completion of this course, you will be able to:
Build your own research workflow that can be ported to future projects.
Learn new programming skills that will help you efficiently, accurately, and deliberately clean and manage your data.
Create a bank of code and tools that can be used for a variety of types of research.
Course Expectations
- ~50% of the course will be in R
- You will get the most from this course if you:
- have your own data you can apply course content to
- know how to clean clean, transform, and manage that data
- today’s workshop is a good litmus test for this
Course Materials
- All materials (required and optional) are free and online
- Wickham & Grolemond: R for Data Science https://r4ds.had.co.nz
- Wickham: Advanced R http://adv-r.had.co.nz
- Data Camp: All paid content unlocked
Assignments
Assignment Weights | Percent |
---|---|
Class Participation | 20% |
Problem Sets | 40% |
Final Project Proposal | 10%* |
Class Presentation | 10%* |
Final Project | 20%* |
Total | 100% |
Class Participation
- There are lots of ways to participate, both in and outside class meetings
- Classes will be technologically hybrid
- The goal of this is for accessibility and to create recordings
- If you need to miss 2+ classes (i.e. 20+% of total class time), maybe consider taking the course in a different year
Problem Sets
- The main homework in the course are weekly problem sets
- The goal is to let you apply concepts from that week to your own data (or whatever data you’ll focus on for the class)
- Problem sets will be posted on Mondays before class
- Due 12:01 AM each Monday (starting next Monday and not including the last day of the course)
Final Projects
- Final project replaces final exam (there are no exams)
- This is a bring your own data class, so the goal of the course is to apply what you’re learning to your own research throughout the term
- Details of the final project TBD, but will generally include
- Stage 1: Proposals (due 11/13/23)
- Stage 2: In-class presentations (12/04/23)
- Stage 3: Final project submission (Due day and time of scheduled final; which I can’t access because ScheduleBuilder thinks I need a CRN for my own course and no one emails me back 🙃)
Extra Credit
Participate in a https://www.tidytuesday.com.
2 pt extra credit for each one you participate in (max 6 pt total).
Can post on Twitter or just create a document with the code and output
-
Submit on Canvas
- If posting, link the post in the Canvas submission
- If not posting, attach the knitted file on Canvas
Grading Scale
92.5% - 100% = A; 89.5% - 92.4% = A-
87.5% - 89.4% = B+; 82.5% - 87.4% = B; 79.5% - 82.4% = B-
77.5% - 79.4% = C+; 72.5% - 77.4% = C; 69.5% - 72.4% = C-
67.5% - 69.4% = D+; 62.5% - 67.4% = D; 59.5% - 62.4% = D-
0% - 59.4% = F
Schedule
- Week 1: Intro & Basics
- Week 2: Reproducibility &
dplyr
- Week 3: Data Quality &
tidyr
- Week 4: Codebooks & importing data
- Week 5: Data structures & transformation
- Week 6: Versioning &
purrr
- Week 7: Efficient R & parallelization
- Week 8: TBD & tables and figures in
R
- Week 9: Odds and ends & help with projects
- Week 10: Presentations
What is a workflow?
- Dictionary definition: “the sequence of industrial, administrative, or other processes through which a piece of work passes from initiation to completion”
- Research Workflow: “The process of conducting research from conceptualization to dissemination”
Why Should I Care?
- Whether you like it or not, you have a workflow
- You have ways you go about doing a project that you maybe haven’t thought too much about
- Issues arise when
- A workflow has missing steps
- Your workflow is inconsistent across projects
- Your workflow is inefficient, which can lead to mistakes
- A workflow is a work in progress. If it no longer serves you, let it go
How Do I Build a Workflow?
- Building a good workflow is both top-down (i.e. big steps to smaller ones) and bottom-up (i.e. necessary smaller steps make certain larger ones necessary)
- What?
Example: New Data Collection
1. Conceptualization
2. Funding acquisition
3. Preregistration
4. Project Building
5. Data Collection
6. Data Cleaning
7. Data Analysis
8. Writing (and rewriting)
9. Submission
10. Revision (and possibly crying)
11. ACCEPTANCE
Example: Secondary Data
1. Conceptualization
2. Data search
3. Project Building
4. Data documentation
5. Preregistration
6. Data Cleaning
7. Data Analysis
8. Writing (and rewriting)
9. Submission
10. Revision (and possibly crying)
11. ACCEPTANCE
- Workflows Are Hierarchical: Example – Data Cleaning Steps
Experimental Data
1. Gather all data files
2. Quality checks for each file
3. Load all files
4. Merge all files
5. Check all descriptives
6. Scoring, coding, and data transformation
7. Recheck all descriptives
8. Correlations and visualization
9. Restructure data for analyses
Secondary Data
1. Gather all data files
2. Load each file
3. Extract variables used
4. Rename variables, possibly deal with time variables
4. Merge all files
5. Check all descriptives
6. Scoring, coding, and data transformation
7. Recheck all descriptives
8. Correlations and visualization
9. Restructure data for analyses
Workflows: Overview of the Course
In this class, we will focus on building tools for:
Documenting Data (both before and after collection)
File management (how do I build a machine and human navigable directory)
Loading data files
All steps of cleaning data
Restructuring Data
DESCRIPTIVES DESCRIPTIVES DESCRIPTIVES
Efficient Programming (plz stop copy-pasting)
This class does not focus on modeling but rather how you get your data set up to run models (Weeks 1-5/6) AND how to extract and present data after you run them (Weeks 6/7-9)
We will focus on classes of models in R you will most likely encounter (
lm()
,glm()
,lmer()
,nlme()
,lavaan
,brms
)If you run other kinds of models, most tools we will use are portable to many packages and other object classes
By the end of this class, my goal is that you:
- Have a documented workflow for the kind of research you work on
- Have a set of tools and skills that apply to each piece of that workflow
- Have a skillset that will allow you to adapt and build new workflows for different kinds of research
Fundamentals of R
What is R? Why R?
- An “open source” programming language and software that provide collections of interrelated “functions”
- “open source” means that R is free and created by the user community. The user community can modify basic things about R and add new capabilities to what R can do the user community can modify R and
- a “function” is usually something that takes in some “input,” processes this input in some way, and creates some “output”
- e.g., the
max()
function takes as input a collection of numbers (e.g., 3,5,6) and returns as output the number with the maximum value - e.g., the
lm()
function takes in as inputs a dataset and a statistical model you specify within the function, and returns as output the results of the regression model
- e.g., the
Base R vs. R packages
Base R
- When you install R, you automatically install the “Base R” set of functions
- Example of a few of the functions in in Base R:
-
as.character()
function -
print()
function -
setwd()
function
-
R packages
- an R “package” (or “library”) is a collection of (related) functions developed by the R community
- Examples of R packages:
-
tidyverse
package for manipulating and visualizing data -
igraph
package for network analyses -
leaflet
package for mapping -
rvest
package for webscraping -
rtweet
package for streaming and downloading data from Twitter
-
- All R packages are free!
Why Use RStudio (Pivot)
- Also free
- Basically a GUI for R
- Organize files, import data, etc. with ease
- RMarkdown, Quarto, and more are powerful tools (they were used to create these slides!)
- Lots of new features and support
Why Use the tidyverse
- Maintained by RStudio (Pivot)
- No one should have to use a for loop to change data from long to wide
- Tons of integrated tools for data cleaning, manipulation, transformation, and visualization
- Even increasing support for modeling (e.g.,
tidymodels
)
Code
Why use Quarto
- These slides
- The course website
- Your homework
- All written in Quarto
Some R Basics
Executing R commands
Three ways to execute commands in R
- Type/copy commands directly into the “console”
- `code chunks’ in RMarkdown (.Rmd files)
- Cmd/Ctrl + Enter: execute highlighted line(s) within chunk
- Cmd/Ctrl + Shift + k: “knit” entire document
- R scripts (.R files)
- Cmd/Ctrl + Enter: execute highlighted line(s)
- Cmd/Ctrl + Shift + Enter (without highlighting any lines): run entire script
Assignment
Assignment refers to creating an “object” and assigning values to it
- The object may be a variable, a dataset, a bit of text that reads “la la la”
-
<-
is the assignment operator- in other languages
=
is the assignment operator
- in other languages
- general syntax:
object_name <- object_values
- good practice to put a space before and after assignment operator
Objects
R is an “object-oriented” programming language (like Python, JavaScript). So, what is an “object”?
- formal computer science definitions are confusing because they require knowledge of concepts we haven’t introduced yet
- More intuitively, I think objects as anything I assign values to
- For example, below,
a
andb
are the names of objects I assigned values to
- For example, below,
Ben Skinner says “Objects are like boxes in which we can put things: data, functions, and even other objects.”
Many commercial statistical software packages (e.g., SPSS, Stata) operate on datasets, which consist of rows of observations and columns of variables
Usually, these packages can open only one dataset at a time
By contrast, in R everything is an object and there is no limit to the number of objects R can hold (except memory)
Vectors
The fundamental data structure in R is the “vector”
A vector is a collection of values
The individual values within a vector are called “elements”
Values in a vector can be numeric, character (e.g., “Apple”), or some other type
Below we use the combine function
c()
to create a numeric vector that contains three elementsHelp file says that
c()
“combines values into a vector or list”
Code
[1] 4 7 9
Vector where the elements are characters
EXERCISE
Either in the R console or within the R markdown file, do the following:
- Create a vector called
v1
with three elements, where all the elements are numbers. Then print the values. - Create a vector called
v2
with four elements, where all the elements are characters (i.e., enclosed in single ’’ or double “” quotes). Then print the values. - Create a vector called
v3
with five elements, where some elements are numeric and some elements are characters. Then print the values.
Solution to Exercise
Code
[1] 1 2 3
Code
[1] "a" "b" "c" "d"
Formal classification of vectors in R
Here, I introduce the classification of vectors by Grolemund and Wickham
There are two broad types of vectors
-
Atomic vectors. An object that contains elements. Six “types” of atomic vectors:
-
logical, integer, double, character, complex, and raw.
- Integer and double vectors are collectively known as numeric vectors.
-
logical, integer, double, character, complex, and raw.
-
Lists. Like atomic vectors, lists are objects that contain elements
- elements within a list may be atomic vectors
- elements within a list may also be other lists; that is lists can contain other lists
One difference between atomic vectors and lists: homogeneous vs. heterogeneous elements
- atomic vectors are homogeneous: all elements within atomic vector must be of the same type
- lists can be heterogeneous: e.g., one element can be an integer and another element can be character
Problem with this classification:
- Not conceptually intutive
- Technically, lists are a type of vector, but people often think of atomic vectors and lists as fundamentally different things
Classification used by Ben Skinner:
- data type: logical, numeric (integer and double), character, etc.
- data structure: vector, list, matrix, etc.
Using R functions
What are functions
Functions are pre-written bits of code that accomplish some task.
Functions generally follow three sequential steps:
- take in an input object(s)
- process the input.
- return (A) a new object or (B) a visualizatoin (e.g., plot)
- For example,
sum()
function calculates sum of elements in a vector
- input. takes in a vector of elements (numeric or logical)
- processing. Calculates the sum of elements
- return. Returns numeric vector of length=1; value is sum of input vector
Function syntax
Components of a function
- function name (e.g.,
sum()
,length()
,seq()
) - function arguments
- Inputs that the function takes, which determine what function does
- can be vectors, data frames, logical statements, etc.
- In “function call” you specify values to assign to these function arguments
- e.g.,
sum(c(1,2,3))
- e.g.,
- Separate arguments with a comma
,
- e.g.,
seq(10,15)
- e.g.,
- Inputs that the function takes, which determine what function does
- Example: the sequence function,
seq()
Function syntax: More on function arguments
Usually, function arguments have names
- e.g., the
seq()
function includes the argumentsfrom
,to
,by
- when you call the function, you need to assign values to these arguments; but you usually don’t have to specify the name of the argument
Many function arguments have “default values”, set by whoever wrote the function
- if you don’t specify a value for that argument, the default value is inserted
- e.g., partial list of default values for
seq()
:seq(from=1, to=1, by=1)
Help files for functions
Contents of help files
- Description. What the function does
- Usage. Syntax, including default values for arguments
- Arguments. Description of function arguments
- Details. Details and idiosyncracies of about how the function works.
-
Value. What (object) the function “returns”
- e.g.,
sum()
returns vector of length 1 whose value is sum of input vector
- e.g.,
- References. Additional reading
- See Also. Related functions
- Examples. Examples of function in action
- Bottom of help file identifies the package the function comes from
Brief Quarto Overview
What is Quarto
- Quarto documents embed R code, output associated with R code, and text into one document
- An Quarto document is a “‘Living’ document that updates every time you compile [”Render”] it”
- Quarto documents have the extension .qmd
- Can think of them as text files with the extension .qmd rather than .txt
- At top of .qmd file you specify the “output” style, which dictates what kind of formatted document will be created
- e.g.,
html_document
orpdf_document
(this document was created withrevealjs
)
- e.g.,
- When you compile [“Render”] a .qmd file, the resulting formatted document can be an HTML document, a PDF document, an MS Word document, or many other types
Creating Quarto documents
Do this with a partner
Approach for creating a Quarto document.
- Point-and-click from within RStudio
- Click on File >> New File >> Quarto Document… >> choose HTML >> click OK
- Optional: add title (this is not the file name, just what appears at the top of document)
- Optional: add author name
- Save the .qmd file; File >> Save As
- Any file name
- Recommend you save it in same folder you saved this lecture
- “Render” the entire .qmd file
- Point-and-click OR shortcut: Cmd/Ctrl + Shift + k
- Click on File >> New File >> Quarto Document… >> choose HTML >> click OK
Creating and Formatting Quarto Documents
Take a few minutes and have you peruse the Quarto site to build familiarity (I still access it all the time when I forget how to do specific things)
I especially want you to take some time to peruse documents on YAML headers:
Course Reminders
- Problem set 1 due next Monday at 12:01 AM (grace period until 9 AM)
- Make sure to check out the readings
- There are exercises at the end that can be helpful to do. You can even download the directory of the bookdown/quarto book from GitHub (link in book)
- Next time:
- Bring your data, ideally loaded into R (or at a piece of it is)
- Part 1: Reproducibility and Using Workflows to Reflect Your Values
- Part 2: Data Manipulation:
dplyr