Problem Set #2
Part 1: Data Manipulation
In the last problem set, you used tidyverse functions such as filter(), arrange(), and select() to perform data manipulations.
This time, You’ll also be practicing variable creation in this problem set, using both mutate() in combination with if_else(), case_when() as well as the base R approach.
Question 1: Data manipulation using pipes
In the code chunk below, complete the following:
- Load the
tidyverselibrary - Use the
load()andurl()functions to download thedf_school_alldataframe from the url:https://github.com/emoriebeck/psc203a-data-WQ26/raw/main/05-assignments/03-ps3/recruit_school_allvars.RData- Each row in
df_school_allrepresents a high school (includes both public and private) - There are columns (e.g.,
visit_by_100751) indicating the number of times a university visited that high school - The variable
total_visitsidentifies the number of visits the high school received from all (16) public research universities in this data collection sample
- Each row in
- Load the
Use the functions
arrange(),select(), andhead()to do the following:- Sort
df_school_alldescending bytotal_visits - Select the following variables:
name,state_code,city,school_type,total_visits,med_inc,pct_white,pct_black,pct_hispanic,pct_asian,pct_amerindian - Show the first 10 rows of the dataframe, which represents the top 10 most visited schools by the 16 universities
Complete this using pipes (
%>%) using 1 line of code- Sort
Using pipes (‘%>%’):
Building upon the previous question, use the functions
arrange(),select(),filter(), andhead()to do the following (select same variables as above):- Top 10 most visited public high schools in California
- Top 10 most visited private high schools in California
Complete this using pipes (
%>%) using 1 line of code each
Using pipes (‘%>%’):
Question 2: Variable creation using tidyverse’s mutate()
Above you used a data set provided for the homework set. For the rest of the questions, I invite you to load in your own data and use it when appropriate / you’re able. In some cases, you may have a variable (e.g., like race, which is used below) that you can directly sub in from your own data. In other cases, you will not, so I invite you to use a different variable on the same scale (e.g., numeric, count, categorical).
If you find this isn’t possible, don’t worry – next week, your homework will be different, asking you to take everything you’ve learned so far and clean your data, turning in a script that does everything from reading in your data to descriptives. For that homework, what variables are of interest are up to you, and you won’t have the same level of detail in instructions.
Often before creating new “analysis” variables, you may want to investigate the values of “input” variables. Here are some examples of checking variable values using count():
# Counts the total number of observations (i.e., rows) in `df_school_all`
df_school_all %>% count()
# Counts the number of observations that have missing values for the variable `med_inc`
df_school_all %>% filter(is.na(med_inc)) %>% count()
# Frequency count of the variable `school_type`
df_school_all %>% count(school_type)Use
mutate()withif_else()to create a 0/1 indicator and then usecount()to generate the following frequency tables:- Create 0/1 indicator called
ca_schoolfor whether the high school is in California and generate the frequency table for the values ofca_school - Create 0/1 indicator called
ca_pub_schoolfor whether the high school is a public school in California and generate the frequency table for the values ofca_pub_school
Note: You do not need to assign/retain the indicator variables in the
df_school_alldataframe.- Create 0/1 indicator called
Complete the following steps to create an analysis variable using
mutate()andif_else():- First, use
select()to select the variablesname,pct_black,pct_hispanic,pct_amerindianfromdf_school_all, and assign the resulting dataframe todf_race. You’ll be usingdf_racefor the remaining bullet points below. - Use
filter(),is.na(), andcount()to investigate whether or not the following variables have missing values:pct_black,pct_hispanic,pct_amerindian - Use
mutate()to create a new variablepct_bl_hisp_natindf_racethat is the sum ofpct_black,pct_hispanic, andpct_amerindian. Remember to assign todf_race. - Create a 0/1 indicator called
gt50pct_bl_hisp_natfor whether more than 50% of students identify as black, latinx, or native american and generate a frequency table for the values ofgt50pct_bl_hisp_nat
- First, use
Complete the following steps to create an analysis variable using
mutate()andcase_when():- First, use
select()to select the variablesnameandstate_codefromdf_school_all, and assign the resulting dataframe todf_schools - Use
case_when()to create a new variable indf_schoolscalledregionwhose values are:Northeast, ifstate_codeis in:'CT', 'ME', 'MA', 'NH', 'RI', 'VT', 'NJ', 'NY', 'PA'Midwest, ifstate_codeis in:'IN', 'IL', 'MI', 'OH', 'WI', 'IA', 'KS', 'MN', 'MO', 'NE', 'ND', 'SD'West, ifstate_codeis in:'AZ', 'CO', 'ID', 'NM', 'MT', 'UT', 'NV', 'WY', 'AK', 'CA', 'HI', 'OR', 'WA'South, ifstate_codeis not any of the above states (Hint: UseTRUEas the condition to specify default value. You can see an example here.)
- First, use
Complete the following steps to recode variables using
mutate()andrecode():- In the
df_schoolsdataframe, replace the values of theregionvariable as follows:- Change
NortheasttoNE - Change
MidwesttoMW - Change
WesttoW - Change
SouthtoS
- Change
- In the
df_schoolsdataframe, create a new variablestate_namewhose value is:California, ifstate_codeisCANew York, ifstate_codeisNY- Choose another state of your choice to recode
Other, ifstate_codeis any other state (Hint: Use.defaultto specify the default value)
- In the
Question 3: Grouping and summarizing
Now, we will use
group_by()in conjunction withsummarise()to calculate summary results for public and private schools in each state. First, group by state (state_code) and type (school_type) and calculate the following statistics for each combination:The total number of students (
total_students)The percentage of students who identify as each of the following race/ethnicity category:
pct_whitepct_blackpct_hispanicpct_asianpct_amerindianpct_other
Lastly, sort by the number of students per state-type combinaiton in descending order, and answer the following question.
In one or two sentences, what is something you find interesting about these results?
- ANSWER:
Next, we will look at the students’ median household income (
med_inc) by state and type. Group by type state (state_code) and (school_type) and calculate the following statistics for each type:- The total number of students
- The total number of visits where
med_incis missing - The average median household income of students
- The maximum median household income of students
- The minimum median household income of students
Lastly, sort by the number of students per state in descending order.
Question 4: Fun with YAML Headers
Explore the documentation on YAML headers for html on Quarto’s website. Also look at the defaults I’ve set.
Test out at least three options (but as many as you want) not currently in the YAML header from the Quarto website and note them here (If you test more than three, just list the three you thought were the coolest or where you learned the most):
Next, change the options for at least three things already in the YAML header and note what happens. How does the appearance change? How does the directory structure change (if at all)?
In the first parts of the problem set, you used data I provided to complete your assignment. Now, I’d like you to use your own data. This problem set is more open ended. For time, I’m going to have you focus on creating a codebook. You’ll use this for your next problem set, which will be a broad review of course content so far.
Part 2: Codebooks
Data Overview
Provide an overview of your data set. What is it? How was it collected?
Codebook
Using the codebook provided in class as a reference, document the variables (or a subset of the variables) for an ongoing (or completed) research project you are involved in. Please also make the “Key” and “Overview” sheets providing overviews of your study.
As a reminder, here’s a description of the column names I traditionally use. Feel free to omit or add columns as you see fit:
- dataset: this column indexes the name of the dataset that you will be pulling the data from. This is important because we will use this info later on (see purrr tutorial) to load and clean specific data files. Even if you don’t have multiple data sets, I believe consistency is more important and suggest using this.
- category: broad categories that different variables can be put into. I’m a fan of naming them things like “outcome”, “predictor”, “moderator”, “demographic”, “procedural”, etc. but sometimes use more descriptive labels like “Big 5” to indicate the model from which the measures are derived.
- name: label is basically one level lower than category. So if the category is Big 5, the label would be, or example, “A” for Agreeableness, “SWB” for subjective well-being, etc. This column is most important and useful when you have multiple items in a scales, so I’ll typically leave this blank when something is a standalone variable (e.g. sex, single-item scales, etc.).
- item_name: This is the lowest level and most descriptive variable. It indicates which item in scale something is. So it may be “kind” for Agreebleness or “sex” for the demographic biological sex variable.
- old_name: this column is the name of the variable in the data you are pulling it from. This should be exact. The goal of this column is that it will allow us to select() variables from the original data file and rename them something that is more useful to us.
- item_text: this column is the original text that participants saw or a description of the item.
- scale: this column tells you what the scale of the variable is. Is it a numeric variable, a text variable, etc. This is helpful for knowing the plausible range.
- recode_text: sometimes, we want to recode variables for analyses (e.g. for categorical variables with many levels where sample sizes for some levels are too small to actually do anything with it). I use this column to note the kind of recoding I’ll do to a variable for transparency.
- recode: I write the R code I’ll parse by reading my codebook into R into this column.
Here are additional columns that will make our lives easier or are applicable to some but not all data sets:
- reverse: this column tells you whether items in a scale need to be reverse coded. I recommend coding this as 1 (leave alone) and -1 (reverse) for reasons that will become clear later.
- mini: this column represents the minimum value of scales that are numeric. Leave blank otherwise.
- maxi: this column represents the maximumv alue of scales that are numeric. Leave blank otherwise.
- year: for longitudinal data, we have several waves of data and the name of the same item across waves is often different, so it’s important to note to which wave an item belongs. You can do this by noting the wave (e.g. 1, 2, 3), but I prefer the actual year the data were collected (e.g. 2005, 2009, etc.)
- meta: Some datasets have a meta name, which essentially means a name that variable has across all waves to make it clear which variables are the same. They are not always useful as some data sets have meta names but no great way of extracting variables using them. But they’re still typically useful to include in your codebook regardless.
- notes: Add any notes you may have about the variables that you think will be helpful for future you or others.
Render to pdf and submit problem set
Render to html by clicking the “Render” button near the top of your RStudio window (icon with blue arrow)
- Go to the Canvas –> Assignments –> Problem Set 3
- Submit both .qmd and .html files
- Use this naming convention “lastname_firstname_ps#” for your .qmd and html files (e.g. beck_emorie_ps3.qmd & beck_emorie_ps3.html)