Problem Set #4
Part 1: Using Functions
Functions are really useful when you have to do something many times.
Practicing Function Writing
Write a function to z-standardize data ((observed - mean)/standard deviation). Apply it to at least 2 variables in your data. (Note: if your data are in long form, you will either need to make your data wide by some key variable (stimuli, question, etc. or will need to use your function on a grouped data frame))
Now look at your descriptives, do the descriptives suggest your function worked correctly? (Hint: if in long form, make sure you’re looking at the descriptives grouped)
Part 2: Iteration
In class, we talked about iteration as for loops, lapply(), and purrr:::map(). But we’ve actually been doing iteration for weeks using functions like mutate_at() and mutate_all(). Another alternative is using mutate(across()), which works similar to mutate_at() but is more generic. So for example the code below reverse scores the BFI items that are negatively keyed:
- Using at least two methods of iteration, apply a function to multiple columns, subsets (e.g., participants, stimuli, waves, etc.). For example, as above use two methods to mutate multiple columns (hint: see
?applyor?map_dbl) or to calculate descriptives, correlations, etc. (hint: see?lappyand `?map).
Do you get the same results? If not, why?
- Write a function that estimates multiple descriptives (mean, median, sd, min, max, n, n missing). Using any form of iteration, apply that function to all continuous variables in your data frame. (Hints: you could (1) pivot your data to long and
group_by()item, either nesting and applying your function or writing a data frame function or (2) use a function like apply() or across to estimate them. Note the format challenges you experience [e.g., errors, your data are super wide].)
Ultimately, you want to end up with a data frame with items / indicators, etc. as rows (indexed by a column) and columns for each of the descriptives.
Part 3: Strings
Working with strings
- Load the following packages in the code chunk below:
tidyverseandlubridate.
-
Using
str_c()and the following objects as input, create the string:"Roses are red, Violets are blue"- We encourage you to first sketch out what you want to do on some scratch paper.
- Recall from the lecture example on “Using str_c() on vectors of different lengths”, when multiple vectors of different length are provided in the str_c() function, the elements of shorter vectors are recycled. See below.
- Now try it yourself.
-
Pig Latin is a language game in which the first consonant of each word is moved to the end of the word, then
"ay"is appended to create a suffix. For example, the word"Wikipedia"would become"Ikipediaway".- Using
str_c()andstr_sub(), turn the givenpig_latinvector into the string:"igpay atinlay" - We encourage you to first sketch out what you want to do on some scratch paper.
- First, think about what the final outcome will look like.
- Then, think about how you can get there. Play around with the
str_sub()function. What happens when you include different values in thestr_sub()function?
- this is low-key the trickiest question in the problem set. So if you get stuck, ask a question to your group or github and move on. and come back to it later.
- Using
-
Using
str_c()andstr_sub(), decode the givensecret_message. Your output should be a string.- Follow the same logic from above.
- Sketch out what you want to do on some scratch paper. Break it down step by step. Play around with different values for the
str_sub()function.
Working with Twitter data
-
You will be using Twitter data we fetched from the following Twitter handles:
UniNoticias,FoxNews, andCNN.- This data has been saved as an Rdata file.
- Use the
load()andurl()functions to download thenews_dfdataframe from the url:https://github.com/emoriebeck/psc203a-data-FQ26/raw/main/05-assignments/04-ps4/twitter_news.RData - Report the dimensions of the
news_dfdata frame (rows and columns). Use thedim()function.
-
Subset your dataframe
news_dfand create a new dataframe callednews_df2keeping only the following variables:user_id,status_id,created_at,screen_name,text,followers_count,profile_expanded_url.- Note in the following questions we will ask you to create a new column and that means you have to assign
<-the new changes you are making to the existing dataframenews_df2. Ex.news_df2 <- news_df %>% mutate(newvar = mean(oldvar))
- Note in the following questions we will ask you to create a new column and that means you have to assign
-
Create a new column in
news_df2calledtext_lenthat contains the length of the character variabletext.- What is the
classandtypeof this new column? Make sure to include your code in the code chunk below.- ANSWER:
- What is the
-
Create an additional column in
news_df2calledhandle_followersthat stores the twitter handle and the number of followers associated with that twitter handle in a string. For example, the entries in thehandle_followerscolumn should look like this:@[twitter_handle] has [number] followers.- What is the
classandtypeof this new column? Make sure to include your code in the code chunk below.- ANSWER:
- What is the
- Lastly, create a column in
news_df2calledshort_webthat contains a short version of theprofile_expanded_urlwithout thehttp://www.part of the url. For example, the entries in that column should look something like this:nytimes.com.
Part 4: Dates
Working with dates/times
-
Using the column
created_at, create a new column innews_df2calleddt_chrthat is a character version ofcreated_at.- What is the
classof thecreated_atanddt_chrcolumns? Make sure to include your code in the code chunk below.- ANSWER:
- What is the
- Create another column in
news_df2calleddt_lenthat stores the length ofdt_chr.
-
Next, create additional columns in
news_df2for each of the following date/time components:- Create a new column
date_chrfor date (e.g.2020-03-26) using the columndt_chrand thestr_sub()function. - Do the same for year
yr_chr(e.g.2020). - Do the same for month
mth_chr(e.g.03). - Do the same for day
day_chr(e.g.26). - Do the same for time
time_chr(e.g.22:41:09).
- Create a new column
-
Using the column we created in the previous question
time_chr, create additional columns innews_df2for the following time components:- Create a new column
hr_chrfor hour (e.g.22) using the columntime_chrand thestr_sub()function. - Do the same for minutes
min_chr(e.g.41). - Do the same for seconds
sec_chr(e.g.09).
- Create a new column
-
Now let’s get some practice with the
lubridatepackage.- Using the
year()function from thelubridatepackage, create a new column innews_df2calledyr_numthat contains the year (e.g.2020) extracted fromdate_chr. - Do the same for month
mth_num. - Do the same for day
day_num. - Do the same for hour
hr_num, but extract fromcreated_atcolumn instead ofdate_chr. - Do the same for minutes
min_num. - Do the same for seconds
sec_num.
- Using the
-
Using the new numeric columns (e.g. day_num, mth_num) you’ve created in the previous step, reconstruct the date and datetime columns. Namely, add the following columns to
news_df2:- Use
make_date()to create new column calledmy_datethat contains the date (year, month, day). - Use
make_datetime()to create new column calledmy_datetimethat contains the datetime (year, month, day, hour, minutes, seconds).
- What is the
classof yourmy_dateandmy_datetimecolumns? Make sure to include your code in the code chunk below.- ANSWER:
- Use
The purpose of this problem set is for you to understand how the backslash escape character (\) works in R strings, as well as to practice writing regular expressions. You will be using the str_view_all() function to see all the matches from your regex. You’ll get practice combining character classes, quantifiers, anchors, ranges, groups, and more to build your regular expressions for each question.
Part 5: Regex
Backslash (\) escape character
In this section, you will practice working with strings that include backslashes, such as for escaping characters or for writing special characters. You will be using both the print() and writeLines() functions to print out your string and compare the difference. This section is not specific to/does not involve regular expressions.
-
Create a short string (could be a phrase or sentence) that contains both the single quote (
') and double quote (") inside your string, and save it as an object calledstring_with_quotes. Use bothprint()andwriteLines()to print out your string.Hint: You will need to use a backslash to escape either the single quote (
') or the double quote (") depending on if you used single or double quotes to enclose your string.
- Create a short string (could be a phrase or sentence) that contains both the tab and newline special characters, and save it to
string_with_spchars. Use bothprint()andwriteLines()to print out your string.
-
Create a string that contains your first name where each letter is separated by a backslash (e.g.,
y\o\u\r\n\a\m\e), and save it tostring_with_backslashes. Use bothprint()andwriteLines()to print out your string.Hint: Your
writeLines()output should show single backslashes between each letter of your name.
- With respect to the previous questions, explain in general why the output created by the
print()function differs from the output created by thewriteLines()function.
Matching characters
In this section and the next, you will practice writing regular expressions to match specific text. Use str_view_all() for all the following questions to show the matches.
- Show all matches to single quotes (
') in yourstring_with_quotesthat you created in Part I.
- Show all matches to double quotes (
") instring_with_quotes.
- Show all matches to tab characters in
string_with_spchars.
- Show all matches to newline characters in
string_with_spchars.
- Show all matches to backslashes (
\) instring_with_backslashes.
Regular expressions
- Copy the following code to create the character vector
text:
- Show all matches to a capital
Iat the beginning of the string.
- Show all matches to a period at the end of the string.
- Show all matches to 1 or more digits.
- Show all matches to all dollar amounts, including the dollar sign and
kif there is one (i.e.,$50,$100,$1k)
- Show all matches to ellipses (
...)
- Show all matches to parentheses, including the contents between the parentheses if there are any.
- Show all matches to words (define words as containing only letters, upper or lowercase)
- Show all matches to either a word that’s 4 or more letters long or ellipses.
- Show all matches to any digit or vowel (upper or lowercase) that repeats 2 times in a row (i.e., the same digit or vowel repeated twice in a row)
Render to html and submit problem set
Render to html by clicking the “Render” button near the top of your RStudio window (icon with blue arrow)
- Go to the Canvas –> Assignments –> Problem Set 4
- Submit both .qmd and .html files
- Use this naming convention “lastname_firstname_ps#” for your .qmd and html files (e.g. beck_emorie_ps4.qmd & beck_emorie_ps4.html)