We will use rtweet
to pull Twitter data from the PAC-12 universities. We will use the university admissions Twitter handle if there is one, or the main Twitter handle for the university if there isn’t one:
Rows: 328
Columns: 90
$ user_id <chr> "22080148", "22080148", "22080148", "22080148"…
$ status_id <chr> "1254177694599675904", "1253431405993840646", …
$ created_at <dttm> 2020-04-25 22:37:18, 2020-04-23 21:11:49, 202…
$ screen_name <chr> "WSUPullman", "WSUPullman", "WSUPullman", "WSU…
$ text <chr> "Big Dez is headed to Indy!\n\n#GoCougs | #NFL…
$ source <chr> "Twitter for iPhone", "Twitter Web App", "Twit…
$ display_text_width <dbl> 125, 58, 246, 83, 56, 64, 156, 271, 69, 140, 4…
$ reply_to_status_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, "1252615862659…
$ reply_to_user_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, "22080148", NA…
$ reply_to_screen_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, "WSUPullman", …
$ is_quote <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ is_retweet <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ favorite_count <int> 0, 322, 30, 55, 186, 53, 22, 44, 11, 0, 69, 42…
$ retweet_count <int> 230, 32, 1, 5, 0, 3, 2, 6, 2, 6, 3, 4, 5, 5, 2…
$ quote_count <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ reply_count <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ hashtags <list> <"GoCougs", "NFLDraft2020", "NFLCougs">, <"WS…
$ symbols <list> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ urls_url <list> NA, NA, NA, NA, NA, NA, NA, "commencement.wsu…
$ urls_t.co <list> NA, NA, NA, NA, NA, NA, NA, "https://t.co/RR4…
$ urls_expanded_url <list> NA, NA, NA, NA, NA, NA, NA, "https://commence…
$ media_url <list> "http://pbs.twimg.com/ext_tw_video_thumb/1254…
$ media_t.co <list> "https://t.co/NdGsvXnij7", "https://t.co/0OWG…
$ media_expanded_url <list> "https://twitter.com/WSUCougarFB/status/12541…
$ media_type <list> "photo", "photo", "photo", "photo", "photo", …
$ ext_media_url <list> "http://pbs.twimg.com/ext_tw_video_thumb/1254…
$ ext_media_t.co <list> "https://t.co/NdGsvXnij7", "https://t.co/0OWG…
$ ext_media_expanded_url <list> "https://twitter.com/WSUCougarFB/status/12541…
$ ext_media_type <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ mentions_user_id <list> <"1250265324", "1409024796", "180884045">, NA…
$ mentions_screen_name <list> <"WSUCougarFB", "dadpat7", "Colts">, NA, "WSU…
$ lang <chr> "en", "en", "en", "en", "en", "en", "en", "en"…
$ quoted_status_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "12529…
$ quoted_text <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "My WS…
$ quoted_created_at <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2020-…
$ quoted_source <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Twitt…
$ quoted_favorite_count <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 209, N…
$ quoted_retweet_count <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 6, NA,…
$ quoted_user_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "43947…
$ quoted_screen_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "maddd…
$ quoted_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Maddy…
$ quoted_followers_count <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 629, N…
$ quoted_friends_count <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 382, N…
$ quoted_statuses_count <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 8881, …
$ quoted_location <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Seatt…
$ quoted_description <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "WSU A…
$ quoted_verified <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, FALSE,…
$ retweet_status_id <chr> "1254159118996127746", NA, NA, NA, NA, NA, NA,…
$ retweet_text <chr> "Big Dez is headed to Indy!\n\n#GoCougs | #NFL…
$ retweet_created_at <dttm> 2020-04-25 21:23:29, NA, NA, NA, NA, NA, NA, …
$ retweet_source <chr> "Twitter for iPhone", NA, NA, NA, NA, NA, NA, …
$ retweet_favorite_count <int> 1402, NA, NA, NA, NA, NA, NA, NA, NA, 26, NA, …
$ retweet_retweet_count <int> 230, NA, NA, NA, NA, NA, NA, NA, NA, 6, NA, NA…
$ retweet_user_id <chr> "1250265324", NA, NA, NA, NA, NA, NA, NA, NA, …
$ retweet_screen_name <chr> "WSUCougarFB", NA, NA, NA, NA, NA, NA, NA, NA,…
$ retweet_name <chr> "Washington State Football", NA, NA, NA, NA, N…
$ retweet_followers_count <int> 77527, NA, NA, NA, NA, NA, NA, NA, NA, 996, NA…
$ retweet_friends_count <int> 1448, NA, NA, NA, NA, NA, NA, NA, NA, 316, NA,…
$ retweet_statuses_count <int> 15363, NA, NA, NA, NA, NA, NA, NA, NA, 1666, N…
$ retweet_location <chr> "Pullman, WA", NA, NA, NA, NA, NA, NA, NA, NA,…
$ retweet_description <chr> "Official Twitter home of Washington State Cou…
$ retweet_verified <lgl> TRUE, NA, NA, NA, NA, NA, NA, NA, NA, FALSE, N…
$ place_url <chr> NA, NA, NA, NA, NA, "https://api.twitter.com/1…
$ place_name <chr> NA, NA, NA, NA, NA, "Pullman", NA, NA, NA, NA,…
$ place_full_name <chr> NA, NA, NA, NA, NA, "Pullman, WA", NA, NA, NA,…
$ place_type <chr> NA, NA, NA, NA, NA, "city", NA, NA, NA, NA, "c…
$ country <chr> NA, NA, NA, NA, NA, "United States", NA, NA, N…
$ country_code <chr> NA, NA, NA, NA, NA, "US", NA, NA, NA, NA, "US"…
$ geo_coords <list> <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, …
$ coords_coords <list> <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, …
$ bbox_coords <list> <NA, NA, NA, NA, NA, NA, NA, NA>, <NA, NA, NA…
$ status_url <chr> "https://twitter.com/WSUPullman/status/1254177…
$ name <chr> "WSU Pullman", "WSU Pullman", "WSU Pullman", "…
$ location <chr> "Pullman, Washington USA", "Pullman, Washingto…
$ description <chr> "We are an award-winning research university i…
$ url <chr> "http://t.co/VxKZH9BuMS", "http://t.co/VxKZH9B…
$ protected <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ followers_count <int> 43914, 43914, 43914, 43914, 43914, 43914, 4391…
$ friends_count <int> 9717, 9717, 9717, 9717, 9717, 9717, 9717, 9717…
$ listed_count <int> 556, 556, 556, 556, 556, 556, 556, 556, 556, 5…
$ statuses_count <int> 15234, 15234, 15234, 15234, 15234, 15234, 1523…
$ favourites_count <int> 20124, 20124, 20124, 20124, 20124, 20124, 2012…
$ account_created_at <dttm> 2009-02-26 23:39:34, 2009-02-26 23:39:34, 200…
$ verified <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
$ profile_url <chr> "http://t.co/VxKZH9BuMS", "http://t.co/VxKZH9B…
$ profile_expanded_url <chr> "http://www.wsu.edu", "http://www.wsu.edu", "h…
$ account_lang <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ profile_banner_url <chr> "https://pbs.twimg.com/profile_banners/2208014…
$ profile_background_url <chr> "http://abs.twimg.com/images/themes/theme5/bg.…
$ profile_image_url <chr> "http://pbs.twimg.com/profile_images/576502906…
# A tibble: 6 × 5
user_id created_at screen_name text location
<chr> <dttm> <chr> <chr> <chr>
1 22080148 2020-04-25 22:37:18 WSUPullman "Big Dez is headed to Indy!… Pullman…
2 22080148 2020-04-23 21:11:49 WSUPullman "Cougar Cheese. That's it. … Pullman…
3 22080148 2020-04-21 04:00:00 WSUPullman "Darien McLaughlin '19, and… Pullman…
4 22080148 2020-04-24 03:00:00 WSUPullman "6 houses, one pick. Cougs,… Pullman…
5 22080148 2020-04-20 19:00:21 WSUPullman "Why did you choose to atte… Pullman…
6 22080148 2020-04-20 02:20:01 WSUPullman "Tell us one of your Bryan … Pullman…
What are strings?
Notice how R stores strings using double quotes internally:
Note: To include quotes as part of the string, we can either use the other type of quotes to surround the string (i.e., '
or "
) or escape the quote using a backslash (\
).
stringr
package“A consistent, simple and easy to use set of wrappers around the fantastic
stringi
package. All function and argument names (and positions) are consistent, all functions deal withNA
’s and zero length vectors in the same way, and the output from one function is easy to feed into the input of another.”
Credit: stringr
R documentation
stringr
package:stringr
package is based off the stringi
package and is part of Tidyverse
stringr
contains functions to work with stringsstringr
package, there are equivalent “base R” functionsstringr
functions all follow the same rules, while rules often differ across different “base R” string functions, so we will focus exclusively on stringr
functionsstringr
functions start with str_
(e.g., str_length
)str_length()
string
: Character vector (or vector coercible to character)str_length()
calculates the length of a string, whereas the length()
function (which is not part of stringr
package) calculates the number of elements in an objectstr_length()
on stringCompare to length()
, which treats the string as a single object:
str_length()
on character vectorCompare to length()
, which finds the number of elements in the vector:
str_length()
on other vectors coercible to characterLogical
vectors can be coerced to character vectors:Numeric
vectors can be coerced to character vectors:Integer
vectors can be coerced to character vectors:str_length()
on dataframe columnRecall that the columns in a dataframe are just vectors, so we can use str_length()
as long as the vector is coercible to character type.
# A tibble: 11 × 2
screen_name screen_name_len
<chr> <int>
1 WSUPullman 10
2 CalAdmissions 13
3 UW 2
4 USCAdmission 12
5 uoregon 7
6 FutureSunDevils 15
7 UCLAAdmission 13
8 UtahAdmissions 14
9 futurebuffs 11
10 uaadmissions 12
11 BeaverVIP 9
str_c()
The str_c()
function:
sep
: String to insert between input vectorscollapse
: Optional string used to combine input vectors into single stringstr_c()
on one vectorSince we only provided one input vector, it has nothing to concatenate with, so str_c()
will just return the same vector:
str_c()
on one vectorNote that specifying the sep
argument will also not have any effect because we only have one input vector, and sep
is the separator between multiple vectors:
str_c()
on one vectorstr_c()
returns a vector by default (because the default value for the collapse
argument is NULL
).collapse
in order to collapse the elements of the output vector into a single string:[1] "a|b|c"
# Check length: Output vector of length 3 is collapsed into a single string
str_c(c("a", "b", "c"), collapse = "|") %>% length()
[1] 1
# Check str_length: This gives the length of the collapsed string, which is 5 characters long
str_c(c("a", "b", "c"), collapse = "|") %>% str_length()
[1] 5
str_c()
on more than one vectorWhen we provide multiple input vectors, we can see that the vectors get concatenated element-wise (i.e., 1st element from each vector are concatenated, 2nd element from each vector are concatenated, etc):
str_c()
on more than one vectorThe default separator for each element-wise concatenation is an empty string (""
), but we can customize that by specifying the sep
argument:
str_c()
on more than one vectorAgain, we can specify the collapse
argument in order to collapse the elements of the output vector into a single string:
[1] "ax!|by?|cz;"
# Check length: Output vector of length 3 is collapsed into a single string
str_c(c("a", "b", "c"), c("x", "y", "z"), c("!", "?", ";"), collapse = "|") %>% length()
[1] 1
# Specifying both `sep` and `collapse`
str_c(c("a", "b", "c"), c("x", "y", "z"), c("!", "?", ";"), sep = "~", collapse = "|")
[1] "a~x~!|b~y~?|c~z~;"
str_sub()
The str_sub()
function:
string
: Character vector (or vector coercible to character)start
: Position of first character to be included in substring (default: 1
)end
: Position of last character to be included in substring (default: -1
)
omit_na
: If TRUE
, missing values in any of the arguments provided will result in an unchanged inputstr_sub()
is used in the assignment form, you can replace the subsetted part of the string with a value
of your choice
value
will be concatenated to the end of that elementstr_sub()
to subset stringsIf no start
and end
positions are specified, str_sub()
will by default return the entire (original) string:
Note that if an element is shorter than the specified end
(i.e., 123
in the example below), it will just include all the available characters that it does have:
str_sub()
to subset stringsRemember we can also use negative index to count the position starting from the back:
str_sub()
to replace stringsIf no start
and end
positions are specified, str_sub()
will by default return the original string, so the entire string would be replaced:
str_sub()
on dataframe columnWe can use as.character()
to turn the created_at
value to a string, then use str_sub()
to extract out various date/time components from the string:
p12_datetime_df <- p12_df %>% select(created_at) %>%
mutate(
dt_chr = as.character(created_at),
date_chr = str_sub(dt_chr, 1, 10),
yr_chr = str_sub(dt_chr, 1, 4),
mth_chr = str_sub(dt_chr, 6, 7),
day_chr = str_sub(dt_chr, 9, 10),
hr_chr = str_sub(dt_chr, -8, -7),
min_chr = str_sub(dt_chr, -5, -4),
sec_chr = str_sub(dt_chr, -2, -1)
)
# A tibble: 328 × 9
created_at dt_chr date_chr yr_chr mth_chr day_chr hr_chr min_chr
<dttm> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 2020-04-25 22:37:18 2020-04-2… 2020-04… 2020 04 25 22 37
2 2020-04-23 21:11:49 2020-04-2… 2020-04… 2020 04 23 21 11
3 2020-04-21 04:00:00 2020-04-2… 2020-04… 2020 04 21 04 00
4 2020-04-24 03:00:00 2020-04-2… 2020-04… 2020 04 24 03 00
5 2020-04-20 19:00:21 2020-04-2… 2020-04… 2020 04 20 19 00
6 2020-04-20 02:20:01 2020-04-2… 2020-04… 2020 04 20 02 20
7 2020-04-22 04:00:00 2020-04-2… 2020-04… 2020 04 22 04 00
8 2020-04-25 17:00:00 2020-04-2… 2020-04… 2020 04 25 17 00
9 2020-04-21 15:13:06 2020-04-2… 2020-04… 2020 04 21 15 13
10 2020-04-21 17:52:47 2020-04-2… 2020-04… 2020 04 21 17 52
# ℹ 318 more rows
# ℹ 1 more variable: sec_chr <chr>
stringr
functionsOther useful stringr
functions:
str_to_upper()
: Turn strings to uppercasestr_to_lower()
: Turn strings to lowercasestr_sort()
: Sort a character vectorstr_trim()
: Trim whitespace from strings (including \n
, \t
, etc.)str_pad()
: Pad strings with specified characterstr_to_upper()
to turn strings to uppercaseTurn column names of p12_df
to uppercase:
str_to_lower()
to turn strings to lowercaseTurn column names of p12_df
to lowercase:
str_sort()
to sort character vectorSort the vector of p12_df
column names:
[1] "user_id" "created_at" "screen_name" "text" "location"
[1] "created_at" "location" "screen_name" "text" "user_id"
[1] "user_id" "text" "screen_name" "location" "created_at"
str_trim()
to trim whitespace from string[1] "ABC" "XYZ"
[1] "ABC " "XYZ\t"
[1] "\nABC" " XYZ"
str_pad()
to pad string with characterLet’s say we have a vector of zip codes that has lost all leading 0’s. We can use str_pad()
to add that back in:
10 AM
and 1 PM
)
"Class starts at 10 AM and ends at 1 PM."
\d+ [AP]M
can!\d+ [AP]M
works:
\d+
matches 1 or more digits in a row
\d
means match all numeric digits (i.e., 0
-9
)+
means match 1 or more of[AP]M
matches either AM
or PM
[AP]
means match either an A
or P
at that positionM
means match a literal M
Credit: DaveChild Regular Expression Cheat Sheet
STRING | REGEX | MATCHES |
---|---|---|
"\\d" |
\d |
any digit |
"\\D" |
\D |
any non-digit |
"\\s" |
\s |
any whitespace |
"\\S" |
\S |
any non-whitespace |
"\\w" |
\w |
any word character |
"\\W" |
\W |
any non-word character |
Credit: Working with strings in stringr Cheat sheet
\d
is used to match any digit (i.e., number)\s
is used to match any whitespace (i.e., space, tab, or newline character)\w
is used to match any word character (i.e., alphanumeric character or underscore)“But wait… there’s more! Before a regex is interpreted as a regular expression, it is also interpreted by R as a string. And backslash is used to escape there as well. So, in the end, you need to preprend two backslashes…”
Credit: Escaping sequences from Stat 545
This means in R, when we want to use regular expression patterns "\d"
,"\s"
, "\w"
, etc. to match to strings, we must write out the regex patterns as "\\d"
,"\\s"
, "\\w"
, etc.
\d
& \D
to match digits & non-digitsGoal: write a regular expression pattern that matches to any digit in the string p12_df$text[119]
We can use \d
to match all instances of a digit (i.e., number):
[1] │ "I stand with my colleagues at @UW and America's leading research universities as they take fight to Covid-<1><9> in our labs and hospitals."
│
│ #ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/<4>YSf<4>SpPe<0>
pattern
argument above; this is our “regex object”\d
, which matches to any digit"\\d"
rather than "\d"
\D
to match all instances of a non-digit character:[1] │ <"><I>< ><s><t><a><n><d>< ><w><i><t><h>< ><m><y>< ><c><o><l><l><e><a><g><u><e><s>< ><a><t>< ><@><U><W>< ><a><n><d>< ><A><m><e><r><i><c><a><'><s>< ><l><e><a><d><i><n><g>< ><r><e><s><e><a><r><c><h>< ><u><n><i><v><e><r><s><i><t><i><e><s>< ><a><s>< ><t><h><e><y>< ><t><a><k><e>< ><f><i><g><h><t>< ><t><o>< ><C><o><v><i><d><->19< ><i><n>< ><o><u><r>< ><l><a><b><s>< ><a><n><d>< ><h><o><s><p><i><t><a><l><s><.><"><
│ ><
│ ><#><P><r><o><u><d><T><o><B><e><O><n><T><h><e><i><r><T><e><a><m>< ><x>< ><#><A><l><w><a><y><s><C><o><m><p><e><t><e>< ><x>< ><#><G><o><H><u><s><k><i><e><s>< ><h><t><t><p><s><:></></><t><.><c><o></>4<Y><S><f>4<S><p><P><e>0
\s
& \S
to match whitespace & non-whitespaceWe can use \s
to match all instances of a whitespace (i.e., space, tab, or newline character):
[1] │ "I< >stand< >with< >my< >colleagues< >at< >@UW< >and< >America's< >leading< >research< >universities< >as< >they< >take< >fight< >to< >Covid-19< >in< >our< >labs< >and< >hospitals."<
│ ><
│ >#ProudToBeOnTheirTeam< >x< >#AlwaysCompete< >x< >#GoHuskies< >https://t.co/4YSf4SpPe0
We can use \S
to match all instances of a non-whitespace character:
# Match any instances of a non-whitespace
str_view_all(
string = p12_df$text[119]
, pattern = "\\S"
)
[1] │ <"><I> <s><t><a><n><d> <w><i><t><h> <m><y> <c><o><l><l><e><a><g><u><e><s> <a><t> <@><U><W> <a><n><d> <A><m><e><r><i><c><a><'><s> <l><e><a><d><i><n><g> <r><e><s><e><a><r><c><h> <u><n><i><v><e><r><s><i><t><i><e><s> <a><s> <t><h><e><y> <t><a><k><e> <f><i><g><h><t> <t><o> <C><o><v><i><d><-><1><9> <i><n> <o><u><r> <l><a><b><s> <a><n><d> <h><o><s><p><i><t><a><l><s><.><">
│
│ <#><P><r><o><u><d><T><o><B><e><O><n><T><h><e><i><r><T><e><a><m> <x> <#><A><l><w><a><y><s><C><o><m><p><e><t><e> <x> <#><G><o><H><u><s><k><i><e><s> <h><t><t><p><s><:></></><t><.><c><o></><4><Y><S><f><4><S><p><P><e><0>
\w
& \W
to match words & non-wordsWe can use \w
to match all instances of a word character (i.e., alphanumeric character or underscore):
# Match any instances of a word character
str_view_all(
string = p12_df$text[119]
, pattern = "\\w"
)
[1] │ "<I> <s><t><a><n><d> <w><i><t><h> <m><y> <c><o><l><l><e><a><g><u><e><s> <a><t> @<U><W> <a><n><d> <A><m><e><r><i><c><a>'<s> <l><e><a><d><i><n><g> <r><e><s><e><a><r><c><h> <u><n><i><v><e><r><s><i><t><i><e><s> <a><s> <t><h><e><y> <t><a><k><e> <f><i><g><h><t> <t><o> <C><o><v><i><d>-<1><9> <i><n> <o><u><r> <l><a><b><s> <a><n><d> <h><o><s><p><i><t><a><l><s>."
│
│ #<P><r><o><u><d><T><o><B><e><O><n><T><h><e><i><r><T><e><a><m> <x> #<A><l><w><a><y><s><C><o><m><p><e><t><e> <x> #<G><o><H><u><s><k><i><e><s> <h><t><t><p><s>://<t>.<c><o>/<4><Y><S><f><4><S><p><P><e><0>
We can use \W
to match all instances of a non-word character:
# Match any instances of a non-word character
str_view_all(
string = p12_df$text[119]
, pattern = "\\W"
)
[1] │ <">I< >stand< >with< >my< >colleagues< >at< ><@>UW< >and< >America<'>s< >leading< >research< >universities< >as< >they< >take< >fight< >to< >Covid<->19< >in< >our< >labs< >and< >hospitals<.><"><
│ ><
│ ><#>ProudToBeOnTheirTeam< >x< ><#>AlwaysCompete< >x< ><#>GoHuskies< >https<:></></>t<.>co</>4YSf4SpPe0
\w
& \W
to match words & non-wordsThis matches all instances of 3-letter words:
[1] │ "I stand with my colleagues at @UW< and >America's leading research universities as they take fight to Covid-19 in< our >labs< and >hospitals."
│
│ #ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/4YSf4SpPe0
\n
and \t
, as well as using backslash to escape characters that have special meanings in regex, like .
or ?
(as we will soon see.\.
and \?
, or strings "\\."
and "\\?"
in R.STRING | REGEX | MATCHES |
---|---|---|
"\\n" |
\n |
newline |
"\\t" |
\t |
tab |
"\\\\" |
\\ |
\ |
"\\." |
\. |
. |
"\\?" |
\? |
? |
"\\(" |
\( |
( |
"\\)" |
\) |
) |
"\\{" |
\{ |
{ |
"\\}" |
\} |
} |
Character | Description |
---|---|
* |
0 or more |
? |
0 or 1 |
+ |
1 or more |
{3} |
Exactly 3 |
{3,} |
3 or more |
{3,5} |
3, 4, or 5 |
s?
matches 0 or 1 s
and \d{4}
matches exactly 4 digits.^
matches the start of the string, $
matches the end of the string (Notice how we do not need to escape these characters).\b
can be used to help detect word boundaries, and \B
can be used to help match characters within a word.String | Character | Description |
---|---|---|
"^" |
^ |
Start of string, or start of line in multi-line pattern |
"$" |
$ |
End of string, or end of line in multi-line pattern |
"\\b" |
\b |
Word boundary |
"\\B" |
\B |
Non-word boundary |
^
& $
to match start & end of stringWe can use ^
to match the start of a string:
# Matches only the quotation mark at the start of the text and not the end quote
str_view_all(string = p12_df$text[119], pattern = '^"')
[1] │ <">I stand with my colleagues at @UW and America's leading research universities as they take fight to Covid-19 in our labs and hospitals."
│
│ #ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/4YSf4SpPe0
^
& $
to match start & end of stringWe can use $
to match the end of a string:
# Matches only the number at the end of the text and not any other numbers
str_view_all(string = p12_df$text[119], pattern = "\\d$")
[1] │ "I stand with my colleagues at @UW and America's leading research universities as they take fight to Covid-19 in our labs and hospitals."
│
│ #ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/4YSf4SpPe<0>
Character | Description |
---|---|
. |
Match any character except newline (\n ) |
a|b |
Match a or b
|
[abc] |
Match either a , b , or c
|
[^abc] |
Match anything except a , b , or c
|
[a-z] |
Match range of lowercase letters from a to z
|
[A-Z] |
Match range of uppercase letters from A to Z
|
[0-9] |
Match range of numbers from 0 to 9
|
.
acts as a wildcard to match any character except newline.|
is similar to an OR operator. Square brackets [...]
can be used to specify a set or range of characters to match (or not to match)..
as a wildcardWe can use .
to match any character except newline (\n
):
[1] │ <"><I>< ><s><t><a><n><d>< ><w><i><t><h>< ><m><y>< ><c><o><l><l><e><a><g><u><e><s>< ><a><t>< ><@><U><W>< ><a><n><d>< ><A><m><e><r><i><c><a><'><s>< ><l><e><a><d><i><n><g>< ><r><e><s><e><a><r><c><h>< ><u><n><i><v><e><r><s><i><t><i><e><s>< ><a><s>< ><t><h><e><y>< ><t><a><k><e>< ><f><i><g><h><t>< ><t><o>< ><C><o><v><i><d><-><1><9>< ><i><n>< ><o><u><r>< ><l><a><b><s>< ><a><n><d>< ><h><o><s><p><i><t><a><l><s><.><">
│
│ <#><P><r><o><u><d><T><o><B><e><O><n><T><h><e><i><r><T><e><a><m>< ><x>< ><#><A><l><w><a><y><s><C><o><m><p><e><t><e>< ><x>< ><#><G><o><H><u><s><k><i><e><s>< ><h><t><t><p><s><:></></><t><.><c><o></><4><Y><S><f><4><S><p><P><e><0>
.
as a wildcardWe can confirm there is a newline in the tweet above by using writeLines()
or print()
:
"I stand with my colleagues at @UW and America's leading research universities as they take fight to Covid-19 in our labs and hospitals."
#ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/4YSf4SpPe0
[1] "\"I stand with my colleagues at @UW and America's leading research universities as they take fight to Covid-19 in our labs and hospitals.\"\n\n#ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/4YSf4SpPe0"
|
as an OR operatorWe can use |
to match either one of multiple patterns:
# Matches `research`, `fight`, or `labs`
str_view_all(string = p12_df$text[119], pattern = "research|fight|labs")
[1] │ "I stand with my colleagues at @UW and America's leading <research> universities as they take <fight> to Covid-19 in our <labs> and hospitals."
│
│ #ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/4YSf4SpPe0
[1] │ "I stand with my colleagues at <@UW> and America's leading research universities as they take fight to Covid-19 in our labs and hospitals."
│
│ <#ProudToBeOnTheirTeam> x <#AlwaysCompete> x <#GoHuskies> https://t.co/4YSf4SpPe0
[...]
to match (or not match) a set or range of charactersWe can use [...]
to match any set of characters:
[1] │ "I stand with my colleagues at <@UW> and America's leading research universities as they take fight to Covid-19 in our labs and hospitals."
│
│ <#ProudToBeOnTheirTeam> x <#AlwaysCompete> x <#GoHuskies> https://t.co/4YSf4SpPe0
# Matches any 2 consecutive vowels
str_view_all(string = p12_df$text[119], pattern = "[aeiouAEIOU]{2}")
[1] │ "I stand with my coll<ea>g<ue>s at @UW and America's l<ea>ding res<ea>rch universit<ie>s as they take fight to Covid-19 in <ou>r labs and hospitals."
│
│ #Pr<ou>dToB<eO>nTh<ei>rT<ea>m x #AlwaysCompete x #GoHusk<ie>s https://t.co/4YSf4SpPe0
[...]
to match (or not match) a set or range of charactersWe can also use [...]
to match any range of alpha or numeric characters:
# Matches only lowercase x through z or uppercase A through C
str_view_all(string = p12_df$text[119], pattern = "[x-zA-C]")
[1] │ "I stand with m<y> colleagues at @UW and <A>merica's leading research universities as the<y> take fight to <C>ovid-19 in our labs and hospitals."
│
│ #ProudTo<B>eOnTheirTeam <x> #<A>lwa<y>s<C>ompete <x> #GoHuskies https://t.co/4YSf4SpPe0
# Matches only numbers 1 through 4 or the pound sign
str_view_all(string = p12_df$text[119], pattern = "[1-4#]")
[1] │ "I stand with my colleagues at @UW and America's leading research universities as they take fight to Covid-<1>9 in our labs and hospitals."
│
│ <#>ProudToBeOnTheirTeam x <#>AlwaysCompete x <#>GoHuskies https://t.co/<4>YSf<4>SpPe0
[...]
to match (or not match) a set or range of charactersWe can use [^...]
to indicate we do not want to match the provided set or range of characters:
[1] │ "<I> st<a>nd w<i>th my c<o>ll<e><a>g<u><e>s <a>t @<U>W <a>nd <A>m<e>r<i>c<a>'s l<e><a>d<i>ng r<e>s<e><a>rch <u>n<i>v<e>rs<i>t<i><e>s <a>s th<e>y t<a>k<e> f<i>ght t<o> C<o>v<i>d-19 <i>n <o><u>r l<a>bs <a>nd h<o>sp<i>t<a>ls."
│
│ #Pr<o><u>dT<o>B<e><O>nTh<e><i>rT<e><a>m x #<A>lw<a>ysC<o>mp<e>t<e> x #G<o>H<u>sk<i><e>s https://t.c<o>/4YSf4SpP<e>0
[1] │ <">I< ><s><t>a<n><d>< ><w>i<t><h>< ><m><y>< ><c>o<l><l>ea<g>ue<s>< >a<t>< ><@>U<W>< >a<n><d>< >A<m>e<r>i<c>a<'><s>< ><l>ea<d>i<n><g>< ><r>e<s>ea<r><c><h>< >u<n>i<v>e<r><s>i<t>ie<s>< >a<s>< ><t><h>e<y>< ><t>a<k>e< ><f>i<g><h><t>< ><t>o< ><C>o<v>i<d><-><1><9>< >i<n>< >ou<r>< ><l>a<b><s>< >a<n><d>< ><h>o<s><p>i<t>a<l><s><.><"><
│ ><
│ ><#><P><r>ou<d><T>o<B>eO<n><T><h>ei<r><T>ea<m>< ><x>< ><#>A<l><w>a<y><s><C>o<m><p>e<t>e< ><x>< ><#><G>o<H>u<s><k>ie<s>< ><h><t><t><p><s><:></></><t><.><c>o</><4><Y><S><f><4><S><p><P>e<0>
# Matches anything that's not uppercase letters
str_view_all(string = p12_df$text[119], pattern = "[^A-Z]+")
[1] │ <">I< stand with my colleagues at @>UW< and >A<merica's leading research universities as they take fight to >C<ovid-19 in our labs and hospitals."
│
│ #>P<roud>T<o>B<e>O<n>T<heir>T<eam x #>A<lways>C<ompete x #>G<o>H<uskies https://t.co/4>YS<f4>S<p>P<e0>
[...]
to match (or not match) a set or range of charactersNotice that [...]
only matches a single character (see second to last example above). We need to use quantifiers if we want to match a stretch of characters (see last example above).
“Date-time data can be frustrating to work with in R. R commands for date-times are generally unintuitive and change depending on the type of date-time object being used. Moreover, the methods we use with date-times must be robust to time zones, leap days, daylight savings times, and other time related quirks, and R lacks these capabilities in some situations. Lubridate makes it easier to do the things R does with date-times and possible to do the things R does not.”
Credit: lubridate
documentation
How are dates and times stored in R? (From Dates and Times in R)
Date
class is used for storing dates
Date
objects are stored as the number of days since January 1, 1970, using negative numbers for earlier dates. The as.numeric()
function can be used to convert a Date
object to its internal form.”POSIXct
class stores date/time values as the number of seconds since January 1, 1970”POSIXlt
class stores date/time values as a list of components (hour, min, sec, mon, etc.) making it easy to extract these parts”Why use date/time objects?
Functions that create date/time objects by parsing character or numeric input:
Date
object: ymd()
, ydm()
, mdy()
, myd()
, dmy()
, dym()
y
stands for year, m
stands for month, d
stands for dayDate
objectPOSIXct
object: ymd_h()
, ymd_hm()
, ymd_hms()
, etc.
h
stands for hour, m
stands for minute, s
stands for secondh
, hm
, or hms
if you want to provide additional time information in order to create a POSIXct
objectPOSIXct
object without providing any time information, you can just provide a timezone (using tz
) to one of the date functions and it will assume midnight as the timeSys.timezone()
to get the timezone for your locationDate
object from character or numeric inputThe lubridate
functions are flexible and can parse dates in various formats:
Date
object from character or numeric inputInvestigate the Date
object:
POSIXct
object from character or numeric inputThe lubridate
functions are flexible and can parse AM/PM in various formats:
POSIXct
object from character or numeric inputInvestigate the POSIXct
object:
POSIXct
object from character or numeric inputWe can also create a POSIXct
object from a date function by providing a timezone. The time would default to midnight:
Date
objects from dataframe columnUsing the p12_datetime_df
we created earlier, we can create Date
objects from the date_chr
column:
# Use `ymd()` to parse the string stored in the `date_chr` column
p12_datetime_df %>% select(created_at, dt_chr, date_chr) %>%
mutate(date_ymd = ymd(date_chr))
# A tibble: 328 × 4
created_at dt_chr date_chr date_ymd
<dttm> <chr> <chr> <date>
1 2020-04-25 22:37:18 2020-04-25 22:37:18 2020-04-25 2020-04-25
2 2020-04-23 21:11:49 2020-04-23 21:11:49 2020-04-23 2020-04-23
3 2020-04-21 04:00:00 2020-04-21 04:00:00 2020-04-21 2020-04-21
4 2020-04-24 03:00:00 2020-04-24 03:00:00 2020-04-24 2020-04-24
5 2020-04-20 19:00:21 2020-04-20 19:00:21 2020-04-20 2020-04-20
6 2020-04-20 02:20:01 2020-04-20 02:20:01 2020-04-20 2020-04-20
7 2020-04-22 04:00:00 2020-04-22 04:00:00 2020-04-22 2020-04-22
8 2020-04-25 17:00:00 2020-04-25 17:00:00 2020-04-25 2020-04-25
9 2020-04-21 15:13:06 2020-04-21 15:13:06 2020-04-21 2020-04-21
10 2020-04-21 17:52:47 2020-04-21 17:52:47 2020-04-21 2020-04-21
# ℹ 318 more rows
POSIXct
objects from dataframe columnUsing the p12_datetime_df
we created earlier, we can recreate the created_at
column (class POSIXct
) from the dt_chr
column (class character
):
# Use `ymd_hms()` to parse the string stored in the `dt_chr` column
p12_datetime_df %>% select(created_at, dt_chr) %>%
mutate(datetime_ymd_hms = ymd_hms(dt_chr))
# A tibble: 328 × 3
created_at dt_chr datetime_ymd_hms
<dttm> <chr> <dttm>
1 2020-04-25 22:37:18 2020-04-25 22:37:18 2020-04-25 22:37:18
2 2020-04-23 21:11:49 2020-04-23 21:11:49 2020-04-23 21:11:49
3 2020-04-21 04:00:00 2020-04-21 04:00:00 2020-04-21 04:00:00
4 2020-04-24 03:00:00 2020-04-24 03:00:00 2020-04-24 03:00:00
5 2020-04-20 19:00:21 2020-04-20 19:00:21 2020-04-20 19:00:21
6 2020-04-20 02:20:01 2020-04-20 02:20:01 2020-04-20 02:20:01
7 2020-04-22 04:00:00 2020-04-22 04:00:00 2020-04-22 04:00:00
8 2020-04-25 17:00:00 2020-04-25 17:00:00 2020-04-25 17:00:00
9 2020-04-21 15:13:06 2020-04-21 15:13:06 2020-04-21 15:13:06
10 2020-04-21 17:52:47 2020-04-21 17:52:47 2020-04-21 17:52:47
# ℹ 318 more rows
Functions that create date/time objects from various date/time components:
Date
object: make_date()
make_date(year = 1970L, month = 1L, day = 1L)
POSIXct
object: make_datetime()
make_datetime(year = 1970L, month = 1L, day = 1L, hour = 0L, min = 0L, sec = 0, tz = "UTC")
Date
object from individual componentsThere are various ways to pass in the inputs to create the same Date
object:
POSIXct
object from individual componentsDate
objects from dataframe columnsUsing the p12_datetime_df
we created earlier, we can create Date
objects from the various date component columns:
# Use `make_date()` to create a `Date` object from the `yr_chr`, `mth_chr`, `day_chr` fields
p12_datetime_df %>% select(created_at, dt_chr, yr_chr, mth_chr, day_chr) %>%
mutate(date_make_date = make_date(year = yr_chr, month = mth_chr, day = day_chr))
# A tibble: 328 × 6
created_at dt_chr yr_chr mth_chr day_chr date_make_date
<dttm> <chr> <chr> <chr> <chr> <date>
1 2020-04-25 22:37:18 2020-04-25 22:37:18 2020 04 25 2020-04-25
2 2020-04-23 21:11:49 2020-04-23 21:11:49 2020 04 23 2020-04-23
3 2020-04-21 04:00:00 2020-04-21 04:00:00 2020 04 21 2020-04-21
4 2020-04-24 03:00:00 2020-04-24 03:00:00 2020 04 24 2020-04-24
5 2020-04-20 19:00:21 2020-04-20 19:00:21 2020 04 20 2020-04-20
6 2020-04-20 02:20:01 2020-04-20 02:20:01 2020 04 20 2020-04-20
7 2020-04-22 04:00:00 2020-04-22 04:00:00 2020 04 22 2020-04-22
8 2020-04-25 17:00:00 2020-04-25 17:00:00 2020 04 25 2020-04-25
9 2020-04-21 15:13:06 2020-04-21 15:13:06 2020 04 21 2020-04-21
10 2020-04-21 17:52:47 2020-04-21 17:52:47 2020 04 21 2020-04-21
# ℹ 318 more rows
POSIXct
objects from dataframe columnsUsing the p12_datetime_df
we created earlier, we can recreate the created_at
column (class POSIXct
) from the various date and time component columns (class character
):
# Use `make_datetime()` to create a `POSIXct` object from the `yr_chr`, `mth_chr`, `day_chr`, `hr_chr`, `min_chr`, `sec_chr` fields
# Convert inputs to integers first
p12_datetime_df %>%
mutate(datetime_make_datetime = make_datetime(
as.integer(yr_chr), as.integer(mth_chr), as.integer(day_chr),
as.integer(hr_chr), as.integer(min_chr), as.integer(sec_chr)
)) %>%
select(datetime_make_datetime, yr_chr, mth_chr, day_chr, hr_chr, min_chr, sec_chr)
# A tibble: 328 × 7
datetime_make_datetime yr_chr mth_chr day_chr hr_chr min_chr sec_chr
<dttm> <chr> <chr> <chr> <chr> <chr> <chr>
1 2020-04-25 22:37:18 2020 04 25 22 37 18
2 2020-04-23 21:11:49 2020 04 23 21 11 49
3 2020-04-21 04:00:00 2020 04 21 04 00 00
4 2020-04-24 03:00:00 2020 04 24 03 00 00
5 2020-04-20 19:00:21 2020 04 20 19 00 21
6 2020-04-20 02:20:01 2020 04 20 02 20 01
7 2020-04-22 04:00:00 2020 04 22 04 00 00
8 2020-04-25 17:00:00 2020 04 25 17 00 00
9 2020-04-21 15:13:06 2020 04 21 15 13 06
10 2020-04-21 17:52:47 2020 04 21 17 52 47
# ℹ 318 more rows
Storing data using date/time objects makes it easier to get and set the various date/time components.
Basic accessor functions:
date()
: Date componentyear()
: Yearmonth()
: Monthday()
: Dayhour()
: Hourminute()
: Minutesecond()
: Secondweek()
: Week of the yearwday()
: Day of the week (1
for Sunday to 7
for Saturday)am()
: Is it in the am? (returns TRUE
or FALSE
)pm()
: Is it in the pm? (returns TRUE
or FALSE
)accessor_function(<date/time_object>)
accessor_function(<date/time_object>) <- "new_component"
am()
and pm()
can’t be set. Modify the time components instead.[1] "2019-12-31 23:59:59 UTC"
[1] "POSIXct" "POSIXt"
[1] "2019-12-31"
[1] 23
[1] TRUE
[1] 3
[1] 2019
Using the p12_datetime_df
we created earlier, we can isolate the various date/time components from the POSIXct
object in the created_at
column:
# The extracted date/time components will be of numeric type
p12_datetime_df %>% select(created_at) %>%
mutate(
yr_num = year(created_at),
mth_num = month(created_at),
day_num = day(created_at),
hr_num = hour(created_at),
min_num = minute(created_at),
sec_num = second(created_at),
ampm = ifelse(am(created_at), 'AM', 'PM') # am()/pm() returns TRUE/FALSE
)
# A tibble: 328 × 8
created_at yr_num mth_num day_num hr_num min_num sec_num ampm
<dttm> <dbl> <dbl> <int> <int> <int> <dbl> <chr>
1 2020-04-25 22:37:18 2020 4 25 22 37 18 PM
2 2020-04-23 21:11:49 2020 4 23 21 11 49 PM
3 2020-04-21 04:00:00 2020 4 21 4 0 0 AM
4 2020-04-24 03:00:00 2020 4 24 3 0 0 AM
5 2020-04-20 19:00:21 2020 4 20 19 0 21 PM
6 2020-04-20 02:20:01 2020 4 20 2 20 1 AM
7 2020-04-22 04:00:00 2020 4 22 4 0 0 AM
8 2020-04-25 17:00:00 2020 4 25 17 0 0 PM
9 2020-04-21 15:13:06 2020 4 21 15 13 6 PM
10 2020-04-21 17:52:47 2020 4 21 17 52 47 PM
# ℹ 318 more rows
3 ways to represent time spans (From lubridate cheatsheet)
lubridate
: Durationsd
(e.g., dyears()
, dweeks()
, ddays()
, dhours()
, dminutes()
, dseconds()
)lubridate
: Durationsddays(1)
creates a duration of 86400s
, using the standard conversion of 60
seconds in an minute, 60
minutes in an hour, and 24
hours in a day:Notice that the output says this is equivalent to approximately 1
day, since it acknowledges that not all days have 24
hours.
lubridate
: DurationsIn the case of daylight savings, one particular day may have 25
hours, so the duration of that day should be represented as:
as.duration()
to get duration of an intervalIf we use as.duration()
to get the duration of scorpio_interval
, we see that it is a duration of 2595600
seconds. It takes into account the extra 1
hour gained due to daylight savings ending:
scorpio_start <- ymd("2019-10-23", tz = Sys.timezone())
scorpio_end <- ymd("2019-11-22", tz = Sys.timezone())
scorpio_interval <- scorpio_start %--% scorpio_end # or `interval(scorpio_start, scorpio_end)`
scorpio_interval <- interval(scorpio_start, scorpio_end)
# Duration is 2595600 seconds, which is equivalent to 30 24-hr days + 1 additional hour
scorpio_duration <- as.duration(scorpio_interval)
scorpio_duration
[1] "2595600s (~4.29 weeks)"
[1] "Duration"
attr(,"package")
[1] "lubridate"
# Using the standard 60s/min, 60min/hr, 24hr/day conversion,
# confirm duration is slightly more than 30 "standard" (ie. 24-hr) days
2595600 / (60 * 60 * 24)
[1] 30.04167
# Specifically, it is 30 days + 1 hour, if we define a day to have 24 hours
seconds_to_period(scorpio_duration)
[1] "30d 1H 0M 0S"
Because durations work with physical time, when we add a duration of 30
days to the scorpio_start
datetime object, we do not get the end datetime we’d expect:
[1] "2019-10-23 PDT"
# After adding 30 day duration, we do not get the expected end datetime
# `ddays(30)` adds the number of seconds in 30 standard 24-hr days, but one of the days has 25 hours
scorpio_start + ddays(30)
[1] "2019-11-21 23:00:00 PST"
# We need to add the additional 1 hour of physical time that elapsed during this time span
scorpio_start + ddays(30) + dhours(1)
[1] "2019-11-22 PST"
These materials were adapted from Ozan Jaquette’s EDUC 260A Course: Introduction to Programming and Data Management, Strings & Dates and EDUC 260B Course: Fundamentals of Programming, Strings & Regex
PSC 290 - Data Management and Cleaning