Week 7 Workbook

Author

Emorie D Beck

Loading required package: Matrix


Attaching package: 'lubridate'

The following objects are masked from 'package:base':

    date, intersect, setdiff, union

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr   1.1.3     ✔ readr   2.1.4
✔ forcats 1.0.0     ✔ stringr 1.5.0
✔ ggplot2 3.4.2     ✔ tibble  3.2.1
✔ purrr   1.0.2     ✔ tidyr   1.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ ggplot2::%+%()      masks psych::%+%()
✖ ggplot2::alpha()    masks psych::alpha()
✖ dplyr::arrange()    masks plyr::arrange()
✖ purrr::compact()    masks plyr::compact()
✖ dplyr::count()      masks plyr::count()
✖ dplyr::desc()       masks plyr::desc()
✖ tidyr::expand()     masks Matrix::expand()
✖ dplyr::failwith()   masks plyr::failwith()
✖ dplyr::filter()     masks stats::filter()
✖ dplyr::group_rows() masks kableExtra::group_rows()
✖ dplyr::id()         masks plyr::id()
✖ dplyr::lag()        masks stats::lag()
✖ dplyr::mutate()     masks plyr::mutate()
✖ tidyr::pack()       masks Matrix::pack()
✖ dplyr::rename()     masks plyr::rename()
✖ dplyr::summarise()  masks plyr::summarise()
✖ dplyr::summarize()  masks plyr::summarize()
✖ tidyr::unpack()     masks Matrix::unpack()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Week 7 - Strings & Dates

Outline

Questions on Homework
Strings
Regex
Dates

Dataset we will use

We will use rtweet to pull Twitter data from the PAC-12 universities. We will use the university admissions Twitter handle if there is one, or the main Twitter handle for the university if there isn’t one:

Code

# Load previously pulled Twitter data
p12_full_df <- readRDS("week7-data.RDS")
glimpse(p12_full_df)

Rows: 328
Columns: 90
$ user_id                 <chr> "22080148", "22080148", "22080148", "22080148"…
$ status_id               <chr> "1254177694599675904", "1253431405993840646", …
$ created_at              <dttm> 2020-04-25 22:37:18, 2020-04-23 21:11:49, 202…
$ screen_name             <chr> "WSUPullman", "WSUPullman", "WSUPullman", "WSU…
$ text                    <chr> "Big Dez is headed to Indy!\n\n#GoCougs | #NFL…
$ source                  <chr> "Twitter for iPhone", "Twitter Web App", "Twit…
$ display_text_width      <dbl> 125, 58, 246, 83, 56, 64, 156, 271, 69, 140, 4…
$ reply_to_status_id      <chr> NA, NA, NA, NA, NA, NA, NA, NA, "1252615862659…
$ reply_to_user_id        <chr> NA, NA, NA, NA, NA, NA, NA, NA, "22080148", NA…
$ reply_to_screen_name    <chr> NA, NA, NA, NA, NA, NA, NA, NA, "WSUPullman", …
$ is_quote                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ is_retweet              <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ favorite_count          <int> 0, 322, 30, 55, 186, 53, 22, 44, 11, 0, 69, 42…
$ retweet_count           <int> 230, 32, 1, 5, 0, 3, 2, 6, 2, 6, 3, 4, 5, 5, 2…
$ quote_count             <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ reply_count             <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ hashtags                <list> <"GoCougs", "NFLDraft2020", "NFLCougs">, <"WS…
$ symbols                 <list> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ urls_url                <list> NA, NA, NA, NA, NA, NA, NA, "commencement.wsu…
$ urls_t.co               <list> NA, NA, NA, NA, NA, NA, NA, "https://t.co/RR4…
$ urls_expanded_url       <list> NA, NA, NA, NA, NA, NA, NA, "https://commence…
$ media_url               <list> "http://pbs.twimg.com/ext_tw_video_thumb/1254…
$ media_t.co              <list> "https://t.co/NdGsvXnij7", "https://t.co/0OWG…
$ media_expanded_url      <list> "https://twitter.com/WSUCougarFB/status/12541…
$ media_type              <list> "photo", "photo", "photo", "photo", "photo", …
$ ext_media_url           <list> "http://pbs.twimg.com/ext_tw_video_thumb/1254…
$ ext_media_t.co          <list> "https://t.co/NdGsvXnij7", "https://t.co/0OWG…
$ ext_media_expanded_url  <list> "https://twitter.com/WSUCougarFB/status/12541…
$ ext_media_type          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ mentions_user_id        <list> <"1250265324", "1409024796", "180884045">, NA…
$ mentions_screen_name    <list> <"WSUCougarFB", "dadpat7", "Colts">, NA, "WSU…
$ lang                    <chr> "en", "en", "en", "en", "en", "en", "en", "en"…
$ quoted_status_id        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "12529…
$ quoted_text             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "My WS…
$ quoted_created_at       <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2020-…
$ quoted_source           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Twitt…
$ quoted_favorite_count   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 209, N…
$ quoted_retweet_count    <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 6, NA,…
$ quoted_user_id          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "43947…
$ quoted_screen_name      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "maddd…
$ quoted_name             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Maddy…
$ quoted_followers_count  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 629, N…
$ quoted_friends_count    <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 382, N…
$ quoted_statuses_count   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 8881, …
$ quoted_location         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Seatt…
$ quoted_description      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "WSU A…
$ quoted_verified         <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, FALSE,…
$ retweet_status_id       <chr> "1254159118996127746", NA, NA, NA, NA, NA, NA,…
$ retweet_text            <chr> "Big Dez is headed to Indy!\n\n#GoCougs | #NFL…
$ retweet_created_at      <dttm> 2020-04-25 21:23:29, NA, NA, NA, NA, NA, NA, …
$ retweet_source          <chr> "Twitter for iPhone", NA, NA, NA, NA, NA, NA, …
$ retweet_favorite_count  <int> 1402, NA, NA, NA, NA, NA, NA, NA, NA, 26, NA, …
$ retweet_retweet_count   <int> 230, NA, NA, NA, NA, NA, NA, NA, NA, 6, NA, NA…
$ retweet_user_id         <chr> "1250265324", NA, NA, NA, NA, NA, NA, NA, NA, …
$ retweet_screen_name     <chr> "WSUCougarFB", NA, NA, NA, NA, NA, NA, NA, NA,…
$ retweet_name            <chr> "Washington State Football", NA, NA, NA, NA, N…
$ retweet_followers_count <int> 77527, NA, NA, NA, NA, NA, NA, NA, NA, 996, NA…
$ retweet_friends_count   <int> 1448, NA, NA, NA, NA, NA, NA, NA, NA, 316, NA,…
$ retweet_statuses_count  <int> 15363, NA, NA, NA, NA, NA, NA, NA, NA, 1666, N…
$ retweet_location        <chr> "Pullman, WA", NA, NA, NA, NA, NA, NA, NA, NA,…
$ retweet_description     <chr> "Official Twitter home of Washington State Cou…
$ retweet_verified        <lgl> TRUE, NA, NA, NA, NA, NA, NA, NA, NA, FALSE, N…
$ place_url               <chr> NA, NA, NA, NA, NA, "https://api.twitter.com/1…
$ place_name              <chr> NA, NA, NA, NA, NA, "Pullman", NA, NA, NA, NA,…
$ place_full_name         <chr> NA, NA, NA, NA, NA, "Pullman, WA", NA, NA, NA,…
$ place_type              <chr> NA, NA, NA, NA, NA, "city", NA, NA, NA, NA, "c…
$ country                 <chr> NA, NA, NA, NA, NA, "United States", NA, NA, N…
$ country_code            <chr> NA, NA, NA, NA, NA, "US", NA, NA, NA, NA, "US"…
$ geo_coords              <list> <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, …
$ coords_coords           <list> <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, …
$ bbox_coords             <list> <NA, NA, NA, NA, NA, NA, NA, NA>, <NA, NA, NA…
$ status_url              <chr> "https://twitter.com/WSUPullman/status/1254177…
$ name                    <chr> "WSU Pullman", "WSU Pullman", "WSU Pullman", "…
$ location                <chr> "Pullman, Washington USA", "Pullman, Washingto…
$ description             <chr> "We are an award-winning research university i…
$ url                     <chr> "http://t.co/VxKZH9BuMS", "http://t.co/VxKZH9B…
$ protected               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ followers_count         <int> 43914, 43914, 43914, 43914, 43914, 43914, 4391…
$ friends_count           <int> 9717, 9717, 9717, 9717, 9717, 9717, 9717, 9717…
$ listed_count            <int> 556, 556, 556, 556, 556, 556, 556, 556, 556, 5…
$ statuses_count          <int> 15234, 15234, 15234, 15234, 15234, 15234, 1523…
$ favourites_count        <int> 20124, 20124, 20124, 20124, 20124, 20124, 2012…
$ account_created_at      <dttm> 2009-02-26 23:39:34, 2009-02-26 23:39:34, 200…
$ verified                <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
$ profile_url             <chr> "http://t.co/VxKZH9BuMS", "http://t.co/VxKZH9B…
$ profile_expanded_url    <chr> "http://www.wsu.edu", "http://www.wsu.edu", "h…
$ account_lang            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ profile_banner_url      <chr> "https://pbs.twimg.com/profile_banners/2208014…
$ profile_background_url  <chr> "http://abs.twimg.com/images/themes/theme5/bg.…
$ profile_image_url       <chr> "http://pbs.twimg.com/profile_images/576502906…

Code

p12_df <- p12_full_df |> 
  select(user_id, created_at, screen_name, text, location)
head(p12_df)

String basics

What are strings?

String is a type of data in R
You can create strings using either single quotes (') or double quotes (")
- Internally, R stores strings using double quotes
The class() and typeof() a string is character

Creating Strings

Creating string using single quotes

Notice how R stores strings using double quotes internally:

Code

my_string <- 'This is a string'
my_string

[1] "This is a string"

Creating string using double quotes

Code

my_string <- "Strings can also contain numbers: 123"
my_string

[1] "Strings can also contain numbers: 123"

Checking class and type of strings

Code

class(my_string)

[1] "character"

Code

typeof(my_string)

[1] "character"

Quotes in quotes

Note: To include quotes as part of the string, we can either use the other type of quotes to surround the string (i.e., ' or ") or escape the quote using a backslash (\).

Code

# Include quote by using the other type of quotes to surround the string 
my_string <- "There's no issues with this string."
my_string

[1] "There's no issues with this string."

Code

# Include quote of the same type by escaping it with a backslash
my_string <- 'There\'s no issues with this string.'
my_string

[1] "There's no issues with this string."

Code

# This would not work
my_string <- 'There's an issue with this string.'
my_string

`stringr` package

“A consistent, simple and easy to use set of wrappers around the fantastic stringi package. All function and argument names (and positions) are consistent, all functions deal with NA’s and zero length vectors in the same way, and the output from one function is easy to feed into the input of another.”

Credit: stringrR documentation

The stringr package is based off the stringi package and is part of Tidyverse
stringr contains functions to work with strings
For many functions in the stringr package, there are equivalent “base R” functions
But stringr functions all follow the same rules, while rules often differ across different “base R” string functions, so we will focus exclusively on stringr functions
Most stringr functions start with str_ (e.g., str_length)

`str_length()`

The str_length() function: - Function: Find string length

Code

?str_length

Code

# SYNTAX
str_length(string)

Arguments:
- string: Character vector (or vector coercible to character)
Note that str_length() calculates the length of a string, whereas the length() function (which is not part of stringr package) calculates the number of elements in an object

Using str_length() on string

Code

str_length("cats")

[1] 4

Compare to length(), which treats the string as a single object:

Code

length("cats")

[1] 1

str_length() on character vector

Code

str_length(c("cats", "in", "hat"))

[1] 4 2 3

Compare to length(), which finds the number of elements in the vector:

Code

length(c("cats", "in", "hat"))

[1] 3

Using str_length() on other vectors coercible to character

Logical vectors can be coerced to character vectors:

Code

str_length(c(TRUE, FALSE))

[1] 4 5

Numeric vectors can be coerced to character vectors:

Code

str_length(c(1, 2.5, 3000))

[1] 1 3 4

Integer vectors can be coerced to character vectors:

Code

str_length(c(2L, 100L))

[1] 1 3

Using str_length() on dataframe column

Recall that the columns in a dataframe are just vectors, so we can use str_length() as long as the vector is coercible to character type.

Code

str_length(p12_df$screen_name[1:20])

 [1] 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10

Code

p12_df %>% select(screen_name) %>% unique() %>% 
  mutate(screen_name_len = str_length(screen_name))

`str_c()`

The str_c() function:

Function: Concatenate strings between vectors (element-wise)

Code

?str_c

# SYNTAX AND DEFAULT VALUES
str_c(..., sep = "", collapse = NULL)

Arguments:
- The input is one or more character vectors (or vectors coercible to character)
  - Zero length arguments are removed
  - Short arguments are recycled to the length of the longest
- sep: String to insert between input vectors
- collapse: Optional string used to combine input vectors into single string

Using str_c() on one vector

Since we only provided one input vector, it has nothing to concatenate with, so str_c() will just return the same vector:

Code

str_c(c("a", "b", "c"))

[1] "a" "b" "c"

Using str_c() on one vector

Note that specifying the sep argument will also not have any effect because we only have one input vector, and sep is the separator between multiple vectors:

Code

str_c(c("a", "b", "c"), sep = "~")

[1] "a" "b" "c"

Code

# Check length: Output is the original vector of 3 elements
str_c(c("a", "b", "c")) %>% length()

[1] 3

Using str_c() on one vector

As seen on the previous slide, str_c() returns a vector by default (because the default value for the collapse argument is NULL).
But we can specify a string for collapse in order to collapse the elements of the output vector into a single string:

Code

str_c(c("a", "b", "c"), collapse = "|")

[1] "a|b|c"

Code

# Check length: Output vector of length 3 is collapsed into a single string
str_c(c("a", "b", "c"), collapse = "|") %>% length()

[1] 1

Code

# Check str_length: This gives the length of the collapsed string, which is 5 characters long
str_c(c("a", "b", "c"), collapse = "|") %>% str_length()

[1] 5

Using str_c() on more than one vector

When we provide multiple input vectors, we can see that the vectors get concatenated element-wise (i.e., 1st element from each vector are concatenated, 2nd element from each vector are concatenated, etc):

Code

str_c(c("a", "b", "c"), c("x", "y", "z"), c("!", "?", ";"))

[1] "ax!" "by?" "cz;"

Using str_c() on more than one vector

The default separator for each element-wise concatenation is an empty string (""), but we can customize that by specifying the sep argument:

Code

str_c(c("a", "b", "c"), c("x", "y", "z"), c("!", "?", ";"), sep = "~")

[1] "a~x~!" "b~y~?" "c~z~;"

Code

# Check length: Output vector is same length as input vectors
str_c(c("a", "b", "c"), c("x", "y", "z"), c("!", "?", ";"), sep = "~") %>% length()

[1] 3

Using str_c() on more than one vector

Again, we can specify the collapse argument in order to collapse the elements of the output vector into a single string:

Code

str_c(c("a", "b", "c"), c("x", "y", "z"), c("!", "?", ";"), collapse = "|")

[1] "ax!|by?|cz;"

Code

# Check length: Output vector of length 3 is collapsed into a single string
str_c(c("a", "b", "c"), c("x", "y", "z"), c("!", "?", ";"), collapse = "|") %>% length()

[1] 1

Code

# Specifying both `sep` and `collapse`
str_c(c("a", "b", "c"), c("x", "y", "z"), c("!", "?", ";"), sep = "~", collapse = "|")

[1] "a~x~!|b~y~?|c~z~;"

`str_sub()`

The str_sub() function:

Function: Subset strings
Arguments:
- string: Character vector (or vector coercible to character)
- start: Position of first character to be included in substring (default: 1)
- end: Position of last character to be included in substring (default: -1)
  - Negative index = counting backwards
- omit_na: If TRUE, missing values in any of the arguments provided will result in an unchanged input

Code

?str_sub

# SYNTAX AND DEFAULT VALUES
str_sub(string, start = 1L, end = -1L)
str_sub(string, start = 1L, end = -1L, omit_na = FALSE) <- value

When str_sub() is used in the assignment form, you can replace the subsetted part of the string with a value of your choice
- If an element in the vector is too short to meet the subset specification, the replacement value will be concatenated to the end of that element
- Note that this modifies your input vector directly, so you must have the vector saved to a variable (see example below)

Using str_sub() to subset strings

If no start and end positions are specified, str_sub() will by default return the entire (original) string:

Code

str_sub(string = c("abcdefg", 123, TRUE))

[1] "abcdefg" "123"     "TRUE"

Note that if an element is shorter than the specified end (i.e., 123 in the example below), it will just include all the available characters that it does have:

Code

str_sub(string = c("abcdefg", 123, TRUE), start = 2, end = 4)

[1] "bcd" "23"  "RUE"

Remember we can also use negative index to count the position starting from the back:

Code

str_sub(c("abcdefg", 123, TRUE), start = 2, end = -2)

[1] "bcdef" "2"     "RU"

Using str_sub() to replace strings

If no start and end positions are specified, str_sub() will by default return the original string, so the entire string would be replaced:

Code

v <- c("A", "AB", "ABC", "ABCD", "ABCDE")
str_sub(v, start = 1,end =-1)

[1] "A"     "AB"    "ABC"   "ABCD"  "ABCDE"

Code

str_sub(v, start = 1,end =-1) <- "*"
v

[1] "*" "*" "*" "*" "*"

Using str_sub() on dataframe column

We can use as.character() to turn the created_at value to a string, then use str_sub() to extract out various date/time components from the string:

Code

p12_datetime_df <- p12_df %>% select(created_at) %>%
  mutate(
      dt_chr = as.character(created_at),
      date_chr = str_sub(dt_chr, 1, 10),
      yr_chr = str_sub(dt_chr, 1, 4),
      mth_chr = str_sub(dt_chr, 6, 7),
      day_chr = str_sub(dt_chr, 9, 10),
      hr_chr = str_sub(dt_chr, -8, -7),
      min_chr = str_sub(dt_chr, -5, -4),
      sec_chr = str_sub(dt_chr, -2, -1)
    )

p12_datetime_df

Other `stringr` functions

Other useful stringr functions:

str_to_upper(): Turn strings to uppercase
str_to_lower(): Turn strings to lowercase
str_sort(): Sort a character vector
str_trim(): Trim whitespace from strings (including \n, \t, etc.)
str_pad(): Pad strings with specified character

Using `str_to_upper()` to turn strings to uppercase

Turn column names of p12_df to uppercase:

Code

# Column names are originally lowercase
names(p12_df)

[1] "user_id"     "created_at"  "screen_name" "text"        "location"

Code

# Turn column names to uppercase
names(p12_df) <- str_to_upper(names(p12_df))
names(p12_df)

[1] "USER_ID"     "CREATED_AT"  "SCREEN_NAME" "TEXT"        "LOCATION"

Using `str_to_lower()` to turn strings to lowercase

Turn column names of p12_df to lowercase:

Code

# Column names are originally uppercase
names(p12_df)

[1] "USER_ID"     "CREATED_AT"  "SCREEN_NAME" "TEXT"        "LOCATION"

Code

# Turn column names to lowercase
names(p12_df) <- str_to_lower(names(p12_df))
names(p12_df)

[1] "user_id"     "created_at"  "screen_name" "text"        "location"

Using `str_sort()` to sort character vector

Sort the vector of p12_df column names:

Code

# Before sort
names(p12_df)

[1] "user_id"     "created_at"  "screen_name" "text"        "location"

Code

# Sort alphabetically (default)
str_sort(names(p12_df))

[1] "created_at"  "location"    "screen_name" "text"        "user_id"

Code

# Sort reverse alphabetically
str_sort(names(p12_df), decreasing = TRUE)

[1] "user_id"     "text"        "screen_name" "location"    "created_at"

Using `str_trim()` to trim whitespace from string

Code

# Trim whitespace from both left and right sides (default)
str_trim(c("\nABC ", " XYZ\t"))

[1] "ABC" "XYZ"

Code

# Trim whitespace from left side
str_trim(c("\nABC ", " XYZ\t"), side = "left")

[1] "ABC "  "XYZ\t"

Code

# Trim whitespace from right side
str_trim(c("\nABC ", " XYZ\t"), side = "right")

[1] "\nABC" " XYZ"

Using `str_pad()` to pad string with character

Let’s say we have a vector of zip codes that has lost all leading 0’s. We can use str_pad() to add that back in:

Code

# Pad the left side of strings with "0" until width of 5 is reached
str_pad(c(95035, 90024, 5009, 5030), width = 5, side = "left", pad = "0")

[1] "95035" "90024" "05009" "05030"

Regular expression basics

Example of using regular expression in action:

How can we match all occurrences of times in the following string? (i.e., 10 AM and 1 PM)
- "Class starts at 10 AM and ends at 1 PM."
The regular expression \d+ [AP]M can!

Code

my_string = "Class starts at 10 AM and ends at 1 PM."
my_regex = "\\d+ [AP]M"

# The escaped string "\\d" results in the regex \d
print(my_regex)

[1] "\\d+ [AP]M"

Code

writeLines(my_regex)

\d+ [AP]M

Code

# View matches for our regular expression
str_view_all(string = my_string, pattern = my_regex)

[1] │ Class starts at <10 AM> and ends at <1 PM>.

How the regular expression \d+ [AP]M works:
- \d+ matches 1 or more digits in a row
  - \d means match all numeric digits (i.e., 0-9)
  - + means match 1 or more of
- matches a literal space
- [AP]M matches either AM or PM
  - [AP] means match either an A or P at that position
  - M means match a literal M

Some common regular expression patterns include (not inclusive):

Character classes
Quantifiers
Anchors
Sets and ranges
Groups and backreferences

Credit: DaveChild Regular Expression Cheat Sheet

Character classes

STRING	REGEX	MATCHES
`"\\d"`	`\d`	any digit
`"\\D"`	`\D`	any non-digit
`"\\s"`	`\s`	any whitespace
`"\\S"`	`\S`	any non-whitespace
`"\\w"`	`\w`	any word character
`"\\W"`	`\W`	any non-word character

Credit: Working with strings in stringr Cheat sheet

There are certain character classes in regular expression that have special meaning. For example:
- \d is used to match any digit (i.e., number)
- \s is used to match any whitespace (i.e., space, tab, or newline character)
- \w is used to match any word character (i.e., alphanumeric character or underscore)
“But wait… there’s more! Before a regex is interpreted as a regular expression, it is also interpreted by R as a string. And backslash is used to escape there as well. So, in the end, you need to preprend two backslashes…”
Credit: Escaping sequences from Stat 545
This means in R, when we want to use regular expression patterns "\d","\s", "\w", etc. to match to strings, we must write out the regex patterns as "\\d","\\s", "\\w", etc.

Using `\d` & `\D` to match digits & non-digits

Goal: write a regular expression pattern that matches to any digit in the string p12_df$text[119]
We can use \d to match all instances of a digit (i.e., number):

Code

# Match any instances of a digit
str_view_all(string = p12_df$text[119], pattern = "\\d")

[1] │ "I stand with my colleagues at @UW and America's leading research universities as they take fight to Covid-<1><9> in our labs and hospitals."
    │ 
    │ #ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/<4>YSf<4>SpPe<0>

KEY POINT WITH REGEX

Our regular expression is the value we specify for the pattern argument above; this is our “regex object”
We want our regex object to include the regular expression \d, which matches to any digit
We specify our regex object as "\\d" rather than "\d"

Use regular expression `\D` to match all instances of a non-digit character:

Code

# Match any instances of a non-digit
str_view_all(string = p12_df$text[119], pattern = "\\D")

[1] │ <"><I>< ><s><t><a><n><d>< ><w><i><t><h>< ><m><y>< ><c><o><l><l><e><a><g><u><e><s>< ><a><t>< ><@><U><W>< ><a><n><d>< ><A><m><e><r><i><c><a><'><s>< ><l><e><a><d><i><n><g>< ><r><e><s><e><a><r><c><h>< ><u><n><i><v><e><r><s><i><t><i><e><s>< ><a><s>< ><t><h><e><y>< ><t><a><k><e>< ><f><i><g><h><t>< ><t><o>< ><C><o><v><i><d><->19< ><i><n>< ><o><u><r>< ><l><a><b><s>< ><a><n><d>< ><h><o><s><p><i><t><a><l><s><.><"><
    │ ><
    │ ><#><P><r><o><u><d><T><o><B><e><O><n><T><h><e><i><r><T><e><a><m>< ><x>< ><#><A><l><w><a><y><s><C><o><m><p><e><t><e>< ><x>< ><#><G><o><H><u><s><k><i><e><s>< ><h><t><t><p><s><:></></><t><.><c><o></>4<Y><S><f>4<S><p><P><e>0

Match to all instances of a digit followed by a non-digit character:

Code

str_view_all(string = p12_df$text[119], pattern = "\\d\\D")

[1] │ "I stand with my colleagues at @UW and America's leading research universities as they take fight to Covid-1<9 >in our labs and hospitals."
    │ 
    │ #ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/<4Y>Sf<4S>pPe0

Using `\s` & `\S` to match whitespace & non-whitespace

We can use \s to match all instances of a whitespace (i.e., space, tab, or newline character):

Code

# Match any instances of a whitespace
str_view_all(
  string = p12_df$text[119]
  , pattern = "\\s"
  )

[1] │ "I< >stand< >with< >my< >colleagues< >at< >@UW< >and< >America's< >leading< >research< >universities< >as< >they< >take< >fight< >to< >Covid-19< >in< >our< >labs< >and< >hospitals."<
    │ ><
    │ >#ProudToBeOnTheirTeam< >x< >#AlwaysCompete< >x< >#GoHuskies< >https://t.co/4YSf4SpPe0

We can use \S to match all instances of a non-whitespace character:

Code

# Match any instances of a non-whitespace
str_view_all(
  string = p12_df$text[119]
  , pattern = "\\S"
  )

[1] │ <"><I> <s><t><a><n><d> <w><i><t><h> <m><y> <c><o><l><l><e><a><g><u><e><s> <a><t> <@><U><W> <a><n><d> <A><m><e><r><i><c><a><'><s> <l><e><a><d><i><n><g> <r><e><s><e><a><r><c><h> <u><n><i><v><e><r><s><i><t><i><e><s> <a><s> <t><h><e><y> <t><a><k><e> <f><i><g><h><t> <t><o> <C><o><v><i><d><-><1><9> <i><n> <o><u><r> <l><a><b><s> <a><n><d> <h><o><s><p><i><t><a><l><s><.><">
    │ 
    │ <#><P><r><o><u><d><T><o><B><e><O><n><T><h><e><i><r><T><e><a><m> <x> <#><A><l><w><a><y><s><C><o><m><p><e><t><e> <x> <#><G><o><H><u><s><k><i><e><s> <h><t><t><p><s><:></></><t><.><c><o></><4><Y><S><f><4><S><p><P><e><0>

Using `\w` & `\W` to match words & non-words

We can use \w to match all instances of a word character (i.e., alphanumeric character or underscore):

Code

# Match any instances of a word character
str_view_all(
  string = p12_df$text[119]
  , pattern = "\\w"
  )

[1] │ "<I> <s><t><a><n><d> <w><i><t><h> <m><y> <c><o><l><l><e><a><g><u><e><s> <a><t> @<U><W> <a><n><d> <A><m><e><r><i><c><a>'<s> <l><e><a><d><i><n><g> <r><e><s><e><a><r><c><h> <u><n><i><v><e><r><s><i><t><i><e><s> <a><s> <t><h><e><y> <t><a><k><e> <f><i><g><h><t> <t><o> <C><o><v><i><d>-<1><9> <i><n> <o><u><r> <l><a><b><s> <a><n><d> <h><o><s><p><i><t><a><l><s>."
    │ 
    │ #<P><r><o><u><d><T><o><B><e><O><n><T><h><e><i><r><T><e><a><m> <x> #<A><l><w><a><y><s><C><o><m><p><e><t><e> <x> #<G><o><H><u><s><k><i><e><s> <h><t><t><p><s>://<t>.<c><o>/<4><Y><S><f><4><S><p><P><e><0>

We can use \W to match all instances of a non-word character:

Code

# Match any instances of a non-word character
str_view_all(
  string = p12_df$text[119]
  , pattern = "\\W"
  )

[1] │ <">I< >stand< >with< >my< >colleagues< >at< ><@>UW< >and< >America<'>s< >leading< >research< >universities< >as< >they< >take< >fight< >to< >Covid<->19< >in< >our< >labs< >and< >hospitals<.><"><
    │ ><
    │ ><#>ProudToBeOnTheirTeam< >x< ><#>AlwaysCompete< >x< ><#>GoHuskies< >https<:></></>t<.>co</>4YSf4SpPe0

This matches all instances of 3-letter words:

Code

str_view_all(
  string = p12_df$text[119]
  , pattern = "\\W\\w\\w\\w\\W"
  )

[1] │ "I stand with my colleagues at @UW< and >America's leading research universities as they take fight to Covid-19 in< our >labs< and >hospitals."
    │ 
    │ #ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/4YSf4SpPe0

Wrap-Up: Character Classes

The second half of the table above shows other regular expressions involving backslashes
This includes special characters like \n and \t, as well as using backslash to escape characters that have special meanings in regex, like . or ? (as we will soon see.
So to match a literal period or question mark, we need to use the regex \. and \?, or strings "\\." and "\\?" in R.

STRING	REGEX	MATCHES
`"\\n"`	`\n`	newline
`"\\t"`	`\t`	tab
`"\\\\"`	`\\`	`\`
`"\\."`	`\.`	`.`
`"\\?"`	`\?`	`?`
`"\\("`	`\(`	`(`
`"\\)"`	`\)`	`)`
`"\\{"`	`\{`	`{`
`"\\}"`	`\}`	`}`

Quantifiers

Character	Description
`*`	0 or more
`?`	0 or 1
`+`	1 or more
`{3}`	Exactly 3
`{3,}`	3 or more
`{3,5}`	3, 4, or 5

We can use quantifiers to specify the amount of a certain character or expression to match.
The quantifier should directly follow the pattern you want to quantify.
For example, s? matches 0 or 1 s and \d{4} matches exactly 4 digits.

Using the `*`, `?`, and `+` quantifiers

We can use * to match 0 or more of a pattern:

Code

# Matches all instances of `s` followed by 0 or more non-word character
str_view_all(string = p12_df$text[119], pattern = "s\\W*")

[1] │ "I <s>tand with my colleague<s >at @UW and America'<s >leading re<s>earch univer<s>itie<s >a<s >they take fight to Covid-19 in our lab<s >and ho<s>pital<s."
    │ 
    │ #>ProudToBeOnTheirTeam x #Alway<s>Compete x #GoHu<s>kie<s >http<s://>t.co/4YSf4SpPe0

Using the `*`, `?`, and `+` quantifiers

We can use ? to match 0 or 1 of a pattern:

Code

# Matches all instances of `s` followed by 0 or 1 non-word character
str_view_all(string = p12_df$text[119], pattern = "s\\W?")

[1] │ "I <s>tand with my colleague<s >at @UW and America'<s >leading re<s>earch univer<s>itie<s >a<s >they take fight to Covid-19 in our lab<s >and ho<s>pital<s.>"
    │ 
    │ #ProudToBeOnTheirTeam x #Alway<s>Compete x #GoHu<s>kie<s >http<s:>//t.co/4YSf4SpPe0

Using the `*`, `?`, and `+` quantifiers

We can use + to match 1 or more of a pattern:

Code

# Matches all instances of `s` followed by 1 or more non-word character
str_view_all(string = p12_df$text[119], pattern = "s\\W+")

[1] │ "I stand with my colleague<s >at @UW and America'<s >leading research universitie<s >a<s >they take fight to Covid-19 in our lab<s >and hospital<s."
    │ 
    │ #>ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskie<s >http<s://>t.co/4YSf4SpPe0

Code

# Matche all twitter hashtags
  # hashtag defined as hashtag character # followed by 1 or more word characters
str_view_all(string = p12_df$text[119], pattern = "#\\w+")

[1] │ "I stand with my colleagues at @UW and America's leading research universities as they take fight to Covid-19 in our labs and hospitals."
    │ 
    │ <#ProudToBeOnTheirTeam> x <#AlwaysCompete> x <#GoHuskies> https://t.co/4YSf4SpPe0

Using `{...}` to specify how many occurrences to match

We can use {n} to specify the exact number of characters or expressions to match:

Code

# Matches words with exactly 3 letters
str_view_all(string = p12_df$text[119], pattern = "\\s\\w{3}\\s")

[1] │ "I stand with my colleagues at @UW< and >America's leading research universities as they take fight to Covid-19 in< our >labs< and >hospitals."
    │ 
    │ #ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/4YSf4SpPe0

Using `{...}` to specify how many occurrences to match

We can use {n,} to specify n as the minimum amount to match:

Code

# Matches words with 3 or more letters
str_view_all(string = p12_df$text[119], pattern = "\\s\\w{3,}\\s")

[1] │ "I< stand >with my< colleagues >at @UW< and >America's< leading >research< universities >as< they >take< fight >to Covid-19 in< our >labs< and >hospitals."
    │ 
    │ #ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/4YSf4SpPe0

Using `{...}` to specify how many occurrences to match

We can use {n,m} to specify we want to match between n and m amount (inclusive):

Code

# Matches words with between 3 to 5 letters (inclusive)
str_view_all(string = p12_df$text[119], pattern = "\\s\\w{3,5}\\s")

[1] │ "I< stand >with my colleagues at @UW< and >America's leading research universities as< they >take< fight >to Covid-19 in< our >labs< and >hospitals."
    │ 
    │ #ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/4YSf4SpPe0

Anchors

We can use anchors to indicate which part of the string to match.
For example, ^ matches the start of the string, $ matches the end of the string (Notice how we do not need to escape these characters).
\b can be used to help detect word boundaries, and \B can be used to help match characters within a word.

String	Character	Description
`"^"`	`^`	Start of string, or start of line in multi-line pattern
`"$"`	`$`	End of string, or end of line in multi-line pattern
`"\\b"`	`\b`	Word boundary
`"\\B"`	`\B`	Non-word boundary

Using `^` & `$` to match start & end of string

We can use ^ to match the start of a string:

Code

# Matches only the quotation mark at the start of the text and not the end quote
str_view_all(string = p12_df$text[119], pattern = '^"')

[1] │ <">I stand with my colleagues at @UW and America's leading research universities as they take fight to Covid-19 in our labs and hospitals."
    │ 
    │ #ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/4YSf4SpPe0

Using `^` & `$` to match start & end of string

We can use $ to match the end of a string:

Code

# Matches only the number at the end of the text and not any other numbers
str_view_all(string = p12_df$text[119], pattern = "\\d$")

[1] │ "I stand with my colleagues at @UW and America's leading research universities as they take fight to Covid-19 in our labs and hospitals."
    │ 
    │ #ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/4YSf4SpPe<0>

Using `\b` & `\B` to match word boundary & non-word boundary

We can use \b to help detect word boundary:

Code

# Match to all word bounraries
str_view_all(string = p12_df$text[119], pattern = "\\b")

[1] │ "<>I<> <>stand<> <>with<> <>my<> <>colleagues<> <>at<> @<>UW<> <>and<> <>America<>'<>s<> <>leading<> <>research<> <>universities<> <>as<> <>they<> <>take<> <>fight<> <>to<> <>Covid<>-<>19<> <>in<> <>our<> <>labs<> <>and<> <>hospitals<>."
    │ 
    │ #<>ProudToBeOnTheirTeam<> <>x<> #<>AlwaysCompete<> <>x<> #<>GoHuskies<> <>https<>://<>t<>.<>co<>/<>4YSf4SpPe0<>

Code

# Matches words with 3 or more letters using \b
str_view_all(string = p12_df$text[119], pattern = "\\b\\w{3,}\\b")

[1] │ "I <stand> <with> my <colleagues> at @UW <and> <America>'s <leading> <research> <universities> as <they> <take> <fight> to <Covid>-19 in <our> <labs> <and> <hospitals>."
    │ 
    │ #<ProudToBeOnTheirTeam> x #<AlwaysCompete> x #<GoHuskies> <https>://t.co/<4YSf4SpPe0>

Notice how this is much flexible than trying to use whitespace (\s) to determine word boundary:

Code

# Matches words with 3 or more letters using \s
str_view_all(string = p12_df$text[119], pattern = "\\s\\w{3,}\\s")

[1] │ "I< stand >with my< colleagues >at @UW< and >America's< leading >research< universities >as< they >take< fight >to Covid-19 in< our >labs< and >hospitals."
    │ 
    │ #ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/4YSf4SpPe0

Regular expression \B matches to “non-word boundary”; what does that mean?

Code

str_view_all(string = p12_df$text[119], pattern = "\\B")

[1] │ <>"I s<>t<>a<>n<>d w<>i<>t<>h m<>y c<>o<>l<>l<>e<>a<>g<>u<>e<>s a<>t <>@U<>W a<>n<>d A<>m<>e<>r<>i<>c<>a's l<>e<>a<>d<>i<>n<>g r<>e<>s<>e<>a<>r<>c<>h u<>n<>i<>v<>e<>r<>s<>i<>t<>i<>e<>s a<>s t<>h<>e<>y t<>a<>k<>e f<>i<>g<>h<>t t<>o C<>o<>v<>i<>d-1<>9 i<>n o<>u<>r l<>a<>b<>s a<>n<>d h<>o<>s<>p<>i<>t<>a<>l<>s.<>"<>
    │ <>
    │ <>#P<>r<>o<>u<>d<>T<>o<>B<>e<>O<>n<>T<>h<>e<>i<>r<>T<>e<>a<>m x <>#A<>l<>w<>a<>y<>s<>C<>o<>m<>p<>e<>t<>e x <>#G<>o<>H<>u<>s<>k<>i<>e<>s h<>t<>t<>p<>s:<>/<>/t.c<>o/4<>Y<>S<>f<>4<>S<>p<>P<>e<>0

We can use \B to help match characters within a word:

Code

# Matches only the letter `s` within a word and not at the start or end
str_view_all(string = p12_df$text[119], pattern = "\\Bs\\B")

[1] │ "I stand with my colleagues at @UW and America's leading re<s>earch univer<s>ities as they take fight to Covid-19 in our labs and ho<s>pitals."
    │ 
    │ #ProudToBeOnTheirTeam x #Alway<s>Compete x #GoHu<s>kies https://t.co/4YSf4SpPe0

Sets and ranges

Character	Description
`.`	Match any character except newline (`\n`)
`a\|b`	Match `a` or `b`
`[abc]`	Match either `a`, `b`, or `c`
`[^abc]`	Match anything except `a`, `b`, or `c`
`[a-z]`	Match range of lowercase letters from `a` to `z`
`[A-Z]`	Match range of uppercase letters from `A` to `Z`
`[0-9]`	Match range of numbers from `0` to `9`

The table lists some more ways regular expression offers us flexibility and option in what we want to match.
The period . acts as a wildcard to match any character except newline.
The vertical bar | is similar to an OR operator. Square brackets [...] can be used to specify a set or range of characters to match (or not to match).

Using `.` as a wildcard

We can use . to match any character except newline (\n):

Code

# Matches any character except newline
str_view_all(string = p12_df$text[119], pattern = ".")

[1] │ <"><I>< ><s><t><a><n><d>< ><w><i><t><h>< ><m><y>< ><c><o><l><l><e><a><g><u><e><s>< ><a><t>< ><@><U><W>< ><a><n><d>< ><A><m><e><r><i><c><a><'><s>< ><l><e><a><d><i><n><g>< ><r><e><s><e><a><r><c><h>< ><u><n><i><v><e><r><s><i><t><i><e><s>< ><a><s>< ><t><h><e><y>< ><t><a><k><e>< ><f><i><g><h><t>< ><t><o>< ><C><o><v><i><d><-><1><9>< ><i><n>< ><o><u><r>< ><l><a><b><s>< ><a><n><d>< ><h><o><s><p><i><t><a><l><s><.><">
    │ 
    │ <#><P><r><o><u><d><T><o><B><e><O><n><T><h><e><i><r><T><e><a><m>< ><x>< ><#><A><l><w><a><y><s><C><o><m><p><e><t><e>< ><x>< ><#><G><o><H><u><s><k><i><e><s>< ><h><t><t><p><s><:></></><t><.><c><o></><4><Y><S><f><4><S><p><P><e><0>

We can confirm there is a newline in the tweet above by using writeLines() or print():

Code

writeLines(p12_df$text[119])

"I stand with my colleagues at @UW and America's leading research universities as they take fight to Covid-19 in our labs and hospitals."

#ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/4YSf4SpPe0

Code

print(p12_df$text[119])

[1] "\"I stand with my colleagues at @UW and America's leading research universities as they take fight to Covid-19 in our labs and hospitals.\"\n\n#ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/4YSf4SpPe0"

Using `|` as an OR operator

We can use | to match either one of multiple patterns:

Code

# Matches `research`, `fight`, or `labs`
str_view_all(string = p12_df$text[119], pattern = "research|fight|labs")

[1] │ "I stand with my colleagues at @UW and America's leading <research> universities as they take <fight> to Covid-19 in our <labs> and hospitals."
    │ 
    │ #ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/4YSf4SpPe0

Code

# Matches hashtags or handles
str_view_all(string = p12_df$text[119], pattern = "@\\w+|#\\w+")

[1] │ "I stand with my colleagues at <@UW> and America's leading research universities as they take fight to Covid-19 in our labs and hospitals."
    │ 
    │ <#ProudToBeOnTheirTeam> x <#AlwaysCompete> x <#GoHuskies> https://t.co/4YSf4SpPe0

#Using `[...]` to match (or not match) a set or range of characters

We can use [...] to match any set of characters:

Code

# Matches hashtags or handles
str_view_all(string = p12_df$text[119], pattern = "[@#]\\w+")

[1] │ "I stand with my colleagues at <@UW> and America's leading research universities as they take fight to Covid-19 in our labs and hospitals."
    │ 
    │ <#ProudToBeOnTheirTeam> x <#AlwaysCompete> x <#GoHuskies> https://t.co/4YSf4SpPe0

Code

# Matches any 2 consecutive vowels
str_view_all(string = p12_df$text[119], pattern = "[aeiouAEIOU]{2}")

[1] │ "I stand with my coll<ea>g<ue>s at @UW and America's l<ea>ding res<ea>rch universit<ie>s as they take fight to Covid-19 in <ou>r labs and hospitals."
    │ 
    │ #Pr<ou>dToB<eO>nTh<ei>rT<ea>m x #AlwaysCompete x #GoHusk<ie>s https://t.co/4YSf4SpPe0

We can also use [...] to match any range of alpha or numeric characters:

Code

# Matches only lowercase x through z or uppercase A through C
str_view_all(string = p12_df$text[119], pattern = "[x-zA-C]")

[1] │ "I stand with m<y> colleagues at @UW and <A>merica's leading research universities as the<y> take fight to <C>ovid-19 in our labs and hospitals."
    │ 
    │ #ProudTo<B>eOnTheirTeam <x> #<A>lwa<y>s<C>ompete <x> #GoHuskies https://t.co/4YSf4SpPe0

Code

# Matches only numbers 1 through 4 or the pound sign
str_view_all(string = p12_df$text[119], pattern = "[1-4#]")

[1] │ "I stand with my colleagues at @UW and America's leading research universities as they take fight to Covid-<1>9 in our labs and hospitals."
    │ 
    │ <#>ProudToBeOnTheirTeam x <#>AlwaysCompete x <#>GoHuskies https://t.co/<4>YSf<4>SpPe0

We can use [^...] to indicate we do not want to match the provided set or range of characters:

Code

# Matches any vowels
str_view_all(string = p12_df$text[119], pattern = "[aeiouAEIOU]")

[1] │ "<I> st<a>nd w<i>th my c<o>ll<e><a>g<u><e>s <a>t @<U>W <a>nd <A>m<e>r<i>c<a>'s l<e><a>d<i>ng r<e>s<e><a>rch <u>n<i>v<e>rs<i>t<i><e>s <a>s th<e>y t<a>k<e> f<i>ght t<o> C<o>v<i>d-19 <i>n <o><u>r l<a>bs <a>nd h<o>sp<i>t<a>ls."
    │ 
    │ #Pr<o><u>dT<o>B<e><O>nTh<e><i>rT<e><a>m x #<A>lw<a>ysC<o>mp<e>t<e> x #G<o>H<u>sk<i><e>s https://t.c<o>/4YSf4SpP<e>0

Code

# Matches anything except vowels
str_view_all(string = p12_df$text[119], pattern = "[^aeiouAEIOU]")

[1] │ <">I< ><s><t>a<n><d>< ><w>i<t><h>< ><m><y>< ><c>o<l><l>ea<g>ue<s>< >a<t>< ><@>U<W>< >a<n><d>< >A<m>e<r>i<c>a<'><s>< ><l>ea<d>i<n><g>< ><r>e<s>ea<r><c><h>< >u<n>i<v>e<r><s>i<t>ie<s>< >a<s>< ><t><h>e<y>< ><t>a<k>e< ><f>i<g><h><t>< ><t>o< ><C>o<v>i<d><-><1><9>< >i<n>< >ou<r>< ><l>a<b><s>< >a<n><d>< ><h>o<s><p>i<t>a<l><s><.><"><
    │ ><
    │ ><#><P><r>ou<d><T>o<B>eO<n><T><h>ei<r><T>ea<m>< ><x>< ><#>A<l><w>a<y><s><C>o<m><p>e<t>e< ><x>< ><#><G>o<H>u<s><k>ie<s>< ><h><t><t><p><s><:></></><t><.><c>o</><4><Y><S><f><4><S><p><P>e<0>

Code

# Matches anything that's not uppercase letters
str_view_all(string = p12_df$text[119], pattern = "[^A-Z]+")

[1] │ <">I< stand with my colleagues at @>UW< and >A<merica's leading research universities as they take fight to >C<ovid-19 in our labs and hospitals."
    │ 
    │ #>P<roud>T<o>B<e>O<n>T<heir>T<eam x #>A<lways>C<ompete x #>G<o>H<uskies https://t.co/4>YS<f4>S<p>P<e0>

Notice that [...] only matches a single character (see second to last example above). We need to use quantifiers if we want to match a stretch of characters (see last example above).

Dates and times

“Date-time data can be frustrating to work with in R. R commands for date-times are generally unintuitive and change depending on the type of date-time object being used. Moreover, the methods we use with date-times must be robust to time zones, leap days, daylight savings times, and other time related quirks, and R lacks these capabilities in some situations. Lubridate makes it easier to do the things R does with date-times and possible to do the things R does not.”

Credit: lubridatedocumentation

How are dates and times stored in R? (From Dates and Times in R)

The Date class is used for storing dates
- “Internally, Date objects are stored as the number of days since January 1, 1970, using negative numbers for earlier dates. The as.numeric() function can be used to convert a Date object to its internal form.”
POSIX classes can be used for storing date plus times
- “The POSIXct class stores date/time values as the number of seconds since January 1, 1970”
- “The POSIXlt class stores date/time values as a list of components (hour, min, sec, mon, etc.) making it easy to extract these parts”
There is no native R class for storing only time

Why use date/time objects?

Using date/time objects makes it easier to fetch or modify various date/time components (e.g., year, month, day, day of the week)
- Compared to if the date/time is just stored in a string, these components are not as readily accessible and need to be parsed
You can perform certain arithmetics with date/time objects (e.g., find the “difference” between date/time points)

Creating date/time objects

Functions that create date/time objects by parsing character or numeric input:

Create Date object: ymd(), ydm(), mdy(), myd(), dmy(), dym()
- y stands for year, m stands for month, d stands for day
- Select the function that represents the order in which your date input is formatted, and the function will be able to parse your input and create a Date object

Creating POSIXct objects

Create POSIXct object: ymd_h(), ymd_hm(), ymd_hms(), etc.
- h stands for hour, m stands for minute, s stands for second
- For any of the previous 6 date functions, you can append h, hm, or hms if you want to provide additional time information in order to create a POSIXct object
- To force a POSIXct object without providing any time information, you can just provide a timezone (using tz) to one of the date functions and it will assume midnight as the time
- You can use Sys.timezone() to get the timezone for your location

Creating `Date` object from character or numeric input

The lubridate functions are flexible and can parse dates in various formats:

Code

d <- mdy("1/1/2020"); d

[1] "2020-01-01"

Code

d <- mdy("1-1-2020"); d

[1] "2020-01-01"

Code

d <- mdy("Jan. 1, 2020"); d

[1] "2020-01-01"

Code

d <- ymd(20200101); d

[1] "2020-01-01"

Creating `Date` object from character or numeric input

Investigate the Date object:

Code

class(d)

[1] "Date"

Code

typeof(d)

[1] "double"

Code

# Number of days since January 1, 1970
as.numeric(d)

[1] 18262

Creating `POSIXct` object from character or numeric input

The lubridate functions are flexible and can parse AM/PM in various formats:

Code

dt <- mdy_h("12/31/2019 11pm"); dt

[1] "2019-12-31 23:00:00 UTC"

Code

dt <- mdy_hm("12/31/2019 11:59 pm"); dt

[1] "2019-12-31 23:59:00 UTC"

Code

dt <- mdy_hms("12/31/2019 11:59:59 PM"); dt

[1] "2019-12-31 23:59:59 UTC"

Code

dt <- ymd_hms(20191231235959); dt

[1] "2019-12-31 23:59:59 UTC"

Investigate the POSIXct object:

Code

class(dt)

[1] "POSIXct" "POSIXt"

Code

typeof(dt)

[1] "double"

Code

# Number of seconds since January 1, 1970
as.numeric(dt)

[1] 1577836799

We can also create a POSIXct object from a date function by providing a timezone. The time would default to midnight:

Code

dt <- mdy("1/1/2020", tz = "UTC")
dt

[1] "2020-01-01 UTC"

Code

# Number of seconds since January 1, 1970
as.numeric(dt)  # Note that this is indeed 1 sec after the previous example

[1] 1577836800

Creating `Date` objects from dataframe column

Using the p12_datetime_df we created earlier, we can create Date objects from the date_chr column:

Code

# Use `ymd()` to parse the string stored in the `date_chr` column
p12_datetime_df %>% select(created_at, dt_chr, date_chr) %>%
  mutate(date_ymd = ymd(date_chr))

Creating `POSIXct` objects from dataframe column

Using the p12_datetime_df we created earlier, we can recreate the created_at column (class POSIXct) from the dt_chr column (class character):

Code

# Use `ymd_hms()` to parse the string stored in the `dt_chr` column
p12_datetime_df %>% select(created_at, dt_chr) %>%
  mutate(datetime_ymd_hms = ymd_hms(dt_chr))

Creating date/time objects from individual components

Functions that create date/time objects from various date/time components:

Create Date object: make_date()
- Syntax and default values: make_date(year = 1970L, month = 1L, day = 1L)
- All inputs are coerced to integer
Create POSIXct object: make_datetime()
- Syntax and default values: make_datetime(year = 1970L, month = 1L, day = 1L, hour = 0L, min = 0L, sec = 0, tz = "UTC")

There are various ways to pass in the inputs to create the same Date object:

Code

d <- make_date(2020, 1, 1); d

[1] "2020-01-01"

Code

# Characters can be coerced to integers
d <- make_date("2020", "01", "01"); d

[1] "2020-01-01"

Code

# Remember that the default values for month and day would be 1L
d <- make_date(2020); d

[1] "2020-01-01"

Creating `POSIXct` object from individual components

Code

# Inputs should be numeric
d <- make_datetime(2019, 12, 31, 23, 59, 59)
d

[1] "2019-12-31 23:59:59 UTC"

Creating `Date` objects from dataframe columns

Using the p12_datetime_df we created earlier, we can create Date objects from the various date component columns:

Code

# Use `make_date()` to create a `Date` object from the `yr_chr`, `mth_chr`, `day_chr` fields
p12_datetime_df %>% select(created_at, dt_chr, yr_chr, mth_chr, day_chr) %>%
  mutate(date_make_date = make_date(year = yr_chr, month = mth_chr, day = day_chr))

Creating `POSIXct` objects from dataframe columns

Using the p12_datetime_df we created earlier, we can recreate the created_at column (class POSIXct) from the various date and time component columns (class character):

Code

# Use `make_datetime()` to create a `POSIXct` object from the `yr_chr`, `mth_chr`, `day_chr`, `hr_chr`, `min_chr`, `sec_chr` fields
# Convert inputs to integers first
p12_datetime_df %>%
  mutate(datetime_make_datetime = make_datetime(
    as.integer(yr_chr), as.integer(mth_chr), as.integer(day_chr), 
    as.integer(hr_chr), as.integer(min_chr), as.integer(sec_chr)
  )) %>%
  select(datetime_make_datetime, yr_chr, mth_chr, day_chr, hr_chr, min_chr, sec_chr)

Date/time object components

Storing data using date/time objects makes it easier to get and set the various date/time components.

Basic accessor functions:

date(): Date component
year(): Year
month(): Month
day(): Day
hour(): Hour
minute(): Minute
second(): Second
week(): Week of the year
wday(): Day of the week (1 for Sunday to 7 for Saturday)
am(): Is it in the am? (returns TRUE or FALSE)
pm(): Is it in the pm? (returns TRUE or FALSE)
To get a date/time component, you can simply pass a date/time object to the function
- Syntax: accessor_function(<date/time_object>)
To set a date/time component, you can assign into the accessor function to change the component
- Syntax: accessor_function(<date/time_object>) <- "new_component"
- Note that am() and pm() can’t be set. Modify the time components instead.

Code

# Create datetime for New Year's Eve
dt <- make_datetime(2019, 12, 31, 23, 59, 59)
dt

[1] "2019-12-31 23:59:59 UTC"

Code

dt %>% class()

[1] "POSIXct" "POSIXt"

Code

date(dt) # Get date

[1] "2019-12-31"

Code

hour(dt) # Get hour

[1] 23

Code

pm(dt)   # Is it pm?

[1] TRUE

Code

wday(dt) # Day of the week (3 = Tuesday)

[1] 3

Code

year(dt) # Get year

[1] 2019

Setting date/time components

Code

week(dt) # Get week of year

[1] 53

Code

# Set week of year (move back 1 week)
week(dt) <- week(dt) - 1
dt

[1] "2019-12-24 23:59:59 UTC"

Code

day(dt) <- 25 # Set day to Christmas Day
dt

[1] "2019-12-25 23:59:59 UTC"

Getting date/time components from dataframe column

Using the p12_datetime_df we created earlier, we can isolate the various date/time components from the POSIXct object in the created_at column:

Code

# The extracted date/time components will be of numeric type
p12_datetime_df %>% select(created_at) %>%
  mutate(
    yr_num = year(created_at),
    mth_num = month(created_at),
    day_num = day(created_at),
    hr_num = hour(created_at),
    min_num = minute(created_at),
    sec_num = second(created_at),
    ampm = ifelse(am(created_at), 'AM', 'PM')  # am()/pm() returns TRUE/FALSE
  )

Time spans

3 ways to represent time spans (From lubridate cheatsheet)

Intervals represent specific intervals of the timeline, bounded by start and end date-times
- Example: People with birthdays between the interval October 23 to November 22 are Scorpios
Periods track changes in clock times, which ignore time line irregularities
- Example: Daylight savings time ends at the beginning of November and we gain an hour - this extra hour is ignored when determining the period between October 23 to November 22
Durations track the passage of physical time, which deviates from clock time when irregularities occur
- Example: Daylight savings time ends at the beginning of November and we gain an hour - this extra hour is added when determining the duration between October 23 to November 22

Using the lubridate package for time spans:

Interval
- Create an interval using interval() or %--%
  - Syntax: interval(<date/time_object1>, <date/time_object2>) or <date/time_object1> %--% <date/time_object2>

Time spans using `lubridate`: Periods

“Periods are time spans but don’t have a fixed length in seconds, instead they work with ‘human’ times, like days and months.” (From R for Data Science)
Create periods using functions whose name is the time unit pluralized (e.g., years(), months(), weeks(), days(), hours(), minutes(), seconds())
You can add and subtract periods
You can also use as.period() to get period of an interval

Code

days(1)

[1] "1d 0H 0M 0S"

Time spans using `lubridate`: Durations

Durations keep track of the physical amount of time elapsed, so it is “stored as seconds, the only time unit with a consistent length” (From lubridate cheatsheet)
Create durations using functions whose name is the time unit prefixed with a d (e.g., dyears(), dweeks(), ddays(), dhours(), dminutes(), dseconds())
Example: ddays(1) creates a duration of 86400s, using the standard conversion of 60 seconds in an minute, 60 minutes in an hour, and 24 hours in a day:

Code

ddays(1)

[1] "86400s (~1 days)"

Notice that the output says this is equivalent to approximately 1 day, since it acknowledges that not all days have 24 hours.

In the case of daylight savings, one particular day may have 25 hours, so the duration of that day should be represented as:

Code

ddays(1) + dhours(1)

[1] "90000s (~1.04 days)"

You can add and subract durations
You can also use as.duration() to get duration of an interval

Working with interval

Code

# Use `Sys.timezone()` to get timezone for your location (time is midnight by default)
scorpio_start <- ymd("2019-10-23", tz = Sys.timezone())
scorpio_end <- ymd("2019-11-22", tz = Sys.timezone())

scorpio_start

[1] "2019-10-23 PDT"

Code

# These datetime objects have class `POSIXct`
class(scorpio_start)

[1] "POSIXct" "POSIXt"

Code

# Create interval for the datetimes
scorpio_interval <- scorpio_start %--% scorpio_end  # or `interval(scorpio_start, scorpio_end)`
scorpio_interval <- interval(scorpio_start, scorpio_end)
scorpio_interval

[1] 2019-10-23 PDT--2019-11-22 PST

Code

# The object has class `Interval`
class(scorpio_interval)

[1] "Interval"
attr(,"package")
[1] "lubridate"

Code

as.numeric(scorpio_interval)

[1] 2595600

Working with period

If we use as.period() to get the period of scorpio_interval, we see that it is a period of 30 days. We do not worry about the extra 1 hour gained due to daylight savings ending:

Code

# Period is 30 days
scorpio_period <- as.period(scorpio_interval)
scorpio_period

[1] "30d 0H 0M 0S"

Code

# The object has class `Period`
class(scorpio_period)

[1] "Period"
attr(,"package")
[1] "lubridate"

Because periods work with “human” times like days, it is more intuitive. For example, if we add a period of 30 days to the scorpio_start datetime object, we get the expected end datetime that is 30 days later:

Code

# Start datetime for Scorpio birthdays (time is midnight)
scorpio_start

[1] "2019-10-23 PDT"

Code

# After adding 30 day period, we get the expected end datetime (time is midnight)
scorpio_start + days(30)

[1] "2019-11-22 PST"

Working with duration

If we use as.duration() to get the duration of scorpio_interval, we see that it is a duration of 2595600 seconds. It takes into account the extra 1 hour gained due to daylight savings ending:

Code

# Duration is 2595600 seconds, which is equivalent to 30 24-hr days + 1 additional hour
scorpio_duration <- as.duration(scorpio_interval)
scorpio_duration

[1] "2595600s (~4.29 weeks)"

Code

# The object has class `Duration`
class(scorpio_duration)

[1] "Duration"
attr(,"package")
[1] "lubridate"

Code

# Using the standard 60s/min, 60min/hr, 24hr/day conversion,
# confirm duration is slightly more than 30 "standard" (ie. 24-hr) days
2595600 / (60 * 60 * 24)

[1] 30.04167

Code

# Specifically, it is 30 days + 1 hour, if we define a day to have 24 hours
seconds_to_period(scorpio_duration)

[1] "30d 1H 0M 0S"

Because durations work with physical time, when we add a duration of 30 days to the scorpio_start datetime object, we do not get the end datetime we’d expect:

Code

# Start datetime for Scorpio birthdays (time is midnight)
scorpio_start

[1] "2019-10-23 PDT"

Code

# After adding 30 day duration, we do not get the expected end datetime
# `ddays(30)` adds the number of seconds in 30 standard 24-hr days, but one of the days has 25 hours
scorpio_start + ddays(30)

[1] "2019-11-21 23:00:00 PST"

Code

# We need to add the additional 1 hour of physical time that elapsed during this time span
scorpio_start + ddays(30) + dhours(1)

[1] "2019-11-22 PST"

Attributions

These materials were adapted from Ozan Jaquette’s EDUC 260A Course: Introduction to Programming and Data Management, Strings & Dates and EDUC 260B Course: Fundamentals of Programming, Strings & Regex.

Week 7 - Strings & Dates

Outline

Dataset we will use

String basics

Creating Strings

stringr package

str_length()

str_c()

str_sub()

Other stringr functions

Using str_to_upper() to turn strings to uppercase

Using str_to_lower() to turn strings to lowercase

Using str_sort() to sort character vector

Using str_trim() to trim whitespace from string

Using str_pad() to pad string with character

Regular expression basics

Example of using regular expression in action:

Some common regular expression patterns include (not inclusive):

Character classes

Using \d & \D to match digits & non-digits

KEY POINT WITH REGEX

Use regular expression \D to match all instances of a non-digit character:

Using \s & \S to match whitespace & non-whitespace

Using \w & \W to match words & non-words

Wrap-Up: Character Classes

Quantifiers

Using the *, ?, and + quantifiers

Using the *, ?, and + quantifiers

Using the *, ?, and + quantifiers

Using {...} to specify how many occurrences to match

Using {...} to specify how many occurrences to match

Using {...} to specify how many occurrences to match

Anchors

Using ^ & $ to match start & end of string

Using ^ & $ to match start & end of string

Using \b & \B to match word boundary & non-word boundary

Sets and ranges

Using . as a wildcard

Using | as an OR operator

#Using [...] to match (or not match) a set or range of characters

Dates and times

Creating date/time objects

Creating POSIXct objects

Creating Date object from character or numeric input

Creating Date object from character or numeric input

Creating POSIXct object from character or numeric input

Creating Date objects from dataframe column

Creating POSIXct objects from dataframe column

Creating date/time objects from individual components

Creating POSIXct object from individual components

Creating Date objects from dataframe columns

Creating POSIXct objects from dataframe columns

Date/time object components

Setting date/time components

Getting date/time components from dataframe column

Time spans

Time spans using lubridate: Periods

Time spans using lubridate: Durations

Working with interval

Working with period

Working with duration

Attributions

`stringr` package

`str_length()`

`str_c()`

`str_sub()`

Other `stringr` functions

Using `str_to_upper()` to turn strings to uppercase

Using `str_to_lower()` to turn strings to lowercase

Using `str_sort()` to sort character vector

Using `str_trim()` to trim whitespace from string

Using `str_pad()` to pad string with character

Using `\d` & `\D` to match digits & non-digits

Use regular expression `\D` to match all instances of a non-digit character:

Using `\s` & `\S` to match whitespace & non-whitespace

Using `\w` & `\W` to match words & non-words

Using the `*`, `?`, and `+` quantifiers

Using the `*`, `?`, and `+` quantifiers

Using the `*`, `?`, and `+` quantifiers

Using `{...}` to specify how many occurrences to match

Using `{...}` to specify how many occurrences to match

Using `{...}` to specify how many occurrences to match

Using `^` & `$` to match start & end of string

Using `^` & `$` to match start & end of string

Using `\b` & `\B` to match word boundary & non-word boundary

Using `.` as a wildcard

Using `|` as an OR operator

#Using `[...]` to match (or not match) a set or range of characters

Creating `Date` object from character or numeric input

Creating `Date` object from character or numeric input

Creating `POSIXct` object from character or numeric input

Creating `Date` objects from dataframe column

Creating `POSIXct` objects from dataframe column

Creating `POSIXct` object from individual components

Creating `Date` objects from dataframe columns

Creating `POSIXct` objects from dataframe columns

Time spans using `lubridate`: Periods

Time spans using `lubridate`: Durations