Week 7 - Strings & Dates

Emorie D Beck

Outline

Questions on Homework
Strings
Regex
Dates

Dataset we will use

We will use rtweet to pull Twitter data from the PAC-12 universities. We will use the university admissions Twitter handle if there is one, or the main Twitter handle for the university if there isn’t one:

# Load previously pulled Twitter data
p12_full_df <- readRDS("week7-data.RDS")
glimpse(p12_full_df)

Rows: 328
Columns: 90
$ user_id                 <chr> "22080148", "22080148", "22080148", "22080148"…
$ status_id               <chr> "1254177694599675904", "1253431405993840646", …
$ created_at              <dttm> 2020-04-25 22:37:18, 2020-04-23 21:11:49, 202…
$ screen_name             <chr> "WSUPullman", "WSUPullman", "WSUPullman", "WSU…
$ text                    <chr> "Big Dez is headed to Indy!\n\n#GoCougs | #NFL…
$ source                  <chr> "Twitter for iPhone", "Twitter Web App", "Twit…
$ display_text_width      <dbl> 125, 58, 246, 83, 56, 64, 156, 271, 69, 140, 4…
$ reply_to_status_id      <chr> NA, NA, NA, NA, NA, NA, NA, NA, "1252615862659…
$ reply_to_user_id        <chr> NA, NA, NA, NA, NA, NA, NA, NA, "22080148", NA…
$ reply_to_screen_name    <chr> NA, NA, NA, NA, NA, NA, NA, NA, "WSUPullman", …
$ is_quote                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ is_retweet              <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ favorite_count          <int> 0, 322, 30, 55, 186, 53, 22, 44, 11, 0, 69, 42…
$ retweet_count           <int> 230, 32, 1, 5, 0, 3, 2, 6, 2, 6, 3, 4, 5, 5, 2…
$ quote_count             <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ reply_count             <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ hashtags                <list> <"GoCougs", "NFLDraft2020", "NFLCougs">, <"WS…
$ symbols                 <list> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ urls_url                <list> NA, NA, NA, NA, NA, NA, NA, "commencement.wsu…
$ urls_t.co               <list> NA, NA, NA, NA, NA, NA, NA, "https://t.co/RR4…
$ urls_expanded_url       <list> NA, NA, NA, NA, NA, NA, NA, "https://commence…
$ media_url               <list> "http://pbs.twimg.com/ext_tw_video_thumb/1254…
$ media_t.co              <list> "https://t.co/NdGsvXnij7", "https://t.co/0OWG…
$ media_expanded_url      <list> "https://twitter.com/WSUCougarFB/status/12541…
$ media_type              <list> "photo", "photo", "photo", "photo", "photo", …
$ ext_media_url           <list> "http://pbs.twimg.com/ext_tw_video_thumb/1254…
$ ext_media_t.co          <list> "https://t.co/NdGsvXnij7", "https://t.co/0OWG…
$ ext_media_expanded_url  <list> "https://twitter.com/WSUCougarFB/status/12541…
$ ext_media_type          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ mentions_user_id        <list> <"1250265324", "1409024796", "180884045">, NA…
$ mentions_screen_name    <list> <"WSUCougarFB", "dadpat7", "Colts">, NA, "WSU…
$ lang                    <chr> "en", "en", "en", "en", "en", "en", "en", "en"…
$ quoted_status_id        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "12529…
$ quoted_text             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "My WS…
$ quoted_created_at       <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2020-…
$ quoted_source           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Twitt…
$ quoted_favorite_count   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 209, N…
$ quoted_retweet_count    <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 6, NA,…
$ quoted_user_id          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "43947…
$ quoted_screen_name      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "maddd…
$ quoted_name             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Maddy…
$ quoted_followers_count  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 629, N…
$ quoted_friends_count    <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 382, N…
$ quoted_statuses_count   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 8881, …
$ quoted_location         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Seatt…
$ quoted_description      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "WSU A…
$ quoted_verified         <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, FALSE,…
$ retweet_status_id       <chr> "1254159118996127746", NA, NA, NA, NA, NA, NA,…
$ retweet_text            <chr> "Big Dez is headed to Indy!\n\n#GoCougs | #NFL…
$ retweet_created_at      <dttm> 2020-04-25 21:23:29, NA, NA, NA, NA, NA, NA, …
$ retweet_source          <chr> "Twitter for iPhone", NA, NA, NA, NA, NA, NA, …
$ retweet_favorite_count  <int> 1402, NA, NA, NA, NA, NA, NA, NA, NA, 26, NA, …
$ retweet_retweet_count   <int> 230, NA, NA, NA, NA, NA, NA, NA, NA, 6, NA, NA…
$ retweet_user_id         <chr> "1250265324", NA, NA, NA, NA, NA, NA, NA, NA, …
$ retweet_screen_name     <chr> "WSUCougarFB", NA, NA, NA, NA, NA, NA, NA, NA,…
$ retweet_name            <chr> "Washington State Football", NA, NA, NA, NA, N…
$ retweet_followers_count <int> 77527, NA, NA, NA, NA, NA, NA, NA, NA, 996, NA…
$ retweet_friends_count   <int> 1448, NA, NA, NA, NA, NA, NA, NA, NA, 316, NA,…
$ retweet_statuses_count  <int> 15363, NA, NA, NA, NA, NA, NA, NA, NA, 1666, N…
$ retweet_location        <chr> "Pullman, WA", NA, NA, NA, NA, NA, NA, NA, NA,…
$ retweet_description     <chr> "Official Twitter home of Washington State Cou…
$ retweet_verified        <lgl> TRUE, NA, NA, NA, NA, NA, NA, NA, NA, FALSE, N…
$ place_url               <chr> NA, NA, NA, NA, NA, "https://api.twitter.com/1…
$ place_name              <chr> NA, NA, NA, NA, NA, "Pullman", NA, NA, NA, NA,…
$ place_full_name         <chr> NA, NA, NA, NA, NA, "Pullman, WA", NA, NA, NA,…
$ place_type              <chr> NA, NA, NA, NA, NA, "city", NA, NA, NA, NA, "c…
$ country                 <chr> NA, NA, NA, NA, NA, "United States", NA, NA, N…
$ country_code            <chr> NA, NA, NA, NA, NA, "US", NA, NA, NA, NA, "US"…
$ geo_coords              <list> <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, …
$ coords_coords           <list> <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, …
$ bbox_coords             <list> <NA, NA, NA, NA, NA, NA, NA, NA>, <NA, NA, NA…
$ status_url              <chr> "https://twitter.com/WSUPullman/status/1254177…
$ name                    <chr> "WSU Pullman", "WSU Pullman", "WSU Pullman", "…
$ location                <chr> "Pullman, Washington USA", "Pullman, Washingto…
$ description             <chr> "We are an award-winning research university i…
$ url                     <chr> "http://t.co/VxKZH9BuMS", "http://t.co/VxKZH9B…
$ protected               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ followers_count         <int> 43914, 43914, 43914, 43914, 43914, 43914, 4391…
$ friends_count           <int> 9717, 9717, 9717, 9717, 9717, 9717, 9717, 9717…
$ listed_count            <int> 556, 556, 556, 556, 556, 556, 556, 556, 556, 5…
$ statuses_count          <int> 15234, 15234, 15234, 15234, 15234, 15234, 1523…
$ favourites_count        <int> 20124, 20124, 20124, 20124, 20124, 20124, 2012…
$ account_created_at      <dttm> 2009-02-26 23:39:34, 2009-02-26 23:39:34, 200…
$ verified                <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
$ profile_url             <chr> "http://t.co/VxKZH9BuMS", "http://t.co/VxKZH9B…
$ profile_expanded_url    <chr> "http://www.wsu.edu", "http://www.wsu.edu", "h…
$ account_lang            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ profile_banner_url      <chr> "https://pbs.twimg.com/profile_banners/2208014…
$ profile_background_url  <chr> "http://abs.twimg.com/images/themes/theme5/bg.…
$ profile_image_url       <chr> "http://pbs.twimg.com/profile_images/576502906…

p12_df <- p12_full_df |> 
  select(user_id, created_at, screen_name, text, location)
head(p12_df)

# A tibble: 6 × 5
  user_id  created_at          screen_name text                         location
  <chr>    <dttm>              <chr>       <chr>                        <chr>   
1 22080148 2020-04-25 22:37:18 WSUPullman  "Big Dez is headed to Indy!… Pullman…
2 22080148 2020-04-23 21:11:49 WSUPullman  "Cougar Cheese. That's it. … Pullman…
3 22080148 2020-04-21 04:00:00 WSUPullman  "Darien McLaughlin '19, and… Pullman…
4 22080148 2020-04-24 03:00:00 WSUPullman  "6 houses, one pick. Cougs,… Pullman…
5 22080148 2020-04-20 19:00:21 WSUPullman  "Why did you choose to atte… Pullman…
6 22080148 2020-04-20 02:20:01 WSUPullman  "Tell us one of your Bryan … Pullman…

String basics

What are strings?

String is a type of data in R
You can create strings using either single quotes (') or double quotes (")
- Internally, R stores strings using double quotes
The class() and typeof() a string is character

Creating Strings

Creating string using single quotes

Notice how R stores strings using double quotes internally:

my_string <- 'This is a string'
my_string

[1] "This is a string"

Creating string using double quotes

my_string <- "Strings can also contain numbers: 123"
my_string

[1] "Strings can also contain numbers: 123"

Checking class and type of strings

class(my_string)

[1] "character"

typeof(my_string)

[1] "character"

Quotes in quotes

Note: To include quotes as part of the string, we can either use the other type of quotes to surround the string (i.e., ' or ") or escape the quote using a backslash (\).

# Include quote by using the other type of quotes to surround the string 
my_string <- "There's no issues with this string."
my_string

[1] "There's no issues with this string."

# Include quote of the same type by escaping it with a backslash
my_string <- 'There\'s no issues with this string.'
my_string

[1] "There's no issues with this string."

# This would not work
my_string <- 'There's an issue with this string.'
my_string

`stringr` package

“A consistent, simple and easy to use set of wrappers around the fantastic stringi package. All function and argument names (and positions) are consistent, all functions deal with NA’s and zero length vectors in the same way, and the output from one function is easy to feed into the input of another.”

Credit: stringrR documentation

The `stringr` package:

The stringr package is based off the stringi package and is part of Tidyverse
stringr contains functions to work with strings
For many functions in the stringr package, there are equivalent “base R” functions
But stringr functions all follow the same rules, while rules often differ across different “base R” string functions, so we will focus exclusively on stringr functions
Most stringr functions start with str_ (e.g., str_length)

`str_length()`

The str_length() function: - Function: Find string length

?str_length

# SYNTAX
str_length(string)

Arguments:
- string: Character vector (or vector coercible to character)
Note that str_length() calculates the length of a string, whereas the length() function (which is not part of stringr package) calculates the number of elements in an object

Using `str_length()` on string

str_length("cats")

[1] 4

Compare to length(), which treats the string as a single object:

length("cats")

[1] 1

`str_length()` on character vector

str_length(c("cats", "in", "hat"))

[1] 4 2 3

Compare to length(), which finds the number of elements in the vector:

length(c("cats", "in", "hat"))

[1] 3

Using `str_length()` on other vectors coercible to character

Logical vectors can be coerced to character vectors:

str_length(c(TRUE, FALSE))

[1] 4 5

Numeric vectors can be coerced to character vectors:

str_length(c(1, 2.5, 3000))

[1] 1 3 4

Integer vectors can be coerced to character vectors:

str_length(c(2L, 100L))

[1] 1 3

Using `str_length()` on dataframe column

Recall that the columns in a dataframe are just vectors, so we can use str_length() as long as the vector is coercible to character type.

str_length(p12_df$screen_name[1:20])

 [1] 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10

p12_df %>% select(screen_name) %>% unique() %>% 
  mutate(screen_name_len = str_length(screen_name))

# A tibble: 11 × 2
   screen_name     screen_name_len
   <chr>                     <int>
 1 WSUPullman                   10
 2 CalAdmissions                13
 3 UW                            2
 4 USCAdmission                 12
 5 uoregon                       7
 6 FutureSunDevils              15
 7 UCLAAdmission                13
 8 UtahAdmissions               14
 9 futurebuffs                  11
10 uaadmissions                 12
11 BeaverVIP                     9

`str_c()`

The str_c() function:

Function: Concatenate strings between vectors (element-wise)

?str_c

# SYNTAX AND DEFAULT VALUES
str_c(..., sep = "", collapse = NULL)

Arguments:
- The input is one or more character vectors (or vectors coercible to character)
  - Zero length arguments are removed
  - Short arguments are recycled to the length of the longest
- sep: String to insert between input vectors
- collapse: Optional string used to combine input vectors into single string

Using `str_c()` on one vector

Since we only provided one input vector, it has nothing to concatenate with, so str_c() will just return the same vector:

str_c(c("a", "b", "c"))

[1] "a" "b" "c"

Using `str_c()` on one vector

Note that specifying the sep argument will also not have any effect because we only have one input vector, and sep is the separator between multiple vectors:

str_c(c("a", "b", "c"), sep = "~")

[1] "a" "b" "c"

# Check length: Output is the original vector of 3 elements
str_c(c("a", "b", "c")) %>% length()

[1] 3

Using `str_c()` on one vector

As seen on the previous slide, str_c() returns a vector by default (because the default value for the collapse argument is NULL).
But we can specify a string for collapse in order to collapse the elements of the output vector into a single string:

str_c(c("a", "b", "c"), collapse = "|")

[1] "a|b|c"

# Check length: Output vector of length 3 is collapsed into a single string
str_c(c("a", "b", "c"), collapse = "|") %>% length()

[1] 1

# Check str_length: This gives the length of the collapsed string, which is 5 characters long
str_c(c("a", "b", "c"), collapse = "|") %>% str_length()

[1] 5

Using `str_c()` on more than one vector

When we provide multiple input vectors, we can see that the vectors get concatenated element-wise (i.e., 1st element from each vector are concatenated, 2nd element from each vector are concatenated, etc):

str_c(c("a", "b", "c"), c("x", "y", "z"), c("!", "?", ";"))

[1] "ax!" "by?" "cz;"

Using `str_c()` on more than one vector

The default separator for each element-wise concatenation is an empty string (""), but we can customize that by specifying the sep argument:

str_c(c("a", "b", "c"), c("x", "y", "z"), c("!", "?", ";"), sep = "~")

[1] "a~x~!" "b~y~?" "c~z~;"

# Check length: Output vector is same length as input vectors
str_c(c("a", "b", "c"), c("x", "y", "z"), c("!", "?", ";"), sep = "~") %>% length()

[1] 3

Using `str_c()` on more than one vector

Again, we can specify the collapse argument in order to collapse the elements of the output vector into a single string:

str_c(c("a", "b", "c"), c("x", "y", "z"), c("!", "?", ";"), collapse = "|")

[1] "ax!|by?|cz;"

# Check length: Output vector of length 3 is collapsed into a single string
str_c(c("a", "b", "c"), c("x", "y", "z"), c("!", "?", ";"), collapse = "|") %>% length()

[1] 1

# Specifying both `sep` and `collapse`
str_c(c("a", "b", "c"), c("x", "y", "z"), c("!", "?", ";"), sep = "~", collapse = "|")

[1] "a~x~!|b~y~?|c~z~;"

`str_sub()`

The str_sub() function:

Function: Subset strings
Arguments:
- string: Character vector (or vector coercible to character)
- start: Position of first character to be included in substring (default: 1)
- end: Position of last character to be included in substring (default: -1)
  - Negative index = counting backwards
- omit_na: If TRUE, missing values in any of the arguments provided will result in an unchanged input

?str_sub

# SYNTAX AND DEFAULT VALUES
str_sub(string, start = 1L, end = -1L)
str_sub(string, start = 1L, end = -1L, omit_na = FALSE) <- value

When str_sub() is used in the assignment form, you can replace the subsetted part of the string with a value of your choice
- If an element in the vector is too short to meet the subset specification, the replacement value will be concatenated to the end of that element
- Note that this modifies your input vector directly, so you must have the vector saved to a variable (see example below)

Using `str_sub()` to subset strings

If no start and end positions are specified, str_sub() will by default return the entire (original) string:

str_sub(string = c("abcdefg", 123, TRUE))

[1] "abcdefg" "123"     "TRUE"

Note that if an element is shorter than the specified end (i.e., 123 in the example below), it will just include all the available characters that it does have:

str_sub(string = c("abcdefg", 123, TRUE), start = 2, end = 4)

[1] "bcd" "23"  "RUE"

Using `str_sub()` to subset strings

Remember we can also use negative index to count the position starting from the back:

str_sub(c("abcdefg", 123, TRUE), start = 2, end = -2)

[1] "bcdef" "2"     "RU"

Using `str_sub()` to replace strings

If no start and end positions are specified, str_sub() will by default return the original string, so the entire string would be replaced:

v <- c("A", "AB", "ABC", "ABCD", "ABCDE")
str_sub(v, start = 1,end =-1)

[1] "A"     "AB"    "ABC"   "ABCD"  "ABCDE"

str_sub(v, start = 1,end =-1) <- "*"
v

[1] "*" "*" "*" "*" "*"

Using `str_sub()` on dataframe column

We can use as.character() to turn the created_at value to a string, then use str_sub() to extract out various date/time components from the string:

p12_datetime_df <- p12_df %>% select(created_at) %>%
  mutate(
      dt_chr = as.character(created_at),
      date_chr = str_sub(dt_chr, 1, 10),
      yr_chr = str_sub(dt_chr, 1, 4),
      mth_chr = str_sub(dt_chr, 6, 7),
      day_chr = str_sub(dt_chr, 9, 10),
      hr_chr = str_sub(dt_chr, -8, -7),
      min_chr = str_sub(dt_chr, -5, -4),
      sec_chr = str_sub(dt_chr, -2, -1)
    )

p12_datetime_df

# A tibble: 328 × 9
   created_at          dt_chr     date_chr yr_chr mth_chr day_chr hr_chr min_chr
   <dttm>              <chr>      <chr>    <chr>  <chr>   <chr>   <chr>  <chr>  
 1 2020-04-25 22:37:18 2020-04-2… 2020-04… 2020   04      25      22     37     
 2 2020-04-23 21:11:49 2020-04-2… 2020-04… 2020   04      23      21     11     
 3 2020-04-21 04:00:00 2020-04-2… 2020-04… 2020   04      21      04     00     
 4 2020-04-24 03:00:00 2020-04-2… 2020-04… 2020   04      24      03     00     
 5 2020-04-20 19:00:21 2020-04-2… 2020-04… 2020   04      20      19     00     
 6 2020-04-20 02:20:01 2020-04-2… 2020-04… 2020   04      20      02     20     
 7 2020-04-22 04:00:00 2020-04-2… 2020-04… 2020   04      22      04     00     
 8 2020-04-25 17:00:00 2020-04-2… 2020-04… 2020   04      25      17     00     
 9 2020-04-21 15:13:06 2020-04-2… 2020-04… 2020   04      21      15     13     
10 2020-04-21 17:52:47 2020-04-2… 2020-04… 2020   04      21      17     52     
# ℹ 318 more rows
# ℹ 1 more variable: sec_chr <chr>

Other `stringr` functions

Other useful stringr functions:

str_to_upper(): Turn strings to uppercase
str_to_lower(): Turn strings to lowercase
str_sort(): Sort a character vector
str_trim(): Trim whitespace from strings (including \n, \t, etc.)
str_pad(): Pad strings with specified character

Using `str_to_upper()` to turn strings to uppercase

Turn column names of p12_df to uppercase:

# Column names are originally lowercase
names(p12_df)

[1] "user_id"     "created_at"  "screen_name" "text"        "location"

# Turn column names to uppercase
names(p12_df) <- str_to_upper(names(p12_df))
names(p12_df)

[1] "USER_ID"     "CREATED_AT"  "SCREEN_NAME" "TEXT"        "LOCATION"

Using `str_to_lower()` to turn strings to lowercase

Turn column names of p12_df to lowercase:

# Column names are originally uppercase
names(p12_df)

[1] "USER_ID"     "CREATED_AT"  "SCREEN_NAME" "TEXT"        "LOCATION"

# Turn column names to lowercase
names(p12_df) <- str_to_lower(names(p12_df))
names(p12_df)

[1] "user_id"     "created_at"  "screen_name" "text"        "location"

Using `str_sort()` to sort character vector

Sort the vector of p12_df column names:

# Before sort
names(p12_df)

[1] "user_id"     "created_at"  "screen_name" "text"        "location"

# Sort alphabetically (default)
str_sort(names(p12_df))

[1] "created_at"  "location"    "screen_name" "text"        "user_id"

# Sort reverse alphabetically
str_sort(names(p12_df), decreasing = TRUE)

[1] "user_id"     "text"        "screen_name" "location"    "created_at"

Using `str_trim()` to trim whitespace from string

# Trim whitespace from both left and right sides (default)
str_trim(c("\nABC ", " XYZ\t"))

[1] "ABC" "XYZ"

# Trim whitespace from left side
str_trim(c("\nABC ", " XYZ\t"), side = "left")

[1] "ABC "  "XYZ\t"

# Trim whitespace from right side
str_trim(c("\nABC ", " XYZ\t"), side = "right")

[1] "\nABC" " XYZ"

Using `str_pad()` to pad string with character

Let’s say we have a vector of zip codes that has lost all leading 0’s. We can use str_pad() to add that back in:

# Pad the left side of strings with "0" until width of 5 is reached
str_pad(c(95035, 90024, 5009, 5030), width = 5, side = "left", pad = "0")

[1] "95035" "90024" "05009" "05030"

Regular expression basics

Example of using regular expression in action:

How can we match all occurrences of times in the following string? (i.e., 10 AM and 1 PM)
- "Class starts at 10 AM and ends at 1 PM."
The regular expression \d+ [AP]M can!

my_string = "Class starts at 10 AM and ends at 1 PM."
my_regex = "\\d+ [AP]M"

# The escaped string "\\d" results in the regex \d
print(my_regex)

[1] "\\d+ [AP]M"

writeLines(my_regex)

\d+ [AP]M

# View matches for our regular expression
str_view_all(string = my_string, pattern = my_regex)

[1] │ Class starts at <10 AM> and ends at <1 PM>.

Example of using regular expression in action:

How the regular expression \d+ [AP]M works:
- \d+ matches 1 or more digits in a row
  - \d means match all numeric digits (i.e., 0-9)
  - + means match 1 or more of
- matches a literal space
- [AP]M matches either AM or PM
  - [AP] means match either an A or P at that position
  - M means match a literal M

Some common regular expression patterns include (not inclusive):

Character classes
Quantifiers
Anchors
Sets and ranges
Groups and backreferences

Credit: DaveChild Regular Expression Cheat Sheet

Character classes

STRING	REGEX	MATCHES
`"\\d"`	`\d`	any digit
`"\\D"`	`\D`	any non-digit
`"\\s"`	`\s`	any whitespace
`"\\S"`	`\S`	any non-whitespace
`"\\w"`	`\w`	any word character
`"\\W"`	`\W`	any non-word character

Credit: Working with strings in stringr Cheat sheet

Character classes

There are certain character classes in regular expression that have special meaning. For example:
- \d is used to match any digit (i.e., number)
- \s is used to match any whitespace (i.e., space, tab, or newline character)
- \w is used to match any word character (i.e., alphanumeric character or underscore)

Character classes

“But wait… there’s more! Before a regex is interpreted as a regular expression, it is also interpreted by R as a string. And backslash is used to escape there as well. So, in the end, you need to preprend two backslashes…”
Credit: Escaping sequences from Stat 545
This means in R, when we want to use regular expression patterns "\d","\s", "\w", etc. to match to strings, we must write out the regex patterns as "\\d","\\s", "\\w", etc.

Using `\d` & `\D` to match digits & non-digits

Goal: write a regular expression pattern that matches to any digit in the string p12_df$text[119]
We can use \d to match all instances of a digit (i.e., number):

# Match any instances of a digit
str_view_all(string = p12_df$text[119], pattern = "\\d")

[1] │ "I stand with my colleagues at @UW and America's leading research universities as they take fight to Covid-<1><9> in our labs and hospitals."
    │ 
    │ #ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/<4>YSf<4>SpPe<0>

KEY POINT WITH REGEX

Our regular expression is the value we specify for the pattern argument above; this is our “regex object”
We want our regex object to include the regular expression \d, which matches to any digit
We specify our regex object as "\\d" rather than "\d"

Use regular expression `\D` to match all instances of a non-digit character:

# Match any instances of a non-digit
str_view_all(string = p12_df$text[119], pattern = "\\D")

[1] │ <"><I>< ><s><t><a><n><d>< ><w><i><t><h>< ><m><y>< ><c><o><l><l><e><a><g><u><e><s>< ><a><t>< ><@><U><W>< ><a><n><d>< ><A><m><e><r><i><c><a><'><s>< ><l><e><a><d><i><n><g>< ><r><e><s><e><a><r><c><h>< ><u><n><i><v><e><r><s><i><t><i><e><s>< ><a><s>< ><t><h><e><y>< ><t><a><k><e>< ><f><i><g><h><t>< ><t><o>< ><C><o><v><i><d><->19< ><i><n>< ><o><u><r>< ><l><a><b><s>< ><a><n><d>< ><h><o><s><p><i><t><a><l><s><.><"><
    │ ><
    │ ><#><P><r><o><u><d><T><o><B><e><O><n><T><h><e><i><r><T><e><a><m>< ><x>< ><#><A><l><w><a><y><s><C><o><m><p><e><t><e>< ><x>< ><#><G><o><H><u><s><k><i><e><s>< ><h><t><t><p><s><:></></><t><.><c><o></>4<Y><S><f>4<S><p><P><e>0

Match to all instances of a digit followed by a non-digit character:

str_view_all(string = p12_df$text[119], pattern = "\\d\\D")

[1] │ "I stand with my colleagues at @UW and America's leading research universities as they take fight to Covid-1<9 >in our labs and hospitals."
    │ 
    │ #ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/<4Y>Sf<4S>pPe0

Using `\s` & `\S` to match whitespace & non-whitespace

We can use \s to match all instances of a whitespace (i.e., space, tab, or newline character):

# Match any instances of a whitespace
str_view_all(
  string = p12_df$text[119]
  , pattern = "\\s"
  )

[1] │ "I< >stand< >with< >my< >colleagues< >at< >@UW< >and< >America's< >leading< >research< >universities< >as< >they< >take< >fight< >to< >Covid-19< >in< >our< >labs< >and< >hospitals."<
    │ ><
    │ >#ProudToBeOnTheirTeam< >x< >#AlwaysCompete< >x< >#GoHuskies< >https://t.co/4YSf4SpPe0

We can use \S to match all instances of a non-whitespace character:

# Match any instances of a non-whitespace
str_view_all(
  string = p12_df$text[119]
  , pattern = "\\S"
  )

[1] │ <"><I> <s><t><a><n><d> <w><i><t><h> <m><y> <c><o><l><l><e><a><g><u><e><s> <a><t> <@><U><W> <a><n><d> <A><m><e><r><i><c><a><'><s> <l><e><a><d><i><n><g> <r><e><s><e><a><r><c><h> <u><n><i><v><e><r><s><i><t><i><e><s> <a><s> <t><h><e><y> <t><a><k><e> <f><i><g><h><t> <t><o> <C><o><v><i><d><-><1><9> <i><n> <o><u><r> <l><a><b><s> <a><n><d> <h><o><s><p><i><t><a><l><s><.><">
    │ 
    │ <#><P><r><o><u><d><T><o><B><e><O><n><T><h><e><i><r><T><e><a><m> <x> <#><A><l><w><a><y><s><C><o><m><p><e><t><e> <x> <#><G><o><H><u><s><k><i><e><s> <h><t><t><p><s><:></></><t><.><c><o></><4><Y><S><f><4><S><p><P><e><0>

Using `\w` & `\W` to match words & non-words

We can use \w to match all instances of a word character (i.e., alphanumeric character or underscore):

# Match any instances of a word character
str_view_all(
  string = p12_df$text[119]
  , pattern = "\\w"
  )

[1] │ "<I> <s><t><a><n><d> <w><i><t><h> <m><y> <c><o><l><l><e><a><g><u><e><s> <a><t> @<U><W> <a><n><d> <A><m><e><r><i><c><a>'<s> <l><e><a><d><i><n><g> <r><e><s><e><a><r><c><h> <u><n><i><v><e><r><s><i><t><i><e><s> <a><s> <t><h><e><y> <t><a><k><e> <f><i><g><h><t> <t><o> <C><o><v><i><d>-<1><9> <i><n> <o><u><r> <l><a><b><s> <a><n><d> <h><o><s><p><i><t><a><l><s>."
    │ 
    │ #<P><r><o><u><d><T><o><B><e><O><n><T><h><e><i><r><T><e><a><m> <x> #<A><l><w><a><y><s><C><o><m><p><e><t><e> <x> #<G><o><H><u><s><k><i><e><s> <h><t><t><p><s>://<t>.<c><o>/<4><Y><S><f><4><S><p><P><e><0>

We can use \W to match all instances of a non-word character:

# Match any instances of a non-word character
str_view_all(
  string = p12_df$text[119]
  , pattern = "\\W"
  )

[1] │ <">I< >stand< >with< >my< >colleagues< >at< ><@>UW< >and< >America<'>s< >leading< >research< >universities< >as< >they< >take< >fight< >to< >Covid<->19< >in< >our< >labs< >and< >hospitals<.><"><
    │ ><
    │ ><#>ProudToBeOnTheirTeam< >x< ><#>AlwaysCompete< >x< ><#>GoHuskies< >https<:></></>t<.>co</>4YSf4SpPe0

Using `\w` & `\W` to match words & non-words

This matches all instances of 3-letter words:

str_view_all(
  string = p12_df$text[119]
  , pattern = "\\W\\w\\w\\w\\W"
  )

[1] │ "I stand with my colleagues at @UW< and >America's leading research universities as they take fight to Covid-19 in< our >labs< and >hospitals."
    │ 
    │ #ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/4YSf4SpPe0

Wrap-Up: Character Classes

The second half of the table above shows other regular expressions involving backslashes
This includes special characters like \n and \t, as well as using backslash to escape characters that have special meanings in regex, like . or ? (as we will soon see.
So to match a literal period or question mark, we need to use the regex \. and \?, or strings "\\." and "\\?" in R.

STRING	REGEX	MATCHES
`"\\n"`	`\n`	newline
`"\\t"`	`\t`	tab
`"\\\\"`	`\\`	`\`
`"\\."`	`\.`	`.`
`"\\?"`	`\?`	`?`
`"\\("`	`\(`	`(`
`"\\)"`	`\)`	`)`
`"\\{"`	`\{`	`{`
`"\\}"`	`\}`	`}`

Quantifiers

Character	Description
`*`	0 or more
`?`	0 or 1
`+`	1 or more
`{3}`	Exactly 3
`{3,}`	3 or more
`{3,5}`	3, 4, or 5

We can use quantifiers to specify the amount of a certain character or expression to match.
The quantifier should directly follow the pattern you want to quantify.
For example, s? matches 0 or 1 s and \d{4} matches exactly 4 digits.

Anchors

We can use anchors to indicate which part of the string to match.
For example, ^ matches the start of the string, $ matches the end of the string (Notice how we do not need to escape these characters).
\b can be used to help detect word boundaries, and \B can be used to help match characters within a word.

Anchors

String	Character	Description
`"^"`	`^`	Start of string, or start of line in multi-line pattern
`"$"`	`$`	End of string, or end of line in multi-line pattern
`"\\b"`	`\b`	Word boundary
`"\\B"`	`\B`	Non-word boundary

Using `^` & `$` to match start & end of string

We can use ^ to match the start of a string:

# Matches only the quotation mark at the start of the text and not the end quote
str_view_all(string = p12_df$text[119], pattern = '^"')

[1] │ <">I stand with my colleagues at @UW and America's leading research universities as they take fight to Covid-19 in our labs and hospitals."
    │ 
    │ #ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/4YSf4SpPe0

Using `^` & `$` to match start & end of string

We can use $ to match the end of a string:

# Matches only the number at the end of the text and not any other numbers
str_view_all(string = p12_df$text[119], pattern = "\\d$")

[1] │ "I stand with my colleagues at @UW and America's leading research universities as they take fight to Covid-19 in our labs and hospitals."
    │ 
    │ #ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/4YSf4SpPe<0>

Sets and ranges

Character	Description
`.`	Match any character except newline (`\n`)
`a\|b`	Match `a` or `b`
`[abc]`	Match either `a`, `b`, or `c`
`[^abc]`	Match anything except `a`, `b`, or `c`
`[a-z]`	Match range of lowercase letters from `a` to `z`
`[A-Z]`	Match range of uppercase letters from `A` to `Z`
`[0-9]`	Match range of numbers from `0` to `9`

The table lists some more ways regular expression offers us flexibility and option in what we want to match.
The period . acts as a wildcard to match any character except newline.
The vertical bar | is similar to an OR operator. Square brackets [...] can be used to specify a set or range of characters to match (or not to match).

Using `.` as a wildcard

We can use . to match any character except newline (\n):

# Matches any character except newline
str_view_all(string = p12_df$text[119], pattern = ".")

[1] │ <"><I>< ><s><t><a><n><d>< ><w><i><t><h>< ><m><y>< ><c><o><l><l><e><a><g><u><e><s>< ><a><t>< ><@><U><W>< ><a><n><d>< ><A><m><e><r><i><c><a><'><s>< ><l><e><a><d><i><n><g>< ><r><e><s><e><a><r><c><h>< ><u><n><i><v><e><r><s><i><t><i><e><s>< ><a><s>< ><t><h><e><y>< ><t><a><k><e>< ><f><i><g><h><t>< ><t><o>< ><C><o><v><i><d><-><1><9>< ><i><n>< ><o><u><r>< ><l><a><b><s>< ><a><n><d>< ><h><o><s><p><i><t><a><l><s><.><">
    │ 
    │ <#><P><r><o><u><d><T><o><B><e><O><n><T><h><e><i><r><T><e><a><m>< ><x>< ><#><A><l><w><a><y><s><C><o><m><p><e><t><e>< ><x>< ><#><G><o><H><u><s><k><i><e><s>< ><h><t><t><p><s><:></></><t><.><c><o></><4><Y><S><f><4><S><p><P><e><0>

Using `.` as a wildcard

We can confirm there is a newline in the tweet above by using writeLines() or print():

writeLines(p12_df$text[119])

"I stand with my colleagues at @UW and America's leading research universities as they take fight to Covid-19 in our labs and hospitals."

#ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/4YSf4SpPe0

print(p12_df$text[119])

[1] "\"I stand with my colleagues at @UW and America's leading research universities as they take fight to Covid-19 in our labs and hospitals.\"\n\n#ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/4YSf4SpPe0"

Using `|` as an OR operator

We can use | to match either one of multiple patterns:

# Matches `research`, `fight`, or `labs`
str_view_all(string = p12_df$text[119], pattern = "research|fight|labs")

[1] │ "I stand with my colleagues at @UW and America's leading <research> universities as they take <fight> to Covid-19 in our <labs> and hospitals."
    │ 
    │ #ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/4YSf4SpPe0

# Matches hashtags or handles
str_view_all(string = p12_df$text[119], pattern = "@\\w+|#\\w+")

[1] │ "I stand with my colleagues at <@UW> and America's leading research universities as they take fight to Covid-19 in our labs and hospitals."
    │ 
    │ <#ProudToBeOnTheirTeam> x <#AlwaysCompete> x <#GoHuskies> https://t.co/4YSf4SpPe0

Using `[...]` to match (or not match) a set or range of characters

We can use [...] to match any set of characters:

# Matches hashtags or handles
str_view_all(string = p12_df$text[119], pattern = "[@#]\\w+")

[1] │ "I stand with my colleagues at <@UW> and America's leading research universities as they take fight to Covid-19 in our labs and hospitals."
    │ 
    │ <#ProudToBeOnTheirTeam> x <#AlwaysCompete> x <#GoHuskies> https://t.co/4YSf4SpPe0

# Matches any 2 consecutive vowels
str_view_all(string = p12_df$text[119], pattern = "[aeiouAEIOU]{2}")

[1] │ "I stand with my coll<ea>g<ue>s at @UW and America's l<ea>ding res<ea>rch universit<ie>s as they take fight to Covid-19 in <ou>r labs and hospitals."
    │ 
    │ #Pr<ou>dToB<eO>nTh<ei>rT<ea>m x #AlwaysCompete x #GoHusk<ie>s https://t.co/4YSf4SpPe0

Using `[...]` to match (or not match) a set or range of characters

We can also use [...] to match any range of alpha or numeric characters:

# Matches only lowercase x through z or uppercase A through C
str_view_all(string = p12_df$text[119], pattern = "[x-zA-C]")

[1] │ "I stand with m<y> colleagues at @UW and <A>merica's leading research universities as the<y> take fight to <C>ovid-19 in our labs and hospitals."
    │ 
    │ #ProudTo<B>eOnTheirTeam <x> #<A>lwa<y>s<C>ompete <x> #GoHuskies https://t.co/4YSf4SpPe0

# Matches only numbers 1 through 4 or the pound sign
str_view_all(string = p12_df$text[119], pattern = "[1-4#]")

[1] │ "I stand with my colleagues at @UW and America's leading research universities as they take fight to Covid-<1>9 in our labs and hospitals."
    │ 
    │ <#>ProudToBeOnTheirTeam x <#>AlwaysCompete x <#>GoHuskies https://t.co/<4>YSf<4>SpPe0

Using `[...]` to match (or not match) a set or range of characters

We can use [^...] to indicate we do not want to match the provided set or range of characters:

# Matches any vowels
str_view_all(string = p12_df$text[119], pattern = "[aeiouAEIOU]")

[1] │ "<I> st<a>nd w<i>th my c<o>ll<e><a>g<u><e>s <a>t @<U>W <a>nd <A>m<e>r<i>c<a>'s l<e><a>d<i>ng r<e>s<e><a>rch <u>n<i>v<e>rs<i>t<i><e>s <a>s th<e>y t<a>k<e> f<i>ght t<o> C<o>v<i>d-19 <i>n <o><u>r l<a>bs <a>nd h<o>sp<i>t<a>ls."
    │ 
    │ #Pr<o><u>dT<o>B<e><O>nTh<e><i>rT<e><a>m x #<A>lw<a>ysC<o>mp<e>t<e> x #G<o>H<u>sk<i><e>s https://t.c<o>/4YSf4SpP<e>0

# Matches anything except vowels
str_view_all(string = p12_df$text[119], pattern = "[^aeiouAEIOU]")

[1] │ <">I< ><s><t>a<n><d>< ><w>i<t><h>< ><m><y>< ><c>o<l><l>ea<g>ue<s>< >a<t>< ><@>U<W>< >a<n><d>< >A<m>e<r>i<c>a<'><s>< ><l>ea<d>i<n><g>< ><r>e<s>ea<r><c><h>< >u<n>i<v>e<r><s>i<t>ie<s>< >a<s>< ><t><h>e<y>< ><t>a<k>e< ><f>i<g><h><t>< ><t>o< ><C>o<v>i<d><-><1><9>< >i<n>< >ou<r>< ><l>a<b><s>< >a<n><d>< ><h>o<s><p>i<t>a<l><s><.><"><
    │ ><
    │ ><#><P><r>ou<d><T>o<B>eO<n><T><h>ei<r><T>ea<m>< ><x>< ><#>A<l><w>a<y><s><C>o<m><p>e<t>e< ><x>< ><#><G>o<H>u<s><k>ie<s>< ><h><t><t><p><s><:></></><t><.><c>o</><4><Y><S><f><4><S><p><P>e<0>

# Matches anything that's not uppercase letters
str_view_all(string = p12_df$text[119], pattern = "[^A-Z]+")

[1] │ <">I< stand with my colleagues at @>UW< and >A<merica's leading research universities as they take fight to >C<ovid-19 in our labs and hospitals."
    │ 
    │ #>P<roud>T<o>B<e>O<n>T<heir>T<eam x #>A<lways>C<ompete x #>G<o>H<uskies https://t.co/4>YS<f4>S<p>P<e0>

Using `[...]` to match (or not match) a set or range of characters

Notice that [...] only matches a single character (see second to last example above). We need to use quantifiers if we want to match a stretch of characters (see last example above).

Dates and times

“Date-time data can be frustrating to work with in R. R commands for date-times are generally unintuitive and change depending on the type of date-time object being used. Moreover, the methods we use with date-times must be robust to time zones, leap days, daylight savings times, and other time related quirks, and R lacks these capabilities in some situations. Lubridate makes it easier to do the things R does with date-times and possible to do the things R does not.”

Credit: lubridate documentation

Dates and times

How are dates and times stored in R? (From Dates and Times in R)

The Date class is used for storing dates
- “Internally, Date objects are stored as the number of days since January 1, 1970, using negative numbers for earlier dates. The as.numeric() function can be used to convert a Date object to its internal form.”
POSIX classes can be used for storing date plus times
- “The POSIXct class stores date/time values as the number of seconds since January 1, 1970”
- “The POSIXlt class stores date/time values as a list of components (hour, min, sec, mon, etc.) making it easy to extract these parts”
There is no native R class for storing only time

Dates and times

Why use date/time objects?

Using date/time objects makes it easier to fetch or modify various date/time components (e.g., year, month, day, day of the week)
- Compared to if the date/time is just stored in a string, these components are not as readily accessible and need to be parsed
You can perform certain arithmetics with date/time objects (e.g., find the “difference” between date/time points)

Creating date/time objects

Functions that create date/time objects by parsing character or numeric input:

Create Date object: ymd(), ydm(), mdy(), myd(), dmy(), dym()
- y stands for year, m stands for month, d stands for day
- Select the function that represents the order in which your date input is formatted, and the function will be able to parse your input and create a Date object

Creating POSIXct objects

Create POSIXct object: ymd_h(), ymd_hm(), ymd_hms(), etc.
- h stands for hour, m stands for minute, s stands for second
- For any of the previous 6 date functions, you can append h, hm, or hms if you want to provide additional time information in order to create a POSIXct object
- To force a POSIXct object without providing any time information, you can just provide a timezone (using tz) to one of the date functions and it will assume midnight as the time
- You can use Sys.timezone() to get the timezone for your location

Creating `Date` object from character or numeric input

The lubridate functions are flexible and can parse dates in various formats:

d <- mdy("1/1/2020"); d

[1] "2020-01-01"

d <- mdy("1-1-2020"); d

[1] "2020-01-01"

d <- mdy("Jan. 1, 2020"); d

[1] "2020-01-01"

d <- ymd(20200101); d

[1] "2020-01-01"

Creating `Date` object from character or numeric input

Investigate the Date object:

class(d)

[1] "Date"

typeof(d)

[1] "double"

# Number of days since January 1, 1970
as.numeric(d)

[1] 18262

Creating `POSIXct` object from character or numeric input

The lubridate functions are flexible and can parse AM/PM in various formats:

dt <- mdy_h("12/31/2019 11pm"); dt

[1] "2019-12-31 23:00:00 UTC"

dt <- mdy_hm("12/31/2019 11:59 pm"); dt

[1] "2019-12-31 23:59:00 UTC"

dt <- mdy_hms("12/31/2019 11:59:59 PM"); dt

[1] "2019-12-31 23:59:59 UTC"

dt <- ymd_hms(20191231235959); dt

[1] "2019-12-31 23:59:59 UTC"

Creating `POSIXct` object from character or numeric input

Investigate the POSIXct object:

class(dt)

[1] "POSIXct" "POSIXt"

typeof(dt)

[1] "double"

# Number of seconds since January 1, 1970
as.numeric(dt)

[1] 1577836799

Creating `POSIXct` object from character or numeric input

We can also create a POSIXct object from a date function by providing a timezone. The time would default to midnight:

dt <- mdy("1/1/2020", tz = "UTC")
dt

[1] "2020-01-01 UTC"

# Number of seconds since January 1, 1970
as.numeric(dt)  # Note that this is indeed 1 sec after the previous example

[1] 1577836800

Creating `Date` objects from dataframe column

Using the p12_datetime_df we created earlier, we can create Date objects from the date_chr column:

# Use `ymd()` to parse the string stored in the `date_chr` column
p12_datetime_df %>% select(created_at, dt_chr, date_chr) %>%
  mutate(date_ymd = ymd(date_chr))

# A tibble: 328 × 4
   created_at          dt_chr              date_chr   date_ymd  
   <dttm>              <chr>               <chr>      <date>    
 1 2020-04-25 22:37:18 2020-04-25 22:37:18 2020-04-25 2020-04-25
 2 2020-04-23 21:11:49 2020-04-23 21:11:49 2020-04-23 2020-04-23
 3 2020-04-21 04:00:00 2020-04-21 04:00:00 2020-04-21 2020-04-21
 4 2020-04-24 03:00:00 2020-04-24 03:00:00 2020-04-24 2020-04-24
 5 2020-04-20 19:00:21 2020-04-20 19:00:21 2020-04-20 2020-04-20
 6 2020-04-20 02:20:01 2020-04-20 02:20:01 2020-04-20 2020-04-20
 7 2020-04-22 04:00:00 2020-04-22 04:00:00 2020-04-22 2020-04-22
 8 2020-04-25 17:00:00 2020-04-25 17:00:00 2020-04-25 2020-04-25
 9 2020-04-21 15:13:06 2020-04-21 15:13:06 2020-04-21 2020-04-21
10 2020-04-21 17:52:47 2020-04-21 17:52:47 2020-04-21 2020-04-21
# ℹ 318 more rows

Creating `POSIXct` objects from dataframe column

Using the p12_datetime_df we created earlier, we can recreate the created_at column (class POSIXct) from the dt_chr column (class character):

# Use `ymd_hms()` to parse the string stored in the `dt_chr` column
p12_datetime_df %>% select(created_at, dt_chr) %>%
  mutate(datetime_ymd_hms = ymd_hms(dt_chr))

# A tibble: 328 × 3
   created_at          dt_chr              datetime_ymd_hms   
   <dttm>              <chr>               <dttm>             
 1 2020-04-25 22:37:18 2020-04-25 22:37:18 2020-04-25 22:37:18
 2 2020-04-23 21:11:49 2020-04-23 21:11:49 2020-04-23 21:11:49
 3 2020-04-21 04:00:00 2020-04-21 04:00:00 2020-04-21 04:00:00
 4 2020-04-24 03:00:00 2020-04-24 03:00:00 2020-04-24 03:00:00
 5 2020-04-20 19:00:21 2020-04-20 19:00:21 2020-04-20 19:00:21
 6 2020-04-20 02:20:01 2020-04-20 02:20:01 2020-04-20 02:20:01
 7 2020-04-22 04:00:00 2020-04-22 04:00:00 2020-04-22 04:00:00
 8 2020-04-25 17:00:00 2020-04-25 17:00:00 2020-04-25 17:00:00
 9 2020-04-21 15:13:06 2020-04-21 15:13:06 2020-04-21 15:13:06
10 2020-04-21 17:52:47 2020-04-21 17:52:47 2020-04-21 17:52:47
# ℹ 318 more rows

Creating date/time objects from individual components

Functions that create date/time objects from various date/time components:

Create Date object: make_date()
- Syntax and default values: make_date(year = 1970L, month = 1L, day = 1L)
- All inputs are coerced to integer
Create POSIXct object: make_datetime()
- Syntax and default values: make_datetime(year = 1970L, month = 1L, day = 1L, hour = 0L, min = 0L, sec = 0, tz = "UTC")

Creating `Date` object from individual components

There are various ways to pass in the inputs to create the same Date object:

d <- make_date(2020, 1, 1); d

[1] "2020-01-01"

# Characters can be coerced to integers
d <- make_date("2020", "01", "01"); d

[1] "2020-01-01"

# Remember that the default values for month and day would be 1L
d <- make_date(2020); d

[1] "2020-01-01"

Creating `POSIXct` object from individual components

# Inputs should be numeric
d <- make_datetime(2019, 12, 31, 23, 59, 59)
d

[1] "2019-12-31 23:59:59 UTC"

Creating `Date` objects from dataframe columns

Using the p12_datetime_df we created earlier, we can create Date objects from the various date component columns:

# Use `make_date()` to create a `Date` object from the `yr_chr`, `mth_chr`, `day_chr` fields
p12_datetime_df %>% select(created_at, dt_chr, yr_chr, mth_chr, day_chr) %>%
  mutate(date_make_date = make_date(year = yr_chr, month = mth_chr, day = day_chr))

# A tibble: 328 × 6
   created_at          dt_chr              yr_chr mth_chr day_chr date_make_date
   <dttm>              <chr>               <chr>  <chr>   <chr>   <date>        
 1 2020-04-25 22:37:18 2020-04-25 22:37:18 2020   04      25      2020-04-25    
 2 2020-04-23 21:11:49 2020-04-23 21:11:49 2020   04      23      2020-04-23    
 3 2020-04-21 04:00:00 2020-04-21 04:00:00 2020   04      21      2020-04-21    
 4 2020-04-24 03:00:00 2020-04-24 03:00:00 2020   04      24      2020-04-24    
 5 2020-04-20 19:00:21 2020-04-20 19:00:21 2020   04      20      2020-04-20    
 6 2020-04-20 02:20:01 2020-04-20 02:20:01 2020   04      20      2020-04-20    
 7 2020-04-22 04:00:00 2020-04-22 04:00:00 2020   04      22      2020-04-22    
 8 2020-04-25 17:00:00 2020-04-25 17:00:00 2020   04      25      2020-04-25    
 9 2020-04-21 15:13:06 2020-04-21 15:13:06 2020   04      21      2020-04-21    
10 2020-04-21 17:52:47 2020-04-21 17:52:47 2020   04      21      2020-04-21    
# ℹ 318 more rows

Creating `POSIXct` objects from dataframe columns

Using the p12_datetime_df we created earlier, we can recreate the created_at column (class POSIXct) from the various date and time component columns (class character):

# Use `make_datetime()` to create a `POSIXct` object from the `yr_chr`, `mth_chr`, `day_chr`, `hr_chr`, `min_chr`, `sec_chr` fields
# Convert inputs to integers first
p12_datetime_df %>%
  mutate(datetime_make_datetime = make_datetime(
    as.integer(yr_chr), as.integer(mth_chr), as.integer(day_chr), 
    as.integer(hr_chr), as.integer(min_chr), as.integer(sec_chr)
  )) %>%
  select(datetime_make_datetime, yr_chr, mth_chr, day_chr, hr_chr, min_chr, sec_chr)

# A tibble: 328 × 7
   datetime_make_datetime yr_chr mth_chr day_chr hr_chr min_chr sec_chr
   <dttm>                 <chr>  <chr>   <chr>   <chr>  <chr>   <chr>  
 1 2020-04-25 22:37:18    2020   04      25      22     37      18     
 2 2020-04-23 21:11:49    2020   04      23      21     11      49     
 3 2020-04-21 04:00:00    2020   04      21      04     00      00     
 4 2020-04-24 03:00:00    2020   04      24      03     00      00     
 5 2020-04-20 19:00:21    2020   04      20      19     00      21     
 6 2020-04-20 02:20:01    2020   04      20      02     20      01     
 7 2020-04-22 04:00:00    2020   04      22      04     00      00     
 8 2020-04-25 17:00:00    2020   04      25      17     00      00     
 9 2020-04-21 15:13:06    2020   04      21      15     13      06     
10 2020-04-21 17:52:47    2020   04      21      17     52      47     
# ℹ 318 more rows

Date/time object components

Storing data using date/time objects makes it easier to get and set the various date/time components.

Basic accessor functions:

date(): Date component
year(): Year
month(): Month
day(): Day
hour(): Hour
minute(): Minute

second(): Second
week(): Week of the year
wday(): Day of the week (1 for Sunday to 7 for Saturday)
am(): Is it in the am? (returns TRUE or FALSE)
pm(): Is it in the pm? (returns TRUE or FALSE)

Date/time object components

To get a date/time component, you can simply pass a date/time object to the function
- Syntax: accessor_function(<date/time_object>)
To set a date/time component, you can assign into the accessor function to change the component
- Syntax: accessor_function(<date/time_object>) <- "new_component"
- Note that am() and pm() can’t be set. Modify the time components instead.

Getting date/time components

# Create datetime for New Year's Eve
dt <- make_datetime(2019, 12, 31, 23, 59, 59)
dt

[1] "2019-12-31 23:59:59 UTC"

dt %>% class()

[1] "POSIXct" "POSIXt"

date(dt) # Get date

[1] "2019-12-31"

hour(dt) # Get hour

[1] 23

pm(dt)   # Is it pm?

[1] TRUE

wday(dt) # Day of the week (3 = Tuesday)

[1] 3

year(dt) # Get year

[1] 2019

Setting date/time components

week(dt) # Get week of year

[1] 53

# Set week of year (move back 1 week)
week(dt) <- week(dt) - 1
dt

[1] "2019-12-24 23:59:59 UTC"

day(dt) <- 25 # Set day to Christmas Day
dt

[1] "2019-12-25 23:59:59 UTC"

Getting date/time components from dataframe column

Using the p12_datetime_df we created earlier, we can isolate the various date/time components from the POSIXct object in the created_at column:

# The extracted date/time components will be of numeric type
p12_datetime_df %>% select(created_at) %>%
  mutate(
    yr_num = year(created_at),
    mth_num = month(created_at),
    day_num = day(created_at),
    hr_num = hour(created_at),
    min_num = minute(created_at),
    sec_num = second(created_at),
    ampm = ifelse(am(created_at), 'AM', 'PM')  # am()/pm() returns TRUE/FALSE
  )

# A tibble: 328 × 8
   created_at          yr_num mth_num day_num hr_num min_num sec_num ampm 
   <dttm>               <dbl>   <dbl>   <int>  <int>   <int>   <dbl> <chr>
 1 2020-04-25 22:37:18   2020       4      25     22      37      18 PM   
 2 2020-04-23 21:11:49   2020       4      23     21      11      49 PM   
 3 2020-04-21 04:00:00   2020       4      21      4       0       0 AM   
 4 2020-04-24 03:00:00   2020       4      24      3       0       0 AM   
 5 2020-04-20 19:00:21   2020       4      20     19       0      21 PM   
 6 2020-04-20 02:20:01   2020       4      20      2      20       1 AM   
 7 2020-04-22 04:00:00   2020       4      22      4       0       0 AM   
 8 2020-04-25 17:00:00   2020       4      25     17       0       0 PM   
 9 2020-04-21 15:13:06   2020       4      21     15      13       6 PM   
10 2020-04-21 17:52:47   2020       4      21     17      52      47 PM   
# ℹ 318 more rows

Time spans

3 ways to represent time spans (From lubridate cheatsheet)

Intervals represent specific intervals of the timeline, bounded by start and end date-times
- Example: People with birthdays between the interval October 23 to November 22 are Scorpios
Periods track changes in clock times, which ignore time line irregularities
- Example: Daylight savings time ends at the beginning of November and we gain an hour - this extra hour is ignored when determining the period between October 23 to November 22
Durations track the passage of physical time, which deviates from clock time when irregularities occur
- Example: Daylight savings time ends at the beginning of November and we gain an hour - this extra hour is added when determining the duration between October 23 to November 22

Time spans using `lubridate`: Durations

Durations keep track of the physical amount of time elapsed, so it is “stored as seconds, the only time unit with a consistent length” (From lubridate cheatsheet)
Create durations using functions whose name is the time unit prefixed with a d (e.g., dyears(), dweeks(), ddays(), dhours(), dminutes(), dseconds())

Time spans using `lubridate`: Durations

Example: ddays(1) creates a duration of 86400s, using the standard conversion of 60 seconds in an minute, 60 minutes in an hour, and 24 hours in a day:

ddays(1)

[1] "86400s (~1 days)"

Notice that the output says this is equivalent to approximately 1 day, since it acknowledges that not all days have 24 hours.

Time spans using `lubridate`: Durations

In the case of daylight savings, one particular day may have 25 hours, so the duration of that day should be represented as:

ddays(1) + dhours(1)

[1] "90000s (~1.04 days)"

You can add and subract durations
You can also use as.duration() to get duration of an interval

Working with duration

If we use as.duration() to get the duration of scorpio_interval, we see that it is a duration of 2595600 seconds. It takes into account the extra 1 hour gained due to daylight savings ending:

scorpio_start <- ymd("2019-10-23", tz = Sys.timezone()) 
scorpio_end <- ymd("2019-11-22", tz = Sys.timezone()) 

scorpio_interval <- scorpio_start %--% scorpio_end  # or `interval(scorpio_start, scorpio_end)`
scorpio_interval <- interval(scorpio_start, scorpio_end) 
  
# Duration is 2595600 seconds, which is equivalent to 30 24-hr days + 1 additional hour
scorpio_duration <- as.duration(scorpio_interval)
scorpio_duration

[1] "2595600s (~4.29 weeks)"

# The object has class `Duration`
class(scorpio_duration)

[1] "Duration"
attr(,"package")
[1] "lubridate"

# Using the standard 60s/min, 60min/hr, 24hr/day conversion,
# confirm duration is slightly more than 30 "standard" (ie. 24-hr) days
2595600 / (60 * 60 * 24)

[1] 30.04167

# Specifically, it is 30 days + 1 hour, if we define a day to have 24 hours
seconds_to_period(scorpio_duration)

[1] "30d 1H 0M 0S"

Working with duration

Because durations work with physical time, when we add a duration of 30 days to the scorpio_start datetime object, we do not get the end datetime we’d expect:

# Start datetime for Scorpio birthdays (time is midnight)
scorpio_start

[1] "2019-10-23 PDT"

# After adding 30 day duration, we do not get the expected end datetime
# `ddays(30)` adds the number of seconds in 30 standard 24-hr days, but one of the days has 25 hours
scorpio_start + ddays(30)

[1] "2019-11-21 23:00:00 PST"

# We need to add the additional 1 hour of physical time that elapsed during this time span
scorpio_start + ddays(30) + dhours(1)

[1] "2019-11-22 PST"

Attributions

These materials were adapted from Ozan Jaquette’s EDUC 260A Course: Introduction to Programming and Data Management, Strings & Dates and EDUC 260B Course: Fundamentals of Programming, Strings & Regex

Week 7 - Strings & Dates

Outline

Dataset we will use

String basics

Creating Strings

Creating string using single quotes

Creating string using double quotes

Checking class and type of strings

Quotes in quotes

stringr package

The stringr package:

str_length()

Using str_length() on string

str_length() on character vector

Using str_length() on other vectors coercible to character

Using str_length() on dataframe column

str_c()

Using str_c() on one vector

Using str_c() on one vector

Using str_c() on one vector

Using str_c() on more than one vector

Using str_c() on more than one vector

Using str_c() on more than one vector

str_sub()

Using str_sub() to subset strings

Using str_sub() to subset strings

Using str_sub() to replace strings

Using str_sub() on dataframe column

Other stringr functions

Using str_to_upper() to turn strings to uppercase

Using str_to_lower() to turn strings to lowercase

Using str_sort() to sort character vector

Using str_trim() to trim whitespace from string

Using str_pad() to pad string with character

Regular expression basics

Example of using regular expression in action:

Example of using regular expression in action:

Some common regular expression patterns include (not inclusive):

Character classes

Character classes

Character classes

Using \d & \D to match digits & non-digits

KEY POINT WITH REGEX

Use regular expression \D to match all instances of a non-digit character:

Match to all instances of a digit followed by a non-digit character:

Using \s & \S to match whitespace & non-whitespace

Using \w & \W to match words & non-words

Using \w & \W to match words & non-words

Wrap-Up: Character Classes

Quantifiers

Anchors

Anchors

Using ^ & $ to match start & end of string

Using ^ & $ to match start & end of string

Sets and ranges

Using . as a wildcard

Using . as a wildcard

Using | as an OR operator

Using [...] to match (or not match) a set or range of characters

Using [...] to match (or not match) a set or range of characters

Using [...] to match (or not match) a set or range of characters

Using [...] to match (or not match) a set or range of characters

Dates and times

Dates and times

Dates and times

Creating date/time objects

Creating POSIXct objects

Creating Date object from character or numeric input

Creating Date object from character or numeric input

Creating POSIXct object from character or numeric input

Creating POSIXct object from character or numeric input

Creating POSIXct object from character or numeric input

Creating Date objects from dataframe column

Creating POSIXct objects from dataframe column

Creating date/time objects from individual components

Creating Date object from individual components

Creating POSIXct object from individual components

Creating Date objects from dataframe columns

Creating POSIXct objects from dataframe columns

Date/time object components

`stringr` package

The `stringr` package:

`str_length()`

Using `str_length()` on string

`str_length()` on character vector

Using `str_length()` on other vectors coercible to character

Using `str_length()` on dataframe column

`str_c()`

Using `str_c()` on one vector

Using `str_c()` on one vector

Using `str_c()` on one vector

Using `str_c()` on more than one vector

Using `str_c()` on more than one vector

Using `str_c()` on more than one vector

`str_sub()`

Using `str_sub()` to subset strings

Using `str_sub()` to subset strings

Using `str_sub()` to replace strings

Using `str_sub()` on dataframe column

Other `stringr` functions

Using `str_to_upper()` to turn strings to uppercase

Using `str_to_lower()` to turn strings to lowercase

Using `str_sort()` to sort character vector

Using `str_trim()` to trim whitespace from string

Using `str_pad()` to pad string with character

Using `\d` & `\D` to match digits & non-digits

Use regular expression `\D` to match all instances of a non-digit character:

Using `\s` & `\S` to match whitespace & non-whitespace

Using `\w` & `\W` to match words & non-words

Using `\w` & `\W` to match words & non-words

Using `^` & `$` to match start & end of string

Using `^` & `$` to match start & end of string

Using `.` as a wildcard

Using `.` as a wildcard

Using `|` as an OR operator

Using `[...]` to match (or not match) a set or range of characters

Using `[...]` to match (or not match) a set or range of characters

Using `[...]` to match (or not match) a set or range of characters

Using `[...]` to match (or not match) a set or range of characters

Creating `Date` object from character or numeric input

Creating `Date` object from character or numeric input

Creating `POSIXct` object from character or numeric input

Creating `POSIXct` object from character or numeric input

Creating `POSIXct` object from character or numeric input

Creating `Date` objects from dataframe column

Creating `POSIXct` objects from dataframe column

Creating `Date` object from individual components

Creating `POSIXct` object from individual components

Creating `Date` objects from dataframe columns

Creating `POSIXct` objects from dataframe columns

Time spans using `lubridate`: Durations

Time spans using `lubridate`: Durations

Time spans using `lubridate`: Durations