Appendix

library(DBI)
library(tidyverse)
library(dbplyr)
library(knitr)
library(duckdb)

db <- dbConnect(duckdb::duckdb())

Date, Datetime, and Time Manipulations

INSTALL 'icu';
LOAD 'icu';

Time Zone Conversions

Tanimura (2021) points out that often “timestamps in the database are not encoded with the time zone, and you will need to consult with the source or developer to figure out how your data was stored.” When pushing data to a PostgreSQL database, I use the timestamp with time zone type as much as possible.

Tanimura (2021) provides the following example, which is interesting because the west coast of the United States would not be on the PST time zone at that time of year. Instead, it would be on PDT.

SELECT timestamptz '2020-09-01 00:00:00 -0' AT TIME ZONE 'PST' AS time;

1 records
time
2020-08-31 17:00:00

In PostgreSQL, we would get a different answer with the following query, but in DuckDB it seems that PDT is not recognized as a time zone abbreviation at all, so we just get the original UTC timestamp back.

SELECT timestamptz '2020-09-01 00:00:00 -0' AT TIME ZONE 'PDT' AS time;

1 records
time
2020-09-01

I think most people barely know the difference between PST and PDT and even fewer would know the exact dates that one switches from one to the other. A better approach is to use a time zone that encodes information about when PDT is used and when PST is used. In PostgreSQL and DuckDB, the function pg_timezone_names() returns a table with the information that we need.

However, it seems that there are inconsistencies between PostgreSQL and DuckDB in terms of the abbreviations used. As such it’s probably safer to use the name form (e.g., US/Pacific) whenever possible. Given the widespread confusion about the meaning of terms like EST, EDT, and so on, just use US/Eastern, etc.

SELECT name, abbrev
FROM pg_timezone_names()
WHERE regexp_matches(name, '^US/');
-- use WHERE name ~ '^US/'; in PostgreSQL

Displaying records 1 - 10
name	abbrev
US/Alaska	AST
US/Aleutian	US/Aleutian
US/Arizona	PNT
US/Central	CST
US/East-Indiana	IET
US/Eastern	US/Eastern
US/Hawaii	US/Hawaii
US/Indiana-Starke	US/Indiana-Starke
US/Michigan	US/Michigan
US/Mountain	Navajo

The following queries demonstrate that daylight savings information is encoded in the database.

SELECT '2020-09-01 17:00:01 US/Pacific'::timestamptz AS t1,
       '2020-09-02 10:00:01 Australia/Melbourne'::timestamptz  AS t2;

1 records
t1	t2
2020-09-02 00:00:01	2020-09-02 00:00:01

SELECT '2020-12-01 16:00:01 US/Pacific'::timestamptz AS t1,
       '2020-12-02 11:00:01 Australia/Melbourne'::timestamptz  AS t2;

1 records
t1	t2
2020-12-02 00:00:01	2020-12-02 00:00:01

sql <-
  "(SELECT     
    '2020-12-01 16:00:01 US/Pacific'::timestamptz AS t1)"

a_time <- tbl(db, sql(sql))
a_time |>
  kable()

t1
2020-12-02 00:00:01

a_time_r <-
  a_time |>
  select(t1) |>
  pull()

print(a_time_r, tz = "UTC")

[1] "2020-12-02 00:00:01 UTC"

print(a_time_r, tz = "US/Pacific")

[1] "2020-12-01 16:00:01 PST"

Sys.timezone()

[1] "America/New_York"

print(a_time_r, tz = Sys.timezone())

[1] "2020-12-01 19:00:01 EST"

The above examples illustrate a few key ideas.

First, while we supply the literal form '2020-09-01 17:00:01 US/Pacific'::timestamptz, it seems that once a variable has been encoded as TIMESTAMP WITH TIME ZONE, it behaves as though it is actually being stored as a timestamp in the UTC time zone, just with the displayed time perhaps being different.

Second, columns of type TIMESTAMP WITH TIME ZONE come into R with the associated time-zone information, which is what we want (especially if we will later put timestamp data back into PostgreSQL).

Third, we can see that we can choose to display information in a different time zone without changing the underlying data.

Some care is needed with timestamp data. I think the AT TIME ZONE queries provided in Tanimura (2021) are actually pretty dangerous, as can be seen in the following query. While we supply 2020-09-01 00:00:01 as UTC and then render it AT TIME ZONE 'US/Pacific', it turns out that the returned value is interpreted as a TIMESTAMP WITHOUT TIME ZONE and subsequent queries lead to confusing behaviour. In the query below, the second application of AT TIME ZONE interprets the TIMESTAMP WITHOUT TIME ZONE as though it came from the stated time zone and the results seem to have AT TIME ZONE doing the opposite of what it did when given a TIMESTAMP WITH TIME ZONE (as in the initial literal '2020-09-01 00:00:01 -0').

WITH q1 AS
 (SELECT timestamptz '2020-09-01 00:00:01-00' AT TIME ZONE 'US/Pacific' AS t1,
         timestamp '2020-09-01 00:00:01' AT TIME ZONE 'US/Pacific' AS t2)
 
SELECT 
  t1,
  t1::varchar AS t1_char,
  t1 AT TIME ZONE 'UTC' AS t3,
  t2 AT TIME ZONE 'UTC' AS t4,
  typeof(t1),
  typeof(t2)
FROM q1

1 records
t1	t1_char	t3	t4	typeof(t1)	typeof(t2)
2020-08-31 17:00:01	2020-08-31 17:00:01	2020-08-31 17:00:01	2020-09-01 07:00:01	TIMESTAMP	TIMESTAMP WITH TIME ZONE

It seems that TIMESTAMP WITHOUT TIME ZONE values should be converted to a time zone as quickly as possible to avoid confusion and that care is needed with AT TIME ZONE given that it does very different (essentially opposite) things according to the supplied data type.

WITH q1 AS
 (SELECT '2020-09-01 00:00:01-00'::timestamptz AS t1)
 
SELECT t1,
  t1::varchar AS t2,
  typeof(t1)
FROM q1

1 records
t1	t2	typeof(t1)
2020-09-01 00:00:01	2020-08-31 20:00:01-04	TIMESTAMP WITH TIME ZONE

Strange behaviour can result from values stored as TIMESTAMP WITHOUT TIME ZONE. Below we see that t1 is printed as UTC no matter what, while the behaviour of t2 seems easier to understand.

sql <-
  "(SELECT     
    '2020-12-01 00:00:01-00' AS t1,
    '2020-12-01 00:00:01-00'::timestamptz AS t2)"

two_times_notz <- tbl(db, sql(sql))
two_times_notz |> kable()

t1	t2
2020-12-01 00:00:01-00	2020-12-01 00:00:01

two_times_notz_r <-
  collect(two_times_notz)
  
print(two_times_notz_r$t1)

[1] "2020-12-01 00:00:01-00"

Sys.timezone()

[1] "America/New_York"

print(two_times_notz_r$t1, tz = Sys.timezone())

[1] "2020-12-01 00:00:01-00"

print(two_times_notz_r$t2)

[1] "2020-12-01 00:00:01 UTC"

Sys.timezone()

[1] "America/New_York"

print(two_times_notz_r$t2, tz = Sys.timezone())

[1] "2020-11-30 19:00:01 EST"

As pointed out by Tanimura (2021), one drawback to storing information as UTC is that localtime information may be lost. But it seems it would be more prudent to store information as TIMESTAMP WITH TIME ZONE and keep local time zone information as a separate column to avoid confusion. For example, if the orders table is stored as TIMESTAMP WITHOUT TIME ZONE based on the local time of the customer, which might be Australia/Melbourne and the shipping table uses TIMESTAMP WITH TIME ZONE, then an analyst of time-to-ship data would be confused by orders apparently being shipped before they are made. If shipping table uses TIMESTAMP WITH TIME ZONE using timestamps in the time zone of the East Bay warehouse (so US/Pacific), things would be even worse.

I think that fully fleshing out the issues here would require a separate chapter. In fact, nothing in the core part of Chapter 3 of Tanimura (2021) (which focuses on the retail_sales table) really uses timestamp information, so we can put these issues aside for now.

Date and Timestamp Format Conversions

As discussed in Tanimura (2021), PostgreSQL has a rich array of functions for converting dates and times and extracting such information as months and days of the week.

SELECT date_trunc('month','2020-10-04 12:33:35 -00'::timestamptz);

1 records
date_trunc(‘month’, CAST(‘2020-10-04 12:33:35 -00’ AS TIMESTAMP WITH TIME ZONE))
2020-10-01 04:00:00

One such function

a_time_df <- tbl(db, sql("(SELECT '2020-10-04 12:33:35'::timestamp AS a_time)"))

a_time_df |> 
  mutate(a_trunced_time = date_trunc('month', a_time))

# Source:   SQL [1 x 2]
# Database: DuckDB 0.7.1 [igow@Darwin 22.5.0:R 4.2.3/:memory:]
  a_time              a_trunced_time
  <dttm>              <date>        
1 2020-10-04 12:33:35 2020-10-01

a_time_df |> 
  mutate(a_trunced_time = date_trunc('month', a_time)) |>
  show_query()

<SQL>
SELECT *, date_trunc('month', a_time) AS a_trunced_time
FROM (SELECT '2020-10-04 12:33:35'::timestamp AS a_time)

a_time_df |>
  collect()

# A tibble: 1 × 1
  a_time             
  <dttm>             
1 2020-10-04 12:33:35

Date Math

Time Math

a_time_df <- tbl(db, sql("(SELECT '2020-10-04 12:33:35 US/Pacific'::timestamptz AS a_time)"))

a_time_df |> 
  mutate(a_trunced_time = date_trunc('month', a_time))

# Source:   SQL [1 x 2]
# Database: DuckDB 0.7.1 [igow@Darwin 22.5.0:R 4.2.3/:memory:]
  a_time              a_trunced_time     
  <dttm>              <dttm>             
1 2020-10-04 19:33:35 2020-10-01 04:00:00

a_time_df |> 
  mutate(a_trunced_time = date_trunc('month', a_time)) |>
  show_query()

<SQL>
SELECT *, date_trunc('month', a_time) AS a_trunced_time
FROM (SELECT '2020-10-04 12:33:35 US/Pacific'::timestamptz AS a_time)

a_time_df |>
  collect()

# A tibble: 1 × 1
  a_time             
  <dttm>             
1 2020-10-04 19:33:35

a_time_df |>
  mutate(new_time = a_time + sql("interval '3 hours'")) |>
  collect()

# A tibble: 1 × 2
  a_time              new_time           
  <dttm>              <dttm>             
1 2020-10-04 19:33:35 2020-10-04 22:33:35

The Retail Sales Data Set

As discussed in Tanimura (2021), the data set used in this chapter comes from the website of the US Census Bureau. The data set is a little messy, but not too large, so we can easily grab it directly from the website and clean it up in much the same way that Cathy has done for us.

library(tidyverse)
library(readxl)

# Use tmpdir = "." or known directory if you have trouble with 
# this part.
mrtssales <- tempfile(fileext = ".xlsx")
url <- paste0("https://www.census.gov/retail/mrts/www/",
              "mrtssales92-present.xlsx")
download.file(url, mrtssales)

read_tab <- function(year) {
  
  # Initially we read all columns as text, as we want to process
  # more precise than read_excel() would do unsupervised.
  temp <- read_excel(mrtssales,
                     range = "A4:N71", 
                     sheet = as.character(year),
                     col_types = "text",
                     col_names = paste0("v", 1:14))
  
  # The third row has the dates for columns 3:14
  names(temp) <- c("naics_code", "kind_of_business",
                   as.character(temp[2, 3:14]))
  
  # The actual data are found after row 3
  temp <- temp[-1:-3, ]
  
  # Now pivot the data and convert sales to numeric values.
  # Also convert sales_month to dates (start of respective month).
  df <-
    temp |>
    pivot_longer(names_to = "sales_month",
                 values_to = "sales",
                 cols = -1:-2) |>
    mutate(sales_month = paste("01", str_remove(sales_month, "\\.")),
           sales_month = as.Date(sales_month, "%d %b %Y")) |>
    mutate(reason_for_null = case_when(sales == "(NA)" ~ "Not Available",
                                       sales == "(S)" ~ "Supressed",
                                       TRUE ~ NA),
           sales = case_when(sales == "(NA)" ~ NA,
                             sales == "(S)" ~ NA,
                             TRUE ~ sales)) |>
    mutate(sales = as.double(sales)) |>
    select(sales_month, naics_code, kind_of_business,
           reason_for_null, sales)
  df
}

retail_sales_local <- bind_rows(lapply(1992:2020, read_tab))

Note that there are differences between the retail_sales_local above and the data set we used in , as US Census economic data is continually being revised even after being released.

Earthquake data

weeks <- 
  tibble(start_date = seq(as.Date("2010-01-01"), 
                          as.Date("2020-12-01"),
                          by = 7)) |>
  mutate(end_date = start_date + days(7))
  
get_quake_data <- function(start, end) {
  db <- dbConnect(duckdb::duckdb(), "earthquakes_new.duckdb")
  url <- paste0("https://earthquake.usgs.gov/fdsnws/event/1/",
                "query?format=csv&starttime=", as.character(start), 
                "&endtime=", as.character(end))
  df <- read_csv(url, show_col_types = FALSE)
  dbWriteTable(db, "earthquakes_new2", df,
               append = TRUE, row.names = FALSE)
  dbDisconnect(db, shutdown = TRUE)
  return(TRUE)
}

db <- dbConnect(duckdb::duckdb(), "earthquakes_new.duckdb")
dbExecute(db, "DROP TABLE IF EXISTS earthquakes")
dbDisconnect(db, shutdown = TRUE)

res <- map2(weeks$start_date,
            weeks$end_date, 
            get_quake_data,
            .progress = TRUE)