2 Using dplyr with databases

Chapter 2 of Tanimura (2021) provides a good foundation discussion of issues related to preparing data for analysis. While the discussion is couched in terms of SQL, in reality the issues are not specific to SQL or databases. For this reason, I recommend that you read the chapter.

In this chapter, I instead show how dplyr, a core Tidyverse package, can be used with SQL databases.

2.1 Introduction to dplyr

Our focus on using R and dplyr to access SQL databases means that we only need to learn a subset of the functionality of R.

While R offers a number of data structures—including vectors, lists, and matrices—we will focus primarily on data frames, which are R’s equivalents of tables in SQL databases. R also offers rich functionality for statistical testing and modelling, but we will make little use of this here.

library(DBI)
library(tidyverse)

2.1.1 Introduction to SQL

db <- dbConnect(duckdb::duckdb())

retail_sales <-
  tbl(db, "read_csv_auto('data/us_retail_sales.csv')") |>
  compute(name = "retail_sales")

SELECT sales_month, kind_of_business, sales
FROM retail_sales

retail_sales |>
  select(sales_month, kind_of_business, sales)

# Source:   SQL [?? x 3]
# Database: DuckDB 0.7.1 [igow@Darwin 22.5.0:R 4.2.3/:memory:]
   sales_month kind_of_business                                        sales
   <date>      <chr>                                                   <dbl>
 1 1992-01-01  Motor vehicle and parts dealers                         29811
 2 1992-01-01  Automobile dealers                                      25800
 3 1992-01-01  Automobile and other motor vehicle dealers              26788
 4 1992-01-01  New car dealers                                         24056
 5 1992-01-01  Used car dealers                                         1744
 6 1992-01-01  Automotive parts, acc., and tire stores                  3023
 7 1992-01-01  Furniture and home furnishings stores                    3846
 8 1992-01-01  Furniture, home furn, electronics, and appliance stores  7503
 9 1992-01-01  Furniture stores                                         2392
10 1992-01-01  Home furnishings stores                                  1454
# ℹ more rows

SELECT sales_month, kind_of_business, sales
FROM retail_sales
WHERE sales_month >= '2002-01-01'

retail_sales |>
  select(sales_month, kind_of_business, sales) |>
  filter(sales_month >= '2002-01-01')

# Source:   SQL [?? x 3]
# Database: DuckDB 0.7.1 [igow@Darwin 22.5.0:R 4.2.3/:memory:]
   sales_month kind_of_business                                        sales
   <date>      <chr>                                                   <dbl>
 1 2002-01-01  Motor vehicle and parts dealers                         60565
 2 2002-01-01  Automobile dealers                                      52948
 3 2002-01-01  Automobile and other motor vehicle dealers              55799
 4 2002-01-01  New car dealers                                         48146
 5 2002-01-01  Used car dealers                                         4802
 6 2002-01-01  Automotive parts, acc., and tire stores                  4766
 7 2002-01-01  Furniture and home furnishings stores                    7149
 8 2002-01-01  Furniture, home furn, electronics, and appliance stores 14342
 9 2002-01-01  Furniture stores                                         4097
10 2002-01-01  Home furnishings stores                                  3052
# ℹ more rows

SELECT sales_month, kind_of_business, sales
FROM retail_sales
WHERE

Use == instead of =.
Can use " or ' for quotes.

retail_sales |>
  select(sales_month, kind_of_business, sales) |>
  filter(kind_of_business == 'Used car dealers')

# Source:   SQL [?? x 3]
# Database: DuckDB 0.7.1 [igow@Darwin 22.5.0:R 4.2.3/:memory:]
   sales_month kind_of_business sales
   <date>      <chr>            <dbl>
 1 1992-01-01  Used car dealers  1744
 2 1992-02-01  Used car dealers  1990
 3 1992-03-01  Used car dealers  2177
 4 1992-04-01  Used car dealers  2601
 5 1992-05-01  Used car dealers  2171
 6 1992-06-01  Used car dealers  2207
 7 1992-07-01  Used car dealers  2251
 8 1992-08-01  Used car dealers  2087
 9 1992-09-01  Used car dealers  2016
10 1992-10-01  Used car dealers  2149
# ℹ more rows

SELECT kind_of_business, sum(sales) AS total_sales
FROM retail_sales
GROUP BY kind_of_business
ORDER BY total_sales DESC;

Displaying records 1 - 10
kind_of_business	total_sales
Retail and food services sales, total	118053993
Retail sales and food services excl gasoline stations	107701613
Retail sales, total	105580364
Retail sales and food services excl motor vehicle and parts	93509935
Retail sales and food services excl motor vehicle and parts and gasoline stations	83157555
Retail sales, total (excl. motor vehicle and parts dealers)	81036306
GAFO(1)	29041144
Motor vehicle and parts dealers	24544058
Automobile and other motor vehicle dealers	22462364
Automobile dealers	20963805

use na.rm = TRUE
Put new variable name to the left of aggregate function
Use arrange() instead of ORDER BY
Use desc() instead of DESC
No need to separately select GROUP BY variables.
No exact equivalent to reference by order

retail_sales |>
  group_by(kind_of_business) |>
  summarize(total_sales = sum(sales, na.rm = TRUE)) |>
  arrange(desc(total_sales))

# Source:     SQL [?? x 2]
# Database:   DuckDB 0.7.1 [igow@Darwin 22.5.0:R 4.2.3/:memory:]
# Ordered by: desc(total_sales)
   kind_of_business                                                  total_sales
   <chr>                                                                   <dbl>
 1 Retail and food services sales, total                               118053993
 2 Retail sales and food services excl gasoline stations               107701613
 3 Retail sales, total                                                 105580364
 4 Retail sales and food services excl motor vehicle and parts          93509935
 5 Retail sales and food services excl motor vehicle and parts and …    83157555
 6 Retail sales, total (excl. motor vehicle and parts dealers)          81036306
 7 GAFO(1)                                                              29041144
 8 Motor vehicle and parts dealers                                      24544058
 9 Automobile and other motor vehicle dealers                           22462364
10 Automobile dealers                                                   20963805
# ℹ more rows

retail_sales |>
  group_by(kind_of_business) |>
  summarize(total_sales = sum(sales, na.rm = TRUE)) |>
  arrange(desc(total_sales))

# Source:     SQL [?? x 2]
# Database:   DuckDB 0.7.1 [igow@Darwin 22.5.0:R 4.2.3/:memory:]
# Ordered by: desc(total_sales)
   kind_of_business                                                  total_sales
   <chr>                                                                   <dbl>
 1 Retail and food services sales, total                               118053993
 2 Retail sales and food services excl gasoline stations               107701613
 3 Retail sales, total                                                 105580364
 4 Retail sales and food services excl motor vehicle and parts          93509935
 5 Retail sales and food services excl motor vehicle and parts and …    83157555
 6 Retail sales, total (excl. motor vehicle and parts dealers)          81036306
 7 GAFO(1)                                                              29041144
 8 Motor vehicle and parts dealers                                      24544058
 9 Automobile and other motor vehicle dealers                           22462364
10 Automobile dealers                                                   20963805
# ℹ more rows

SELECT year(sales_month) AS year,
  sum(sales) AS total_sales
FROM retail_sales
WHERE kind_of_business = 'Used car dealers'
GROUP BY 1
ORDER BY 2 DESC;

Displaying records 1 - 10
year	total_sales
2020	124040
2019	119067
2018	114390
2017	112174
2016	105905
2015	98330
2014	90424
2013	83916
2012	80228
2007	79696

retail_sales |>
  mutate(year = year(sales_month)) |>
  filter(kind_of_business == 'Used car dealers') |>
  group_by(year) |>
  summarize(total_sales = sum(sales, na.rm = TRUE)) |>
  arrange(desc(total_sales))

# Source:     SQL [?? x 2]
# Database:   DuckDB 0.7.1 [igow@Darwin 22.5.0:R 4.2.3/:memory:]
# Ordered by: desc(total_sales)
    year total_sales
   <dbl>       <dbl>
 1  2020      124040
 2  2019      119067
 3  2018      114390
 4  2017      112174
 5  2016      105905
 6  2015       98330
 7  2014       90424
 8  2013       83916
 9  2012       80228
10  2007       79696
# ℹ more rows

SELECT year(sales_month) AS year,
  sum(sales) AS total_sales
FROM retail_sales
WHERE kind_of_business = 'Used car dealers'
GROUP BY 1
HAVING total_sales >= 100000
ORDER BY 2 DESC;

5 records
year	total_sales
2020	124040
2019	119067
2018	114390
2017	112174
2016	105905

retail_sales |>
  mutate(year = year(sales_month)) |>
  filter(kind_of_business == 'Used car dealers') |>
  group_by(year) |>
  summarize(total_sales = sum(sales, na.rm = TRUE)) |>
  filter(total_sales >= 100000) |>
  arrange(desc(total_sales))

# Source:     SQL [5 x 2]
# Database:   DuckDB 0.7.1 [igow@Darwin 22.5.0:R 4.2.3/:memory:]
# Ordered by: desc(total_sales)
   year total_sales
  <dbl>       <dbl>
1  2020      124040
2  2019      119067
3  2018      114390
4  2017      112174
5  2016      105905

SELECT kind_of_business, reason_for_null,
  count(*) AS num_null
FROM retail_sales
WHERE sales IS NULL
GROUP BY 1, 2
ORDER BY 3 DESC

Displaying records 1 - 10
kind_of_business	reason_for_null	num_null
All other home furnishings stores	Not Available	108
Paint and wallpaper stores	Not Available	108
Supermarkets and other grocery (except convenience) stores	Not Available	108
Other clothing stores	Not Available	108
Electronics stores	Not Available	60
Floor covering stores	Supressed	46
Drinking places	Supressed	34
Full service restaurants	Supressed	11
Jewelry stores	Supressed	3
Home furnishings stores	Supressed	3

is.na() instead of IS NULL

retail_sales |>
  filter(is.na(sales)) |>
  group_by(kind_of_business, reason_for_null) |>
  summarize(num_null = n(), .groups = "drop") %>%
  arrange(desc(num_null))

# Source:     SQL [?? x 3]
# Database:   DuckDB 0.7.1 [igow@Darwin 22.5.0:R 4.2.3/:memory:]
# Ordered by: desc(num_null)
   kind_of_business                                     reason_for_null num_null
   <chr>                                                <chr>              <dbl>
 1 All other home furnishings stores                    Not Available        108
 2 Paint and wallpaper stores                           Not Available        108
 3 Supermarkets and other grocery (except convenience)… Not Available        108
 4 Other clothing stores                                Not Available        108
 5 Electronics stores                                   Not Available         60
 6 Floor covering stores                                Supressed             46
 7 Drinking places                                      Supressed             34
 8 Full service restaurants                             Supressed             11
 9 Jewelry stores                                       Supressed              3
10 Home furnishings stores                              Supressed              3
# ℹ more rows

2.1.2 Missing data

The SQL in the book generally uses the form x::date rather than the more standard SQL CAST(x AS DATE). In dbplyr, we would use as.Date(x) and dbplyr would translate as CAST(x AS DATE). The following code and output demonstrates how dbplyr translated from dplyr to SQL.

The table stored in dates_processed below is equivalent to that created and stored in the database as date_dim in the code supplied with book. This date_dim table is only used in #sec-time-series of the book and we will not even use it there (for reasons to be explained).

library(tidyverse)
library(DBI)
library(knitr)

db <- dbConnect(duckdb::duckdb(), bigint = "integer")

2.2 Preparing: Shaping Data

2.2.1 For Which Output: BI, Visualization, Statistics, ML

2.2.2 Pivoting with CASE Statements

2.2.3 Unpivoting with UNION Statements

A user of dplyr has access to the functions pivot_wider and pivot_longer, which make it much easier to “pivot” and “unpivot” tables than using CASE statements, which could become long and tedious.

To illustrate the dplyr way of doing things, I will create ctry_pops to match the data discussed in Chapter 2. First, I create the data set using the tribble() function from dplyr.

ctry_pops <-
  tribble(
  ~country, ~year_1980,  ~year_1990, ~year_2000, ~year_2010,
  "Canada", 24593, 27791, 31100, 34207,
  "Mexico", 68347, 84634, 99775, 114061,
  "United States", 227225, 249623, 282162, 309326
)

Second, I pivot the local data frame using pivot_longer.

ctry_pops_long <-
  ctry_pops |>
  pivot_longer(cols = -country, 
               names_to = "year",
               names_prefix = "year_",
               values_ptypes = integer(),
               values_to = "population") 
ctry_pops_long |>
  kable()

country	year	population
Canada	1980	24593
Canada	1990	27791
Canada	2000	31100
Canada	2010	34207
Mexico	1980	68347
Mexico	1990	84634
Mexico	2000	99775
Mexico	2010	114061
United States	1980	227225
United States	1990	249623
United States	2000	282162
United States	2010	309326

Next, I copy the data to PostgreSQL, so that it’s a (temporary) table inside the database.¹

ctry_pops_db <- copy_to(db, ctry_pops)
ctry_pops_db

# Source:   table<ctry_pops> [3 x 5]
# Database: DuckDB 0.7.1 [igow@Darwin 22.5.0:R 4.2.3/:memory:]
  country       year_1980 year_1990 year_2000 year_2010
  <chr>             <dbl>     <dbl>     <dbl>     <dbl>
1 Canada            24593     27791     31100     34207
2 Mexico            68347     84634     99775    114061
3 United States    227225    249623    282162    309326

ctry_pops_db_long <-
  ctry_pops_db |>
  pivot_longer(cols = -country, 
               names_to = "year",
               names_prefix = "year_",
               values_to = "population")

From the output below, we can see that dbplyr has taken care of the tedious business of constructing several statements for us.

ctry_pops_db_long |>
  show_query()

<SQL>
(
  (
    (
      SELECT country, '1980' AS "year", year_1980 AS population
      FROM ctry_pops
    )
    UNION ALL
    (
      SELECT country, '1990' AS "year", year_1990 AS population
      FROM ctry_pops
    )
  )
  UNION ALL
  (
    SELECT country, '2000' AS "year", year_2000 AS population
    FROM ctry_pops
  )
)
UNION ALL
(
  SELECT country, '2010' AS "year", year_2010 AS population
  FROM ctry_pops
)

And from the following, we can see that the result is the same as it was when using dplyr on a local data frame.

ctry_pops_db_long |>
  kable()

country	year	population
Canada	1980	24593
Mexico	1980	68347
United States	1980	227225
Canada	1990	27791
Mexico	1990	84634
United States	1990	249623
Canada	2000	31100
Mexico	2000	99775
United States	2000	282162
Canada	2010	34207
Mexico	2010	114061
United States	2010	309326

And we can reverse the pivot_longer() using pivot_wider().

ctry_pops_db_long |>
  compute() |>
  pivot_wider(names_from = year, 
              values_from = population, 
              names_prefix = "year_") |>
  kable()

country	year_1980	year_1990	year_2000	year_2010
Canada	24593	27791	31100	34207
Mexico	68347	84634	99775	114061
United States	227225	249623	282162	309326

2.2.4 pivot and unpivot Functions

“Temporary” here means that it will disappear once we close our connection to the database.↩︎