2  Using dplyr with databases

Chapter 2 of Tanimura (2021) provides a good foundation discussion of issues related to preparing data for analysis. While the discussion is couched in terms of SQL, in reality the issues are not specific to SQL or databases. For this reason, I recommend that you read the chapter.

In this chapter, I instead show how dplyr, a core Tidyverse package, can be used with SQL databases.

2.1 Introduction to dplyr

Our focus on using R and dplyr to access SQL databases means that we only need to learn a subset of the functionality of R.

While R offers a number of data structures—including vectors, lists, and matrices—we will focus primarily on data frames, which are R’s equivalents of tables in SQL databases. R also offers rich functionality for statistical testing and modelling, but we will make little use of this here.

library(DBI)
library(tidyverse)

2.1.1 Introduction to SQL

db <- dbConnect(duckdb::duckdb())

retail_sales <-
  tbl(db, "read_csv_auto('data/us_retail_sales.csv')") |>
  compute(name = "retail_sales")
SELECT sales_month, kind_of_business, sales
FROM retail_sales
retail_sales |>
  select(sales_month, kind_of_business, sales)
# Source:   SQL [?? x 3]
# Database: DuckDB 0.7.1 [igow@Darwin 22.5.0:R 4.2.3/:memory:]
   sales_month kind_of_business                                        sales
   <date>      <chr>                                                   <dbl>
 1 1992-01-01  Motor vehicle and parts dealers                         29811
 2 1992-01-01  Automobile dealers                                      25800
 3 1992-01-01  Automobile and other motor vehicle dealers              26788
 4 1992-01-01  New car dealers                                         24056
 5 1992-01-01  Used car dealers                                         1744
 6 1992-01-01  Automotive parts, acc., and tire stores                  3023
 7 1992-01-01  Furniture and home furnishings stores                    3846
 8 1992-01-01  Furniture, home furn, electronics, and appliance stores  7503
 9 1992-01-01  Furniture stores                                         2392
10 1992-01-01  Home furnishings stores                                  1454
# ℹ more rows
SELECT sales_month, kind_of_business, sales
FROM retail_sales
WHERE sales_month >= '2002-01-01'
retail_sales |>
  select(sales_month, kind_of_business, sales) |>
  filter(sales_month >= '2002-01-01')
# Source:   SQL [?? x 3]
# Database: DuckDB 0.7.1 [igow@Darwin 22.5.0:R 4.2.3/:memory:]
   sales_month kind_of_business                                        sales
   <date>      <chr>                                                   <dbl>
 1 2002-01-01  Motor vehicle and parts dealers                         60565
 2 2002-01-01  Automobile dealers                                      52948
 3 2002-01-01  Automobile and other motor vehicle dealers              55799
 4 2002-01-01  New car dealers                                         48146
 5 2002-01-01  Used car dealers                                         4802
 6 2002-01-01  Automotive parts, acc., and tire stores                  4766
 7 2002-01-01  Furniture and home furnishings stores                    7149
 8 2002-01-01  Furniture, home furn, electronics, and appliance stores 14342
 9 2002-01-01  Furniture stores                                         4097
10 2002-01-01  Home furnishings stores                                  3052
# ℹ more rows
SELECT sales_month, kind_of_business, sales
FROM retail_sales
WHERE 
  • Use == instead of =.
  • Can use " or ' for quotes.
retail_sales |>
  select(sales_month, kind_of_business, sales) |>
  filter(kind_of_business == 'Used car dealers')
# Source:   SQL [?? x 3]
# Database: DuckDB 0.7.1 [igow@Darwin 22.5.0:R 4.2.3/:memory:]
   sales_month kind_of_business sales
   <date>      <chr>            <dbl>
 1 1992-01-01  Used car dealers  1744
 2 1992-02-01  Used car dealers  1990
 3 1992-03-01  Used car dealers  2177
 4 1992-04-01  Used car dealers  2601
 5 1992-05-01  Used car dealers  2171
 6 1992-06-01  Used car dealers  2207
 7 1992-07-01  Used car dealers  2251
 8 1992-08-01  Used car dealers  2087
 9 1992-09-01  Used car dealers  2016
10 1992-10-01  Used car dealers  2149
# ℹ more rows
SELECT kind_of_business, sum(sales) AS total_sales
FROM retail_sales
GROUP BY kind_of_business
ORDER BY total_sales DESC;
Displaying records 1 - 10
kind_of_business total_sales
Retail and food services sales, total 118053993
Retail sales and food services excl gasoline stations 107701613
Retail sales, total 105580364
Retail sales and food services excl motor vehicle and parts 93509935
Retail sales and food services excl motor vehicle and parts and gasoline stations 83157555
Retail sales, total (excl. motor vehicle and parts dealers) 81036306
GAFO(1) 29041144
Motor vehicle and parts dealers 24544058
Automobile and other motor vehicle dealers 22462364
Automobile dealers 20963805
  • use na.rm = TRUE
  • Put new variable name to the left of aggregate function
  • Use arrange() instead of ORDER BY
  • Use desc() instead of DESC
  • No need to separately select GROUP BY variables.
  • No exact equivalent to reference by order
retail_sales |>
  group_by(kind_of_business) |>
  summarize(total_sales = sum(sales, na.rm = TRUE)) |>
  arrange(desc(total_sales))
# Source:     SQL [?? x 2]
# Database:   DuckDB 0.7.1 [igow@Darwin 22.5.0:R 4.2.3/:memory:]
# Ordered by: desc(total_sales)
   kind_of_business                                                  total_sales
   <chr>                                                                   <dbl>
 1 Retail and food services sales, total                               118053993
 2 Retail sales and food services excl gasoline stations               107701613
 3 Retail sales, total                                                 105580364
 4 Retail sales and food services excl motor vehicle and parts          93509935
 5 Retail sales and food services excl motor vehicle and parts and …    83157555
 6 Retail sales, total (excl. motor vehicle and parts dealers)          81036306
 7 GAFO(1)                                                              29041144
 8 Motor vehicle and parts dealers                                      24544058
 9 Automobile and other motor vehicle dealers                           22462364
10 Automobile dealers                                                   20963805
# ℹ more rows
retail_sales |>
  group_by(kind_of_business) |>
  summarize(total_sales = sum(sales, na.rm = TRUE)) |>
  arrange(desc(total_sales))
# Source:     SQL [?? x 2]
# Database:   DuckDB 0.7.1 [igow@Darwin 22.5.0:R 4.2.3/:memory:]
# Ordered by: desc(total_sales)
   kind_of_business                                                  total_sales
   <chr>                                                                   <dbl>
 1 Retail and food services sales, total                               118053993
 2 Retail sales and food services excl gasoline stations               107701613
 3 Retail sales, total                                                 105580364
 4 Retail sales and food services excl motor vehicle and parts          93509935
 5 Retail sales and food services excl motor vehicle and parts and …    83157555
 6 Retail sales, total (excl. motor vehicle and parts dealers)          81036306
 7 GAFO(1)                                                              29041144
 8 Motor vehicle and parts dealers                                      24544058
 9 Automobile and other motor vehicle dealers                           22462364
10 Automobile dealers                                                   20963805
# ℹ more rows
SELECT year(sales_month) AS year,
  sum(sales) AS total_sales
FROM retail_sales
WHERE kind_of_business = 'Used car dealers'
GROUP BY 1
ORDER BY 2 DESC;
Displaying records 1 - 10
year total_sales
2020 124040
2019 119067
2018 114390
2017 112174
2016 105905
2015 98330
2014 90424
2013 83916
2012 80228
2007 79696
retail_sales |>
  mutate(year = year(sales_month)) |>
  filter(kind_of_business == 'Used car dealers') |>
  group_by(year) |>
  summarize(total_sales = sum(sales, na.rm = TRUE)) |>
  arrange(desc(total_sales))
# Source:     SQL [?? x 2]
# Database:   DuckDB 0.7.1 [igow@Darwin 22.5.0:R 4.2.3/:memory:]
# Ordered by: desc(total_sales)
    year total_sales
   <dbl>       <dbl>
 1  2020      124040
 2  2019      119067
 3  2018      114390
 4  2017      112174
 5  2016      105905
 6  2015       98330
 7  2014       90424
 8  2013       83916
 9  2012       80228
10  2007       79696
# ℹ more rows
SELECT year(sales_month) AS year,
  sum(sales) AS total_sales
FROM retail_sales
WHERE kind_of_business = 'Used car dealers'
GROUP BY 1
HAVING total_sales >= 100000
ORDER BY 2 DESC;
5 records
year total_sales
2020 124040
2019 119067
2018 114390
2017 112174
2016 105905
retail_sales |>
  mutate(year = year(sales_month)) |>
  filter(kind_of_business == 'Used car dealers') |>
  group_by(year) |>
  summarize(total_sales = sum(sales, na.rm = TRUE)) |>
  filter(total_sales >= 100000) |>
  arrange(desc(total_sales)) 
# Source:     SQL [5 x 2]
# Database:   DuckDB 0.7.1 [igow@Darwin 22.5.0:R 4.2.3/:memory:]
# Ordered by: desc(total_sales)
   year total_sales
  <dbl>       <dbl>
1  2020      124040
2  2019      119067
3  2018      114390
4  2017      112174
5  2016      105905
SELECT kind_of_business, reason_for_null,
  count(*) AS num_null
FROM retail_sales
WHERE sales IS NULL
GROUP BY 1, 2
ORDER BY 3 DESC
Displaying records 1 - 10
kind_of_business reason_for_null num_null
All other home furnishings stores Not Available 108
Paint and wallpaper stores Not Available 108
Supermarkets and other grocery (except convenience) stores Not Available 108
Other clothing stores Not Available 108
Electronics stores Not Available 60
Floor covering stores Supressed 46
Drinking places Supressed 34
Full service restaurants Supressed 11
Jewelry stores Supressed 3
Home furnishings stores Supressed 3
  • is.na() instead of IS NULL
retail_sales |>
  filter(is.na(sales)) |>
  group_by(kind_of_business, reason_for_null) |>
  summarize(num_null = n(), .groups = "drop") %>%
  arrange(desc(num_null))
# Source:     SQL [?? x 3]
# Database:   DuckDB 0.7.1 [igow@Darwin 22.5.0:R 4.2.3/:memory:]
# Ordered by: desc(num_null)
   kind_of_business                                     reason_for_null num_null
   <chr>                                                <chr>              <dbl>
 1 All other home furnishings stores                    Not Available        108
 2 Paint and wallpaper stores                           Not Available        108
 3 Supermarkets and other grocery (except convenience)… Not Available        108
 4 Other clothing stores                                Not Available        108
 5 Electronics stores                                   Not Available         60
 6 Floor covering stores                                Supressed             46
 7 Drinking places                                      Supressed             34
 8 Full service restaurants                             Supressed             11
 9 Jewelry stores                                       Supressed              3
10 Home furnishings stores                              Supressed              3
# ℹ more rows

2.1.2 Missing data

The SQL in the book generally uses the form x::date rather than the more standard SQL CAST(x AS DATE). In dbplyr, we would use as.Date(x) and dbplyr would translate as CAST(x AS DATE). The following code and output demonstrates how dbplyr translated from dplyr to SQL.

The table stored in dates_processed below is equivalent to that created and stored in the database as date_dim in the code supplied with book. This date_dim table is only used in #sec-time-series of the book and we will not even use it there (for reasons to be explained).

library(tidyverse)
library(DBI)
library(knitr)
db <- dbConnect(duckdb::duckdb(), bigint = "integer")

2.2 Preparing: Shaping Data

2.2.1 For Which Output: BI, Visualization, Statistics, ML

2.2.2 Pivoting with CASE Statements

2.2.3 Unpivoting with UNION Statements

A user of dplyr has access to the functions pivot_wider and pivot_longer, which make it much easier to “pivot” and “unpivot” tables than using CASE statements, which could become long and tedious.

To illustrate the dplyr way of doing things, I will create ctry_pops to match the data discussed in Chapter 2. First, I create the data set using the tribble() function from dplyr.

ctry_pops <-
  tribble(
  ~country, ~year_1980,  ~year_1990, ~year_2000, ~year_2010,
  "Canada", 24593, 27791, 31100, 34207,
  "Mexico", 68347, 84634, 99775, 114061,
  "United States", 227225, 249623, 282162, 309326
)

Second, I pivot the local data frame using pivot_longer.

ctry_pops_long <-
  ctry_pops |>
  pivot_longer(cols = -country, 
               names_to = "year",
               names_prefix = "year_",
               values_ptypes = integer(),
               values_to = "population") 
ctry_pops_long |>
  kable()
country year population
Canada 1980 24593
Canada 1990 27791
Canada 2000 31100
Canada 2010 34207
Mexico 1980 68347
Mexico 1990 84634
Mexico 2000 99775
Mexico 2010 114061
United States 1980 227225
United States 1990 249623
United States 2000 282162
United States 2010 309326

Next, I copy the data to PostgreSQL, so that it’s a (temporary) table inside the database.1

ctry_pops_db <- copy_to(db, ctry_pops)
ctry_pops_db
# Source:   table<ctry_pops> [3 x 5]
# Database: DuckDB 0.7.1 [igow@Darwin 22.5.0:R 4.2.3/:memory:]
  country       year_1980 year_1990 year_2000 year_2010
  <chr>             <dbl>     <dbl>     <dbl>     <dbl>
1 Canada            24593     27791     31100     34207
2 Mexico            68347     84634     99775    114061
3 United States    227225    249623    282162    309326
ctry_pops_db_long <-
  ctry_pops_db |>
  pivot_longer(cols = -country, 
               names_to = "year",
               names_prefix = "year_",
               values_to = "population") 

From the output below, we can see that dbplyr has taken care of the tedious business of constructing several statements for us.

ctry_pops_db_long |>
  show_query()
<SQL>
(
  (
    (
      SELECT country, '1980' AS "year", year_1980 AS population
      FROM ctry_pops
    )
    UNION ALL
    (
      SELECT country, '1990' AS "year", year_1990 AS population
      FROM ctry_pops
    )
  )
  UNION ALL
  (
    SELECT country, '2000' AS "year", year_2000 AS population
    FROM ctry_pops
  )
)
UNION ALL
(
  SELECT country, '2010' AS "year", year_2010 AS population
  FROM ctry_pops
)

And from the following, we can see that the result is the same as it was when using dplyr on a local data frame.

ctry_pops_db_long |>
  kable()
country year population
Canada 1980 24593
Mexico 1980 68347
United States 1980 227225
Canada 1990 27791
Mexico 1990 84634
United States 1990 249623
Canada 2000 31100
Mexico 2000 99775
United States 2000 282162
Canada 2010 34207
Mexico 2010 114061
United States 2010 309326

And we can reverse the pivot_longer() using pivot_wider().

ctry_pops_db_long |>
  compute() |>
  pivot_wider(names_from = year, 
              values_from = population, 
              names_prefix = "year_") |>
  kable()
country year_1980 year_1990 year_2000 year_2010
Canada 24593 27791 31100 34207
Mexico 68347 84634 99775 114061
United States 227225 249623 282162 309326

2.2.4 pivot and unpivot Functions


  1. “Temporary” here means that it will disappear once we close our connection to the database.↩︎