library(DBI)
library(tidyverse)
2 Using dplyr with databases
Chapter 2 of Tanimura (2021) provides a good foundation discussion of issues related to preparing data for analysis. While the discussion is couched in terms of SQL, in reality the issues are not specific to SQL or databases. For this reason, I recommend that you read the chapter.
In this chapter, I instead show how dplyr
, a core Tidyverse package, can be used with SQL databases.
2.1 Introduction to dplyr
Our focus on using R and dplyr
to access SQL databases means that we only need to learn a subset of the functionality of R.
While R offers a number of data structures—including vectors, lists, and matrices—we will focus primarily on data frames, which are R’s equivalents of tables in SQL databases. R also offers rich functionality for statistical testing and modelling, but we will make little use of this here.
2.1.1 Introduction to SQL
<- dbConnect(duckdb::duckdb())
db
<-
retail_sales tbl(db, "read_csv_auto('data/us_retail_sales.csv')") |>
compute(name = "retail_sales")
SELECT sales_month, kind_of_business, sales
FROM retail_sales
|>
retail_sales select(sales_month, kind_of_business, sales)
# Source: SQL [?? x 3]
# Database: DuckDB 0.7.1 [igow@Darwin 22.5.0:R 4.2.3/:memory:]
sales_month kind_of_business sales
<date> <chr> <dbl>
1 1992-01-01 Motor vehicle and parts dealers 29811
2 1992-01-01 Automobile dealers 25800
3 1992-01-01 Automobile and other motor vehicle dealers 26788
4 1992-01-01 New car dealers 24056
5 1992-01-01 Used car dealers 1744
6 1992-01-01 Automotive parts, acc., and tire stores 3023
7 1992-01-01 Furniture and home furnishings stores 3846
8 1992-01-01 Furniture, home furn, electronics, and appliance stores 7503
9 1992-01-01 Furniture stores 2392
10 1992-01-01 Home furnishings stores 1454
# ℹ more rows
SELECT sales_month, kind_of_business, sales
FROM retail_sales
WHERE sales_month >= '2002-01-01'
|>
retail_sales select(sales_month, kind_of_business, sales) |>
filter(sales_month >= '2002-01-01')
# Source: SQL [?? x 3]
# Database: DuckDB 0.7.1 [igow@Darwin 22.5.0:R 4.2.3/:memory:]
sales_month kind_of_business sales
<date> <chr> <dbl>
1 2002-01-01 Motor vehicle and parts dealers 60565
2 2002-01-01 Automobile dealers 52948
3 2002-01-01 Automobile and other motor vehicle dealers 55799
4 2002-01-01 New car dealers 48146
5 2002-01-01 Used car dealers 4802
6 2002-01-01 Automotive parts, acc., and tire stores 4766
7 2002-01-01 Furniture and home furnishings stores 7149
8 2002-01-01 Furniture, home furn, electronics, and appliance stores 14342
9 2002-01-01 Furniture stores 4097
10 2002-01-01 Home furnishings stores 3052
# ℹ more rows
SELECT sales_month, kind_of_business, sales
FROM retail_sales
WHERE
- Use
==
instead of=
. - Can use
"
or'
for quotes.
|>
retail_sales select(sales_month, kind_of_business, sales) |>
filter(kind_of_business == 'Used car dealers')
# Source: SQL [?? x 3]
# Database: DuckDB 0.7.1 [igow@Darwin 22.5.0:R 4.2.3/:memory:]
sales_month kind_of_business sales
<date> <chr> <dbl>
1 1992-01-01 Used car dealers 1744
2 1992-02-01 Used car dealers 1990
3 1992-03-01 Used car dealers 2177
4 1992-04-01 Used car dealers 2601
5 1992-05-01 Used car dealers 2171
6 1992-06-01 Used car dealers 2207
7 1992-07-01 Used car dealers 2251
8 1992-08-01 Used car dealers 2087
9 1992-09-01 Used car dealers 2016
10 1992-10-01 Used car dealers 2149
# ℹ more rows
SELECT kind_of_business, sum(sales) AS total_sales
FROM retail_sales
GROUP BY kind_of_business
ORDER BY total_sales DESC;
kind_of_business | total_sales |
---|---|
Retail and food services sales, total | 118053993 |
Retail sales and food services excl gasoline stations | 107701613 |
Retail sales, total | 105580364 |
Retail sales and food services excl motor vehicle and parts | 93509935 |
Retail sales and food services excl motor vehicle and parts and gasoline stations | 83157555 |
Retail sales, total (excl. motor vehicle and parts dealers) | 81036306 |
GAFO(1) | 29041144 |
Motor vehicle and parts dealers | 24544058 |
Automobile and other motor vehicle dealers | 22462364 |
Automobile dealers | 20963805 |
- use
na.rm = TRUE
- Put new variable name to the left of aggregate function
- Use
arrange()
instead ofORDER BY
- Use
desc()
instead ofDESC
- No need to separately select
GROUP BY
variables. - No exact equivalent to reference by order
|>
retail_sales group_by(kind_of_business) |>
summarize(total_sales = sum(sales, na.rm = TRUE)) |>
arrange(desc(total_sales))
# Source: SQL [?? x 2]
# Database: DuckDB 0.7.1 [igow@Darwin 22.5.0:R 4.2.3/:memory:]
# Ordered by: desc(total_sales)
kind_of_business total_sales
<chr> <dbl>
1 Retail and food services sales, total 118053993
2 Retail sales and food services excl gasoline stations 107701613
3 Retail sales, total 105580364
4 Retail sales and food services excl motor vehicle and parts 93509935
5 Retail sales and food services excl motor vehicle and parts and … 83157555
6 Retail sales, total (excl. motor vehicle and parts dealers) 81036306
7 GAFO(1) 29041144
8 Motor vehicle and parts dealers 24544058
9 Automobile and other motor vehicle dealers 22462364
10 Automobile dealers 20963805
# ℹ more rows
|>
retail_sales group_by(kind_of_business) |>
summarize(total_sales = sum(sales, na.rm = TRUE)) |>
arrange(desc(total_sales))
# Source: SQL [?? x 2]
# Database: DuckDB 0.7.1 [igow@Darwin 22.5.0:R 4.2.3/:memory:]
# Ordered by: desc(total_sales)
kind_of_business total_sales
<chr> <dbl>
1 Retail and food services sales, total 118053993
2 Retail sales and food services excl gasoline stations 107701613
3 Retail sales, total 105580364
4 Retail sales and food services excl motor vehicle and parts 93509935
5 Retail sales and food services excl motor vehicle and parts and … 83157555
6 Retail sales, total (excl. motor vehicle and parts dealers) 81036306
7 GAFO(1) 29041144
8 Motor vehicle and parts dealers 24544058
9 Automobile and other motor vehicle dealers 22462364
10 Automobile dealers 20963805
# ℹ more rows
SELECT year(sales_month) AS year,
sum(sales) AS total_sales
FROM retail_sales
WHERE kind_of_business = 'Used car dealers'
GROUP BY 1
ORDER BY 2 DESC;
year | total_sales |
---|---|
2020 | 124040 |
2019 | 119067 |
2018 | 114390 |
2017 | 112174 |
2016 | 105905 |
2015 | 98330 |
2014 | 90424 |
2013 | 83916 |
2012 | 80228 |
2007 | 79696 |
|>
retail_sales mutate(year = year(sales_month)) |>
filter(kind_of_business == 'Used car dealers') |>
group_by(year) |>
summarize(total_sales = sum(sales, na.rm = TRUE)) |>
arrange(desc(total_sales))
# Source: SQL [?? x 2]
# Database: DuckDB 0.7.1 [igow@Darwin 22.5.0:R 4.2.3/:memory:]
# Ordered by: desc(total_sales)
year total_sales
<dbl> <dbl>
1 2020 124040
2 2019 119067
3 2018 114390
4 2017 112174
5 2016 105905
6 2015 98330
7 2014 90424
8 2013 83916
9 2012 80228
10 2007 79696
# ℹ more rows
SELECT year(sales_month) AS year,
sum(sales) AS total_sales
FROM retail_sales
WHERE kind_of_business = 'Used car dealers'
GROUP BY 1
HAVING total_sales >= 100000
ORDER BY 2 DESC;
year | total_sales |
---|---|
2020 | 124040 |
2019 | 119067 |
2018 | 114390 |
2017 | 112174 |
2016 | 105905 |
|>
retail_sales mutate(year = year(sales_month)) |>
filter(kind_of_business == 'Used car dealers') |>
group_by(year) |>
summarize(total_sales = sum(sales, na.rm = TRUE)) |>
filter(total_sales >= 100000) |>
arrange(desc(total_sales))
# Source: SQL [5 x 2]
# Database: DuckDB 0.7.1 [igow@Darwin 22.5.0:R 4.2.3/:memory:]
# Ordered by: desc(total_sales)
year total_sales
<dbl> <dbl>
1 2020 124040
2 2019 119067
3 2018 114390
4 2017 112174
5 2016 105905
SELECT kind_of_business, reason_for_null,
count(*) AS num_null
FROM retail_sales
WHERE sales IS NULL
GROUP BY 1, 2
ORDER BY 3 DESC
kind_of_business | reason_for_null | num_null |
---|---|---|
All other home furnishings stores | Not Available | 108 |
Paint and wallpaper stores | Not Available | 108 |
Supermarkets and other grocery (except convenience) stores | Not Available | 108 |
Other clothing stores | Not Available | 108 |
Electronics stores | Not Available | 60 |
Floor covering stores | Supressed | 46 |
Drinking places | Supressed | 34 |
Full service restaurants | Supressed | 11 |
Jewelry stores | Supressed | 3 |
Home furnishings stores | Supressed | 3 |
is.na()
instead ofIS NULL
|>
retail_sales filter(is.na(sales)) |>
group_by(kind_of_business, reason_for_null) |>
summarize(num_null = n(), .groups = "drop") %>%
arrange(desc(num_null))
# Source: SQL [?? x 3]
# Database: DuckDB 0.7.1 [igow@Darwin 22.5.0:R 4.2.3/:memory:]
# Ordered by: desc(num_null)
kind_of_business reason_for_null num_null
<chr> <chr> <dbl>
1 All other home furnishings stores Not Available 108
2 Paint and wallpaper stores Not Available 108
3 Supermarkets and other grocery (except convenience)… Not Available 108
4 Other clothing stores Not Available 108
5 Electronics stores Not Available 60
6 Floor covering stores Supressed 46
7 Drinking places Supressed 34
8 Full service restaurants Supressed 11
9 Jewelry stores Supressed 3
10 Home furnishings stores Supressed 3
# ℹ more rows
2.1.2 Missing data
The SQL in the book generally uses the form x::date
rather than the more standard SQL CAST(x AS DATE)
. In dbplyr
, we would use as.Date(x)
and dbplyr
would translate as CAST(x AS DATE)
. The following code and output demonstrates how dbplyr
translated from dplyr
to SQL.
The table stored in dates_processed
below is equivalent to that created and stored in the database as date_dim
in the code supplied with book. This date_dim
table is only used in #sec-time-series of the book and we will not even use it there (for reasons to be explained).
library(tidyverse)
library(DBI)
library(knitr)
<- dbConnect(duckdb::duckdb(), bigint = "integer") db
2.2 Preparing: Shaping Data
2.2.1 For Which Output: BI, Visualization, Statistics, ML
2.2.2 Pivoting with CASE Statements
2.2.3 Unpivoting with UNION Statements
A user of dplyr
has access to the functions pivot_wider
and pivot_longer
, which make it much easier to “pivot” and “unpivot” tables than using CASE
statements, which could become long and tedious.
To illustrate the dplyr
way of doing things, I will create ctry_pops
to match the data discussed in Chapter 2. First, I create the data set using the tribble()
function from dplyr
.
<-
ctry_pops tribble(
~country, ~year_1980, ~year_1990, ~year_2000, ~year_2010,
"Canada", 24593, 27791, 31100, 34207,
"Mexico", 68347, 84634, 99775, 114061,
"United States", 227225, 249623, 282162, 309326
)
Second, I pivot the local data frame using pivot_longer
.
<-
ctry_pops_long |>
ctry_pops pivot_longer(cols = -country,
names_to = "year",
names_prefix = "year_",
values_ptypes = integer(),
values_to = "population")
|>
ctry_pops_long kable()
country | year | population |
---|---|---|
Canada | 1980 | 24593 |
Canada | 1990 | 27791 |
Canada | 2000 | 31100 |
Canada | 2010 | 34207 |
Mexico | 1980 | 68347 |
Mexico | 1990 | 84634 |
Mexico | 2000 | 99775 |
Mexico | 2010 | 114061 |
United States | 1980 | 227225 |
United States | 1990 | 249623 |
United States | 2000 | 282162 |
United States | 2010 | 309326 |
Next, I copy the data to PostgreSQL, so that it’s a (temporary) table inside the database.1
<- copy_to(db, ctry_pops)
ctry_pops_db ctry_pops_db
# Source: table<ctry_pops> [3 x 5]
# Database: DuckDB 0.7.1 [igow@Darwin 22.5.0:R 4.2.3/:memory:]
country year_1980 year_1990 year_2000 year_2010
<chr> <dbl> <dbl> <dbl> <dbl>
1 Canada 24593 27791 31100 34207
2 Mexico 68347 84634 99775 114061
3 United States 227225 249623 282162 309326
<-
ctry_pops_db_long |>
ctry_pops_db pivot_longer(cols = -country,
names_to = "year",
names_prefix = "year_",
values_to = "population")
From the output below, we can see that dbplyr
has taken care of the tedious business of constructing several statements for us.
|>
ctry_pops_db_long show_query()
<SQL>
(
(
(
SELECT country, '1980' AS "year", year_1980 AS population
FROM ctry_pops
)
UNION ALL
(
SELECT country, '1990' AS "year", year_1990 AS population
FROM ctry_pops
)
)
UNION ALL
(
SELECT country, '2000' AS "year", year_2000 AS population
FROM ctry_pops
)
)
UNION ALL
(
SELECT country, '2010' AS "year", year_2010 AS population
FROM ctry_pops
)
And from the following, we can see that the result is the same as it was when using dplyr
on a local data frame.
|>
ctry_pops_db_long kable()
country | year | population |
---|---|---|
Canada | 1980 | 24593 |
Mexico | 1980 | 68347 |
United States | 1980 | 227225 |
Canada | 1990 | 27791 |
Mexico | 1990 | 84634 |
United States | 1990 | 249623 |
Canada | 2000 | 31100 |
Mexico | 2000 | 99775 |
United States | 2000 | 282162 |
Canada | 2010 | 34207 |
Mexico | 2010 | 114061 |
United States | 2010 | 309326 |
And we can reverse the pivot_longer()
using pivot_wider()
.
|>
ctry_pops_db_long compute() |>
pivot_wider(names_from = year,
values_from = population,
names_prefix = "year_") |>
kable()
country | year_1980 | year_1990 | year_2000 | year_2010 |
---|---|---|---|---|
Canada | 24593 | 27791 | 31100 | 34207 |
Mexico | 68347 | 84634 | 99775 | 114061 |
United States | 227225 | 249623 | 282162 | 309326 |
2.2.4 pivot and unpivot Functions
“Temporary” here means that it will disappear once we close our connection to the database.↩︎