library(DBI)
library(tidyverse)2 Using dplyr with databases
Chapter 2 of Tanimura (2021) provides a good foundation discussion of issues related to preparing data for analysis. While the discussion is couched in terms of SQL, in reality the issues are not specific to SQL or databases. For this reason, I recommend that you read the chapter.
In this chapter, I instead show how dplyr, a core Tidyverse package, can be used with SQL databases.
2.1 Introduction to dplyr
Our focus on using R and dplyr to access SQL databases means that we only need to learn a subset of the functionality of R.
While R offers a number of data structures—including vectors, lists, and matrices—we will focus primarily on data frames, which are R’s equivalents of tables in SQL databases. R also offers rich functionality for statistical testing and modelling, but we will make little use of this here.
2.1.1 Introduction to SQL
db <- dbConnect(duckdb::duckdb())
retail_sales <-
tbl(db, "read_csv_auto('data/us_retail_sales.csv')") |>
compute(name = "retail_sales")SELECT sales_month, kind_of_business, sales
FROM retail_salesretail_sales |>
select(sales_month, kind_of_business, sales)# Source: SQL [?? x 3]
# Database: DuckDB 0.7.1 [igow@Darwin 22.5.0:R 4.2.3/:memory:]
sales_month kind_of_business sales
<date> <chr> <dbl>
1 1992-01-01 Motor vehicle and parts dealers 29811
2 1992-01-01 Automobile dealers 25800
3 1992-01-01 Automobile and other motor vehicle dealers 26788
4 1992-01-01 New car dealers 24056
5 1992-01-01 Used car dealers 1744
6 1992-01-01 Automotive parts, acc., and tire stores 3023
7 1992-01-01 Furniture and home furnishings stores 3846
8 1992-01-01 Furniture, home furn, electronics, and appliance stores 7503
9 1992-01-01 Furniture stores 2392
10 1992-01-01 Home furnishings stores 1454
# ℹ more rows
SELECT sales_month, kind_of_business, sales
FROM retail_sales
WHERE sales_month >= '2002-01-01'retail_sales |>
select(sales_month, kind_of_business, sales) |>
filter(sales_month >= '2002-01-01')# Source: SQL [?? x 3]
# Database: DuckDB 0.7.1 [igow@Darwin 22.5.0:R 4.2.3/:memory:]
sales_month kind_of_business sales
<date> <chr> <dbl>
1 2002-01-01 Motor vehicle and parts dealers 60565
2 2002-01-01 Automobile dealers 52948
3 2002-01-01 Automobile and other motor vehicle dealers 55799
4 2002-01-01 New car dealers 48146
5 2002-01-01 Used car dealers 4802
6 2002-01-01 Automotive parts, acc., and tire stores 4766
7 2002-01-01 Furniture and home furnishings stores 7149
8 2002-01-01 Furniture, home furn, electronics, and appliance stores 14342
9 2002-01-01 Furniture stores 4097
10 2002-01-01 Home furnishings stores 3052
# ℹ more rows
SELECT sales_month, kind_of_business, sales
FROM retail_sales
WHERE - Use
==instead of=. - Can use
"or'for quotes.
retail_sales |>
select(sales_month, kind_of_business, sales) |>
filter(kind_of_business == 'Used car dealers')# Source: SQL [?? x 3]
# Database: DuckDB 0.7.1 [igow@Darwin 22.5.0:R 4.2.3/:memory:]
sales_month kind_of_business sales
<date> <chr> <dbl>
1 1992-01-01 Used car dealers 1744
2 1992-02-01 Used car dealers 1990
3 1992-03-01 Used car dealers 2177
4 1992-04-01 Used car dealers 2601
5 1992-05-01 Used car dealers 2171
6 1992-06-01 Used car dealers 2207
7 1992-07-01 Used car dealers 2251
8 1992-08-01 Used car dealers 2087
9 1992-09-01 Used car dealers 2016
10 1992-10-01 Used car dealers 2149
# ℹ more rows
SELECT kind_of_business, sum(sales) AS total_sales
FROM retail_sales
GROUP BY kind_of_business
ORDER BY total_sales DESC;| kind_of_business | total_sales |
|---|---|
| Retail and food services sales, total | 118053993 |
| Retail sales and food services excl gasoline stations | 107701613 |
| Retail sales, total | 105580364 |
| Retail sales and food services excl motor vehicle and parts | 93509935 |
| Retail sales and food services excl motor vehicle and parts and gasoline stations | 83157555 |
| Retail sales, total (excl. motor vehicle and parts dealers) | 81036306 |
| GAFO(1) | 29041144 |
| Motor vehicle and parts dealers | 24544058 |
| Automobile and other motor vehicle dealers | 22462364 |
| Automobile dealers | 20963805 |
- use
na.rm = TRUE - Put new variable name to the left of aggregate function
- Use
arrange()instead ofORDER BY - Use
desc()instead ofDESC - No need to separately select
GROUP BYvariables. - No exact equivalent to reference by order
retail_sales |>
group_by(kind_of_business) |>
summarize(total_sales = sum(sales, na.rm = TRUE)) |>
arrange(desc(total_sales))# Source: SQL [?? x 2]
# Database: DuckDB 0.7.1 [igow@Darwin 22.5.0:R 4.2.3/:memory:]
# Ordered by: desc(total_sales)
kind_of_business total_sales
<chr> <dbl>
1 Retail and food services sales, total 118053993
2 Retail sales and food services excl gasoline stations 107701613
3 Retail sales, total 105580364
4 Retail sales and food services excl motor vehicle and parts 93509935
5 Retail sales and food services excl motor vehicle and parts and … 83157555
6 Retail sales, total (excl. motor vehicle and parts dealers) 81036306
7 GAFO(1) 29041144
8 Motor vehicle and parts dealers 24544058
9 Automobile and other motor vehicle dealers 22462364
10 Automobile dealers 20963805
# ℹ more rows
retail_sales |>
group_by(kind_of_business) |>
summarize(total_sales = sum(sales, na.rm = TRUE)) |>
arrange(desc(total_sales))# Source: SQL [?? x 2]
# Database: DuckDB 0.7.1 [igow@Darwin 22.5.0:R 4.2.3/:memory:]
# Ordered by: desc(total_sales)
kind_of_business total_sales
<chr> <dbl>
1 Retail and food services sales, total 118053993
2 Retail sales and food services excl gasoline stations 107701613
3 Retail sales, total 105580364
4 Retail sales and food services excl motor vehicle and parts 93509935
5 Retail sales and food services excl motor vehicle and parts and … 83157555
6 Retail sales, total (excl. motor vehicle and parts dealers) 81036306
7 GAFO(1) 29041144
8 Motor vehicle and parts dealers 24544058
9 Automobile and other motor vehicle dealers 22462364
10 Automobile dealers 20963805
# ℹ more rows
SELECT year(sales_month) AS year,
sum(sales) AS total_sales
FROM retail_sales
WHERE kind_of_business = 'Used car dealers'
GROUP BY 1
ORDER BY 2 DESC;| year | total_sales |
|---|---|
| 2020 | 124040 |
| 2019 | 119067 |
| 2018 | 114390 |
| 2017 | 112174 |
| 2016 | 105905 |
| 2015 | 98330 |
| 2014 | 90424 |
| 2013 | 83916 |
| 2012 | 80228 |
| 2007 | 79696 |
retail_sales |>
mutate(year = year(sales_month)) |>
filter(kind_of_business == 'Used car dealers') |>
group_by(year) |>
summarize(total_sales = sum(sales, na.rm = TRUE)) |>
arrange(desc(total_sales))# Source: SQL [?? x 2]
# Database: DuckDB 0.7.1 [igow@Darwin 22.5.0:R 4.2.3/:memory:]
# Ordered by: desc(total_sales)
year total_sales
<dbl> <dbl>
1 2020 124040
2 2019 119067
3 2018 114390
4 2017 112174
5 2016 105905
6 2015 98330
7 2014 90424
8 2013 83916
9 2012 80228
10 2007 79696
# ℹ more rows
SELECT year(sales_month) AS year,
sum(sales) AS total_sales
FROM retail_sales
WHERE kind_of_business = 'Used car dealers'
GROUP BY 1
HAVING total_sales >= 100000
ORDER BY 2 DESC;| year | total_sales |
|---|---|
| 2020 | 124040 |
| 2019 | 119067 |
| 2018 | 114390 |
| 2017 | 112174 |
| 2016 | 105905 |
retail_sales |>
mutate(year = year(sales_month)) |>
filter(kind_of_business == 'Used car dealers') |>
group_by(year) |>
summarize(total_sales = sum(sales, na.rm = TRUE)) |>
filter(total_sales >= 100000) |>
arrange(desc(total_sales)) # Source: SQL [5 x 2]
# Database: DuckDB 0.7.1 [igow@Darwin 22.5.0:R 4.2.3/:memory:]
# Ordered by: desc(total_sales)
year total_sales
<dbl> <dbl>
1 2020 124040
2 2019 119067
3 2018 114390
4 2017 112174
5 2016 105905
SELECT kind_of_business, reason_for_null,
count(*) AS num_null
FROM retail_sales
WHERE sales IS NULL
GROUP BY 1, 2
ORDER BY 3 DESC| kind_of_business | reason_for_null | num_null |
|---|---|---|
| All other home furnishings stores | Not Available | 108 |
| Paint and wallpaper stores | Not Available | 108 |
| Supermarkets and other grocery (except convenience) stores | Not Available | 108 |
| Other clothing stores | Not Available | 108 |
| Electronics stores | Not Available | 60 |
| Floor covering stores | Supressed | 46 |
| Drinking places | Supressed | 34 |
| Full service restaurants | Supressed | 11 |
| Jewelry stores | Supressed | 3 |
| Home furnishings stores | Supressed | 3 |
is.na()instead ofIS NULL
retail_sales |>
filter(is.na(sales)) |>
group_by(kind_of_business, reason_for_null) |>
summarize(num_null = n(), .groups = "drop") %>%
arrange(desc(num_null))# Source: SQL [?? x 3]
# Database: DuckDB 0.7.1 [igow@Darwin 22.5.0:R 4.2.3/:memory:]
# Ordered by: desc(num_null)
kind_of_business reason_for_null num_null
<chr> <chr> <dbl>
1 All other home furnishings stores Not Available 108
2 Paint and wallpaper stores Not Available 108
3 Supermarkets and other grocery (except convenience)… Not Available 108
4 Other clothing stores Not Available 108
5 Electronics stores Not Available 60
6 Floor covering stores Supressed 46
7 Drinking places Supressed 34
8 Full service restaurants Supressed 11
9 Jewelry stores Supressed 3
10 Home furnishings stores Supressed 3
# ℹ more rows
2.1.2 Missing data
The SQL in the book generally uses the form x::date rather than the more standard SQL CAST(x AS DATE). In dbplyr, we would use as.Date(x) and dbplyr would translate as CAST(x AS DATE). The following code and output demonstrates how dbplyr translated from dplyr to SQL.
The table stored in dates_processed below is equivalent to that created and stored in the database as date_dim in the code supplied with book. This date_dim table is only used in #sec-time-series of the book and we will not even use it there (for reasons to be explained).
library(tidyverse)
library(DBI)
library(knitr)db <- dbConnect(duckdb::duckdb(), bigint = "integer")2.2 Preparing: Shaping Data
2.2.1 For Which Output: BI, Visualization, Statistics, ML
2.2.2 Pivoting with CASE Statements
2.2.3 Unpivoting with UNION Statements
A user of dplyr has access to the functions pivot_wider and pivot_longer, which make it much easier to “pivot” and “unpivot” tables than using CASE statements, which could become long and tedious.
To illustrate the dplyr way of doing things, I will create ctry_pops to match the data discussed in Chapter 2. First, I create the data set using the tribble() function from dplyr.
ctry_pops <-
tribble(
~country, ~year_1980, ~year_1990, ~year_2000, ~year_2010,
"Canada", 24593, 27791, 31100, 34207,
"Mexico", 68347, 84634, 99775, 114061,
"United States", 227225, 249623, 282162, 309326
)Second, I pivot the local data frame using pivot_longer.
ctry_pops_long <-
ctry_pops |>
pivot_longer(cols = -country,
names_to = "year",
names_prefix = "year_",
values_ptypes = integer(),
values_to = "population")
ctry_pops_long |>
kable()| country | year | population |
|---|---|---|
| Canada | 1980 | 24593 |
| Canada | 1990 | 27791 |
| Canada | 2000 | 31100 |
| Canada | 2010 | 34207 |
| Mexico | 1980 | 68347 |
| Mexico | 1990 | 84634 |
| Mexico | 2000 | 99775 |
| Mexico | 2010 | 114061 |
| United States | 1980 | 227225 |
| United States | 1990 | 249623 |
| United States | 2000 | 282162 |
| United States | 2010 | 309326 |
Next, I copy the data to PostgreSQL, so that it’s a (temporary) table inside the database.1
ctry_pops_db <- copy_to(db, ctry_pops)
ctry_pops_db# Source: table<ctry_pops> [3 x 5]
# Database: DuckDB 0.7.1 [igow@Darwin 22.5.0:R 4.2.3/:memory:]
country year_1980 year_1990 year_2000 year_2010
<chr> <dbl> <dbl> <dbl> <dbl>
1 Canada 24593 27791 31100 34207
2 Mexico 68347 84634 99775 114061
3 United States 227225 249623 282162 309326
ctry_pops_db_long <-
ctry_pops_db |>
pivot_longer(cols = -country,
names_to = "year",
names_prefix = "year_",
values_to = "population") From the output below, we can see that dbplyr has taken care of the tedious business of constructing several statements for us.
ctry_pops_db_long |>
show_query()<SQL>
(
(
(
SELECT country, '1980' AS "year", year_1980 AS population
FROM ctry_pops
)
UNION ALL
(
SELECT country, '1990' AS "year", year_1990 AS population
FROM ctry_pops
)
)
UNION ALL
(
SELECT country, '2000' AS "year", year_2000 AS population
FROM ctry_pops
)
)
UNION ALL
(
SELECT country, '2010' AS "year", year_2010 AS population
FROM ctry_pops
)
And from the following, we can see that the result is the same as it was when using dplyr on a local data frame.
ctry_pops_db_long |>
kable()| country | year | population |
|---|---|---|
| Canada | 1980 | 24593 |
| Mexico | 1980 | 68347 |
| United States | 1980 | 227225 |
| Canada | 1990 | 27791 |
| Mexico | 1990 | 84634 |
| United States | 1990 | 249623 |
| Canada | 2000 | 31100 |
| Mexico | 2000 | 99775 |
| United States | 2000 | 282162 |
| Canada | 2010 | 34207 |
| Mexico | 2010 | 114061 |
| United States | 2010 | 309326 |
And we can reverse the pivot_longer() using pivot_wider().
ctry_pops_db_long |>
compute() |>
pivot_wider(names_from = year,
values_from = population,
names_prefix = "year_") |>
kable()| country | year_1980 | year_1990 | year_2000 | year_2010 |
|---|---|---|---|---|
| Canada | 24593 | 27791 | 31100 | 34207 |
| Mexico | 68347 | 84634 | 99775 | 114061 |
| United States | 227225 | 249623 | 282162 | 309326 |
2.2.4 pivot and unpivot Functions
“Temporary” here means that it will disappear once we close our connection to the database.↩︎