Appendix B — SQL primer
This brief appendix aims to serve two groups of users. The first group comprises those who have followed the material in the main part of the book and would like a quick introduction to SQL. The second group comprises those who know SQL (say, SAS’s PROC SQL
) would like a quick introduction to the dplyr
-based approach to R that we use in this book.
B.1 What are SQL and dplyr
?
SQL is a specialized language for manipulating and retrieving tabular data used by almost all modern database systems.
The R package dplyr
is a core part of the Tidyverse and perhaps the package we use the most in this book. From the Tidyverse website:
dplyr
is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:
mutate()
adds new variables that are functions of existing variablesselect()
picks variables based on their namesfilter()
picks cases based on their valuessummarize()
reduces multiple values down to a single summaryarrange()
changes the ordering of the rows
dplyr
verbs
dplyr verb (R) |
SQL equivalent |
---|---|
|> |
FROM |
select() |
SELECT |
filter() |
WHERE |
group_by() |
GROUP BY |
arrange() |
ORDER BY |
mutate() |
used-defined columns |
summarize() |
used-defined aggregate columns |
Prior to the advent of the dplyr
in 2014, most users of R would have used base R functions and operators, such as subset()
, $
, and [
. However, dplyr
provides a much more consistent framework for manipulating data that is easier to learn, especially for users familiar with SQL. This is unsurprising given that SQL provides something like a “grammar of data manipulation” of its own. In fact, each dplyr
verb has an SQL equivalent keyword or concept, as seen in Table B.1, which provides a translation table of sorts between dplyr
and SQL.
In this appendix, we use the following packages:
As in earlier chapters, we set up a database connection that we can use within R. We also create remote data frames for each of the two tables from WRDS that we use in this appendix: crsp.dsf
and crsp.dsi
.
In this appendix, we use the following packages:
As in earlier chapters, we set up a database connection that we can use within R. We also create remote data frames for each of the two tables from WRDS that we use in this appendix: crsp.dsf
and crsp.dsi
. A wrinkle with our parquet-based approach to DuckDB is that to access the tables from SQL, we need to create named tables that refer to the underlying parquet files. The create_view()
function does this. Note that we use CREATE VIEW
rather than CREATE TABLE
so that the data are not read into memory (this is costly with crsp.dsf
).
create_view <- function(conn, table, schema = "",
data_dir = Sys.getenv("DATA_DIR")) {
dbExecute(conn, paste0("CREATE SCHEMA IF NOT EXISTS ", schema))
file_path <- file.path(data_dir, schema, paste0(table, ".parquet"))
df_sql <- paste0("CREATE VIEW ", schema, ".", table, " AS ",
"SELECT * FROM read_parquet('", file_path,
"')")
DBI::dbExecute(conn, dplyr::sql(df_sql))
}
db <- dbConnect(duckdb::duckdb())
crsp.dsi <- create_view(db, table = "dsi", schema = "crsp")
crsp.dsf <- create_view(db, table = "dsf", schema = "crsp")
B.2 SQL terms SELECT
and FROM
Let’s begin with a basic SQL query. (Note that, throughout this appendix, we only display the first five records when there are more than five records returned by a query. Note that a record is SQL terminology for what we might call a row in the context of an R data frame.)
SELECT date, vwretd, ewretd
FROM crsp.dsi
date | vwretd | ewretd |
---|---|---|
1925-12-31 | ||
1926-01-02 | 0.005689 | 0.009516 |
1926-01-04 | 0.000706 | 0.005780 |
1926-01-05 | -0.004821 | -0.001927 |
1926-01-06 | -0.000423 | 0.001182 |
The query above
- extracts the data in three columns (
date
,vwretd
,ewretd
) (the first line) - from the table named
crsp.dsi
(the second line)
While the syntax differs, SQL’s SELECT
operates very much like select()
from dplyr
.
Translating this into dplyr
code using the pipe operator (|>
), it’s easy to see that the order of presentation is one of the big differences between SQL and dplyr
.
To use dplyr
we first need to set up the table on which to operate; once we’ve done so we can see that SQL’s FROM
is implicit in the |>
operator.
crsp.dsi |>
select(date, vwretd, ewretd)
date | vwretd | ewretd |
---|---|---|
1925-12-31 | ||
1926-01-02 | 0.005689 | 0.009516 |
1926-01-04 | 0.000706 | 0.005780 |
1926-01-05 | -0.004821 | -0.001927 |
1926-01-06 | -0.000423 | 0.001182 |
B.3 SQL WHERE
The filter()
verb from dplyr
corresponds to WHERE
in SQL. Note that WHERE
goes after the FROM
clause in SQL, though in practice the query optimizer will execute the filter implied by the WHERE
clause before executing other elements of a query (it would be wasteful to perform calculations on data that are going to be filtered out later on).
SELECT date, vwretd, ewretd
FROM crsp.dsi
WHERE date = '2015-01-02'
date | vwretd | ewretd |
---|---|---|
2015-01-02 | -0.000242 | 0.001470 |
In dplyr
, this query could be written as follows:
date | vwretd | ewretd |
---|---|---|
2015-01-02 | -0.000242 | 0.001470 |
But it could also be written with the filter coming first.
date | vwretd | ewretd |
---|---|---|
2015-01-02 | -0.000242 | 0.001470 |
Note that each of these dplyr
queries is implemented in the same way by the PostgreSQL query optimizer, as can be seen by examining the output from explain()
.
<SQL>
SELECT "date", "vwretd", "ewretd"
FROM "crsp"."dsi"
WHERE ("date" = '2015-01-02')
<PLAN>
QUERY PLAN
1 Index Scan using dsi_date_idx on dsi (cost=0.29..8.30 rows=1 width=20)
2 Index Cond: (date = '2015-01-02'::date)
<SQL>
SELECT "date", "vwretd", "ewretd"
FROM "crsp"."dsi"
WHERE ("date" = '2015-01-02')
<PLAN>
QUERY PLAN
1 Index Scan using dsi_date_idx on dsi (cost=0.29..8.30 rows=1 width=20)
2 Index Cond: (date = '2015-01-02'::date)
By adding EXPLAIN
to the SQL above, we can see also that the dplyr
queries are viewed by the query optimizer as equivalent to that SQL.
EXPLAIN
SELECT date, vwretd, ewretd
FROM crsp.dsi
WHERE date = '2015-01-02'
QUERY PLAN
1 Index Scan using dsi_date_idx on dsi (cost=0.29..8.30 rows=1 width=20)
2 Index Cond: (date = '2015-01-02'::date)
This illustrates a nice feature of SQL engines, namely that you can leave some of the details of the query to the software and focus on the higher-level requirements of your query (i.e., which observations to keep, etc.).
B.4 SQL ORDER BY
An R data frame can be considered a list of vectors, where each vector has a well-defined order. In contrast, SQL tables should be not considered to have any particular order when there is no ORDER BY
clause.1
An ORDER BY
clause is placed at the end of an SQL query (reflecting in this case the order of operations) followed by the columns that the query is to be sorted on.
SELECT date, vwretd, ewretd
FROM crsp.dsi
ORDER BY vwretd
date | vwretd | ewretd |
---|---|---|
1987-10-19 | -0.171346 | -0.103897 |
1929-10-29 | -0.119539 | -0.134856 |
2020-03-16 | -0.118168 | -0.107631 |
1929-10-28 | -0.113022 | -0.095202 |
1929-11-06 | -0.097020 | -0.083819 |
Thus, ORDER BY
works very much like arrange()
.
date | vwretd | ewretd |
---|---|---|
1987-10-19 | -0.171346 | -0.103897 |
1929-10-29 | -0.119539 | -0.134856 |
2020-03-16 | -0.118168 | -0.107631 |
1929-10-28 | -0.113022 | -0.095202 |
1929-11-06 | -0.097020 | -0.083819 |
To reverse the order, use the DESC
keyword after the relevant variable.
SELECT date, vwretd, ewretd
FROM crsp.dsi
ORDER BY vwretd DESC, date
date | vwretd | ewretd |
---|---|---|
1925-12-31 | ||
1933-03-15 | 0.156838 | 0.203020 |
1929-10-30 | 0.122155 | 0.125714 |
2008-10-13 | 0.114918 | 0.107422 |
1931-10-06 | 0.111278 | 0.088742 |
Thus, DESC
is very much equivalent to desc()
in dplyr
apart from the minor syntactical difference that desc()
is written as a function of the relevant variable.
date | vwretd | ewretd |
---|---|---|
1925-12-31 | ||
1933-03-15 | 0.156838 | 0.203020 |
1929-10-30 | 0.122155 | 0.125714 |
2008-10-13 | 0.114918 | 0.107422 |
1931-10-06 | 0.111278 | 0.088742 |
It is important to note that some differences in the implementation details between data frames in R and tables in SQL are seen with arrange()
. While a data frame in R is fundamentally a list of ordered vectors, SQL tables are best thought of as sets of rows without a natural order. As such, one needs to pay attention to when arrange()
is implemented in a query that depends on it.
To illustrate this issue, suppose we get stock returns for Apple and order them from largest to smallest returns.2
apple <-
crsp.dsf |>
filter(permno == 14593L) |>
select(permno, date, ret)
apple |>
arrange(desc(ret))
permno | date | ret |
---|---|---|
14593 | 1980-12-12 | |
14593 | 1997-08-06 | 0.332278 |
14593 | 1998-01-02 | 0.238095 |
14593 | 1996-07-18 | 0.237037 |
14593 | 1998-01-06 | 0.192913 |
The following code illustrates that the order created above is lost when we merge with crsp.dsi
. If you are using window queries with SQL data sources, use window_order()
to order data within windows.
apple |>
inner_join(crsp.dsi, by = "date") |>
select(permno, date, ret, vwretd)
permno | date | ret | vwretd |
---|---|---|---|
14593 | 1980-12-12 | 0.014859 | |
14593 | 1980-12-15 | -0.052061 | 0.001605 |
14593 | 1980-12-16 | -0.073227 | 0.007485 |
14593 | 1980-12-17 | 0.024691 | 0.016168 |
14593 | 1980-12-18 | 0.028916 | 0.004134 |
B.5 SQL approach to mutate()
Creating new variables in SQL is quite straightforward. In addition to basic mathematical operators, such as +
and *
, PostgreSQL and DuckDB have an extensive array of functions available for use in queries.3
The names of calculated variables can be specified using AS
, as in the following query:
SELECT date,
- ewretd AS ret_diff,
vwretd ln(1 + vwretd) AS log_vwretd,
'year', date) AS year
date_part(FROM crsp.dsi
date | ret_diff | log_vwretd | year |
---|---|---|---|
1925-12-31 | 1925 | ||
1926-01-02 | -0.003827 | 0.005673 | 1926 |
1926-01-04 | -0.005074 | 0.000706 | 1926 |
1926-01-05 | -0.002894 | -0.004833 | 1926 |
1926-01-06 | -0.001605 | -0.000423 | 1926 |
The translation of this query into dplyr
requires the use of the mutate()
verb, but this translation is straightforward, as can be seen below:
crsp.dsi |>
mutate(ret_diff = vwretd - ewretd,
log_vwretd = log(1 + vwretd),
year = year(date)) |>
select(date, ret_diff, log_vwretd, year)
date | ret_diff | log_vwretd | year |
---|---|---|---|
1925-12-31 | 1925 | ||
1926-01-02 | -0.003827 | 0.005673 | 1926 |
1926-01-04 | -0.005074 | 0.000706 | 1926 |
1926-01-05 | -0.002894 | -0.004833 | 1926 |
1926-01-06 | -0.001605 | -0.000423 | 1926 |
We see that the dplyr
code is slightly more verbose than the SQL because the SQL equivalent of mutate()
is an implicit part of the SELECT
portion of the statement.
Note that we use the ln()
in SQL, but use log()
in the dplyr
code. In SQL, log()
returns the base-10 logarithm, but dplyr
will translate common functions such as log()
to their SQL equivalents.4 An alternative approach would be to use ln()
with dplyr
. Because ln()
is not a recognized R function, ln()
is passed to SQL unmodified. Similarly, year()
is translated by dbplyr
to its nearest equivalent in SQL (for year()
something using SQL’s EXTRACT
that is equivalent to date_part()
). We can use show_query()
to see the generated SQL.
crsp.dsi |>
mutate(ret_diff = vwretd - ewretd,
log_vwretd = log(1 + vwretd),
year = year(date)) |>
select(date, ret_diff, log_vwretd, year) |>
show_query()
<SQL>
SELECT
"date",
"vwretd" - "ewretd" AS "ret_diff",
LN(1.0 + "vwretd") AS "log_vwretd",
EXTRACT(year FROM "date") AS "year"
FROM "crsp"."dsi"
B.6 SQL GROUP BY
and aggregates
One use for group_by()
in R is to create summary statistics and the like for each of the groups implied by the group_by()
variables. In SQL, GROUP BY
is used with aggregate functions, which are like the functions used with summarize()
.
Suppose we want to get the maximum and minimum returns, as well as a measure of the volatility of the returns, for each year. In SQL, we could do the following:
SELECT date_part('year', date) AS year,
max(vwretd) AS max_ret,
min(vwretd) AS min_ret,
stddev(ln(1 + vwretd)) AS sd_ret
FROM crsp.dsi
GROUP BY year
ORDER BY sd_ret DESC
year | max_ret | min_ret | sd_ret |
---|---|---|---|
1925 | |||
1932 | 0.109846 | -0.073118 | 0.027422 |
1933 | 0.156838 | -0.093272 | 0.025666 |
2008 | 0.114918 | -0.089818 | 0.025397 |
2020 | 0.091556 | -0.118168 | 0.021380 |
And the dplyr
equivalent would look like this:
crsp.dsi |>
mutate(year = year(date)) |>
group_by(year) |>
summarize(max_ret = max(vwretd, na.rm = TRUE),
min_ret = min(vwretd, na.rm = TRUE),
sd_ret = sd(log(1 + vwretd), na.rm = TRUE)) |>
arrange(desc(sd_ret))
year | max_ret | min_ret | sd_ret |
---|---|---|---|
1925 | |||
1932 | 0.109846 | -0.073118 | 0.027422 |
1933 | 0.156838 | -0.093272 | 0.025666 |
2008 | 0.114918 | -0.089818 | 0.025397 |
2020 | 0.091556 | -0.118168 | 0.021380 |
Note that na.rm = TRUE
could be omitted, as SQL always omits these values. The inclusion of this argument serves only as a reminder to the user of the code that NA
values are omitted.5
See https://stackoverflow.com/questions/20050341 on this point.↩︎
Note that the
NA
value from Apple’s first day of trading is viewed as being the largest value due to the way that SQL collatesNULL
values.↩︎See https://www.postgresql.org/docs/current/functions.html.↩︎
See https://www.postgresql.org/docs/current/functions-math.html.↩︎
Setting
na.rm = TRUE
also suppresses warnings fromdplyr
regarding the handling of missing values in SQL.↩︎