Appendix D — SQL primer
This brief appendix aims to serve two groups of users. The first group comprises those who have followed the material above and would like a quick introduction to SQL. The second group comprises those who know SQL (say, SAS’s PROC SQL
) would like a quick introduction to the dplyr
-based approach to R that we use in this book.
D.1 What is dplyr
?
Throughout this course, we make use of the R package dplyr
, which is a core part of the Tidyverse. From the Tidyverse website:
dplyr
is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:
mutate()
adds new variables that are functions of existing variablesselect()
picks variables based on their namesfilter()
picks cases based on their valuessummarize()
reduces multiple values down to a single summaryarrange()
changes the ordering of the rows
Prior to the advent of the dplyr
in 2014, most users of R would have used base R functions and operators, such as subset
, $
, and [
. However, dplyr
provides a much more consistent framework for manipulating data that is easier to learn, and especially so for users familiar with SQL. This last point makes sense, as SQL provides something like a “grammar of data manipulation” of its own. In fact, each dplyr
verb has an SQL equivalent keyword or concept, as we see below.
D.2 A translation table
dplyr verb (R) |
SQL equivalent |
---|---|
|> |
FROM |
select |
SELECT |
filter |
WHERE |
group_by |
GROUP BY |
arrange |
ORDER BY |
mutate |
used-defined columns |
summarize |
used-defined aggregate columns |
As in earlier chapters, let’s set up a database connection that we can use within R.
D.3 SQL terms SELECT
and FROM
Let’s begin with a basic SQL query.
SELECT date, vwretd, ewretd
FROM crsp.dsi
date | vwretd | ewretd |
---|---|---|
1925-12-31 | NA | NA |
1926-01-02 | 0.005689 | 0.009516 |
1926-01-04 | 0.000706 | 0.005780 |
1926-01-05 | -0.004821 | -0.001927 |
1926-01-06 | -0.000423 | 0.001182 |
1926-01-07 | 0.004988 | 0.008453 |
1926-01-08 | -0.003238 | -0.001689 |
1926-01-09 | 0.002209 | 0.003312 |
1926-01-11 | -0.008540 | -0.009943 |
1926-01-12 | -0.000929 | -0.003623 |
This query
- extracts the data in three columns (
date
,vwretd
,ewretd
) (the first line) - from the table named
crsp.dsi
(the second line)
While the syntax is a bit different, SQL’s SELECT
operates very much like select
from dplyr
.
Translating this into dplyr
code using the pipe operator (|>
), it’s easy to see that the order of presentattion is one of the big differences with the dplyr
approach.
While we first need to set up the table on which to operate, once we’ve done so we can see that the FROM
is implicit in the |>
operator.
# Source: SQL [?? x 3]
# Database: postgres [igow@/tmp:5432/igow]
date vwretd ewretd
<date> <dbl> <dbl>
1 1925-12-31 NA NA
2 1926-01-02 0.00569 0.00952
3 1926-01-04 0.000706 0.00578
4 1926-01-05 -0.00482 -0.00193
5 1926-01-06 -0.000423 0.00118
6 1926-01-07 0.00499 0.00845
7 1926-01-08 -0.00324 -0.00169
8 1926-01-09 0.00221 0.00331
9 1926-01-11 -0.00854 -0.00994
10 1926-01-12 -0.000929 -0.00362
# ℹ more rows
D.4 SQL WHERE
The filter
verb from dplyr
corresponds to WHERE
in SQL. Note that WHERE
goes after the FROM
clause in SQL, though in practice the query optimizer will option execute the filter implied by the WHERE
clause before executing other elements of a query (it would be wasteful to perform calculations on data that are going to be filtered out later on).
SELECT date, vwretd, ewretd
FROM crsp.dsi
WHERE date = '2015-01-02'
date | vwretd | ewretd |
---|---|---|
2015-01-02 | -0.000245 | 0.00147 |
In dplyr
, this query could be written as follows:
# Source: SQL [1 x 3]
# Database: postgres [igow@/tmp:5432/igow]
date vwretd ewretd
<date> <dbl> <dbl>
1 2015-01-02 -0.000245 0.00147
But it could also be written with the filter coming first.
# Source: SQL [1 x 3]
# Database: postgres [igow@/tmp:5432/igow]
date vwretd ewretd
<date> <dbl> <dbl>
1 2015-01-02 -0.000245 0.00147
Note that each of these dplyr
queries is implemented in the same way by the PostgreSQL query optimizer, as can be seen by examining the output from explain()
.
<SQL>
SELECT "date", "vwretd", "ewretd"
FROM "crsp"."dsi"
WHERE ("date" = '2015-01-02')
<PLAN>
QUERY PLAN
1 Index Scan using dsi_date_idx on dsi (cost=0.29..8.30 rows=1 width=20)
2 Index Cond: (date = '2015-01-02'::date)
<SQL>
SELECT "date", "vwretd", "ewretd"
FROM "crsp"."dsi"
WHERE ("date" = '2015-01-02')
<PLAN>
QUERY PLAN
1 Index Scan using dsi_date_idx on dsi (cost=0.29..8.30 rows=1 width=20)
2 Index Cond: (date = '2015-01-02'::date)
We can see that these are in turn viewed by the query optimizer as equivalents of the SQL above.
EXPLAIN
SELECT date, vwretd, ewretd
FROM crsp.dsi
WHERE date = '2015-01-02'
QUERY PLAN |
---|
Index Scan using dsi_date_idx on dsi (cost=0.29..8.30 rows=1 width=20) |
Index Cond: (date = ‘2015-01-02’::date) |
This illustrates a nice feature of SQL engines, namely that you can leave some of the details of the query to the software and focus on the higher-level requirements of your query (i.e., which observations to keep, etc.).
D.5 SQL ORDER BY
A R data frame can be considered as a list of vectors, where each vector has a well-defined order. In contrast, SQL tables should be considered to have any particular order absent an ORDER BY
clause (see here for discussion).
An ORDER BY
clause is placed at the end of an SQL query (reflecting in this case the order of operations) followed by the columns that the query is to be sorted on.
SELECT date, vwretd, ewretd
FROM crsp.dsi
ORDER BY vwretd
date | vwretd | ewretd |
---|---|---|
1987-10-19 | -0.171346 | -0.103897 |
1929-10-29 | -0.119539 | -0.134856 |
2020-03-16 | -0.118168 | -0.107631 |
1929-10-28 | -0.113022 | -0.095202 |
1929-11-06 | -0.097020 | -0.083819 |
2020-03-12 | -0.096841 | -0.100597 |
1933-07-21 | -0.093272 | -0.135468 |
2008-10-15 | -0.089821 | -0.070781 |
2008-12-01 | -0.089411 | -0.078240 |
1933-07-20 | -0.084973 | -0.093141 |
Thus ORDER BY
works very much like arrange
.
# Source: SQL [?? x 3]
# Database: postgres [igow@/tmp:5432/igow]
# Ordered by: vwretd
date vwretd ewretd
<date> <dbl> <dbl>
1 1987-10-19 -0.171 -0.104
2 1929-10-29 -0.120 -0.135
3 2020-03-16 -0.118 -0.108
4 1929-10-28 -0.113 -0.0952
5 1929-11-06 -0.0970 -0.0838
6 2020-03-12 -0.0968 -0.101
7 1933-07-21 -0.0933 -0.135
8 2008-10-15 -0.0898 -0.0708
9 2008-12-01 -0.0894 -0.0782
10 1933-07-20 -0.0850 -0.0931
# ℹ more rows
To reverse the order, use the DESC
keyword after the relevant variable.
SELECT date, vwretd, ewretd
FROM crsp.dsi
ORDER BY vwretd DESC, date
date | vwretd | ewretd |
---|---|---|
1925-12-31 | NA | NA |
1933-03-15 | 0.156838 | 0.203020 |
1929-10-30 | 0.122155 | 0.125714 |
2008-10-13 | 0.114921 | 0.107422 |
1931-10-06 | 0.111278 | 0.088742 |
1932-09-21 | 0.109846 | 0.113008 |
2008-10-28 | 0.095347 | 0.050386 |
2020-03-24 | 0.091556 | 0.082175 |
1939-09-05 | 0.089162 | 0.147448 |
1987-10-21 | 0.086614 | 0.069292 |
Thus DESC
is very much equivalent to desc
in dplyr
apart from the minor syntactical difference that desc()
is written as a function of the relevant variable.
# Source: SQL [?? x 3]
# Database: postgres [igow@/tmp:5432/igow]
# Ordered by: desc(vwretd), date
date vwretd ewretd
<date> <dbl> <dbl>
1 1925-12-31 NA NA
2 1933-03-15 0.157 0.203
3 1929-10-30 0.122 0.126
4 2008-10-13 0.115 0.107
5 1931-10-06 0.111 0.0887
6 1932-09-21 0.110 0.113
7 2008-10-28 0.0953 0.0504
8 2020-03-24 0.0916 0.0822
9 1939-09-05 0.0892 0.147
10 1987-10-21 0.0866 0.0693
# ℹ more rows
It is important to note that some differences in the implementation details between data frames in R and tables in SQL are seen with arrange
. While a data frame in R is fundamentally a list of ordered vectors, SQL tables are best thought of a sets of rows without a natural order. As such, one needs to pay attention to when arrange()
is implemented in a query that depends on it.
To illustrate this issue, suppose we get get stock returns for Apple and order them from largest to smallest returns.1
crsp.dsf <- tbl(pg, Id(schema = "crsp", table = "dsf"))
apple <-
crsp.dsf |>
filter(permno == 14593L) |>
select(permno, date, ret)
apple |>
arrange(desc(ret))
# Source: SQL [?? x 3]
# Database: postgres [igow@/tmp:5432/igow]
# Ordered by: desc(ret)
permno date ret
<int> <date> <dbl>
1 14593 1980-12-12 NA
2 14593 1997-08-06 0.332
3 14593 1998-01-02 0.238
4 14593 1996-07-18 0.237
5 14593 1998-01-06 0.193
6 14593 1993-10-15 0.189
7 14593 1987-10-21 0.181
8 14593 1987-10-29 0.179
9 14593 1997-07-11 0.146
10 14593 1999-10-14 0.143
# ℹ more rows
The following code illustrates that the order created above is lost when we merged with crsp.dsi
. Fortunately, dplyr
provides a warning when we use arrange()
in such situations.
apple |>
inner_join(crsp.dsi, by = "date") |>
select(permno, date, ret, vwretd)
# Source: SQL [?? x 4]
# Database: postgres [igow@/tmp:5432/igow]
permno date ret vwretd
<int> <date> <dbl> <dbl>
1 14593 1980-12-12 NA 0.0149
2 14593 1980-12-15 -0.0521 0.00160
3 14593 1980-12-16 -0.0732 0.00748
4 14593 1980-12-17 0.0247 0.0162
5 14593 1980-12-18 0.0289 0.00413
6 14593 1980-12-19 0.0609 0.00712
7 14593 1980-12-22 0.0486 0.0138
8 14593 1980-12-23 0.0421 -0.00291
9 14593 1980-12-24 0.0525 0.00405
10 14593 1980-12-26 0.0921 0.00489
# ℹ more rows
D.6 SQL approach to mutate()
Creating new variables in SQL is quite straightforward. In addition to basic mathematical operators, such as +
and *
, PostgreSQL has an extensive array of functions available for use in queries.
The names of calculated variables can be specified using AS
, as can be seen in the following query.
SELECT date,
- ewretd AS ret_diff,
vwretd ln(1 + vwretd) AS log_vwretd,
'year', date) AS year
date_part(FROM crsp.dsi
date | ret_diff | log_vwretd | year |
---|---|---|---|
1925-12-31 | NA | NA | 1925 |
1926-01-02 | -0.003827 | 0.0056729 | 1926 |
1926-01-04 | -0.005074 | 0.0007058 | 1926 |
1926-01-05 | -0.002894 | -0.0048327 | 1926 |
1926-01-06 | -0.001605 | -0.0004231 | 1926 |
1926-01-07 | -0.003465 | 0.0049756 | 1926 |
1926-01-08 | -0.001549 | -0.0032433 | 1926 |
1926-01-09 | -0.001103 | 0.0022066 | 1926 |
1926-01-11 | 0.001403 | -0.0085767 | 1926 |
1926-01-12 | 0.002694 | -0.0009294 | 1926 |
The translation of this query into dplyr
requires the use of the mutate
verb, but this translation is straightforward, as can be seen below.
crsp.dsi |>
mutate(ret_diff = vwretd - ewretd,
log_vwretd = ln(1 + vwretd),
year = date_part('year', date)) |>
select(date, ret_diff, log_vwretd, year)
# Source: SQL [?? x 4]
# Database: postgres [igow@/tmp:5432/igow]
date ret_diff log_vwretd year
<date> <dbl> <dbl> <dbl>
1 1925-12-31 NA NA 1925
2 1926-01-02 -0.00383 0.00567 1926
3 1926-01-04 -0.00507 0.000706 1926
4 1926-01-05 -0.00289 -0.00483 1926
5 1926-01-06 -0.00160 -0.000423 1926
6 1926-01-07 -0.00346 0.00498 1926
7 1926-01-08 -0.00155 -0.00324 1926
8 1926-01-09 -0.00110 0.00221 1926
9 1926-01-11 0.00140 -0.00858 1926
10 1926-01-12 0.00269 -0.000929 1926
# ℹ more rows
Note that in the code above, we are using the function ln()
to get the natural logarithm, rather than the R function log()
. We will discuss this further below, but for now we just note that log()
in PostgreSQL returns the base-10 logarithm and—because the calculations are being done by PostgreSQL, not R—we need to use the appropriate PostgreSQL function when using dplyr
in this way.2 Similarly, we are using the date_part()
function from PostgreSQL, but in this case we could have used year
, which is translated by dbplyr
to provide the nearest equivalent of the year
function found in the lubridate
package.
Note that the dplyr
code is slightly more verbose than the SQL because the SQL equivalent of mutate
is implicit and part of the SELECT
portion of the statement.
D.7 SQL GROUP BY
and aggregates
One use for group_by
in R is to create summary statistics and the like for each of the groups implied by the group_by
variables. In SQL, GROUP BY
is used with aggregate functions, which are equivalent to functions used with summarize
.
Suppose we want to get the maximum and minimum returns, as well as a measure of the volatility of the returns, for each year. In SQL, we could do the following:
SELECT date_part('year', date) AS year,
max(vwretd) AS max_ret,
min(vwretd) AS min_ret,
stddev(ln(1 + vwretd)) AS sd_ret
FROM crsp.dsi
GROUP BY year
ORDER BY sd_ret DESC
year | max_ret | min_ret | sd_ret |
---|---|---|---|
1925 | NA | NA | NA |
1932 | 0.109846 | -0.073118 | 0.0274218 |
1933 | 0.156838 | -0.093272 | 0.0256663 |
2008 | 0.114921 | -0.089821 | 0.0253980 |
2020 | 0.091556 | -0.118168 | 0.0213797 |
1929 | 0.122155 | -0.119539 | 0.0207290 |
1931 | 0.111278 | -0.068048 | 0.0205423 |
1987 | 0.086614 | -0.171346 | 0.0182097 |
2009 | 0.069114 | -0.055329 | 0.0176703 |
1938 | 0.057302 | -0.051672 | 0.0164642 |
And the dplyr
equivalent would look like this:
crsp.dsi |>
mutate(year = date_part('year', date)) |>
group_by(year) |>
summarize(max_ret = max(vwretd, na.rm = TRUE),
min_ret = min(vwretd, na.rm = TRUE),
sd_ret = stddev(ln(1 + vwretd))) |>
arrange(desc(sd_ret))
# Source: SQL [?? x 4]
# Database: postgres [igow@/tmp:5432/igow]
# Ordered by: desc(sd_ret)
year max_ret min_ret sd_ret
<dbl> <dbl> <dbl> <dbl>
1 1925 NA NA NA
2 1932 0.110 -0.0731 0.0274
3 1933 0.157 -0.0933 0.0257
4 2008 0.115 -0.0898 0.0254
5 2020 0.0916 -0.118 0.0214
6 1929 0.122 -0.120 0.0207
7 1931 0.111 -0.0680 0.0205
8 1987 0.0866 -0.171 0.0182
9 2009 0.0691 -0.0553 0.0177
10 1938 0.0573 -0.0517 0.0165
# ℹ more rows
Note that the na.rm = TRUE
arguments could be omitted, as SQL always omits these values. The inclusion of this argument serves only as a reminder to the user of the code that these values are omitted.
Note that the
NA
value from Apple’s first day of trading is viewed as being the largest value due to the way that SQL collatesNULL
values.↩︎See the PostgreSQL documentation for mathematical functions.↩︎