Appendix C — SQL primer

This brief appendix aims to serve two groups of users. The first group comprises those who have followed the material in the main part of the book and would like a quick introduction to SQL. The second group comprises those who know SQL (say, SAS’s PROC SQL) would like a quick introduction to Polars.

C.1 Comparing SQL and Polars

SQL is a specialized language for manipulating and retrieving tabular data used by almost all modern database systems.

Polars is a data frame library that provide a set of methods that help you solve the most common data manipulation challenges:¹

.with_columns() adds new variables that are functions of existing variables
.select() picks variables based on their names
.filter() picks cases based on their values
.agg() reduces multiple values down to a single summary
.sort() changes the ordering of the rows

As we will see below, there is overlap between .with_columns() and .select() in both Polars and the SQL equivalents.

Table C.1: SQL translations of key Polars methods

Polars method (R)	SQL equivalent
`.select()`	`SELECT`
`.filter()`	`WHERE`
`.group_by()`	`GROUP BY`
`.sort()`	`ORDER BY`
`.with_columns()`	used-defined columns
`.agg()`	used-defined aggregate columns

Tip

Portions of the code in this appendix use packages in addition to those installed when you set up your computer to use the code found in this book using the guide on the support page. Instructions will be provided in the sections that need additional packages.

To understand equivalent operations in Polars and SQL, we will focus on the small table crsp.dsi.

import polars as pl
from era_pl import load_parquet

dsi = load_parquet("dsi", "crsp")

Let’s begin with a basic query.

(
    dsi
    .select("date", "vwretd", "ewretd")
    .collect()
)

shape: (26_051, 3)

date	vwretd	ewretd
date	f64	f64
1925-12-31	null	null
1926-01-02	0.005689	0.009516
1926-01-04	0.000706	0.00578
…	…	…
2024-12-27	-0.010692	-0.004451
2024-12-30	-0.009878	-0.000715
2024-12-31	-0.003392	-0.000125

We can rewrite this query using SQL and it turns out that Polars offers the function pl.sql() that allows us to run SQL against Polars data frames. Because dsf is a lazy frame, the result of pl.sql(my_query) will also be lazy, so we append .collect() to better display the results of the query.

my_query = """
  SELECT date, vwretd, ewretd
  FROM dsi
"""

pl.sql(my_query).collect()

shape: (26_051, 3)

date	vwretd	ewretd
date	f64	f64
1925-12-31	null	null
1926-01-02	0.005689	0.009516
1926-01-04	0.000706	0.00578
…	…	…
2024-12-27	-0.010692	-0.004451
2024-12-30	-0.009878	-0.000715
2024-12-31	-0.003392	-0.000125

The query above

extracts the data in three columns (date, vwretd, ewretd) (the first line)
from the table named dsi (the second line)

While the syntax differs, SQL’s SELECT operates very much like .select() from Polars.

Note that the FROM dsi comes at the end of the SQL query above, but dsi is how we would start a query in Polars. So we already see that the order of presentation in code is one of the big differences between SQL and Polars.

C.2 SQL `WHERE`

The .filter() method from Polars corresponds to WHERE in SQL. Note that WHERE goes after the FROM clause in SQL, though in practice the query optimizer will execute the filter implied by the WHERE clause before executing other elements of a query (it would be wasteful to perform calculations on data that are going to be filtered out later on).

my_query = """
  SELECT date, vwretd, ewretd
  FROM dsi
  WHERE date = '2015-01-02'
"""

pl.sql(my_query).collect()

shape: (1, 3)

date	vwretd	ewretd
date	f64	f64
2015-01-02	-0.000242	0.00147

C.3 SQL `ORDER BY`

SQL tables should be not considered to have any particular order when there is no ORDER BY clause.² An ORDER BY clause is placed at the end of an SQL query (reflecting in this case the order of operations) followed by the columns that the query is to be sorted on.

my_query = """
  SELECT date, vwretd, ewretd
  FROM dsi
  ORDER BY vwretd
"""

pl.sql(my_query).collect()

shape: (26_051, 3)

date	vwretd	ewretd
date	f64	f64
1987-10-19	-0.171346	-0.103897
1929-10-29	-0.119539	-0.134856
2020-03-16	-0.118167	-0.107631
…	…	…
1929-10-30	0.122155	0.125714
1933-03-15	0.156838	0.20302
1925-12-31	null	null

To reverse the order, use the DESC keyword after the relevant variable.

my_query = """
    SELECT date, vwretd, ewretd
    FROM dsi
    ORDER BY vwretd DESC, date
"""

pl.sql(my_query).collect()

shape: (26_051, 3)

date	vwretd	ewretd
date	f64	f64
1925-12-31	null	null
1933-03-15	0.156838	0.20302
1929-10-30	0.122155	0.125714
…	…	…
2020-03-16	-0.118167	-0.107631
1929-10-29	-0.119539	-0.134856
1987-10-19	-0.171346	-0.103897

Thus, DESC is very much equivalent to using descending=True with Polars’s .sort() method.

It is important to note that, while a Polars data frame has a well-defined row order, SQL tables are best thought of as sets of rows without a natural order. As such, one needs to pay attention to when an ORDER BY clause is implemented in a query that depends on it.

C.4 SQL `SELECT`

Creating new variables in SQL is quite straightforward. The names of calculated variables can be specified in a SELECT clause using AS, as in the following query:

my_query = """
    SELECT
      date,
      1 + vwretd AS gross_vwretd,
      DATE_PART('year', date) AS year
    FROM dsi
"""

pl.sql(my_query).collect()

shape: (26_051, 3)

date	gross_vwretd	year
date	f64	i32
1925-12-31	null	1925
1926-01-02	1.005689	1926
1926-01-04	1.000706	1926
…	…	…
2024-12-27	0.989308	2024
2024-12-30	0.990122	2024
2024-12-31	0.996608	2024

The closest equivalent Polars expression code would use .select():

(
    dsi
    .select(
        "date",
        (1 + pl.col("vwretd")).alias("gross_vwretd"),
        pl.col("date").dt.year().alias("year"),
    )
    .collect()
)

shape: (26_051, 3)

date	gross_vwretd	year
date	f64	i32
1925-12-31	null	1925
1926-01-02	1.005689	1926
1926-01-04	1.000706	1926
…	…	…
2024-12-27	0.989308	2024
2024-12-30	0.990122	2024
2024-12-31	0.996608	2024

Here the SQL date function DATE_PART() corresponds to Polars datetime method .dt.year().

Note that, instead of .select(), we could have used .with_columns(). For users of SQL, .with_columns(...) could be considered as equivalent to SELECT *, .... For Polars users, .with_columns(...) is equivalent to .select(pl.all(), ...).

(
    dsi
    .with_columns(
        (1 + pl.col("vwretd")).alias("gross_vwretd"),
        pl.col("date").dt.year().alias("year"),
    )
    .select("date", "gross_vwretd", "year")
    .collect()
)

shape: (26_051, 3)

date	gross_vwretd	year
date	f64	i32
1925-12-31	null	1925
1926-01-02	1.005689	1926
1926-01-04	1.000706	1926
…	…	…
2024-12-27	0.989308	2024
2024-12-30	0.990122	2024
2024-12-31	0.996608	2024

C.5 SQL `GROUP BY` and aggregates

One use for Polars’s .group_by() is to create summary statistics and the like for each of the groups implied by the grouping variables. In SQL, GROUP BY is used with aggregate functions, which play the same role as expressions inside .agg().

Suppose we want to get the maximum and minimum returns, as well as a measure of the volatility of the returns, for each year. In SQL, we could do the following:

my_query = """
SELECT DATE_PART('year', date) AS year,
  max(vwretd) AS max_ret,
  min(vwretd) AS min_ret,
  stddev(ln(1 + vwretd)) AS sd_ret
FROM dsi
GROUP BY year
ORDER BY sd_ret DESC
"""

pl.sql(my_query).collect()

shape: (100, 4)

year	max_ret	min_ret	sd_ret
i32	f64	f64	f64
1925	null	null	null
1932	0.109846	-0.073118	0.027422
1933	0.156838	-0.093272	0.025666
…	…	…	…
1952	0.014297	-0.014855	0.004312
1965	0.022139	-0.018758	0.004162
1964	0.007781	-0.011021	0.003097

The equivalent Polars expression code would look like this:

(
    dsi
    .with_columns(year=pl.col("date").dt.year())
    .group_by("year")
    .agg(
        max_ret=pl.col("vwretd").max(),
        min_ret=pl.col("vwretd").min(),
        sd_ret=(1 + pl.col("vwretd")).log().std(),
    )
    .sort("sd_ret", descending=True)
    .collect()
)

shape: (100, 4)

year	max_ret	min_ret	sd_ret
i32	f64	f64	f64
1925	null	null	null
1932	0.109846	-0.073118	0.027422
1933	0.156838	-0.093272	0.025666
…	…	…	…
1952	0.014297	-0.014855	0.004312
1965	0.022139	-0.018758	0.004162
1964	0.007781	-0.011021	0.003097

Note that missing values are omitted by SQL aggregate functions. Polars aggregate expressions also skip null values by default, so the handling of missing data is similar in the two approaches.

In addition to basic mathematical operators, such as + and *, PostgreSQL and DuckDB have an extensive array of functions available for use in queries.³

C.6 Using WRDS PostgreSQL

Polars actually comes with the ability to read directly from SQL sources.

Tip

To read from a PostgreSQL source, we can use the ADBC drivers, which we can install using uv.

In this see, run this command in the terminal from the root directory of the folder that you are using for this book.⁴

uv add adbc-driver-postgresql pyarrow

The following assumes that use have set WRDS_ID in the .env file in the folder that you are using for this book, as you would have done if you followed the instructions on the support page for this book. If not, you can set wrds_id to your WRDS ID directly in Python using wrds_id = "your_wrds_id".

from dotenv import dotenv_values

config = dotenv_values(".env")
wrds_id = config["WRDS_ID"]
uri = (f"postgresql://"
       "{wrds_id}@wrds-pgdata.wharton.upenn.edu:9737/wrds")

We can use the database URI to read the table crsp.dsi into Polars.

dsi = pl.read_database_uri("SELECT * FROM crsp.dsi", 
                            uri, engine='adbc')
dsi

/var/folders/cl/47p53r5n3v1ggzf1_2t0bjyh0000gn/T/ipykernel_79888/4249341716.py:1: UserWarning: Extension type 'arrow.opaque' is not registered; loading as its storage type.

To avoid this warning, register the extension type or set environment variable 'POLARS_UNKNOWN_EXTENSION_TYPE_BEHAVIOR' to 'load_as_storage' or 'load_as_extension'.

In Polars 2.0, the default behavior will change to 'load_as_extension'.
  dsi = pl.read_database_uri("SELECT * FROM crsp.dsi",

shape: (26_051, 11)

date	vwretd	vwretx	ewretd	ewretx	sprtrn	spindx	totval	totcnt	usdval	usdcnt
date	str	str	str	str	str	str	str	i32	str	i32
1925-12-31	null	null	null	null	null	null	"27487487.20"	503	null	null
1926-01-02	"0.005689"	"0.005689"	"0.009516"	"0.009516"	null	null	"27600296.60"	497	"27366755.70"	494
1926-01-04	"0.000706"	"0.000706"	"0.005780"	"0.005780"	null	null	"27577892.10"	502	"27479721.60"	495
…	…	…	…	…	…	…	…	…	…	…
2024-12-27	"-0.010692"	"-0.010775"	"-0.004451"	"-0.004754"	"-0.011056"	"5970.84"	"74890129000.00"	8866	"75702330000.00"	8861
2024-12-30	"-0.009878"	"-0.009900"	"-0.000715"	"-0.001173"	"-0.010702"	"5906.94"	"74153714600.00"	8867	"74890031400.00"	8864
2024-12-31	"-0.003392"	"-0.003541"	"-0.000125"	"-0.000491"	"-0.004285"	"5881.63"	"74032373700.00"	8866	"74153696300.00"	8865

The problem with pl.read_database_uri() is that it returns a pl.DataFrame. So, while it is fine to use with small tables such as crsp.dsi, it will be painful to use pl.read_database_uri() with large tables, such as comp.fundq or crsp.dsf. In effect, using pl.read_database_uri() means losing all the benefits of using lazy frames.

C.6.1 Using Ibis

While Polars does not have a way of treating database tables as lazy data frames, the Python package Ibis aims to provide exactly that functionality.

Tip

To use Ibis, you will need to install it using uv. Run this command in the terminal from the root directory of the folder that you are using for this book.⁵

uv add 'ibis-framework[postgres, duckdb]'

import ibis
from ibis import _

To use Ibis, we connect to a database using a URI. Here we use the URI for the WRDS PostgreSQL database that we stored in uri above.

con = ibis.connect(uri)
dsi = con.table("dsi", database="crsp")

The following code shows how Ibis displays lazy data frames.

dsi

DatabaseTable: crsp.dsi
  date   date
  vwretd decimal(9, 6)
  vwretx decimal(9, 6)
  ewretd decimal(9, 6)
  ewretx decimal(9, 6)
  sprtrn decimal(9, 6)
  spindx decimal(7, 2)
  totval decimal(13, 2)
  totcnt int32
  usdval decimal(13, 2)
  usdcnt int32

We can put Ibis into “interactive” mode using the following setting:

ibis.options.interactive = True

Now Ibis shows a preview of the data when displaying lazy data frames.⁶

dsi

┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃ date       ┃ vwretd        ┃ vwretx        ┃ ewretd        ┃ ewretx        ┃ sprtrn        ┃ spindx        ┃ totval         ┃ totcnt ┃ usdval         ┃ usdcnt ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━┩
│ date       │ decimal(9, 6) │ decimal(9, 6) │ decimal(9, 6) │ decimal(9, 6) │ decimal(9, 6) │ decimal(7, 2) │ decimal(13, 2) │ int32  │ decimal(13, 2) │ int32  │
├────────────┼───────────────┼───────────────┼───────────────┼───────────────┼───────────────┼───────────────┼────────────────┼────────┼────────────────┼────────┤
│ 1925-12-31 │          NULL │          NULL │          NULL │          NULL │          NULL │          NULL │    27487487.20 │    503 │           NULL │   NULL │
│ 1926-01-02 │      0.005689 │      0.005689 │      0.009516 │      0.009516 │          NULL │          NULL │    27600296.60 │    497 │    27366755.70 │    494 │
│ 1926-01-04 │      0.000706 │      0.000706 │      0.005780 │      0.005780 │          NULL │          NULL │    27577892.10 │    502 │    27479721.60 │    495 │
│ 1926-01-05 │     -0.004821 │     -0.004867 │     -0.001927 │     -0.002030 │          NULL │          NULL │    27530212.50 │    501 │    27561757.10 │    499 │
│ 1926-01-06 │     -0.000423 │     -0.000427 │      0.001182 │      0.001155 │          NULL │          NULL │    27618772.60 │    505 │    27526837.50 │    500 │
│ 1926-01-07 │      0.004988 │      0.004953 │      0.008453 │      0.008384 │          NULL │          NULL │    27690393.80 │    504 │    27553072.60 │    504 │
│ 1926-01-08 │     -0.003238 │     -0.003549 │     -0.001689 │     -0.001938 │          NULL │          NULL │    27547713.60 │    502 │    27645825.00 │    502 │
│ 1926-01-09 │      0.002209 │      0.002209 │      0.003312 │      0.003312 │          NULL │          NULL │    27424192.80 │    498 │    27341031.40 │    497 │
│ 1926-01-11 │     -0.008540 │     -0.008612 │     -0.009943 │     -0.010013 │          NULL │          NULL │    27415594.60 │    504 │    27424192.80 │    498 │
│ 1926-01-12 │     -0.000929 │     -0.000929 │     -0.003623 │     -0.003623 │          NULL │          NULL │    27400801.40 │    502 │    27407261.60 │    501 │
│ …          │             … │             … │             … │             … │             … │             … │              … │      … │              … │      … │
└────────────┴───────────────┴───────────────┴───────────────┴───────────────┴───────────────┴───────────────┴────────────────┴────────┴────────────────┴────────┘

We can use Ibis methods to manipulate data before bringing them into Polars with .to_polars().

(
    dsi
    .select(
        "date",
        gross_vwretd=1 + dsi.vwretd,
        year=dsi.date.year(),
    )
    .to_polars()
)

shape: (26_051, 3)

date	gross_vwretd	year
date	decimal[9,6]	i32
1925-12-31	null	1925
1926-01-02	1.005689	1926
1926-01-04	1.000706	1926
…	…	…
2024-12-27	0.989308	2024
2024-12-30	0.990122	2024
2024-12-31	0.996608	2024

C.6.2 Comparing Ibis and Polars

In Chapter 12, we used the following Polars code to create earn_annc_dates, a (lazy) data frame containing data on earnings annnouncement dates.

first_date = pl.date(2010, 1, 1)
last_date = pl.date(2019, 12, 31)

fundq = load_parquet(table="fundq", schema="comp")

earn_annc_dates = (
    fundq
    .filter(
        pl.col("indfmt") == "INDL",
        pl.col("datafmt") == "STD",
        pl.col("consol") == "C",
        pl.col("popsrc") == "D",
        pl.col("rdq").is_not_null(),
        pl.col("fqtr") == 4,
    )
    .select("gvkey", "datadate", "rdq")
    .filter(pl.col("datadate").is_between(first_date, last_date))
)

Rather than creating earn_annc_dates from a local Parquet file, we could use Ibis to get the data directly from the WRDS PostgreSQL database. We use con.table() to create the lazy frame–equivalent fundq, and then use Ibis verbs to produce the data we want. Again we can append .to_polars() to convert the result to a pl.DataFrame.

From the code below, you can see that, while the Ibis code looks very similar to the Polars code, there are differences in the details. Things that stayed the same include .filter() and .select(). Things that changed include:

pl.date(2010, 1, 1). became ibis.date("2010-01-01")
pl.col("indfmt") became _.indfmt
.is_not_null() became .notnull()
.is_between() because .between()

fundq = con.table("fundq", database="comp")

first_date = ibis.date("2010-01-01")
last_date = ibis.date("2019-12-31")

earn_annc_dates = (
    fundq
    .filter(
        _.indfmt == "INDL",
        _.datafmt == "STD",
        _.consol == "C",
        _.popsrc == "D",
        _.rdq.notnull(),
        _.fqtr == 4,
        _.datadate.between(first_date, last_date)
    )
    .select("gvkey", "datadate", "rdq")
    .filter(_.datadate.between(first_date, last_date))
    .to_polars()
)

We can even use Ibis to access data in Parquet files organized in the way we have them for this book. Below we import an alternative version of load_parquet() supplied by the era_py interface of the package containing era_pl. To avoid confusing this version of load_parquet() with the one from era_pl, we rename the former as load_parquet_db(). This version of the function accepts an additional argument representing a connection to a local (DuckDB) database. In this case, we change almost nothing from our Ibis code above, even though the code is now accessing precisely the same Parquet data as we used above with our Polars code.

from era_py import load_parquet as load_parquet_db

con = ibis.duckdb.connect()

fundq = load_parquet_db(con, table="fundq", schema="comp")

earn_annc_dates = (
    fundq
    .filter(
        _.indfmt == "INDL",
        _.datafmt == "STD",
        _.consol == "C",
        _.popsrc == "D",
        _.rdq.notnull(),
        _.fqtr == 4,
        _.datadate.between(first_date, last_date),
    )
    .select("gvkey", "datadate", "rdq")
    .to_polars()
)

Subsequent code in Chapter 12 assumed that earn_annc_dates was a pl.LazyFrame. By appending .lazy() to the code above, we could create a pl.LazyFrame and the rest of the Polars code would function without any changes.

In this case, there is no particular advantage to using Ibis. The PostgreSQL version actually takes much longer to run than the original Polars code, though we didn’t need to download the full comp.fundq to run the code. While the DuckDB variant is not much slower than the Polars variant, it offers no clear advantage in this case.

However, DuckDB may offer functionality that Polars lacks and this example shows how easy it is to work with a DuckDB database even when most of the code is using Polars.

This paragraph adapts ideas from Chapter 3 of Wickham:2023aa in which the methods here are described as verbs. Grammatical sticklers will note that .with_columns() is more like an adverbial modifier than a verb. While .mutate() might have been a better choice, such finer points of grammar were not a driver in designing the Polars interface.↩︎
See https://stackoverflow.com/questions/20050341 on this point.↩︎
See https://www.postgresql.org/docs/current/functions.html.↩︎
You could use !uv add adbc-driver-postgresql pyarrow to run this command from a Python notebook.↩︎
You could use !uv add 'ibis-framework[postgres, duckdb]' to run this command from inside a Python notebook.↩︎
This is probably analogous to using df.show() with a pl.LazyFrame stored in df.↩︎

C.1 Comparing SQL and Polars

C.2 SQL WHERE

C.3 SQL ORDER BY

C.4 SQL SELECT

C.5 SQL GROUP BY and aggregates

C.6 Using WRDS PostgreSQL

C.6.1 Using Ibis

C.6.2 Comparing Ibis and Polars

C.2 SQL `WHERE`

C.3 SQL `ORDER BY`

C.4 SQL `SELECT`

C.5 SQL `GROUP BY` and aggregates