import polars as pl
from era_pl import load_parquet
dsi = load_parquet("dsi", "crsp")Appendix C — SQL primer
This brief appendix aims to serve two groups of users. The first group comprises those who have followed the material in the main part of the book and would like a quick introduction to SQL. The second group comprises those who know SQL (say, SAS’s PROC SQL) would like a quick introduction to Polars.
C.1 Comparing SQL and Polars
SQL is a specialized language for manipulating and retrieving tabular data used by almost all modern database systems.
Polars is a data frame library that provide a set of methods that help you solve the most common data manipulation challenges:1
.with_columns()adds new variables that are functions of existing variables.select()picks variables based on their names.filter()picks cases based on their values.agg()reduces multiple values down to a single summary.sort()changes the ordering of the rows
As we will see below, there is overlap between .with_columns() and .select() in both Polars and the SQL equivalents.
| Polars method (R) | SQL equivalent |
|---|---|
.select() |
SELECT |
.filter() |
WHERE |
.group_by() |
GROUP BY |
.sort() |
ORDER BY |
.with_columns() |
used-defined columns |
.agg() |
used-defined aggregate columns |
Portions of the code in this appendix use packages in addition to those installed when you set up your computer to use the code found in this book using the guide on the support page. Instructions will be provided in the sections that need additional packages.
To understand equivalent operations in Polars and SQL, we will focus on the small table crsp.dsi.
Let’s begin with a basic query.
(
dsi
.select("date", "vwretd", "ewretd")
.collect()
)| date | vwretd | ewretd |
|---|---|---|
| date | f64 | f64 |
| 1925-12-31 | null | null |
| 1926-01-02 | 0.005689 | 0.009516 |
| 1926-01-04 | 0.000706 | 0.00578 |
| … | … | … |
| 2024-12-27 | -0.010692 | -0.004451 |
| 2024-12-30 | -0.009878 | -0.000715 |
| 2024-12-31 | -0.003392 | -0.000125 |
We can rewrite this query using SQL and it turns out that Polars offers the function pl.sql() that allows us to run SQL against Polars data frames. Because dsf is a lazy frame, the result of pl.sql(my_query) will also be lazy, so we append .collect() to better display the results of the query.
my_query = """
SELECT date, vwretd, ewretd
FROM dsi
"""
pl.sql(my_query).collect()| date | vwretd | ewretd |
|---|---|---|
| date | f64 | f64 |
| 1925-12-31 | null | null |
| 1926-01-02 | 0.005689 | 0.009516 |
| 1926-01-04 | 0.000706 | 0.00578 |
| … | … | … |
| 2024-12-27 | -0.010692 | -0.004451 |
| 2024-12-30 | -0.009878 | -0.000715 |
| 2024-12-31 | -0.003392 | -0.000125 |
The query above
- extracts the data in three columns (
date,vwretd,ewretd) (the first line) - from the table named
dsi(the second line)
While the syntax differs, SQL’s SELECT operates very much like .select() from Polars.
Note that the FROM dsi comes at the end of the SQL query above, but dsi is how we would start a query in Polars. So we already see that the order of presentation in code is one of the big differences between SQL and Polars.
C.2 SQL WHERE
The .filter() method from Polars corresponds to WHERE in SQL. Note that WHERE goes after the FROM clause in SQL, though in practice the query optimizer will execute the filter implied by the WHERE clause before executing other elements of a query (it would be wasteful to perform calculations on data that are going to be filtered out later on).
my_query = """
SELECT date, vwretd, ewretd
FROM dsi
WHERE date = '2015-01-02'
"""
pl.sql(my_query).collect()| date | vwretd | ewretd |
|---|---|---|
| date | f64 | f64 |
| 2015-01-02 | -0.000242 | 0.00147 |
C.3 SQL ORDER BY
SQL tables should be not considered to have any particular order when there is no ORDER BY clause.2 An ORDER BY clause is placed at the end of an SQL query (reflecting in this case the order of operations) followed by the columns that the query is to be sorted on.
my_query = """
SELECT date, vwretd, ewretd
FROM dsi
ORDER BY vwretd
"""
pl.sql(my_query).collect()| date | vwretd | ewretd |
|---|---|---|
| date | f64 | f64 |
| 1987-10-19 | -0.171346 | -0.103897 |
| 1929-10-29 | -0.119539 | -0.134856 |
| 2020-03-16 | -0.118167 | -0.107631 |
| … | … | … |
| 1929-10-30 | 0.122155 | 0.125714 |
| 1933-03-15 | 0.156838 | 0.20302 |
| 1925-12-31 | null | null |
To reverse the order, use the DESC keyword after the relevant variable.
my_query = """
SELECT date, vwretd, ewretd
FROM dsi
ORDER BY vwretd DESC, date
"""
pl.sql(my_query).collect()| date | vwretd | ewretd |
|---|---|---|
| date | f64 | f64 |
| 1925-12-31 | null | null |
| 1933-03-15 | 0.156838 | 0.20302 |
| 1929-10-30 | 0.122155 | 0.125714 |
| … | … | … |
| 2020-03-16 | -0.118167 | -0.107631 |
| 1929-10-29 | -0.119539 | -0.134856 |
| 1987-10-19 | -0.171346 | -0.103897 |
Thus, DESC is very much equivalent to using descending=True with Polars’s .sort() method.
It is important to note that, while a Polars data frame has a well-defined row order, SQL tables are best thought of as sets of rows without a natural order. As such, one needs to pay attention to when an ORDER BY clause is implemented in a query that depends on it.
C.4 SQL SELECT
Creating new variables in SQL is quite straightforward. The names of calculated variables can be specified in a SELECT clause using AS, as in the following query:
my_query = """
SELECT
date,
1 + vwretd AS gross_vwretd,
DATE_PART('year', date) AS year
FROM dsi
"""
pl.sql(my_query).collect()| date | gross_vwretd | year |
|---|---|---|
| date | f64 | i32 |
| 1925-12-31 | null | 1925 |
| 1926-01-02 | 1.005689 | 1926 |
| 1926-01-04 | 1.000706 | 1926 |
| … | … | … |
| 2024-12-27 | 0.989308 | 2024 |
| 2024-12-30 | 0.990122 | 2024 |
| 2024-12-31 | 0.996608 | 2024 |
The closest equivalent Polars expression code would use .select():
(
dsi
.select(
"date",
(1 + pl.col("vwretd")).alias("gross_vwretd"),
pl.col("date").dt.year().alias("year"),
)
.collect()
)| date | gross_vwretd | year |
|---|---|---|
| date | f64 | i32 |
| 1925-12-31 | null | 1925 |
| 1926-01-02 | 1.005689 | 1926 |
| 1926-01-04 | 1.000706 | 1926 |
| … | … | … |
| 2024-12-27 | 0.989308 | 2024 |
| 2024-12-30 | 0.990122 | 2024 |
| 2024-12-31 | 0.996608 | 2024 |
Here the SQL date function DATE_PART() corresponds to Polars datetime method .dt.year().
Note that, instead of .select(), we could have used .with_columns(). For users of SQL, .with_columns(...) could be considered as equivalent to SELECT *, .... For Polars users, .with_columns(...) is equivalent to .select(pl.all(), ...).
(
dsi
.with_columns(
(1 + pl.col("vwretd")).alias("gross_vwretd"),
pl.col("date").dt.year().alias("year"),
)
.select("date", "gross_vwretd", "year")
.collect()
)| date | gross_vwretd | year |
|---|---|---|
| date | f64 | i32 |
| 1925-12-31 | null | 1925 |
| 1926-01-02 | 1.005689 | 1926 |
| 1926-01-04 | 1.000706 | 1926 |
| … | … | … |
| 2024-12-27 | 0.989308 | 2024 |
| 2024-12-30 | 0.990122 | 2024 |
| 2024-12-31 | 0.996608 | 2024 |
C.5 SQL GROUP BY and aggregates
One use for Polars’s .group_by() is to create summary statistics and the like for each of the groups implied by the grouping variables. In SQL, GROUP BY is used with aggregate functions, which play the same role as expressions inside .agg().
Suppose we want to get the maximum and minimum returns, as well as a measure of the volatility of the returns, for each year. In SQL, we could do the following:
my_query = """
SELECT DATE_PART('year', date) AS year,
max(vwretd) AS max_ret,
min(vwretd) AS min_ret,
stddev(ln(1 + vwretd)) AS sd_ret
FROM dsi
GROUP BY year
ORDER BY sd_ret DESC
"""
pl.sql(my_query).collect()| year | max_ret | min_ret | sd_ret |
|---|---|---|---|
| i32 | f64 | f64 | f64 |
| 1925 | null | null | null |
| 1932 | 0.109846 | -0.073118 | 0.027422 |
| 1933 | 0.156838 | -0.093272 | 0.025666 |
| … | … | … | … |
| 1952 | 0.014297 | -0.014855 | 0.004312 |
| 1965 | 0.022139 | -0.018758 | 0.004162 |
| 1964 | 0.007781 | -0.011021 | 0.003097 |
The equivalent Polars expression code would look like this:
(
dsi
.with_columns(year=pl.col("date").dt.year())
.group_by("year")
.agg(
max_ret=pl.col("vwretd").max(),
min_ret=pl.col("vwretd").min(),
sd_ret=(1 + pl.col("vwretd")).log().std(),
)
.sort("sd_ret", descending=True)
.collect()
)| year | max_ret | min_ret | sd_ret |
|---|---|---|---|
| i32 | f64 | f64 | f64 |
| 1925 | null | null | null |
| 1932 | 0.109846 | -0.073118 | 0.027422 |
| 1933 | 0.156838 | -0.093272 | 0.025666 |
| … | … | … | … |
| 1952 | 0.014297 | -0.014855 | 0.004312 |
| 1965 | 0.022139 | -0.018758 | 0.004162 |
| 1964 | 0.007781 | -0.011021 | 0.003097 |
Note that missing values are omitted by SQL aggregate functions. Polars aggregate expressions also skip null values by default, so the handling of missing data is similar in the two approaches.
In addition to basic mathematical operators, such as + and *, PostgreSQL and DuckDB have an extensive array of functions available for use in queries.3
C.6 Using WRDS PostgreSQL
Polars actually comes with the ability to read directly from SQL sources.
To read from a PostgreSQL source, we can use the ADBC drivers, which we can install using uv.
In this see, run this command in the terminal from the root directory of the folder that you are using for this book.4
uv add adbc-driver-postgresql pyarrowThe following assumes that use have set WRDS_ID in the .env file in the folder that you are using for this book, as you would have done if you followed the instructions on the support page for this book. If not, you can set wrds_id to your WRDS ID directly in Python using wrds_id = "your_wrds_id".
from dotenv import dotenv_values
config = dotenv_values(".env")
wrds_id = config["WRDS_ID"]
uri = (f"postgresql://"
"{wrds_id}@wrds-pgdata.wharton.upenn.edu:9737/wrds")We can use the database URI to read the table crsp.dsi into Polars.
dsi = pl.read_database_uri("SELECT * FROM crsp.dsi",
uri, engine='adbc')
dsi/var/folders/cl/47p53r5n3v1ggzf1_2t0bjyh0000gn/T/ipykernel_11769/4249341716.py:1: UserWarning: Extension type 'arrow.opaque' is not registered; loading as its storage type.
To avoid this warning, register the extension type or set environment variable 'POLARS_UNKNOWN_EXTENSION_TYPE_BEHAVIOR' to 'load_as_storage' or 'load_as_extension'.
In Polars 2.0, the default behavior will change to 'load_as_extension'.
dsi = pl.read_database_uri("SELECT * FROM crsp.dsi",
| date | vwretd | vwretx | ewretd | ewretx | sprtrn | spindx | totval | totcnt | usdval | usdcnt |
|---|---|---|---|---|---|---|---|---|---|---|
| date | str | str | str | str | str | str | str | i32 | str | i32 |
| 1925-12-31 | null | null | null | null | null | null | "27487487.20" | 503 | null | null |
| 1926-01-02 | "0.005689" | "0.005689" | "0.009516" | "0.009516" | null | null | "27600296.60" | 497 | "27366755.70" | 494 |
| 1926-01-04 | "0.000706" | "0.000706" | "0.005780" | "0.005780" | null | null | "27577892.10" | 502 | "27479721.60" | 495 |
| … | … | … | … | … | … | … | … | … | … | … |
| 2024-12-27 | "-0.010692" | "-0.010775" | "-0.004451" | "-0.004754" | "-0.011056" | "5970.84" | "74890129000.00" | 8866 | "75702330000.00" | 8861 |
| 2024-12-30 | "-0.009878" | "-0.009900" | "-0.000715" | "-0.001173" | "-0.010702" | "5906.94" | "74153714600.00" | 8867 | "74890031400.00" | 8864 |
| 2024-12-31 | "-0.003392" | "-0.003541" | "-0.000125" | "-0.000491" | "-0.004285" | "5881.63" | "74032373700.00" | 8866 | "74153696300.00" | 8865 |
The problem with pl.read_database_uri() is that it returns a pl.DataFrame. So, while it is fine to use with small tables such as crsp.dsi, it will be painful to use pl.read_database_uri() with large tables, such as comp.fundq or crsp.dsf. In effect, using pl.read_database_uri() means losing all the benefits of using lazy frames.
C.6.1 Using Ibis
While Polars does not have a way of treating database tables as lazy data frames, the Python package Ibis aims to provide exactly that functionality.
To use Ibis, you will need to install it using uv. Run this command in the terminal from the root directory of the folder that you are using for this book.5
uv add 'ibis-framework[postgres, duckdb]'import ibis
from ibis import _To use Ibis, we connect to a database using a URI. Here we use the URI for the WRDS PostgreSQL database that we stored in uri above.
con = ibis.connect(uri)
dsi = con.table("dsi", database="crsp")The following code shows how Ibis displays lazy data frames.
dsiDatabaseTable: crsp.dsi date date vwretd decimal(9, 6) vwretx decimal(9, 6) ewretd decimal(9, 6) ewretx decimal(9, 6) sprtrn decimal(9, 6) spindx decimal(7, 2) totval decimal(13, 2) totcnt int32 usdval decimal(13, 2) usdcnt int32
We can put Ibis into “interactive” mode using the following setting:
ibis.options.interactive = TrueNow Ibis shows a preview of the data when displaying lazy data frames.6
dsi┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┃ date ┃ vwretd ┃ vwretx ┃ ewretd ┃ ewretx ┃ sprtrn ┃ spindx ┃ totval ┃ totcnt ┃ usdval ┃ usdcnt ┃ ┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━┩ │ date │ decimal(9, 6) │ decimal(9, 6) │ decimal(9, 6) │ decimal(9, 6) │ decimal(9, 6) │ decimal(7, 2) │ decimal(13, 2) │ int32 │ decimal(13, 2) │ int32 │ ├────────────┼───────────────┼───────────────┼───────────────┼───────────────┼───────────────┼───────────────┼────────────────┼────────┼────────────────┼────────┤ │ 1925-12-31 │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ 27487487.20 │ 503 │ NULL │ NULL │ │ 1926-01-02 │ 0.005689 │ 0.005689 │ 0.009516 │ 0.009516 │ NULL │ NULL │ 27600296.60 │ 497 │ 27366755.70 │ 494 │ │ 1926-01-04 │ 0.000706 │ 0.000706 │ 0.005780 │ 0.005780 │ NULL │ NULL │ 27577892.10 │ 502 │ 27479721.60 │ 495 │ │ 1926-01-05 │ -0.004821 │ -0.004867 │ -0.001927 │ -0.002030 │ NULL │ NULL │ 27530212.50 │ 501 │ 27561757.10 │ 499 │ │ 1926-01-06 │ -0.000423 │ -0.000427 │ 0.001182 │ 0.001155 │ NULL │ NULL │ 27618772.60 │ 505 │ 27526837.50 │ 500 │ │ 1926-01-07 │ 0.004988 │ 0.004953 │ 0.008453 │ 0.008384 │ NULL │ NULL │ 27690393.80 │ 504 │ 27553072.60 │ 504 │ │ 1926-01-08 │ -0.003238 │ -0.003549 │ -0.001689 │ -0.001938 │ NULL │ NULL │ 27547713.60 │ 502 │ 27645825.00 │ 502 │ │ 1926-01-09 │ 0.002209 │ 0.002209 │ 0.003312 │ 0.003312 │ NULL │ NULL │ 27424192.80 │ 498 │ 27341031.40 │ 497 │ │ 1926-01-11 │ -0.008540 │ -0.008612 │ -0.009943 │ -0.010013 │ NULL │ NULL │ 27415594.60 │ 504 │ 27424192.80 │ 498 │ │ 1926-01-12 │ -0.000929 │ -0.000929 │ -0.003623 │ -0.003623 │ NULL │ NULL │ 27400801.40 │ 502 │ 27407261.60 │ 501 │ │ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │ └────────────┴───────────────┴───────────────┴───────────────┴───────────────┴───────────────┴───────────────┴────────────────┴────────┴────────────────┴────────┘
We can use Ibis methods to manipulate data before bringing them into Polars with .to_polars().
(
dsi
.select(
"date",
gross_vwretd=1 + dsi.vwretd,
year=dsi.date.year(),
)
.to_polars()
)| date | gross_vwretd | year |
|---|---|---|
| date | decimal[9,6] | i32 |
| 1925-12-31 | null | 1925 |
| 1926-01-02 | 1.005689 | 1926 |
| 1926-01-04 | 1.000706 | 1926 |
| … | … | … |
| 2024-12-27 | 0.989308 | 2024 |
| 2024-12-30 | 0.990122 | 2024 |
| 2024-12-31 | 0.996608 | 2024 |
C.6.2 Comparing Ibis and Polars
In Chapter 12, we used the following Polars code to create earn_annc_dates, a (lazy) data frame containing data on earnings annnouncement dates.
first_date = pl.date(2010, 1, 1)
last_date = pl.date(2019, 12, 31)
fundq = load_parquet(table="fundq", schema="comp")
earn_annc_dates = (
fundq
.filter(
pl.col("indfmt") == "INDL",
pl.col("datafmt") == "STD",
pl.col("consol") == "C",
pl.col("popsrc") == "D",
pl.col("rdq").is_not_null(),
pl.col("fqtr") == 4,
)
.select("gvkey", "datadate", "rdq")
.filter(pl.col("datadate").is_between(first_date, last_date))
)Rather than creating earn_annc_dates from a local Parquet file, we could use Ibis to get the data directly from the WRDS PostgreSQL database. We use con.table() to create the lazy frame–equivalent fundq, and then use Ibis verbs to produce the data we want. Again we can append .to_polars() to convert the result to a pl.DataFrame.
From the code below, you can see that, while the Ibis code looks very similar to the Polars code, there are differences in the details. Things that stayed the same include .filter() and .select(). Things that changed include:
pl.date(2010, 1, 1). becameibis.date("2010-01-01")pl.col("indfmt")became_.indfmt.is_not_null()became.notnull().is_between()because.between()
fundq = con.table("fundq", database="comp")
first_date = ibis.date("2010-01-01")
last_date = ibis.date("2019-12-31")
earn_annc_dates = (
fundq
.filter(
_.indfmt == "INDL",
_.datafmt == "STD",
_.consol == "C",
_.popsrc == "D",
_.rdq.notnull(),
_.fqtr == 4,
_.datadate.between(first_date, last_date)
)
.select("gvkey", "datadate", "rdq")
.filter(_.datadate.between(first_date, last_date))
.to_polars()
)We can even use Ibis to access data in Parquet files organized in the way we have them for this book. Below we import an alternative version of load_parquet() supplied by the era_py interface of the package containing era_pl. To avoid confusing this version of load_parquet() with the one from era_pl, we rename the former as load_parquet_db(). This version of the function accepts an additional argument representing a connection to a local (DuckDB) database. In this case, we change almost nothing from our Ibis code above, even though the code is now accessing precisely the same Parquet data as we used above with our Polars code.
from era_py import load_parquet as load_parquet_db
con = ibis.duckdb.connect()
fundq = load_parquet_db(con, table="fundq", schema="comp")
earn_annc_dates = (
fundq
.filter(
_.indfmt == "INDL",
_.datafmt == "STD",
_.consol == "C",
_.popsrc == "D",
_.rdq.notnull(),
_.fqtr == 4,
_.datadate.between(first_date, last_date),
)
.select("gvkey", "datadate", "rdq")
.to_polars()
)Subsequent code in Chapter 12 assumed that earn_annc_dates was a pl.LazyFrame. By appending .lazy() to the code above, we could create a pl.LazyFrame and the rest of the Polars code would function without any changes.
In this case, there is no particular advantage to using Ibis. The PostgreSQL version actually takes much longer to run than the original Polars code, though we didn’t need to download the full comp.fundq to run the code. While the DuckDB variant is not much slower than the Polars variant, it offers no clear advantage in this case. However, DuckDB may offer functionality that Polars lacks and this example shows how easy it is to work with a DuckDB database even when most of the code is using Polars.
This paragraph adapts ideas from Chapter 3 of Wickham:2023aa in which the methods here are described as verbs. Grammatical sticklers will note that
.with_columns()is more like an adverbial modifier than a verb. While.mutate()might have been a better choice, such finer points of grammar were not a driver in designing the Polars interface.↩︎See https://stackoverflow.com/questions/20050341 on this point.↩︎
See https://www.postgresql.org/docs/current/functions.html.↩︎
You could use
!uv add adbc-driver-postgresql pyarrowto run this command from a Python notebook.↩︎You could use
!uv add 'ibis-framework[postgres, duckdb]'to run this command from inside a Python notebook.↩︎This is probably analogous to using
df.show()with apl.LazyFramestored indf.↩︎