Using the Parquet Repository with Polars
Once you have built a local Parquet repository with db2pq, the next step is usually analysis rather than more data movement. This page shows a practical Polars-based workflow for working with those files.
The examples here are adapted from the Python Polars Ball and Brown (1968) replication note: “Ball and Brown (1968): Replication using Python Polars”.
Repository layout
By default, db2pq writes files in this layout:
<DATA_DIR>/<schema>/<table>.parquet
That means a CRSP monthly index file might live at:
$DATA_DIR/crsp/msi.parquet
Scanning files lazily
Polars works especially well with a db2pq repository because you can scan the Parquet files lazily instead of reading them into memory all at once.
from pathlib import Path
import os
import polars as pl
data_dir = Path(os.environ["DATA_DIR"])
msi = pl.scan_parquet(data_dir / "crsp" / "msi.parquet")
msf = pl.scan_parquet(data_dir / "crsp" / "msf.parquet")
stocknames = pl.scan_parquet(data_dir / "crsp" / "stocknames.parquet")
ccmxpf_lnkhist = pl.scan_parquet(data_dir / "crsp" / "ccmxpf_lnkhist.parquet")
funda = pl.scan_parquet(data_dir / "comp" / "funda.parquet")
fundq = pl.scan_parquet(data_dir / "comp" / "fundq.parquet")The important point is that these objects are pl.LazyFrame instances. Polars can push filters down into the Parquet scan, read only needed columns, and defer work until you call .collect().
Loading only the tables you need
For the Ball and Brown replication, you do not need a full WRDS mirror. A small set of tables is enough:
from db2pq import wrds_update_pq
# CRSP
wrds_update_pq("msi", "crsp")
wrds_update_pq("msf", "crsp")
wrds_update_pq("stocknames", "crsp")
wrds_update_pq(
"ccmxpf_lnkhist",
"crsp",
col_types={"lpermno": "int32", "lpermco": "int32"},
)
# Compustat
wrds_update_pq("funda", "comp")
wrds_update_pq("fundq", "comp")This is often a better pattern than bulk-downloading many tables you may never use in the analysis.
Filtering and selecting lazily
Once you have lazy scans, stay lazy for as long as possible.
annc_events = (
fundq
.filter(
pl.col("indfmt") == "INDL",
pl.col("datafmt") == "STD",
pl.col("consol") == "C",
pl.col("popsrc") == "D",
pl.col("fqtr") == 4,
pl.col("fyr") == 12,
pl.col("rdq").is_not_null(),
)
.select("gvkey", "datadate", "rdq")
.with_columns(annc_month=pl.col("rdq").dt.truncate("1mo"))
)This style lets Polars prune both rows and columns before materialization.
Building helper tables from smaller inputs
Another useful pattern from the note is to derive small helper tables from compact source files before joining them to larger tables.
crsp_dates = (
msi
.select("date")
.sort("date")
.with_row_index("td", offset=1)
.with_columns(
td=pl.col("td").cast(pl.Int32),
month=pl.col("date").dt.truncate("1mo"),
)
)Using a small table like msi to define the monthly calendar is often cleaner and cheaper than deriving the same structure from a much larger table such as msf.
Join only after narrowing the data
When working with large WRDS-derived tables, a good default strategy is:
- start with lazy scans
- filter early
- select only needed columns
- build compact helper tables
- join later
That approach is visible throughout the Ball and Brown replication and tends to work well for Parquet-centered workflows generally.
When to collect
Stay in LazyFrame form until you actually need concrete results for:
- plotting
- export
- summary tables
- debugging intermediate output
At that point, collect only the object you need:
preview = annc_events.limit(10).collect()Original source
For a full research example built around these ideas, see “Ball and Brown (1968): Replication using Python Polars”.