Converting lazy data frames into Parquet files (Python version)

Python
Parquet
db2pq
WRDS
Tidy Finance
Author

Ian D. Gow

Published

23 March 2026

Abstract

In a note I published yesterday (Converting lazy data frames into Parquet files), I showed how my db2pq R package can be used to turn “lazy data frames” produced by the R package dbplyr into Parquet files on disk without needing to load the data into memory. Today, I wondered if I could do the same with my db2pq Python package. In this note, I turn “lazy data frames” produced by Ibis into Parquet files on disk.

Keywords

Python, Polars, Arrow, db2pq, WRDS

Note

The source for this note is available in a folder on GitHub that contains two files:

To run this note locally with uv, put the two files above in a directory on your computer, I will refer to this directory as the project directory in this now.

For the convenience of Tidy Finance readers, I follow the approach laid out here closely. The Tidy Finance instructions start with installation of “uv, a modern Python package and project manager.” You should install uv using the instructions provided by Tidy Finance; these are taken from the homepage for uv.

From the project directory, run the following command:

uv sync

Next, you should follow the instructions provided by Tidy Finance for setting things up to connect to WRDS:

Create a .env file in your project directory. For the purpose of this book, we create and save the following variables (where user and password are our private login credentials):

WRDS_USER=user
WRDS_PASSWORD=password

Once you have done the above, you should be able to run the code in this note. For example, you could uv run jupyter lab and copy-paste the commands show here (in order). Alternatively, if you are comfortable with Quarto, you could open ibis_to_pq.qmd and treat it like a notebook.

1 Motivating example

According to Tidy Finance with Python:

The daily CRSP data file is substantially larger than monthly data and can exceed 20 GB. This has two important implications: you cannot hold all the daily return data in your memory (hence it is not possible to save the entire dataset to your local folder), and in our experience, the download usually crashes (or never stops) because it is too much data for the WRDS cloud to prepare and send to your R session.

There is a solution to this challenge. As with many big data problems, you can split up the big task into several smaller tasks that are easier to handle. That is, instead of downloading data about all stocks at once, download the data in small batches of stocks consecutively.

Below I show how one can simplify the code dramatically (no batches!) and download the data faster (for me, it takes less than a minute) and with no significant burden on either the CPU or RAM.

An issue with translating the lazy_tbl_to_pq() function from the R version of my db2pq package is that there is no equivalent of dbplyr for Python. While Python Polars has its LazyFrame, the Tidy Finance task involves extracting data from a PostgreSQL database and Polars has no way of creating lazy data frames from database connections. The closest analogue is surely Ibis. So I added a function ibis_to_pq() to my Python package db2pq and in this note I take it for a spin.

I start by creating a connection to the WRDS PostgreSQL database. Here I follow the Tidy Finance approach closely. But so long as wrds is a psycopg-backed SQLAlchemy engine for the WRDS PostgreSQL database, you should be able to use whatever approach you are used to.

from sqlalchemy import create_engine
import os
from dotenv import load_dotenv

load_dotenv()

connection_string = (
    "postgresql+psycopg://"
    f"{os.getenv('WRDS_USER')}:{os.getenv('WRDS_PASSWORD')}"
    "@wrds-pgdata.wharton.upenn.edu:9737/wrds"
)

wrds = create_engine(connection_string, pool_pre_ping=True)

Having established wrds, I load in the packages I will be using.

import polars as pl
import ibis
from ibis import _
from db2pq import ibis_to_pq

Then I turn wrds into an Ibis instance with WRDS PostgreSQL as its backend. You can learn more about Ibis here.

db = ibis.postgres.from_connection(
    wrds.connect().connection.driver_connection
)

I then set up Ibis lazy tables for the three tables I will use here.1

dsf = db.table("dsf_v2", database="crsp")
stksecurityinfohist = db.table("stksecurityinfohist", database="crsp")
factors_daily = db.table("factors_daily", database="ff")

Ibis offers an “interactive” option that makes it easier to work with lazy tables.

ibis.options.interactive = True

Let’s look at ff3:

factors_daily
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ date        mktrf          smb            hml            rf             umd           ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ datedecimal(8, 6)decimal(8, 6)decimal(8, 6)decimal(7, 5)decimal(8, 6) │
├────────────┼───────────────┼───────────────┼───────────────┼───────────────┼───────────────┤
│ 1926-07-010.000900-0.002500-0.0027000.00010NULL │
│ 1926-07-020.004500-0.003300-0.0006000.00010NULL │
│ 1926-07-060.0017000.003000-0.0039000.00010NULL │
│ 1926-07-070.000900-0.0058000.0002000.00010NULL │
│ 1926-07-080.002200-0.0038000.0019000.00010NULL │
│ 1926-07-09-0.0071000.0043000.0057000.00010NULL │
│ 1926-07-100.006100-0.005300-0.0010000.00010NULL │
│ 1926-07-120.000400-0.0003000.0064000.00010NULL │
│ 1926-07-130.004800-0.002800-0.0020000.00010NULL │
│ 1926-07-140.0004000.000700-0.0043000.00010NULL │
│  │
└────────────┴───────────────┴───────────────┴───────────────┴───────────────┴───────────────┘

And then dsf:

dsf
┏━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━┓
┃ permno  hdrcusip   permco  siccd  nasdissuno  yyyymmdd  sharetype  securitytype  securitysubtype  usincflg   issuertype  primaryexch  conditionaltype  tradingstatusflg  dlycaldt    dlydelflg  dlyprc          dlyprcflg  dlycap          dlycapflg  dlyprevprc      dlyprevprcflg  dlyprevdt   dlyprevcap      dlyprevcapflg  dlyret          dlyretx         dlyreti         dlyretmissflg  dlyretdurflg  dlyorddivamt    dlynonorddivamt  dlyfacprc       dlydistretflg  dlyvol          dlyclose        dlylow          dlyhigh         dlybid          dlyask          dlyopen         dlynumtrd  dlymmcnt  dlyprcvol       dlycumfacpr      dlycumfacshr     cusip      ticker     exchangetier  shrout ┃
┡━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━┩
│ int32string(8)int32int32int32int32string(3)string(4)string(3)string(1)string(4)string(1)string(3)string(1)datestring(1)decimal(13, 6)string(2)decimal(13, 2)string(2)decimal(13, 6)string(2)datedecimal(13, 2)string(2)decimal(10, 6)decimal(10, 6)decimal(10, 6)string(2)string(2)decimal(13, 6)decimal(13, 6)decimal(10, 6)string(2)decimal(14, 0)decimal(13, 6)decimal(13, 6)decimal(13, 6)decimal(13, 6)decimal(13, 6)decimal(13, 6)int32int16decimal(14, 1)decimal(17, 12)decimal(17, 12)string(8)string(5)string(3)int32  │
├────────┼───────────┼────────┼───────┼────────────┼──────────┼───────────┼──────────────┼─────────────────┼───────────┼────────────┼─────────────┼─────────────────┼──────────────────┼────────────┼───────────┼────────────────┼───────────┼────────────────┼───────────┼────────────────┼───────────────┼────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼───────────────┼──────────────┼────────────────┼─────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼───────────┼──────────┼────────────────┼─────────────────┼─────────────────┼───────────┼───────────┼──────────────┼────────┤
│  1000068391610 795239901039619860107NS       EQTY        COM            Y        ACOR      Q          RW             A               1986-01-07N        2.562500BA       9430.00BP       NULLNS           NULLNULLMP           NULLNULLNULLNS           MR          0.0000000.0000001.000000NO           1000NULLNULLNULL2.3750002.750000NULLNULL92562.51.0000000000001.00000000000068391610 OMFGA    SC1         3680 │
│  1000068391610 795239901039619860108NS       EQTY        COM            Y        ACOR      Q          RW             A               1986-01-08N        2.500000BA       9200.00BP       2.562500BA           1986-01-079430.00PB           -0.024390-0.0243900.000000NA           D1          0.0000000.0000001.000000NO           12800NULLNULLNULL2.3750002.625000NULLNULL932000.01.0000000000001.00000000000068391610 OMFGA    SC1         3680 │
│  1000068391610 795239901039619860109NS       EQTY        COM            Y        ACOR      Q          RW             A               1986-01-09N        2.500000BA       9200.00BP       2.500000BA           1986-01-089200.00PB           0.0000000.0000000.000000NA           D1          0.0000000.0000001.000000NO           1400NULLNULLNULL2.3750002.625000NULLNULL93500.01.0000000000001.00000000000068391610 OMFGA    SC1         3680 │
│  1000068391610 795239901039619860110NS       EQTY        COM            Y        ACOR      Q          RW             A               1986-01-10N        2.500000BA       9200.00BP       2.500000BA           1986-01-099200.00PB           0.0000000.0000000.000000NA           D1          0.0000000.0000001.000000NO           8500NULLNULLNULL2.3750002.625000NULLNULL1021250.01.0000000000001.00000000000068391610 OMFGA    SC1         3680 │
│  1000068391610 795239901039619860113NS       EQTY        COM            Y        ACOR      Q          RW             A               1986-01-13N        2.625000BA       9660.00BP       2.500000BA           1986-01-109200.00PB           0.0500000.0500000.000000NA           D3          0.0000000.0000001.000000NO           5450NULLNULLNULL2.5000002.750000NULLNULL1014306.31.0000000000001.00000000000068391610 OMFGA    SC1         3680 │
│  1000068391610 795239901039619860114NS       EQTY        COM            Y        ACOR      Q          RW             A               1986-01-14N        2.750000BA       10120.00BP       2.625000BA           1986-01-139660.00PB           0.0476190.0476190.000000NA           D1          0.0000000.0000001.000000NO           2075NULLNULLNULL2.6250002.875000NULLNULL105706.31.0000000000001.00000000000068391610 OMFGA    SC1         3680 │
│  1000068391610 795239901039619860115NS       EQTY        COM            Y        ACOR      Q          RW             A               1986-01-15N        2.875000BA       10580.00BP       2.750000BA           1986-01-1410120.00PB           0.0454550.0454550.000000NA           D1          0.0000000.0000001.000000NO           22490NULLNULLNULL2.7500003.000000NULLNULL1064658.81.0000000000001.00000000000068391610 OMFGA    SC1         3680 │
│  1000068391610 795239901039619860116NS       EQTY        COM            Y        ACOR      Q          RW             A               1986-01-16N        3.000000BA       11040.00BP       2.875000BA           1986-01-1510580.00PB           0.0434780.0434780.000000NA           D1          0.0000000.0000001.000000NO           10900NULLNULLNULL2.8750003.125000NULLNULL1032700.01.0000000000001.00000000000068391610 OMFGA    SC1         3680 │
│  1000068391610 795239901039619860117NS       EQTY        COM            Y        ACOR      Q          RW             A               1986-01-17N        3.000000BA       11040.00BP       3.000000BA           1986-01-1611040.00PB           0.0000000.0000000.000000NA           D1          0.0000000.0000001.000000NO           8470NULLNULLNULL2.8750003.125000NULLNULL1025410.01.0000000000001.00000000000068391610 OMFGA    SC1         3680 │
│  1000068391610 795239901039619860120NS       EQTY        COM            Y        ACOR      Q          RW             A               1986-01-20N        3.000000BA       11040.00BP       3.000000BA           1986-01-1711040.00PB           0.0000000.0000000.000000NA           D3          0.0000000.0000001.000000NO           1000NULLNULLNULL2.8750003.125000NULLNULL103000.01.0000000000001.00000000000068391610 OMFGA    SC1         3680 │
│       │
└────────┴───────────┴────────┴───────┴────────────┴──────────┴───────────┴──────────────┴─────────────────┴───────────┴────────────┴─────────────┴─────────────────┴──────────────────┴────────────┴───────────┴────────────────┴───────────┴────────────────┴───────────┴────────────────┴───────────────┴────────────┴────────────────┴───────────────┴────────────────┴────────────────┴────────────────┴───────────────┴──────────────┴────────────────┴─────────────────┴────────────────┴───────────────┴────────────────┴────────────────┴────────────────┴────────────────┴────────────────┴────────────────┴────────────────┴───────────┴──────────┴────────────────┴─────────────────┴─────────────────┴───────────┴───────────┴──────────────┴────────┘
start_date = "1960-01-01"
end_date = "2024-12-31"

For clarity, in translating the original Tidy Finance query, I break out the security sub-query from the main query.

security = (
    stksecurityinfohist
    .filter(
        _.sharetype == "NS",
        _.securitytype == "EQTY",
        _.securitysubtype == "COM",
        _.usincflg == "Y",
        _.issuertype.isin(["ACOR", "CORP"]),
        _.primaryexch.isin(["N", "A", "Q"]),
        _.conditionaltype.isin(["RW", "NW"]),
        _.tradingstatusflg == "A",
    ).select(
        "permno",
        "secinfostartdt",
        "secinfoenddt",
    )
)

I next construct the main crsp_daily query.

crsp_daily = (
    dsf
    .filter(_.dlycaldt.between(start_date, end_date))
    .inner_join(security,
                dsf.permno == security.permno)
    .filter(_.dlycaldt.between(_.secinfostartdt, _.secinfoenddt))
    .select(
        permno=_.permno,
        date=_.dlycaldt,
        ret=_.dlyret,
    )
    .drop_null(["permno", "date", "ret"])
    .left_join(
        factors_daily
        .select("date", "rf")
        .rename(risk_free="rf"),
        _.date == factors_daily.date,
    )
    .mutate(
        ret_excess=ibis.greatest(_.ret - _.risk_free, -1)
    )
    .select(
        "permno", "date", 
        ret=_.ret.cast("float64"),
        ret_excess=_.ret_excess.cast("float64"),
    )
)

Let’s take a peek at crsp_daily, which is still a lazy table:

crsp_daily
┏━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ permno  date        ret        ret_excess ┃
┡━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ int32datefloat64float64    │
├────────┼────────────┼───────────┼────────────┤
│  100001986-01-08-0.024390-0.024690 │
│  100001986-01-090.000000-0.000300 │
│  100001986-01-100.000000-0.000300 │
│  100001986-01-130.0500000.049700 │
│  100001986-01-140.0476190.047319 │
│  100001986-01-150.0454550.045155 │
│  100001986-01-160.0434780.043178 │
│  100001986-01-170.000000-0.000300 │
│  100001986-01-200.000000-0.000300 │
│  100001986-01-210.000000-0.000300 │
│       │
└────────┴────────────┴───────────┴────────────┘

The next step is to create a Parquet file from crsp_daily using ibis_to_pq().

%%time
ibis_to_pq(crsp_daily, "crsp_daily.parquet")
CPU times: user 2min 33s, sys: 12.7 s, total: 2min 46s
Wall time: 4min 28s
'crsp_daily.parquet'

For some reason, ibis_to_pq() does not perform quite as well as its sibling lazy_tbl_to_pq() in the R version of db2pq.

%%time
ibis_to_pq(factors_daily, "factors_ff3_daily.parquet")
CPU times: user 350 ms, sys: 25.3 ms, total: 376 ms
Wall time: 860 ms
'factors_ff3_daily.parquet'

1.1 Using the data

I figured that it would be a little dull to do no more than simply create the Parquet files. But looking through Tidy Finance with Python, the only use of crsp_daily seemed to be in the section Estimating Beta Using Daily Returns.

There Tidy Finance say:

We then create a connection to the daily CRSP data, but we don’t load the whole table into our memory. We only extract all distinct permno because we loop the beta estimation over batches of stocks with size 500. To estimate the CAPM over a consistent lookback window while accommodating different return frequencies, we adjust the minimum required number of observations accordingly. Specifically, we require at least 1,000 daily returns over a five‑year period for a valid estimation. This threshold is consistent with the monthly requirement of 48 observations out of 60 months, given that there are roughly 252 trading days in a year.

The Tidy Finance code uses pandas, but I will use Python Polars, which does much better with larger data sets. I start by creating LazyFrame instances of the two tables that are used in calculating betas.

crsp_daily = pl.scan_parquet("crsp_daily.parquet")
factors_ff3_daily = pl.scan_parquet("factors_ff3_daily.parquet")

The next step is to implement the approach described above. Python Polars can handle the details of creating the windows and so on, but we don’t really want to have to hand over all the data to Statsmodels to do the regressions.2 But if you’re reading Tidy Finance with Python, it seems reasonable to assume that you know how to calculate a univariate regression coefficient, so let’s just use Polars expressions and do it by hand.

Unfortunately, it turns out that we will be calculating about 2.5 million betas over windows of up to about 1,250 days each and that will be a lot of data even for Polars!3

So I ended up using batches of 500, just as Tidy Finance do. The trick is to keep using rolling windows, but to define them over month labels rather than over raw daily dates. Each daily observation is assigned to its month, and the rolling window then spans the current month and the prior 59 months. This gives a 60-month lookback while still ensuring that a stock-month appears in the output only if the stock actually traded in that month.

Tidy Finance do not provide any indication of how long their code takes to run, but let’s see how long this takes to run:

%%time
min_obs = 1000
batch_size = 500

def get_betas_batch(permno_batch: pl.DataFrame) -> pl.LazyFrame:
    daily = (
        crsp_daily
        .join(permno_batch.lazy(), on="permno", how="semi")
        .select("permno", "date", "ret_excess")
        .join(
            factors_ff3_daily.select(
                "date",
                pl.col("mktrf").cast(pl.Float64),
                pl.col("rf").cast(pl.Float64),
            ),
            on="date",
            how="inner",
        )
        .select(
            "permno",
            "date",
            "ret_excess",
            mkt_excess=pl.col("mktrf") - pl.col("rf"),
        )
        .with_columns(month=pl.col("date").dt.truncate("1mo"))
    )

    firm_months = (
        daily
        .select("permno", "month")
        .unique()
    )

    return (
        daily
        .sort("permno", "month", "date")
        .group_by_dynamic(
            "month",
            every="1mo",
            period="60mo",
            offset="-59mo",
            group_by="permno",
            closed="left",
            label="right",
        )
        .agg(
            beta=pl.cov("mkt_excess", "ret_excess") / pl.col("mkt_excess").var(),
            mean_ret=pl.col("ret_excess").mean(),
            mean_mkt=pl.col("mkt_excess").mean(),
            n=pl.len(),
        )
        .with_columns(month=pl.col("month").dt.offset_by("-1mo"))
        .join(firm_months, on=["permno", "month"], how="semi")
        .filter(
            pl.col("n") >= min_obs,
        )
        .with_columns(
            date=pl.col("month"),
            alpha=pl.col("mean_ret") - pl.col("beta") * pl.col("mean_mkt")
        )
        .select("permno", "date", "beta", "n", "alpha")
        .sort("permno", "date")
    )

permnos = (
    crsp_daily
    .select("permno")
    .unique()
    .collect()
)

betas = pl.concat(
    (
        get_betas_batch(permnos.slice(i, batch_size)).collect()
        for i in range(0, permnos.height, batch_size)
    ),
    rechunk=True,
).sort("permno", "date")
CPU times: user 33.6 s, sys: 19.1 s, total: 52.7 s
Wall time: 22 s

So about 15-20 seconds for the whole lot.

And let’s take a peek at the data:

betas
shape: (2_354_575, 5)
permno date beta n alpha
i32 date f64 u32 f64
10001 1989-12-01 0.091415 1005 0.001083
10001 1990-01-01 0.09536 1027 0.001044
10001 1990-02-01 0.098356 1046 0.001014
10001 1990-03-01 0.09657 1068 0.001004
10001 1990-04-01 0.094286 1088 0.000986
93436 2024-08-01 1.592899 1258 0.002095
93436 2024-09-01 1.595383 1258 0.002201
93436 2024-10-01 1.605146 1258 0.001993
93436 2024-11-01 1.62253 1258 0.002191
93436 2024-12-01 1.631224 1258 0.00221

1.2 The same analysis with Ibis and DuckDB

An obvious alternative is to let DuckDB do the rolling-window work directly. Ibis makes this straightforward: we can connect to its default DuckDB backend, query the Parquet files in place, and then get the result as Polars table.

Here I use the same idea as above. I label each daily row with its month, define rolling windows over those month labels, and then keep only the first daily row for each permno/month. The window spans the current month and the prior 59 months, which seems closest to the pandas implementation.

duck = ibis.duckdb.connect()
%%time
min_obs = 1000

crsp_daily = duck.read_parquet(
    "crsp_daily.parquet", 
    table_name="crsp_daily")
    
factors_ff3_daily = duck.read_parquet(
    "factors_ff3_daily.parquet",
    table_name="factors_ff3_daily",
)

daily = (
    crsp_daily
    .select("permno", "date", "ret_excess")
    .join(
        factors_ff3_daily.select(
            "date",
            mkt_excess=(
                _["mktrf"].cast("float64") - _["rf"].cast("float64")
            ),
        ),
        "date",
    )
    .mutate(month=_.date.truncate("M"))
)

w_beta = ibis.trailing_range_window(
    ibis.interval(months=59),
    group_by=_.permno,
    order_by=_.month,
)

w_month = ibis.window(
    group_by=[_.permno, _.month]
)

betas_duckdb = (
    daily
    .mutate(
        beta=(
            _.mkt_excess.cov(_.ret_excess, how="sample").over(w_beta)
            / _.mkt_excess.var(how="sample").over(w_beta)
        ),
        mean_ret=_.ret_excess.mean().over(w_beta),
        mean_mkt=_.mkt_excess.mean().over(w_beta),
        n=_.permno.count().over(w_beta),
        first_month_date=_.date.min().over(w_month),
    )
    .filter(_.date == _.first_month_date, _.n >= min_obs)
    .select(
        "permno",
        "month",
        "beta",
        "n",
        alpha=_.mean_ret - _.beta * _.mean_mkt,
    )
    .rename(date="month")
    .order_by("permno", "date")
    .to_polars()
)
CPU times: user 2min 37s, sys: 6.27 s, total: 2min 43s
Wall time: 15 s

And here are the resulting betas:

betas_duckdb
shape: (2_354_575, 5)
permno date beta n alpha
i32 date f64 i64 f64
10001 1989-12-01 0.091415 1005 0.001083
10001 1990-01-01 0.09536 1027 0.001044
10001 1990-02-01 0.098356 1046 0.001014
10001 1990-03-01 0.09657 1068 0.001004
10001 1990-04-01 0.094286 1088 0.000986
93436 2024-08-01 1.592899 1258 0.002095
93436 2024-09-01 1.595383 1258 0.002201
93436 2024-10-01 1.605146 1258 0.001993
93436 2024-11-01 1.62253 1258 0.002191
93436 2024-12-01 1.631224 1258 0.00221

Footnotes

  1. The R version of the note did not use ff.factors_daily from WRDS, but I do so here because I don’t have the same copy_inline() functionality I had when using R in yesterday’s note.↩︎

  2. The original pandas code uses smf.ols() from Statsmodels.↩︎

  3. There will be a lot of overlap in these windows and I had thought that Polars would just take the data and just process the windows at the start of each month as it passed through for each permno, but it seems it literally stacks copies of data for each window and I quickly ran out of RAM when I tried this.↩︎