Creating a Parquet data repository

library(db2pq)

This article outlines some ideas related to the task that db2pq is designed to support: building and maintaining a local Parquet repository of data from WRDS and other sources.

Some concepts in data management

Scope

Many datasets are project-specific datasets, meaning that only have use within a single project (e.g., paper). Examples of project-specific would include experimental data generated in a particular study.

Other datasets are general-purpose datasets, meaning that they contain data that might be relevant to many studies. Classic examples in a business-school context would be the US stock price files offered by the Center for Research in Security Prices, LLC (CRSP) or financial statement data provided by Compustat, or economic time-series data provided by various statistical offices around the world.

Other datasets are project-level datasets, meaning that the particular data sets are somehow frozen for a particular project, even though the nature of the data otherwise puts them in the category of general-purpose datasets. For example, I might want to fix on a particular version of comp.g_funda, Compustat’s global dataset for annual financial statements for my project, even though this dataset has relevance beyond a specific project.¹

One reason for having project-level arises in the context of reproducibility. If I have published a paper, then the replication package for that paper should ideally contain the data used to produce the exact results in the paper. For this purpose, if the paper used comp.g_funda data, then the ideal replication package would include the precise project-level version of that data set used to produce the paper. Of course, in reality, one cannot simply post the project-level version of comp.g_funda as part of a public replication package. Nonetheless, the authors themselves should have a project-level version of the dataset that they retain. This much aligns with the views of Welch (2019), who suggests that “the author should keep a private copy of the full data set with which the results were obtained.”

As far as WRDS data are concerned, the db2pq package aims to facilitate managing WRDS data either as general-purpose data or as project-level data.² The db2pq package can leverage the environment variable DATA_DIR to identify the root directory for the data repository in question. By default, new WRDS data will go in the matching schema (i.e., subdirectory) of the directory indicated by DATA_DIR.

So, depending on how you want to manage data for a given project, you may set DATA_DIR to a project-level directory or to a user-level directory.

Storage formats

While there are many storage formats available for data, I think a strong case can be made for Parquet being the default storage format for many users. If you use R or Python, I think the case is easy to make. Many software packages can read Parquet data and some of them (e.g., the arrow package or DuckDB) will absolutely fly with Parquet data.

I believe that recent editions of Stata can read Parquet files, though the way Stata operates means that Stata users are unlikely to see the performance benefits Parquet offers.³ SAS users might find the case for Parquet less compelling, though there are probably benefits in moving to a storage medium that is more compact, less proprietary, and more likely to be supported in a few years’ time.

Of course, an alternative to using Parquet files would be using a database, such as PostgreSQL. I think such systems have a lot of merit (and I have used PostgreSQL to store WRDS data since 2011), but I think they are more complicated for most users’ needs and their benefits (e.g., shared access to data and rock-solid assurance) are less meaningful for most.

Another alternative is the CSV file, perhaps compressed. I think if one were sending data on the next Voyager mission, then CSV might be the chosen format.⁴ Or if you really, really wanted data novices to inspect your data in Excel or Word, then CSV might be the go-to option. Or perhaps you want to put your data in a written form in a book for users to type in. For any other purpose with serious data needs, I think Parquet dominates.

One issue with CSV is that one is always dealing with type inference (string, integer, timestamp) and I think that type inference is one of those problems you want to solve once for any given dataset. For the WRDS data that is the focus of this note, I think CSV is to be avoided.

In any case, db2pq is focused on creating and managing a Parquet data repository.

Setting `DATA_DIR`

A key idea of the approach used by db2pq is that the Parquet data repository has a root directory under which the data are organized into schemas. It is perhaps easiest to refer to this root directory as DATA_DIR. When the environment variable DATA_DIR is set, then db2pq function will organize data under that directory. Similarly, functions such as load_parquet() from the farr package will look for data in schemas under that directory. In this way, the same code can work with a repository in different folders on different machines.

First-Time Setup

If you run a function like wrds_update_pq() without setting DATA_DIR, you will be asked to select a directory. You will then be given three choices for persisting the directory you have chosen:

a project-level DATA_DIR (stored in .Renviron)
a user-level DATA_DIR (using ~/.Renviron)
no persistence

db2pq writes Parquet files into a local repository.

For example, this call will, if the file does not exist need to be updated write a Parquet file using data from the WRDS database.

library(db2pq)

wrds_update_pq("dsi", "crsp")

The data are written to the Parquet file in:

<DATA_DIR>/crsp/dsi.parquet

The standard layout is:

<DATA_DIR>/<schema>/<table>.parquet

When updates use archive = TRUE, replaced files are moved under:

<DATA_DIR>/<schema>/archive/<table>_<timestamp>.parquet

Explicit Overrides

An explicit data_dir argument will override the default for a call:

wrds_update_pq("dsi", "crsp", data_dir = "data")

WRDS to Parquet shows the refresh workflow.
Authentication covers WRDS_ID, passwords, and .pgpass.

Welch, Ivo. 2019. “Editorial: An Opinionated FAQ.” Critical Finance Review 8 (1-2): 19–24. https://doi.org/10.1561/104.00000077.