Appendix D — Making a parquet repository

Throughout the book, we have relied on the WRDS PostgreSQL server as our primary source of larger data sets, primarily from CRSP and Compustat. However, there are downsides to relying on the WRDS PostgreSQL server to this degree. One (minor) downside is that one needs to be online to get data. Another downside is that we quickly run into limits when we have large amounts of non-WRDS data that need to be combined with WRDS data for analysis. Finally, many researchers would prefer to have greater clarity on the underlying data sets than is provided by an approach that relies on the ever-changing data sets on WRDS.

One option would be to use a user-created PostgreSQL server with WRDS data. While this has the merit of using the same underlying approach used by WRDS and the rock-solid backbone provided by PostgreSQL, it does require the user to invest in understanding a technology at a deeper level than many researchers would be comfortable with.

In Chapter 23, we introduced an approach that used parquet data files as a foundation for analysis. This approach is likely to be more transparent to many researchers accustomed to storing data in individual data files. This may lead many researchers to ask if such an approach could be extended to handle data from CRSP and Compustat more generally.

The answer is “yes” and this appendix outlines some options for researchers interested in pursuing such an approach.

The code in this appendix uses the following packages. For instructions on how to set up your computer to use this code, go to the support page for this book.

import time
import polars as pl
from plotnine_polars import aes
from era_pl import load_parquet

D.1 Data management approaches

Once upon a time, research computing was largely done on shared computers managed by others. Data of the kind provided by CRSP and Compustat would have resided on these shared computers and often researchers would have had limited allocations for data storage. For example, many academics would have relied on the servers provided by WRDS to conduct research. If WRDS updated Compustat or CRSP, then the results from running a researcher’s code might change.

When computing power and storage became greater and cheaper, many researchers came to conduct research on their personal computers. Today, it is quite feasible to maintain local copies of major CRSP and Compustat data tables on the hard drive of one’s laptop. In fact, one could keep multiple copies, say one per project.

D.2 Organizing data

For brevity, we assume you have elected to maintain a core repository of data files shared across projects (e.g., Compustat, CRSP) as well as project-specific data files. The shared data might also include data from non-WRDS sources, such as conference call files from StreetEvents or FactSet or data obtained from SEC EDGAR. The project-specific data files might be hand-collected or scraped data relevant to a specific project, as well as the results of processing other data for your project.

This approach to data organization is easily implemented using schemas in a relational database, such as PostgreSQL or MariaDB. An obvious choice for schema names for WRDS-derived data is the name of the corresponding schema in the WRDS PostgreSQL database (e.g., crsp for CRSP data and comp for Compustat data).

But if we want to maintain a file system–based approach rather than setting up a relational database, we might choose to organize our data files into directories. With such an approach, we could mimic the schemas discussed above, we could put tables from CRSP in crsp, tables from Compustat in comp, and so on. If we wanted a project-specific version of crsp.dsf for a project stored in mega_paper, we could just put that in (say) a crsp directory in the mega_paper directory.

Having elected to choose a data-file approach to storing data, a natural question arises about the format to use. Many Stata users would use Stata data files (.dta). R users might use RDS files (.rds). SAS users typically keep SAS data files (.sas7bdat). However, .dta and .rds files are a bad choice for large files.¹ Loading these files into Stata or R means reading the whole file off disk into RAM. This takes time and uses up memory. Storing crsp.dsf as dsf.dta or dsf.rds is likely to result in a file that is several gigabytes on disk and even more in memory.

The approach in this chapter uses parquet files, which we first saw in Chapter 23. Parquet files have a rich type system and are created in a way that makes it easier to get just the data we need. Additionally, parquet files are column-oriented which typically leads to superior performance for data analysis than formats that are row-oriented, such as CSV files or database tables. Finally, parquet files are typically much smaller than the equivalent CSV files.

Relative to text files or most package-specific storage formats, parquet files offer most of the benefits of relational databases:

Fast random access to on-disk data.
Data accessible by any software package.²
Data can be properly typed.
Textual data are handled with ease.

A couple of potential advantages of a PostgreSQL server are lost with this approach. First, it’s slightly more difficult to centralize computing in the way that can often be achieved with a database server. Second, data are not “accessible from anywhere” in quite the same way. However, parquet files offer benefits without requiring a database server to be set up, a complex step for many users.

D.3 Canonical WRDS data

There are two underlying sources of WRDS data:

The SAS files on the WRDS cloud server
The WRDS PostgreSQL database

For many years (or even decades), the canonical WRDS data were those in SAS files on the WRDS servers. These SAS files formed the basis of the web queries familiar to many researchers.³

In recent years, WRDS has offered a PostgreSQL server that provides a more platform-neutral offering for researchers using more modern languages, such as Python or R. While the PostgreSQL server was initially constructed from the SAS data files, it seems possible that some data are ingested by the PostgreSQL server directly without an intermediate step using SAS data formats.

WRDS highlights support for four languages: R, Python, Stata, and SAS. For R, Python, and Stata, it is assumed that users will access WRDS data via the PostgreSQL server. Only for SAS is it assumed that the SAS data files will be used directly.

Initially, the SAS data files appeared to be the canonical data sources. Some data sets were not covered by the PostgreSQL database (e.g., TAQ data). Other data sets were in the PostgreSQL server, but with some data loss. For example, non-ASCII text in SAS data files was lost in early versions of the PostgreSQL database; this appears no longer to be the case.

At this stage, it seems that both the WRDS SAS data files and the WRDS PostgreSQL database can be viewed as canonical sources for WRDS data. Nonetheless there are minor differences between these data sources. First, the underlying schema names can differ in some cases.

Second, there may be differences in the underlying data types. SAS has only two underlying data types and is therefore much less strict about what data goes into each column. The more specific data types in SAS are actually implemented as presentations of the underlying data. Nonetheless some data may be presented as one type in SAS and another type in PostgreSQL. For example, permno is presented as a whole number in SAS, but as a floating-point value in PostgreSQL. Practically, this makes little difference.⁴

One data source that is perhaps incorrectly viewed as a canonical data source is the web query interface offered by WRDS. The web query interface leaves much to be desired from a reproducibility perspective as manual steps are needed to extract data and often it is necessary to eliminate variables or periods to manage the size of the resulting files. Additionally, the relation between the data returned by some queries and the underlying tables in SAS or PostgreSQL is sometimes unclear. Some web queries appear to join data from multiple underlying tables in ways that are not always transparent.

D.4 Working with larger parquet files

The power of parquet files becomes even more apparent when working with larger data sets, such as crsp.dsf. Here we assume that you have a copy of crsp.dsf in your repository, perhaps as a result of running code like the following Python code. The resulting data file will be over 3 GB (this is much compressed relative to the original SAS or PostgreSQL data).

from db2pq import wrds_update_pq
wrds_update_pq("dsf", "crsp")

The following code does a number of analytical tasks. First, we count the number of rows in the table. Second, we count the number of rows by year. Third, we extract the data for a particular date. Next, we identify the PERMNO for Apple, then use that to plot cumulative returns for Apple over time.

Completing all these tasks takes less than a second!

now = time.time()

dsf = load_parquet("dsf", "crsp")
stocknames = load_parquet("stocknames", "crsp")

int(dsf.select(pl.len()).collect().item())

(
    dsf
    .with_columns(year=pl.col("date").dt.year())
    .group_by("year")
    .agg(n=pl.len())
    .sort("year", descending=True)
    .collect()
)

dsf_subset = (
    dsf
    .filter(pl.col("date") == pl.date(1986, 1, 7))
    .collect()
)

apple_permno = (
    stocknames
    .filter(pl.col("comnam").str.contains("^APPLE COM"))
    .select("permno")
    .collect()
    .item()
)

(
    dsf
    .filter(pl.col("permno") == apple_permno)
    .sort("date")
    .with_columns(cumret=(1 + pl.col("ret").fill_null(0)).cum_prod())
    .select("date", "cumret")
    .collect()
    .ggplot(aes(x="date", y="cumret"))
    .geom_line()
)

time.time() - now

0.6779248714447021

SAS works in a way that makes SAS data files a reasonable choice.↩︎
This is almost true for parquet files, but not quite. Stata cannot really work with on-disk data and has no native support for parquet data.↩︎
It seems plausible that these web queries still draw on SAS data files to this day.↩︎
SAS stores the data as floating-point values in any case.↩︎