Appendix E — Making a parquet repository
Throughout the book, we have relied on the WRDS PostgreSQL server as our primary source of larger data sets, primarily from CRSP and Compustat. However, there are downsides to relying on the WRDS PostgreSQL server to this degree. One (minor) downside is that one needs to be online to get data. Another downside is that we quickly run into limits when we have large amounts of non-WRDS data that need to be combined with WRDS data for analysis. Finally, many researchers would prefer to have greater clarity on the underlying data sets than is provided by an approach that relies on the ever-changing data sets on WRDS.
In Appendix D, we considered an approach that uses a user-created PostgreSQL server. While this has the merit of using the same underlying approach used by WRDS and the rock-solid backbone provided by PostgreSQL, it does require the user to invest in understanding a technology at a deeper level than many researchers would be comfortable with.
In Chapter 23, we introduced an approach that used parquet data files as a foundation for analysis. This approach is likely to be more transparent to many researchers accustomed to storing data in individual data files. This may lead many researchers to ask if such an approach could be extended to handle data from CRSP and Compustat more generally.
The answer is “yes” and this appendix outlines some options for researchers interested in pursuing such an approach.
The code in this appendix uses the following packages. For instructions on how to set up your computer to use the code found in this book, see Section 1.2.
E.1 Data management approaches
Once upon a time, research computing was largely done on shared computers managed by others. Data of the kind provided by CRSP and Compustat would have resided on these shared computers and often researchers would have had limited allocations for data storage. For example, many academics would have relied on the servers provided by WRDS to conduct research. If WRDS updated Compustat or CRSP, then the results from running a researcher’s code might change.
When computing power and storage became greater and cheaper, many researchers came to conduct research on their personal computers. Today, it is quite feasible to maintain local copies of major CRSP and Compustat data tables on the hard drive of one’s laptop. In fact, one could keep multiple copies, say one per project.
E.2 Organizing data
For brevity, we assume you have elected to maintain a core repository of data files shared across projects (e.g., Compustat, CRSP) as well as project-specific data files. The shared data might also include data from non-WRDS sources, such as conference call files from StreetEvents or FactSet or data obtained from SEC EDGAR. The project-specific data files might be hand-collected or scraped data relevant to a specific project, as well as the results of processing other data for your project.
This approach to data organization is easily implemented using schemas in a relational database, such as PostgreSQL or MariaDB. An obvious choice for schema names for WRDS-derived data is the name of the corresponding schema in the WRDS PostgreSQL database (e.g., crsp
for CRSP data and comp
for Compustat data).
But if we want to maintain a file system–based approach rather than setting up a relational database, we might choose to organize our data files into directories. With such an approach, we could mimic the schemas discussed above, we could put tables from CRSP in crsp
, tables from Compustat in comp
, and so on. If we wanted a project-specific version of crsp.dsf
for a project stored in mega_paper
, we could just put that in (say) a crsp
directory in the mega_paper
directory.
Having elected to choose a data-file approach to storing data, a natural question arises about the format to use. Many Stata users would use Stata data files (.dta
). R users might use RDS files (.rds
). SAS users typically keep SAS data files (.sas7bdat
). However, .dta
and .rds
files are a bad choice for large files.1 Loading these files into Stata or R means reading the whole file off disk into RAM. This takes time and uses up memory. Storing crsp.dsf
as dsf.dta
or dsf.rds
is likely to result in a file that is several gigabytes on disk and even more in memory.
The approach in this chapter uses parquet files, which we first saw in Chapter 23. Parquet files have a rich type system and are created in a way that makes it easier to get just the data we need. Additionally, parquet files are column-oriented which typically leads to superior performance for data analysis than formats that are row-oriented, such as CSV files or database tables. Finally, parquet files are typically much smaller than the equivalent CSV files.
Relative to text files or most package-specific storage formats, parquet files offer most of the benefits of relational databases:
- Fast random access to on-disk data.
- Data accessible by any software package.2
- Data can be properly typed.
- Textual data are handled with ease.
A couple of potential advantages of a PostgreSQL server are lost with this approach. First, it’s slightly more difficult to centralize computing in the way that can often be achieved with a database server. Second, data are not “accessible from anywhere” in quite the same way. However, parquet files offer benefits without requiring a database server to be set up, a complex step for many users.
E.3 Canonical WRDS data
There are two underlying sources of WRDS data:
- The SAS files on the WRDS cloud server
- The WRDS PostgreSQL database
For many years (or even decades), the canonical WRDS data were those in SAS files on the WRDS servers. These SAS files formed the basis of the web queries familiar to many researchers.3
In recent years, WRDS has offered a PostgreSQL server that provides a more platform-neutral offering for researchers using more modern languages, such as Python or R. While the PostgreSQL server was initially constructed from the SAS data files, it seems possible that some data are ingested by the PostgreSQL server directly without an intermediate step using SAS data formats.
WRDS highlights support for four languages: R, Python, Stata, and SAS. For R, Python, and Stata, it is assumed that users will access WRDS data via the PostgreSQL server. Only for SAS is it assumed that the SAS data files will be used directly.
Initially, the SAS data files appeared to be the canonical data sources. Some data sets were not covered by the PostgreSQL database (e.g., TAQ data). Other data sets were in the PostgreSQL server, but with some data loss. For example, non-ASCII text in SAS data files was lost in early versions of the PostgreSQL database; this appears no longer to be the case.
At this stage, it seems that both the WRDS SAS data files and the WRDS PostgreSQL database can be viewed as canonical sources for WRDS data. Nonetheless there are minor differences between these data sources. First, the underlying schema names can differ in some cases.
Second, there may be differences in the underlying data types. As discussed in Appendix C, SAS has only two underlying data types and is therefore much less strict about what data goes into each column. The more specific data types in SAS are actually implemented as presentations of the underlying data. Nonetheless some data may be presented as one type in SAS and another type in PostgreSQL. For example, permno
is presented as a whole number in SAS, but as a floating-point value in PostgreSQL. Practically, this makes little difference.4
One data source that is perhaps incorrectly viewed as a canonical data source is the web query interface offered by WRDS. The web query interface leaves much to be desired from a reproducibility perspective as manual steps are needed to extract data and often it is necessary to eliminate variables or periods to manage the size of the resulting files. Additionally, the relation between the data returned by some queries and the underlying tables in SAS or PostgreSQL is sometimes unclear. Some web queries appear to join data from multiple underlying tables in ways that are not always transparent.
E.4 Converting WRDS data to parquet
In this section, we discuss three approaches that readers could use to create their own parquet-based repositories of WRDS data. To keep the repository we create organized, we put it under a single directory, which we flag using the DATA_DIR
environment variable. As we need to interact with WRDS, we need to communicate our WRDS ID to the code, which we can also do using an environment variable.
Sys.setenv(DATA_DIR = "~/Dropbox/pq_data/",
WRDS_ID = "iangow")
E.4.1 Approach 1: Get WRDS PostgreSQL data using DuckDB
One approach to getting WRDS data is to connect to the PostgreSQL server much as we have done throughout the book. Having made this connection, we can easily copy the data to a local DuckDB database, and from there, it is easy to write the data to parquet files on our local storage.
For the reasons discussed in Section 6.1, we use environment variables to give R the information it needs to connect to the WRDS PostgreSQL server.
Sys.setenv(PGHOST = "wrds-pgdata.wharton.upenn.edu",
PGDATABASE = "wrds",
PGUSER = "iangow",
PGPORT = 9737)
We can now connect to the WRDS PostgreSQL server (pg
here) and to a DuckDB instance (db
). If we want to copy the crsp.msi
table from WRDS to our local machine, we simply create msi
, a remote data table, then copy this to db
using the copy_to()
function provided by dplyr
. We name the table (name = "msi"
) so that we can refer to it in subsequent queries. Then, we save the data to a file named msi.parquet
(DuckDB will infer that we want a parquet file from the .parquet
file extension). Finally, we disconnect from both databases (pg
and db
).
pg <- dbConnect(RPostgres::Postgres())
db <- dbConnect(duckdb::duckdb())
msi <- tbl(pg, Id(schema = "crsp", table = "msi"))
copy_to(db, df = msi, name = "msi", overwrite = TRUE)
if (!dir.exists(file.path(Sys.getenv("DATA_DIR"), "crsp"))) {
dir.create(file.path(Sys.getenv("DATA_DIR"), "crsp"))
}
file_name <- file.path(Sys.getenv("DATA_DIR"), "crsp/msi.parquet")
dbExecute(db, str_c("COPY msi TO '", file_name, "'"))
[1] 1177
dbDisconnect(pg)
dbDisconnect(db)
This basic approach is taken by the pg_to_parquet()
function provided as part of the farr
package. As we have the PostgreSQL connection details set up for WRDS and also have the DATA_DIR
environment variable set up, we can issue a single line to get a local copy of crsp.msi
.
pg_to_parquet("msi", "crsp")
The main downside of this approach is that it is very memory-intensive with large tables.
E.4.2 Approach 2: Get WRDS PostgreSQL data using Python
In Appendix D, we saw how we can use the wrds2pg
Python library to import WRDS SAS data into a local PostgreSQL server with the wrds_update()
function. The db2pq
Python library offers wrds_update_pq()
, a function modelled on wrds_update()
that creates parquet files using data stored in the WRDS PostgreSQL database. To install the required version of db2pq
, use the following command from the command line:
pip install db2pq --upgrade
Assuming we have the necessary environment variables set up (e.g., WRDS_ID
and DATA_DIR
), creating a parquet file is straightforward. (Note that the following lines are Python code.)
from db2pq import wrds_update_pq
"msi", "crsp") wrds_update_pq(
One benefit of this approach is that we can specify variable types. Additionally, it places much less of a burden than the approach using pg_to_parquet()
above. The main downside is the need to use Python. A script to obtain all tables needed to run the code in this book using this approach is provided in Section E.6.1.
E.4.3 Approach 3: Get WRDS SAS data using Python
The wrds2pg
Python library also offers wrds_update_pq()
, a function modelled after wrds_update()
that uses SAS code to generate data that are saved as a parquet file. To install the required version of wrds2pg
, use the following command from the command line.5
pip install wrds2pg --upgrade
Assuming we have the necessary environment variables set up (e.g., WRDS_ID
and DATA_DIR
), creating a parquet file from a WRDS SAS file is straightforward. (Note that the following lines are Python code.)
from wrds2pg import wrds_update_pq
"msi", "crsp") wrds_update_pq(
If you don’t have the DATA_DIR
environment variable set, you can supply it to the wrds_update_pq()
function directly.
"msi", "crsp", data_dir = "~/Dropbox/pq_data/") wrds_update_pq(
One benefit of this approach is that we can leverage the features of wrds2pg
, such as the ability to specify variable types, rename and drop variables, and so on. Additionally, importing from WRDS SAS data is fast and often results in better typed data than that provided by the WRDS PostgreSQL database. Finally, while wrds_update_pq()
does require a significant amount of RAM, it places less of a burden than the R approach using pg_to_parquet()
above. The main downside is the need to use Python. A script to obtain all tables needed to run the code in this book using this approach is provided in Section E.6.2.
E.5 Working with parquet files
To illustrate how we can work with these parquet files, we retrace some of the steps we took in Chapter 21. We will assume that you have parquet versions of the three data tables in your repository, perhaps as a result of running one of the three approaches to getting the data listed here.
Downloading comp.funda
will take a few minutes using any of the three approaches below.
E.5.1 Getting the data
E.5.1.1 Approach 1: Get WRDS PostgreSQL data using DuckDB
While this approach is simpler than the other two, it will take longer and it requires more RAM. It is not recommended for large files.
pg_to_parquet("seg_customer", "compseg")
pg_to_parquet("names_seg", "compseg")
pg_to_parquet("funda", "comp")
E.5.1.2 Approach 2: Get WRDS PostgreSQL data using Python
This is the most robust and fastest approach. Also, it uses less RAM than either of the other two approaches.
from db2pq import wrds_update_pq
"seg_customer", "compseg")
wrds_update_pq("names_seg", "compseg")
wrds_update_pq("funda", "comp") wrds_update_pq(
E.5.1.3 Approach 3: Get WRDS SAS data using Python
As the comp.funda
SAS data file includes special missing value codes that are exported by SAS as text values, we need to use fix_missing = True
to convert these to missing values.
from wrds2pg import wrds_update_pq
"seg_customer", "compseg")
wrds_update_pq("names_seg", "compseg")
wrds_update_pq("funda", "comp", fix_missing = True) wrds_update_pq(
E.5.2 Replicating analysis using parquet files
Assuming you now have the three data tables, we can proceed with some analysis from Chapter 21. We start by creating an in-memory DuckDB database.
We next use load_parquet()
from the farr
package to load the tables we created above into our DuckDB database, db
. By default, load_parquet()
will use the environment variable DATA_DIR
to locate the repository of parquet files. If DATA_DIR
has not been set, we set it now.
Sys.setenv(DATA_DIR = "~/Dropbox/pq_data/")
seg_customer <- load_parquet(db, "seg_customer", "compseg")
names_seg <- load_parquet(db, "names_seg", "compseg")
funda <- load_parquet(db, "funda", "comp")
As an alternative to setting the DATA_DIR
environment variable, you can supply the location of the repository of parquet files to the load_parquet()
function directly as the data_dir
argument.
load_parquet(db, "funda", "comp", data_dir = "~/Dropbox/pq_data/")
The next two tables come with the farr
package, but need to be copied to db
.
The next block of code is essentially unchanged from the code in Chapter 21. The two changes are:
- We do not use
collect()
, as we want the data to remain in our database as long as possible. In the original code, we needed tocollect()
data to merge withundisclosed_names
andllz_2018
. These are now all in a single database. - We add
na.rm = TRUE
in a couple of places to suppress warnings. SQL always usesna.rm = TRUE
.
customers <-
seg_customer |>
filter(ctype == "COMPANY")
disclosure_raw <-
customers |>
filter(between(datadate, "1994-01-01", "2010-12-31")) |>
left_join(undisclosed_names, by = "cnms") |>
mutate(disclosed = coalesce(disclosed, TRUE)) |>
select(gvkey, datadate, cnms, salecs, disclosed)
sales <-
funda |>
filter(indfmt == "INDL", datafmt == "STD",
consol == "C", popsrc == "D") |>
select(gvkey, datadate, sale)
prin_cust_df <-
disclosure_raw |>
inner_join(sales, by = c("gvkey", "datadate")) |>
group_by(gvkey, datadate) |>
filter(!is.na(salecs), sale > 0) |>
summarize(prin_cust = max(salecs / sale, na.rm = TRUE),
.groups = 'drop') |>
mutate(has_prin_cust = prin_cust >= 0.1)
disclosure <-
disclosure_raw |>
inner_join(sales, by = c("gvkey", "datadate")) |>
semi_join(llz_2018, by = "gvkey") |>
group_by(gvkey, datadate) |>
summarize(ratio = mean(as.double(!disclosed), na.rm = TRUE),
ratio_sale = sum(as.double(!disclosed) * salecs, na.rm = TRUE) /
sum(salecs, na.rm = TRUE),
.groups = "drop") |>
mutate(year = year(datadate))
# A tibble: 1 × 2
ratio ratio_sale
<dbl> <dbl>
1 0.459 0.464
dbDisconnect(db)
E.5.3 Working with larger parquet files
The power of parquet files becomes even more apparent when working with larger data sets, such as crsp.dsf
. Here we assume that you have a copy of crsp.dsf
in your repository, perhaps as a result of running code like the following Python code. The resulting data file will be over 3 GB (this is much compressed relative to the original SAS or PostgreSQL data).
Downloading crsp.dsf
can take 30 minutes or more depending on the speed of your connection to WRDS and the approach used.
from db2pq import wrds_update_pq
"dsf", "crsp") wrds_update_pq(
The following code does a number of analytical tasks. First, we count the number of rows in the table. Second, we count the number of rows by year. Third, we extract the data for a particular date. Next, we identify the PERMNO for Apple, then use that to plot cumulative returns for Apple over time.
Completing all these tasks takes about 3 seconds!
now <- Sys.time()
db <- dbConnect(duckdb::duckdb())
dsf <- load_parquet(db, "dsf", "crsp")
stocknames <- load_parquet(db, "stocknames", "crsp")
dsf |>
count() |>
collect()
# A tibble: 1 × 1
n
<dbl>
1 105258380
# A tibble: 99 × 2
year n
<dbl> <dbl>
1 2023 2353668
2 2022 2389874
3 2021 2187044
4 2020 1948489
5 2019 1911581
6 2018 1869102
7 2017 1827464
8 2016 1828303
9 2015 1820106
10 2014 1770410
# ℹ 89 more rows
dsf_subset <-
dsf |>
filter(date == "1986-01-07") |>
collect()
apple_permno <-
stocknames |>
filter(str_detect(comnam, "^APPLE COM")) |>
pull(permno)
dsf |>
filter(permno == apple_permno) |>
group_by(permno) |>
window_order(date) |>
mutate(cumret = exp(cumsum(log(1 + coalesce(ret, 0))))) |>
ggplot(aes(x = date, y = cumret)) +
geom_line()
Sys.time() - now
Time difference of 3.642983 secs
dbDisconnect(db)
E.6 Creating a parquet library
In this section, we describe two approaches to creating a local repository of the data used in this book in parquet format.6 Either approach requires installation of Python, but the first (db2pq
) uses the WRDS PostgreSQL database server as the data source while the second (wrds2pg
) gets the data from the SAS files provided by WRDS.
Note that the db2pq
approach does inspect the WRDS SAS data files to determine whether the current WRDS data is more recent than that in the local repository. So a benefit of the wrds2pg
approach is that it relies only on one data source (SAS data files). However, the db2pq
approach will generally be faster and also avoids the significant demands on RAM imposed by wrds2pg
in converting data to parquet format and therefore we recommend the db2pq
approach.7
E.6.1 Using db2pq
Successfully executing the following steps will result in a local repository of parquet files comprising all WRDS data used in this book.
Install Python. You may already have Python installed. If not, Miniconda is one option.
Install the
db2pq
package.
pip install db2pq --upgrade
- Set up environment variables for
db2pq
. See here for details.
export WRDS_ID="iangow"
export DATA_DIR="~/Dropbox/pq_data"
Run Python.
Within Python, use the
db2pq
module to get data. Running the following script creates a repository comprising all the data sets used in this book. (If you have already downloaded some tables, you might want to comment those out so that you don’t download them again.)
#!/usr/bin/env python3
from db2pq import wrds_update_pq
# CRSP
'ccmxpf_lnkhist', 'crsp',
wrds_update_pq(={'lpermno': 'int32',
col_types'lpermco': 'int32'})
'dsf', 'crsp',
wrds_update_pq(={'permno': 'int32',
col_types'permco': 'int32'})
'dsi', 'crsp')
wrds_update_pq('erdport1', 'crsp')
wrds_update_pq('comphist', 'crsp')
wrds_update_pq('dsedelist', 'crsp',
wrds_update_pq(={'permno': 'int32',
col_types'permco': 'int32'})
'dseexchdates', 'crsp', col_types={'permno': 'int32',
wrds_update_pq('permco': 'int32'})
'msf', 'crsp', col_types={'permno': 'int32',
wrds_update_pq('permco': 'int32'})
'msi', 'crsp')
wrds_update_pq('mse', 'crsp',
wrds_update_pq(={'permno': 'int32',
col_types'permco': 'int32'})
'stocknames', 'crsp',
wrds_update_pq(={'permno': 'int32',
col_types'permco': 'int32'})
'dsedist', 'crsp',
wrds_update_pq(={'permno': 'int32',
col_types'permco': 'int32'})
# Fama-French library
'factors_daily', 'ff')
wrds_update_pq(
# Compustat
'company', 'comp')
wrds_update_pq('funda', 'comp')
wrds_update_pq('funda_fncd', 'comp')
wrds_update_pq('fundq', 'comp')
wrds_update_pq('r_auditors', 'comp')
wrds_update_pq('idx_daily', 'comp')
wrds_update_pq('aco_pnfnda', 'comp')
wrds_update_pq(
# The segment data is in comp_segments_hist_daily in PostgreSQL,
# but in compsegd in SAS, so we need to use sas_schema to find the SAS data.
'seg_customer', 'compseg')
wrds_update_pq('names_seg', 'compseg') wrds_update_pq(
E.6.2 Using wrds2pg
Successfully executing the following steps will result in a local repository of parquet files comprising all WRDS data used in this book.
Install Python. You may already have Python installed. If not, Miniconda is one option.
Install the
wrds2pg
package.
pip install wrds2pg --upgrade
- Set up environment variables for
wrds2pg
. See here for details.
export WRDS_ID="iangow"
export DATA_DIR="~/Dropbox/pq_data"
Run Python.
Within Python, use the
wrds2pg
module to get data. Running the following script creates a repository comprising all the data sets used in this book. (If you have already downloaded some tables, you might want to comment those out so that you don’t download them again.)
#!/usr/bin/env python3
from wrds2pg import wrds_update_pq
# CRSP
'ccmxpf_lnkhist', 'crsp', fix_missing=True,
wrds_update_pq(={'lpermno': 'integer',
col_types'lpermco': 'integer'})
'dsf', 'crsp', fix_missing=True,
wrds_update_pq(={'permno': 'integer',
col_types'permco': 'integer'})
'dsi', 'crsp')
wrds_update_pq('erdport1', 'crsp', fix_missing=True)
wrds_update_pq('comphist', 'crsp', fix_missing=True)
wrds_update_pq('dsedelist', 'crsp', fix_missing=True,
wrds_update_pq(={'permno': 'integer',
col_types'permco': 'integer'})
'dseexchdates', 'crsp',
wrds_update_pq(={'permno': 'integer',
col_types'permco': 'integer'})
'msf', 'crsp', fix_missing=True,
wrds_update_pq(={'permno': 'integer',
col_types'permco': 'integer'})
'msi', 'crsp')
wrds_update_pq('mse', 'crsp', fix_missing=True,
wrds_update_pq(={'permno': 'integer',
col_types'permco': 'integer'})
'stocknames', 'crsp',
wrds_update_pq(={'permno': 'integer',
col_types'permco': 'integer'})
'dsedist', 'crsp', fix_missing=True,
wrds_update_pq(={'permno': 'integer',
col_types'permco': 'integer'})
# Fama-French library
'factors_daily', 'ff')
wrds_update_pq(
# Compustat
'company', 'comp',)
wrds_update_pq('funda', 'comp', fix_missing=True)
wrds_update_pq('funda_fncd', 'comp')
wrds_update_pq('fundq', 'comp', fix_missing=True)
wrds_update_pq('r_auditors', 'comp')
wrds_update_pq('idx_daily', 'comp')
wrds_update_pq('aco_pnfnda', 'comp')
wrds_update_pq(
# compseg
'seg_customer', 'compseg')
wrds_update_pq('names_seg', 'compseg') wrds_update_pq(
SAS works in a way that makes SAS data files a reasonable choice.↩︎
This is almost true for parquet files, but not quite. Stata cannot really work with on-disk data and has no native support for parquet data.↩︎
It seems plausible that these web queries still draw on SAS data files to this day.↩︎
SAS stores the data as floating-point values in any case.↩︎
If you installed
wrds2pg
before December 2023, then you will need to update to the current version.↩︎Above we discussed three approaches, but “Approach 1” will be problematic with larger data files. The two approaches here are “Approach 2” and “Approach 3”.↩︎
This concern with RAM does not apply to the
wrds_update()
function used in Appendix D.↩︎