Appendix E — Making a parquet repository
Throughout the book, we have relied on the WRDS PostgreSQL server as our primary source of larger data sets, primarily from CRSP and Compustat. However, there are downsides to relying on the WRDS PostgreSQL server to this degree. One (minor) downside is that one needs to be online to get data. Another downside is that we quickly run into limits when we have large amounts of non-WRDS data that need to be combined with WRDS data for analysis. Finally, many researchers would prefer to have greater clarity on the underlying data sets than is provided by an approach that relies on the ever-changing data sets on WRDS.
In Appendix D, we considered an approach that uses a user-created PostgreSQL server. While this has the merit of using the same underlying approach used by WRDS and the rock-solid backbone provided by PostgreSQL, it does require the user to invest in understanding a technology at a deeper level than many researchers would be comfortable with.
In Chapter 23, we introduced an approach that used parquet data files as a foundation for analysis. This approach is likely to be more transparent to many researchers accustomed to storing data in individual data files. This may lead many researchers to ask if such an approach could be extended to handle data from CRSP and Compustat more generally.
The answer is “yes” and this appendix outlines some options for researchers interested in pursuing such an approach.
The code in this appendix uses the following packages. For instructions on how to set up your computer to use the code found in this book, see Section 1.2.
E.1 Data management approaches
Once upon a time, research computing was largely done on shared computers managed by others. Data of the kind provided by CRSP and Compustat would have resided on these shared computers and often researchers would have had limited allocations for data storage. For example, many academics would have relied on the servers provided by WRDS to conduct research. If WRDS updated Compustat or CRSP, then the results from running a researcher’s code might change.
When computing power and storage became greater and cheaper, many researchers came to conduct research on their personal computers. Today, it is quite feasible to maintain local copies of major CRSP and Compustat data tables on the hard drive of one’s laptop. In fact, one could keep multiple copies, say one per project.
E.2 Organizing data
For brevity, we assume you have elected to maintain a core repository of data files shared across projects (e.g., Compustat, CRSP) as well as project-specific data files. The shared data might also include data from non-WRDS sources, such as conference call files from StreetEvents or FactSet or data obtained from SEC EDGAR. The project-specific data files might be hand-collected or scraped data relevant to a specific project, as well as the results of processing other data for your project.
This approach to data organization is easily implemented using schemas in a relational database, such as PostgreSQL or MariaDB. An obvious choice for schema names for WRDS-derived data is the name of the corresponding schema in the WRDS PostgreSQL database (e.g., crsp for CRSP data and comp for Compustat data).
But if we want to maintain a file system–based approach rather than setting up a relational database, we might choose to organize our data files into directories. With such an approach, we could mimic the schemas discussed above, we could put tables from CRSP in crsp, tables from Compustat in comp, and so on. If we wanted a project-specific version of crsp.dsf for a project stored in mega_paper, we could just put that in (say) a crsp directory in the mega_paper directory.
Having elected to choose a data-file approach to storing data, a natural question arises about the format to use. Many Stata users would use Stata data files (.dta). R users might use RDS files (.rds). SAS users typically keep SAS data files (.sas7bdat). However, .dta and .rds files are a bad choice for large files.1 Loading these files into Stata or R means reading the whole file off disk into RAM. This takes time and uses up memory. Storing crsp.dsf as dsf.dta or dsf.rds is likely to result in a file that is several gigabytes on disk and even more in memory.
The approach in this chapter uses parquet files, which we first saw in Chapter 23. Parquet files have a rich type system and are created in a way that makes it easier to get just the data we need. Additionally, parquet files are column-oriented which typically leads to superior performance for data analysis than formats that are row-oriented, such as CSV files or database tables. Finally, parquet files are typically much smaller than the equivalent CSV files.
Relative to text files or most package-specific storage formats, parquet files offer most of the benefits of relational databases:
- Fast random access to on-disk data.
- Data accessible by any software package.2
- Data can be properly typed.
- Textual data are handled with ease.
A couple of potential advantages of a PostgreSQL server are lost with this approach. First, it’s slightly more difficult to centralize computing in the way that can often be achieved with a database server. Second, data are not “accessible from anywhere” in quite the same way. However, parquet files offer benefits without requiring a database server to be set up, a complex step for many users.
E.3 Canonical WRDS data
There are two underlying sources of WRDS data:
- The SAS files on the WRDS cloud server
- The WRDS PostgreSQL database
For many years (or even decades), the canonical WRDS data were those in SAS files on the WRDS servers. These SAS files formed the basis of the web queries familiar to many researchers.3
In recent years, WRDS has offered a PostgreSQL server that provides a more platform-neutral offering for researchers using more modern languages, such as Python or R. While the PostgreSQL server was initially constructed from the SAS data files, it seems possible that some data are ingested by the PostgreSQL server directly without an intermediate step using SAS data formats.
WRDS highlights support for four languages: R, Python, Stata, and SAS. For R, Python, and Stata, it is assumed that users will access WRDS data via the PostgreSQL server. Only for SAS is it assumed that the SAS data files will be used directly.
Initially, the SAS data files appeared to be the canonical data sources. Some data sets were not covered by the PostgreSQL database (e.g., TAQ data). Other data sets were in the PostgreSQL server, but with some data loss. For example, non-ASCII text in SAS data files was lost in early versions of the PostgreSQL database; this appears no longer to be the case.
At this stage, it seems that both the WRDS SAS data files and the WRDS PostgreSQL database can be viewed as canonical sources for WRDS data. Nonetheless there are minor differences between these data sources. First, the underlying schema names can differ in some cases.
Second, there may be differences in the underlying data types. As discussed in Appendix C, SAS has only two underlying data types and is therefore much less strict about what data goes into each column. The more specific data types in SAS are actually implemented as presentations of the underlying data. Nonetheless some data may be presented as one type in SAS and another type in PostgreSQL. For example, permno is presented as a whole number in SAS, but as a floating-point value in PostgreSQL. Practically, this makes little difference.4
One data source that is perhaps incorrectly viewed as a canonical data source is the web query interface offered by WRDS. The web query interface leaves much to be desired from a reproducibility perspective as manual steps are needed to extract data and often it is necessary to eliminate variables or periods to manage the size of the resulting files. Additionally, the relation between the data returned by some queries and the underlying tables in SAS or PostgreSQL is sometimes unclear. Some web queries appear to join data from multiple underlying tables in ways that are not always transparent.
E.4 Converting WRDS data to parquet
In this section, we discuss three approaches that readers could use to create their own parquet-based repositories of WRDS data. To keep the repository we create organized, we put it under a single directory, which we flag using the DATA_DIR environment variable. As we need to interact with WRDS, we need to communicate our WRDS ID to the code, which we can also do using an environment variable.
Sys.setenv(DATA_DIR = "~/Dropbox/pq_data/",
WRDS_ID = "iangow")E.4.1 Approach 1: Get WRDS PostgreSQL data using DuckDB
One approach to getting WRDS data is to connect to the PostgreSQL server much as we have done throughout the book. Having made this connection, we can easily copy the data to a local DuckDB database, and from there, it is easy to write the data to parquet files on our local storage.
For the reasons discussed in Section 6.1, we use environment variables to give R the information it needs to connect to the WRDS PostgreSQL server.
Sys.setenv(PGHOST = "wrds-pgdata.wharton.upenn.edu",
PGDATABASE = "wrds",
PGUSER = "iangow",
PGPORT = 9737)We can now connect to the WRDS PostgreSQL server (pg here) and to a DuckDB instance (db). If we want to copy the crsp.msi table from WRDS to our local machine, we simply create msi, a remote data table, then copy this to db using the copy_to() function provided by dplyr. We name the table (name = "msi") so that we can refer to it in subsequent queries. Then, we save the data to a file named msi.parquet (DuckDB will infer that we want a parquet file from the .parquet file extension). Finally, we disconnect from both databases (pg and db).
pg <- dbConnect(RPostgres::Postgres())
db <- dbConnect(duckdb::duckdb())
msi <- tbl(pg, Id(schema = "crsp", table = "msi"))
copy_to(db, df = msi, name = "msi", overwrite = TRUE)
if (!dir.exists(file.path(Sys.getenv("DATA_DIR"), "crsp"))) {
dir.create(file.path(Sys.getenv("DATA_DIR"), "crsp"))
}
file_name <- file.path(Sys.getenv("DATA_DIR"), "crsp/msi.parquet")
dbExecute(db, str_c("COPY msi TO '", file_name, "'"))[1] 1189
dbDisconnect(pg)
dbDisconnect(db)This basic approach is taken by the pg_to_parquet() function provided as part of the farr package. As we have the PostgreSQL connection details set up for WRDS and also have the DATA_DIR environment variable set up, we can issue a single line to get a local copy of crsp.msi.
pg_to_parquet("msi", "crsp")The main downside of this approach is that it is very memory-intensive with large tables.
E.4.2 Approach 2: Get WRDS PostgreSQL data using R
In Appendix D, we saw how we can use the wrds2pg Python library to import WRDS SAS data into a local PostgreSQL server with the wrds_update() function. The db2pq package offers wrds_update_pq(), a function modelled on wrds_update() that creates parquet files using data stored in the WRDS PostgreSQL database. To install the R package db2pq, which is not yet on CRAN, use pak:5
pak::pak("iangow/db2pqr")The db2pq package uses the wrds package to get a connection to the WRDS PostgreSQL database. So if you have not done so, you will need to set up your credentials using that package.
Assuming we have set the necessary environment variable (i.e., DATA_DIR), creating a parquet file is straightforward. If not, we can set it easily:
Sys.setenv(DATA_DIR = "~/Dropbox/pq_data")Using db2pq allows us to specify variable types and it places much less of a burden on RAM than pg_to_parquet() above does.
library(db2pq)
wrds_update_pq("msi", "crsp")A script to obtain all tables needed to run the code in this book using this approach is provided in Section E.6.
E.4.3 Approach 3: Get WRDS SAS data using Python
The wrds2pg Python library also offers wrds_update_pq(), a function modelled after wrds_update() that uses SAS code to generate data that are saved as a parquet file. To install the required version of wrds2pg, use the following command from the command line.6
pip install wrds2pg --upgradeAssuming we have the necessary environment variables set up (e.g., WRDS_ID and DATA_DIR), creating a parquet file from a WRDS SAS file is straightforward. (Note that the following lines are Python code.)
from wrds2pg import wrds_update_pq
wrds_update_pq("msi", "crsp")If you don’t have the DATA_DIR environment variable set, you can supply it to the wrds_update_pq() function directly.
wrds_update_pq("msi", "crsp", data_dir = "~/Dropbox/pq_data/")The main downsides of this approach are the need to use Python and the fact that extracting data from SAS is slower than doing so from PostgreSQL.
E.5 Working with parquet files
To illustrate how we can work with these parquet files, we retrace some of the steps we took in Chapter 21. We will assume that you have parquet versions of the three data tables in your repository, perhaps as a result of running one of the three approaches to getting the data listed here.
Downloading comp.funda will take a few minutes using any of the three approaches below.
E.5.1 Getting the data
E.5.1.1 Approach 1: Get WRDS PostgreSQL data using DuckDB
While this approach is simpler than the other two, it will take longer and it requires more RAM. It is not recommended for large files.
pg_to_parquet("seg_customer", "compseg")
pg_to_parquet("names_seg", "compseg")
pg_to_parquet("funda", "comp")E.5.1.2 Approach 2: Get WRDS PostgreSQL data using Python
This is the most robust and fastest approach. Also, it uses less RAM than either of the other two approaches.
It requires the R package db2pq, which is not yet on CRAN, so you need to install it from GitHub using pak:7
pak::pak("iangow/db2pqr")library(db2pq)
wrds_update_pq("seg_customer", "compseg")
wrds_update_pq("names_seg", "compseg")
wrds_update_pq("funda", "comp")E.5.1.3 Approach 3: Get WRDS SAS data using Python
As the comp.funda SAS data file includes special missing value codes that are exported by SAS as text values, we need to use fix_missing = True to convert these to missing values.
from wrds2pg import wrds_update_pq
wrds_update_pq("seg_customer", "compseg")
wrds_update_pq("names_seg", "compseg")
wrds_update_pq("funda", "comp", fix_missing = True)E.5.2 Replicating analysis using parquet files
Assuming you now have the three data tables, we can proceed with some analysis from Chapter 21. We start by creating an in-memory DuckDB database.
We next use load_parquet() from the farr package to load the tables we created above into our DuckDB database, db. By default, load_parquet() will use the environment variable DATA_DIR to locate the repository of parquet files. If DATA_DIR has not been set, we set it now.
Sys.setenv(DATA_DIR = "~/Dropbox/pq_data/")
seg_customer <- load_parquet(db, "seg_customer", "compseg")
names_seg <- load_parquet(db, "names_seg", "compseg")
funda <- load_parquet(db, "funda", "comp")As an alternative to setting the DATA_DIR environment variable, you can supply the location of the repository of parquet files to the load_parquet() function directly as the data_dir argument.
load_parquet(db, "funda", "comp", data_dir = "~/Dropbox/pq_data/")The next two tables come with the farr package, but need to be copied to db.
The next block of code is essentially unchanged from the code in Chapter 21. The two changes are:
- We do not use
collect(), as we want the data to remain in our database as long as possible. In the original code, we needed tocollect()data to merge withundisclosed_namesandllz_2018. These are now all in a single database. - We add
na.rm = TRUEin a couple of places to suppress warnings. SQL always usesna.rm = TRUE.
customers <-
seg_customer |>
filter(ctype == "COMPANY")
disclosure_raw <-
customers |>
filter(between(datadate, "1994-01-01", "2010-12-31")) |>
left_join(undisclosed_names, by = "cnms") |>
mutate(disclosed = coalesce(disclosed, TRUE)) |>
select(gvkey, datadate, cnms, salecs, disclosed)
sales <-
funda |>
filter(indfmt == "INDL", datafmt == "STD",
consol == "C", popsrc == "D") |>
select(gvkey, datadate, sale)
prin_cust_df <-
disclosure_raw |>
inner_join(sales, by = c("gvkey", "datadate")) |>
group_by(gvkey, datadate) |>
filter(!is.na(salecs), sale > 0) |>
summarize(prin_cust = max(salecs / sale, na.rm = TRUE),
.groups = 'drop') |>
mutate(has_prin_cust = prin_cust >= 0.1)
disclosure <-
disclosure_raw |>
inner_join(sales, by = c("gvkey", "datadate")) |>
semi_join(llz_2018, by = "gvkey") |>
group_by(gvkey, datadate) |>
summarize(ratio = mean(as.double(!disclosed), na.rm = TRUE),
ratio_sale = sum(as.double(!disclosed) * salecs, na.rm = TRUE) /
sum(salecs, na.rm = TRUE),
.groups = "drop") |>
mutate(year = year(datadate))# A tibble: 1 × 2
ratio ratio_sale
<dbl> <dbl>
1 0.459 0.464
dbDisconnect(db)E.5.3 Working with larger parquet files
The power of parquet files becomes even more apparent when working with larger data sets, such as crsp.dsf. Here we assume that you have a copy of crsp.dsf in your repository, perhaps as a result of running code like the following Python code. The resulting data file will be over 3 GB (this is much compressed relative to the original SAS or PostgreSQL data).
Downloading crsp.dsf can take 30 minutes or more depending on the speed of your connection to WRDS and the approach used.
from db2pq import wrds_update_pq
wrds_update_pq("dsf", "crsp")The following code does a number of analytical tasks. First, we count the number of rows in the table. Second, we count the number of rows by year. Third, we extract the data for a particular date. Next, we identify the PERMNO for Apple, then use that to plot cumulative returns for Apple over time.
Completing all these tasks takes about 3 seconds!
now <- Sys.time()
db <- dbConnect(duckdb::duckdb())
dsf <- load_parquet(db, "dsf", "crsp")
stocknames <- load_parquet(db, "stocknames", "crsp")
dsf |>
count() |>
collect()# A tibble: 1 × 1
n
<dbl>
1 107663470
# A tibble: 100 × 2
year n
<dbl> <dbl>
1 2024 2400962
2 2023 2353845
3 2022 2390163
4 2021 2187548
5 2020 1948995
6 2019 1912085
7 2018 1869604
8 2017 1827893
9 2016 1828555
10 2015 1820358
# ℹ 90 more rows
dsf_subset <-
dsf |>
filter(date == "1986-01-07") |>
collect()
apple_permno <-
stocknames |>
filter(str_detect(comnam, "^APPLE COM")) |>
pull(permno)
dsf |>
filter(permno == apple_permno) |>
group_by(permno) |>
window_order(date) |>
mutate(cumret = exp(cumsum(log(1 + coalesce(ret, 0))))) |>
ggplot(aes(x = date, y = cumret)) +
geom_line()Sys.time() - nowTime difference of 0.8599169 secs
dbDisconnect(db)E.6 Creating a parquet library
In this section, we describe how to create a local repository of the data used in this book in Parquet format.8
- Install the
db2pqpackage.
The db2pq package is not yet on CRAN, so you need to install it from GitHub using pak:9
pak::pak("iangow/db2pqr")- Set up your WRDS credentials.
The db2pq package uses the wrds package to get a connection to the WRDS PostgreSQL database. So if you have not done so, you will need to set up your credentials using that package.
- Set the
DATA_DIRenvironment variable.
This tells db2pq where you want to store the Parquet files.
Sys.setenv(DATA_DIR = "~/Dropbox/pq_data")- Within R, use the
db2pqpackage to get data. Running the following script creates a repository comprising all the data sets used in this book.
library(db2pq)
# CRSP
wrds_update_pq('ccmxpf_lnkhist', 'crsp',
col_types=list(lpermno='int32',
lpermco='int32'))
wrds_update_pq('dsf', 'crsp')
wrds_update_pq('dsi', 'crsp')
wrds_update_pq('erdport1', 'crsp')
wrds_update_pq('comphist', 'crsp')
wrds_update_pq('dsedelist', 'crsp')
wrds_update_pq('dseexchdates', 'crsp')
wrds_update_pq('msf', 'crsp')
wrds_update_pq('msi', 'crsp')
wrds_update_pq('mse', 'crsp')
wrds_update_pq('stocknames', 'crsp')
wrds_update_pq('dsedist', 'crsp')
# Fama-French library
wrds_update_pq('factors_daily', 'ff')
# Compustat
wrds_update_pq('company', 'comp')
wrds_update_pq('funda', 'comp')
wrds_update_pq('funda_fncd', 'comp')
wrds_update_pq('fundq', 'comp')
wrds_update_pq('r_auditors', 'comp')
wrds_update_pq('idx_daily', 'comp')
wrds_update_pq('aco_pnfnda', 'comp')
# compseg
wrds_update_pq('seg_customer', 'compseg')
wrds_update_pq('names_seg', 'compseg')SAS works in a way that makes SAS data files a reasonable choice.↩︎
This is almost true for parquet files, but not quite. Stata cannot really work with on-disk data and has no native support for parquet data.↩︎
It seems plausible that these web queries still draw on SAS data files to this day.↩︎
SAS stores the data as floating-point values in any case.↩︎
Use
install.packages("pak")to getpakif you don’t have it installed.↩︎If you installed
wrds2pgbefore December 2023, then you will need to update to the current version.↩︎Use
install.packages("pak")to getpakif you don’t have it installed.↩︎Above we discussed three approaches, but “Approach 1” will be problematic with larger data files. Here we use “Approach 2”.↩︎
Use
install.packages("pak")to getpakif you don’t have it installed.↩︎