pip install wrds2pg --upgrade
Appendix C — PostgreSQL
Throughout the book, we have used the WRDS PostgreSQL server. While WRDS PostgreSQL server offers access to a large menagerie of data with very little set-up, there are limitations to using the WRDS server alone for research. The purpose of this appendix is provide some ideas for how you might run your own PostgreSQL server.
C.1 Benefits of using relational databases
Fast random access to on-disk data. Relational databases make it easy to index data, which makes it “alive” rather than “dead” in the sense we use here. We will see that accessing specific observations from large data sets is fast and easy with a relational database. Accessing stock-return data on
crsp.dsf
in the WRDS PostgreSQL database for Microsoft for a few dates is quite fast.1Data accessible by any software package. Data stored in a PostgreSQL database is accessible from R, Stata, Python, Perl, … pretty much any software. This has subtle but important advantages for data collaboration and multilingual programming. For example, you may be a Stata devotee who needs help preparing a data set from Web data. Finding people with the skills to do this is going to be a lot easier if you don’t need them to have Stata skills, but can work with a Python guru. With a relational database, a Stata user is unlikely to care much if the data coming out of PostgreSQL was put there using Python. In another example, you may have written some complicated Perl code to generate some data sets, and occasionally need to tweak the code to accommodate changes in the source data. As long as the data sets are stored in a PostgreSQL database, you will have no need to translate the entire code into something you still know how to code in. This benefit is probably also the primary driver of WRDS’s decision to add a PostgreSQL database to its offerings. With more users wanting to use Python or R for analysis, sticking to a SAS-format-only approach was surely not viable.
Data accessible from anywhere. Data on the WRDS PostgreSQL server can be accessed from anywhere with an internet connection (in the 2020s, this essentially means anywhere) by anyone with a WRDS account. For example, data stored on a PostgreSQL server in Boston can be easily accessed by coauthors in California, Australia, or Singapore.2
Centralised processing power. In this course, we will see a number of cases where the heavy part of data processing for some tasks is accomplished on the WRDS PostgreSQL server even when executing R code locally. While SAS has similar functionality with
RSUBMIT
, running queries on a remote PostgreSQL server from RStudio is much more seamless.-
Data can be (and should be) properly typed. PostgreSQL in particular has a much richer set of data types than a typical statistical programming package. For example, everything in SAS is fundamentally of either a floating-point numerical type or a fixed-width character type. Things such as dates and times are essentially formatting overlays over these types, and such formatting is easy to break, sometimes causing serious issues in practice. (Stata is fairly similar.) For example, PostgreSQL offers standard types such as strings, floating-point, integers, large integers, along with timestamps with time zones, JSON, XML, and other types. Most of the commonly used types are recognized fairly seamlessly and converted to appropriate types when data are brought into R (e.g., timestamps with time zones).
We say “can be” in the heading to this section because whether the data actually have appropriate types is a function of how much care was used in getting the data into the database in the first place. For example, at the time of writing, WRDS is often fairly careless about data types. So variables that are naturally integers, such as firm identifiers like CIKs or PERMNOs, are
double precision
rather thaninteger
types.3 Text variables are generallycharacter varying
, even though PostgreSQL’stext
type is simpler and equally efficient. And timestamps are invariably of typetimestamp without time zone
when they arguably should always betimestamp with time zone
, as a timestamp has no clear meaning without a time zone. -
Textual data are handled with ease. Stata says that “Unicode is the modern way that computers encode characters such as the letters in the words you are now reading.” Yet prior to Stata 14, Stata did not support it. SAS files provided by WRDS have traditionally been encoded in some kind of Latin encoding. And SAS and Stata have limitations with regard to the size of text fields.
In contrast, PostgreSQL uses Unicode by default, as do Python and R, and there is effectively no constraint on the size of text fields in PostgreSQL.
C.2 Setting up a personal server
Setting up your own PostgreSQL server overcomes some of the disadvantages of using the WRDS server.
One disadvantage is that, because WRDS only allows read-only connections, one cannot store computed results on the WRDS PostgreSQL server. There are two types of computed tables that one can create. Temporary tables, which are created using the compute()
function from dplyr
, can dramatically simplify queries and increase performance, but disappear once you disconnect from the data base. Permanent tables, which can be created using the compute()
function with temporary = FALSE
argument, can be useful to break the research process into several discrete steps.
Another disadvantage of the WRDS server is it limits our ability to use our own data on the WRDS server. While we can use the copy_inline()
function from the dbplyr
package to create something similar to a temporary table on the server using local data, this will not be the best approach if our local data sets are large.4 And while we could download the crsp.dsf
and do the analysis on our local computer, but this is likely to be painful even if we have enough RAM to store crsp.dsf
.
An alternative approach that build on framework used in this book is to create your own PostgreSQL database and use that. While there are some set-up and maintenance costs associated with this, it’s actually fairly manageable using the steps we describe below.
C.2.1 Installation steps
- Install Python. You may already have Python installed. If not Miniconda is a good option.
- Install
wrds2pg
package.
- Install PostgreSQL and initialize the server
Installation of PostgreSQL varies by platform (and on some platforms there are multiple approaches).5 A good place to start is the PostgreSQL webpage.
- Create a database.
You may have already done this in the previous step. If not, and you have the server running, you could do this using SQL:
CREATE DATABASE wrds;
- Set up environment variables and so on for
wrds2pg
.
See here for details.
export PGHOST="localhost"
export PGDATABASE="wrds"
export WRDS_ID="iangow"
export PGUSER="igow"
You might also use the environment variable PGPASSWORD
:
export PGPASSWORD="password"
But it is probably better to use a password file.
Run Python.
Within Python, use the
wrds2pg
module to get data.
from wrds2pg import wrds_update
"dsi", "crsp", dbname="wrds", host="localhost") wrds_update(
C.2.2 Getting the tables used here
Following the steps above, the script below can be used to get the tables used in this book. Note that a few tables require special handling.
Audit Analytics tables are merged (seemingly by WRDS) to create very large tables where most of the variables relate to financial statement items. Most researchers do not use these data, so the code below using SAS wildcard to drop such variables (note that wrds2pg
is actually run SAS code on the WRDS server behind the scenes). While modifying these tables in this way breaks the idea of the local tables closely reflecting their WRDS counterparts, the benefit from having a cleaner, leaner table seems worthwhile in this case.
Some tables on CRSP and Compustat have special missing values that SAS’s PROC EXPORT
function turns into simple character values, which PostgreSQL cannot accept in fields that are not text types. The fix_missing = True
argument to wrds_update
converts such missing values into regular missing values.
The script also fixes variable types that are not well-formatted in the original SAS files (e.g., permno
should be integer
).
Finally, the script creates indexes, which dramatically increase performance of table joins or filter()
operations.
Note that the script below likely takes a few hours to run, primarily because of crsp.dsf
, which is about 20GB of data. However, subsequent runs of the script will only download data if the SAS data file on WRDS has been updated and so will usually run much faster.
#!/usr/bin/env python3
from wrds2pg import wrds_update, make_engine, process_sql
= make_engine()
engine
# Audit Analytics
= wrds_update("auditnonreli", "audit",
updated ="prior: match: closest: disc_text:",
drop= {"res_accounting": "boolean",
col_types "res_fraud": "boolean",
"res_cler_err": "boolean",
"res_adverse": "boolean",
"res_improves": "boolean",
"res_other": "boolean",
"res_sec_invest": "boolean",
"res_begin_aud_fkey": "integer",
"res_notif_key": "integer",
"current_aud_fkey": "integer",
"res_begin_aud_fkey": "integer",
"res_end_aud_fkey": "integer",
"file_date_aud_fkey": "integer"})
if updated:
"CREATE INDEX ON audit.auditnonreli (res_notif_key)", engine)
process_sql(
# CRSP
= wrds_update("ccmxpf_lnkhist", "crsp", fix_missing=True,
ccmxpf_lnkhist = {'lpermno':'integer',
col_types 'lpermco': 'integer'})
if updated:
"CREATE INDEX ON crsp.ccmxpf_lnkhist (gvkey)", engine)
process_sql("CREATE INDEX ON crsp.ccmxpf_lnkhist (lpermno)", engine)
process_sql("CREATE INDEX ON crsp.ccmxpf_lnkhist (lpermco)", engine)
process_sql(
= wrds_update("dsf", "crsp", fix_missing=True,
dsf = {"permno": "integer",
col_types "permco": "integer"})
if dsf:
"CREATE INDEX ON crsp.dsf (permno, date)", engine)
process_sql("CREATE INDEX ON crsp.dsf (permco)", engine)
process_sql(
= wrds_update("dsi", "crsp")
dsi if dsi:
"CREATE INDEX ON crsp.dsi (date)", engine)
process_sql(
"comphist", "crsp", fix_missing=True)
wrds_update(
= wrds_update("dsedelist", "crsp", fix_missing=True,
dsedelist = {'permno':'integer', 'permco': 'integer'})
col_types
if dsedelist:
"CREATE INDEX ON crsp.dsedelist (permno)", engine)
process_sql(
= wrds_update("dseexchdates", "crsp",
dseexchdates = {'permno':'integer',
col_types 'permco': 'integer'})
if dseexchdates:
"CREATE INDEX ON crsp.dseexchdates (permno)", engine)
process_sql(
= wrds_update("erdport1", "crsp", fix_missing = True,
erdport1 = {'permno':'integer', 'capn': 'integer'})
col_types
if erdport1:
"CREATE INDEX ON crsp.erdport1 (permno, date)", engine)
process_sql(
= wrds_update("ccmxpf_lnkhist", "crsp", fix_missing=True,
updated = {"lpermno": "integer",
col_types "lpermco": "integer"})
if updated:
"CREATE INDEX ON crsp.ccmxpf_lnkhist (gvkey)", engine)
process_sql(
= wrds_update("msf", "crsp", fix_missing=True,
updated = {"permno": "integer",
col_types "permco": "integer"})
if updated:
"CREATE INDEX ON crsp.msf (permno, date)", engine)
process_sql(
= wrds_update("msi", "crsp")
updated if updated:
"CREATE INDEX ON crsp.msi (date)", engine)
process_sql(
= wrds_update("mse", "crsp", fix_missing=True,
mse = {"permno": "integer",
col_types "permco": "integer"})
= wrds_update("stocknames", "crsp",
stocknames = {"permno": "integer",
col_types "permco": "integer"})
= wrds_update("dsedist", "crsp", fix_missing=True,
dsedist = {'permno':'integer',
col_types 'permco':'integer'})
if dsedist:
"CREATE INDEX ON crsp.dsedist (permno)", engine)
process_sql(
= wrds_update("factors_daily", "ff")
updated
# Compustat
= wrds_update("company", "comp")
updated if updated:
"CREATE INDEX ON comp.company (gvkey)", engine)
process_sql(
= wrds_update("funda", "comp", fix_missing = True)
updated if updated:
"CREATE INDEX ON comp.funda (gvkey)", engine)
process_sql(
= wrds_update("fundq", "comp", fix_missing = True)
updated if updated:
"CREATE INDEX ON comp.fundq (gvkey, datadate)", engine)
process_sql(
= wrds_update("r_auditors", "comp")
updated
= wrds_update("idx_daily", "comp")
updated if updated:
"CREATE INDEX ON comp.idx_daily (gvkeyx)", engine)
process_sql(
= wrds_update("aco_pnfnda", "comp")
updated if updated:
"CREATE INDEX ON comp.aco_pnfnda (gvkey, datadate)", engine)
process_sql(
# The segment data is in comp_segments_hist_daily in PostgreSQL,
# but in compsegd in SAS, so we need to specify sas_schema.
"seg_customer", "comp_segments_hist_daily",
wrds_update(="compsegd")
sas_schema"names_seg", "comp_segments_hist_daily",
wrds_update(="compsegd") sas_schema
To illustrate, pulling returns and prices for Microsoft using a remote query to the WRDS PostgreSQL takes 27 milliseconds. And most of this is due to remote access; a local query takes less than one millisecond.↩︎
This book only requires PostgreSQL data stored in the the WRDS database, but we include a brief overview of the steps required to set up one’s own research server in Appendix C.↩︎
This can have significant implications for performance.↩︎
As can be seen here, this function was added to
dbplyr
after a request related to creation of this course.↩︎On Ubuntu, we install using say
sudo apt install postgresql-14
; on MacOS, Postgres.app provides a very easy-to-manage PostgreSQL implementation.↩︎