pip install wrds2pg --upgrade
Appendix D — Running PostgreSQL
Throughout the book, we have used the WRDS PostgreSQL server as the primary source of data for our analysis. While WRDS PostgreSQL server offers access to a large menagerie of data with very little set-up, there are limitations to using the WRDS server alone for research. For example, if you are using crsp.dsf
intensively on a near-daily basis, then you may be downloading a lot of data over time. As such, it might be more efficient to have a local copy of crsp.dsf
.
The purpose of this appendix is to provide guidance on one possible solution, namely running your own PostgreSQL server. An alternative approach—creating a repository of parquet files—is discussed in Appendix E.
D.1 Setting up a personal server
Setting up your own PostgreSQL server overcomes some of the disadvantages of using the WRDS server.
One disadvantage is that, because WRDS only allows read-only connections, one cannot store computed results on the WRDS PostgreSQL server. There are two types of computed tables that one can create. Temporary tables, which are created using the compute()
function from dplyr
, can dramatically simplify queries and increase performance, but disappear once you disconnect from the database. Permanent tables, which can be created using the compute()
function with temporary = FALSE
argument, can be useful to break the research process into several discrete steps.
Another disadvantage of the WRDS server is the limits it imposes on using our own data. While we can use the copy_inline()
function from the dbplyr
package to create something similar to a temporary table on the server using local data, this will not be the best approach if our local data sets are large.1 And, while we could download crsp.dsf
as an R data file and do the analysis on our own computer, this is likely to be painful even if we have enough RAM to load crsp.dsf
.
An alternative approach that builds on the framework used in this book is to create your own PostgreSQL database and use that. While there are some set-up and maintenance costs associated with this, it’s actually fairly manageable using the steps we describe below.
D.1.1 Installation steps
- Install Python. You may already have Python installed. If not, Miniconda is one option.
- Install the
wrds2pg
package.
- Install PostgreSQL and initialize the server
Installation of PostgreSQL varies by platform (and on some platforms there are multiple approaches).2 A good place to start is the PostgreSQL webpage.
- Create a database.
You may have already done this in the previous step. If not, and you have the server running, you could do this using SQL:
CREATE DATABASE wrds;
- Set up environment variables for
wrds2pg
.
See here for details.
export PGHOST="localhost"
export PGDATABASE="wrds"
export WRDS_ID="iangow"
export PGUSER="igow"
export PGPORT=5432
You might also use the environment variable PGPASSWORD
:
export PGPASSWORD="password"
But it is better to use a password file.
Run Python.
Within Python, use the
wrds2pg
module to get data.
from wrds2pg import wrds_update
"dsi", "crsp", dbname="wrds", host="localhost") wrds_update(
D.1.2 Getting the tables used in this book
Following the steps above, the script below can be used to get the tables used in this book. Note that a few tables require special handling.
Some SAS data files on CRSP and Compustat have special missing values that SAS’s PROC EXPORT
function turns into simple character values, which PostgreSQL cannot accept in fields that are not text types. The fix_missing = True
argument to wrds_update
converts such missing values into regular missing values.
The script also fixes variable types that are not well-formatted in the original SAS files (e.g., permno
should be integer
).
Finally, the script creates indexes, which dramatically increase performance of table joins or filter()
operations.
Note that the script below likely takes a few hours to run, primarily because of crsp.dsf
, which is about 20 GB of data. However, subsequent runs of the script will only download data if the SAS data file on WRDS has been updated and so will usually run much faster.
#!/usr/bin/env python3
from wrds2pg import wrds_update, make_engine, process_sql
= make_engine()
engine
# CRSP
'ccmxpf_lnkhist', 'crsp', fix_missing=True,
wrds_update(= {'lpermno': 'integer', 'lpermco': 'integer'})
col_types
= wrds_update('dsf', 'crsp', fix_missing=True,
dsf ={'permno': 'integer',
col_types'permco': 'integer'})
if dsf:
'CREATE INDEX ON crsp.dsf (permno, date)', engine)
process_sql('CREATE INDEX ON crsp.dsf (permco)', engine)
process_sql(
= wrds_update('erdport1', 'crsp',
erdport1 ='permno date decret')
keepif erdport1:
'CREATE INDEX ON crsp.erdport1 (permno, date)', engine)
process_sql(
= wrds_update('dsi', 'crsp')
dsi if dsi:
'CREATE INDEX ON crsp.dsi (date)', engine)
process_sql(
'comphist', 'crsp', fix_missing=True)
wrds_update(
= wrds_update('dsedelist', 'crsp', fix_missing=True,
dsedelist ={'permno': 'integer',
col_types'permco': 'integer'})
if dsedelist:
'CREATE INDEX ON crsp.dsedelist (permno)', engine)
process_sql(
= wrds_update('dseexchdates', 'crsp',
dseexchdates = {'permno': 'integer',
col_types'permco': 'integer'})
if dseexchdates:
'CREATE INDEX ON crsp.dseexchdates (permno)', engine)
process_sql(
= wrds_update('msf', 'crsp', fix_missing=True,
msf = {'permno': 'integer',
col_types'permco': 'integer'})
if msf:
'CREATE INDEX ON crsp.msf (permno, date)', engine)
process_sql(
= wrds_update('msi', 'crsp')
msi if msi:
'CREATE INDEX ON crsp.msi (date)', engine)
process_sql(
'mse', 'crsp', fix_missing=True,
wrds_update(= {'permno': 'integer',
col_types'permco': 'integer'})
'stocknames', 'crsp',
wrds_update(= {'permno': 'integer',
col_types'permco': 'integer'})
= wrds_update('dsedist', 'crsp', fix_missing=True,
dsedist ={'permno': 'integer',
col_types'permco': 'integer'})
if dsedist:
'CREATE INDEX ON crsp.dsedist (permno)', engine)
process_sql(
# Fama-French data
= wrds_update('factors_daily', 'ff')
updated
# Compustat
= wrds_update('company', 'comp')
company if company:
'CREATE INDEX ON comp.company (gvkey)', engine)
process_sql(
= wrds_update('funda', 'comp', fix_missing=True)
funda if funda:
'CREATE INDEX ON comp.funda (gvkey)', engine)
process_sql(
'funda_fncd', 'comp')
wrds_update(
= wrds_update('fundq', 'comp', fix_missing=True)
fundq if fundq:
'CREATE INDEX ON comp.fundq (gvkey, datadate)', engine)
process_sql(
= wrds_update('r_auditors', 'comp')
updated
= wrds_update('idx_daily', 'comp')
idx_daily if idx_daily:
'CREATE INDEX ON comp.idx_daily (gvkeyx)', engine)
process_sql(
= wrds_update('aco_pnfnda', 'comp')
aco_pnfnda if aco_pnfnda:
'CREATE INDEX ON comp.aco_pnfnda (gvkey, datadate)', engine)
process_sql(
'seg_customer', 'compseg')
wrds_update('names_seg', 'compseg') wrds_update(
In fact,
copy_inline()
was added todbplyr
after a request related to this book: https://github.com/tidyverse/dbplyr/issues/628.↩︎On Ubuntu, we install using say
sudo apt install postgresql-16
; on MacOS, Postgres.app provides a very easy-to-manage PostgreSQL implementation: https://postgresapp.com.↩︎