Appendix D — Running PostgreSQL

Throughout the book, we have used the WRDS PostgreSQL server as the primary source of data for our analysis. While WRDS PostgreSQL server offers access to a large menagerie of data with very little set-up, there are limitations to using the WRDS server alone for research. For example, if you are using crsp.dsf intensively on a near-daily basis, then you may be downloading a lot of data over time. As such, it might be more efficient to have a local copy of crsp.dsf.

The purpose of this appendix is to provide guidance on one possible solution, namely running your own PostgreSQL server. An alternative approach—creating a repository of parquet files—is discussed in Appendix E.

D.1 Setting up a personal server

Setting up your own PostgreSQL server overcomes some of the disadvantages of using the WRDS server.

One disadvantage is that, because WRDS only allows read-only connections, one cannot store computed results on the WRDS PostgreSQL server. There are two types of computed tables that one can create. Temporary tables, which are created using the compute() function from dplyr, can dramatically simplify queries and increase performance, but disappear once you disconnect from the database. Permanent tables, which can be created using the compute() function with temporary = FALSE argument, can be useful to break the research process into several discrete steps.

Another disadvantage of the WRDS server is the limits it imposes on using our own data. While we can use the copy_inline() function from the dbplyr package to create something similar to a temporary table on the server using local data, this will not be the best approach if our local data sets are large.¹ And, while we could download crsp.dsf as an R data file and do the analysis on our own computer, this is likely to be painful even if we have enough RAM to load crsp.dsf.

An alternative approach that builds on the framework used in this book is to create your own PostgreSQL database and use that. While there are some set-up and maintenance costs associated with this, it’s actually fairly manageable using the steps we describe below.

D.1.1 Installation steps

Install Python. You may already have Python installed. If not, Miniconda is one option.
Install the wrds2pg package.

pip install wrds2pg --upgrade

Install PostgreSQL and initialize the server

Installation of PostgreSQL varies by platform (and on some platforms there are multiple approaches).² A good place to start is the PostgreSQL webpage.

Create a database.

You may have already done this in the previous step. If not, and you have the server running, you could do this using SQL:

CREATE DATABASE wrds;

Set up environment variables for wrds2pg.

See here for details.

export PGHOST="localhost"
export PGDATABASE="wrds"
export WRDS_ID="iangow"
export PGUSER="igow"
export PGPORT=5432

You might also use the environment variable PGPASSWORD:

export PGPASSWORD="password"

But it is better to use a password file.

Run Python.
Within Python, use the wrds2pg module to get data.

from wrds2pg import wrds_update
wrds_update("dsi", "crsp", dbname="wrds", host="localhost")

D.1.2 Getting the tables used in this book

Following the steps above, the script below can be used to get the tables used in this book. Note that a few tables require special handling.

Some SAS data files on CRSP and Compustat have special missing values that SAS’s PROC EXPORT function turns into simple character values, which PostgreSQL cannot accept in fields that are not text types. The fix_missing = True argument to wrds_update converts such missing values into regular missing values.

The script also fixes variable types that are not well-formatted in the original SAS files (e.g., permno should be integer).

Finally, the script creates indexes, which dramatically increase performance of table joins or filter() operations.

Note that the script below likely takes a few hours to run, primarily because of crsp.dsf, which is about 20 GB of data. However, subsequent runs of the script will only download data if the SAS data file on WRDS has been updated and so will usually run much faster.

#!/usr/bin/env python3
from wrds2pg import wrds_update, make_engine, process_sql

engine = make_engine()

# CRSP
wrds_update('ccmxpf_lnkhist', 'crsp', fix_missing=True,
            col_types= {'lpermno': 'integer', 'lpermco': 'integer'})

dsf = wrds_update('dsf', 'crsp', fix_missing=True,
                  col_types={'permno': 'integer',
                             'permco': 'integer'})
if dsf:
    process_sql('CREATE INDEX ON crsp.dsf (permno, date)', engine)
    process_sql('CREATE INDEX ON crsp.dsf (permco)', engine)

erdport1 = wrds_update('erdport1', 'crsp',
                       keep='permno date decret')
if erdport1:
    process_sql('CREATE INDEX ON crsp.erdport1 (permno, date)', engine)

dsi = wrds_update('dsi', 'crsp')
if dsi:
    process_sql('CREATE INDEX ON crsp.dsi (date)', engine)

wrds_update('comphist', 'crsp', fix_missing=True)

dsedelist = wrds_update('dsedelist', 'crsp', fix_missing=True,
                        col_types={'permno': 'integer', 
                                   'permco': 'integer'})
if dsedelist:
    process_sql('CREATE INDEX ON crsp.dsedelist (permno)', engine)
    
dseexchdates = wrds_update('dseexchdates', 'crsp',
                           col_types= {'permno': 'integer', 
                                       'permco': 'integer'})
if dseexchdates:
    process_sql('CREATE INDEX ON crsp.dseexchdates (permno)', engine)

msf = wrds_update('msf', 'crsp', fix_missing=True,
                  col_types= {'permno': 'integer',
                              'permco': 'integer'})
if msf:
    process_sql('CREATE INDEX ON crsp.msf (permno, date)', engine)

msi = wrds_update('msi', 'crsp')
if msi:
    process_sql('CREATE INDEX ON crsp.msi (date)', engine)

wrds_update('mse', 'crsp', fix_missing=True,
            col_types= {'permno': 'integer',
                        'permco': 'integer'})

wrds_update('stocknames', 'crsp',
            col_types= {'permno': 'integer',
                        'permco': 'integer'})

dsedist = wrds_update('dsedist', 'crsp', fix_missing=True,
                      col_types={'permno': 'integer',
                                 'permco': 'integer'})
if dsedist:
    process_sql('CREATE INDEX ON crsp.dsedist (permno)', engine)

# Fama-French data
updated = wrds_update('factors_daily', 'ff')

# Compustat
company = wrds_update('company', 'comp')
if company:
    process_sql('CREATE INDEX ON comp.company (gvkey)', engine)

funda = wrds_update('funda', 'comp', fix_missing=True)
if funda:
    process_sql('CREATE INDEX ON comp.funda (gvkey)', engine)
    
wrds_update('funda_fncd', 'comp')

fundq = wrds_update('fundq', 'comp', fix_missing=True)
if fundq:
    process_sql('CREATE INDEX ON comp.fundq (gvkey, datadate)', engine)
    
updated = wrds_update('r_auditors', 'comp')

idx_daily = wrds_update('idx_daily', 'comp')
if idx_daily:
    process_sql('CREATE INDEX ON comp.idx_daily (gvkeyx)', engine)

aco_pnfnda = wrds_update('aco_pnfnda', 'comp')
if aco_pnfnda:
    process_sql('CREATE INDEX ON comp.aco_pnfnda (gvkey, datadate)', engine)

wrds_update('seg_customer', 'compseg')
wrds_update('names_seg', 'compseg')

D.2 Setting up a shared server

While setting up a personal PostgreSQL server gives you more flexibility, greater potential can be realized if you can share your server with collaborators. For example, you might write Python code to perform textual analysis on SEC filings, then have a co-author pull the data from the database directly into Stata for regression analysis. Having data in a single database is in many ways superior to solutions such as Dropbox that are not really set up for sharing multi-gigabyte data files. This short appendix describes some details of sharing a database.

D.2.1 Opening up the database

By default, a PostgreSQL cluster is not open to connections from other computers and we need to edit two configuration files to change this.

Issue two SQL commands to locate the two configuration files (pg_hba.conf and postgresql.conf) that we need to edit. Here we use the command-line program for accessing PostgreSQL (psql), but anything that allows you to execute SQL against the server should work.

(base) brontegow@brontegow-ubuntu-mate:~$ psql
psql (13.2)
SSL connection (protocol: TLSv1.3)
Type "help" for help.

crsp=# SHOW hba_file ;
              hba_file               
-------------------------------------
 /etc/postgresql/13/main/pg_hba.conf
(1 row)

crsp=# SHOW config_file ;
               config_file               
-----------------------------------------
 /etc/postgresql/13/main/postgresql.conf
(1 row)

Add the following line to the end of pg_hba.conf

host all all 0.0.0.0/0 md5

You may need to edit the file as a super user (e.g., sudo vim /etc/postgresql/13/main/pg_hba.conf).

Change listen_addresses in postgresql.conf

There should already be a line related to listen_addresses in the file, but it is likely commented out (i.e., starts with #). Uncomment the line and edit it so that PostgreSQL listens from all addresses. After editing, the line should look like this:

listen_addresses = '*'

Again, you may need to edit the file as a super user (e.g., sudo vim /etc/postgresql/13/main/postgresql.conf).

After making these two edits, it will be necessary to reboot the server. The details of how to reboot the server will depend on how you installed it, so consult instructions specific to your installation (in the worst case, restarting your computer should do the trick).

D.2.2 Creating user logins

Suppose that Bronte has a co-author named Gemma to whom she wants grant access to the database. Note that wrds_update function automatically creates two roles for each schema. For example, in creating a local copy of crsp.dsf, which is stored in schema crsp, it creates the role crsp, which owns the data (and therefore can delete or modify it), and the role crsp_access, which only has read access to the data. Similar roles are created for data in the comp, compseg, and ff schemas.

In this case, we want to create the role gemma and give her read-only access to the data in the four schemas: crsp, comp, compseg, and ff. The following commands achieve this:

(base) brontegow@brontegow-ubuntu-mate:~$ psql
psql (13.2 (Ubuntu 13.2-1))
SSL connection (protocol: TLSv1.3)
Type "help" for help.

crsp=# CREATE USER gemma ENCRYPTED PASSWORD 'bronte2006';
CREATE ROLE
crsp=# GRANT crsp_access TO gemma;
GRANT ROLE
crsp=# GRANT comp_access TO gemma;
GRANT ROLE
crsp=# GRANT compseg_access TO gemma;
GRANT ROLE
crsp=# GRANT ff_access TO gemma;
GRANT ROLE

D.2.3 Testing access

Now Gemma should have access to the database. We need to give Gemma the IP address for the server and with that information, she should be able to access the data. In the following example, the IP address for the server is assumed to be 108.26.187.8.

psql -h 108.26.187.8 -d wrds -U gemma
Password for user gemma: 
psql (13.2 (Ubuntu 13.2-1.pgdg20.04+1))
SSL connection (protocol: TLSv1.3)
Type "help" for help.

crsp=> SELECT date, vwretd FROM crsp.dsi LIMIT 10;
    date    |  vwretd   
------------+-----------
 1925-12-31 |          
 1926-01-02 |  0.005689
 1926-01-04 |  0.000706
 1926-01-05 | -0.004821
 1926-01-06 | -0.000423
 1926-01-07 |  0.004988
 1926-01-08 | -0.003238
 1926-01-09 |  0.002209
 1926-01-11 |  -0.00854
 1926-01-12 | -0.000929
(10 rows)

Note that the above assumed that the machine on which the database is hosted can be accessed via an IP address (or a URL). If you are using a home computer, there is a good chance that your IP address is not fixed. So, the following steps may be necessary to give access to others outside your home network:

Fix the local IP address of your server. For example, I have one computer set to 192.168.1.3.
Direct all traffic on the port for your database to this computer. This is accomplished via the port-forwarding settings of your internet router. For example, the default port for PostgreSQL is 5432 and we might forward all traffic on this port to 192.168.1.3.
Get a domain name. For example, a service such as GoDaddy can tell us what domains are available to acquire. At the time of writing brontesdb.com is available and would cost $12 for the first year.
Use a dynamic DNS service (such as dyn.com) to make sure that your chosen domain name (brontesdb.com in our example) always points to your IP address.

After taking the steps above, Gemma could now connect to our database like this:

psql -h brontesdb.com -d wrds -U gemma

Note that Gemma could change her password to something more to her liking using SQL such as the following:

ALTER ROLE gemma PASSWORD 'gemma2007';

Because the password is encrypted on the database server, Bronte cannot see the password that Gemma has chosen.

In fact, copy_inline() was added to dbplyr after a request related to this book: https://github.com/tidyverse/dbplyr/issues/628.↩︎
On Ubuntu, we install using say sudo apt install postgresql-16; on MacOS, Postgres.app provides a very easy-to-manage PostgreSQL implementation: https://postgresapp.com.↩︎