Chapter 1 Introduction

This books describes a computing platform for research computing. The major characters in this book include a relational database (I focus on PostgreSQL), version control (using Git and hosting by GitHub or Bitbucket), and open source programming languages (I focus on R and Python). Other characters with minor roles include Google Sheets, Mechanical Turk, and Fiverr.

In addition to these characters, this book will emphasize a number of habits and processes that these all together. In some ways, the value is not in the tools listed above, but in how they are brought to bear on research problems.

1.1 A taxonomy of data

The emphasis in this book is on empirical research, which is largely about data. One challenge with understanding alternatives approaches to data is the fact that data comes in many forms and has a number of types. While many textbook treatments cover critical aspects of working with data, some of the complexity of data is often omitted from such discussions. To understand some of the issues, I will discuss a number of dichotomies that data can be understood to fall into.

1.1.1 Project-specific versus general data

For many researchers, the world might be divided into projects and each project might be understood as comprising a primary final output in the form of a paper. In many discussions, it is recommended that researchers keep their data separate from code by having, say, a subdirectory of the main project directory labelled data and another labelled code (or something similar).

My experience is that, while there is often data that is specific to a particular project, much of the data I work with has applications across multiple projects. Clearly data from the Center for Research in Security Prices (CRSP), whose “US Stock Databases contain daily and monthly market and corporate action data for securities with primary listings on the NYSE, NYSE MKT, NASDAQ, and Arca exchanges and include CRSP broad market indexes”, have applications across multiple projects.1

Some researchers handle general data by maintaining project-specific copies of the data. There is the download of part of the CRSP daily stock file that I made in 2015 for the paper on earnings management around IPOs. Then there is the download of the same data set that I made a few months later for the paper on the new lease accounting standard.

But other data sets might start life as project-specific data

1.2 Why a relational database?

A key point of this book regards the benefits of using a relational database as the primary data repository.2 To discuss the benefits of a relational database, it is helpful to understand the common alternatives.

1.2.1 Alternative 1: Data files in statistical package of choice

One common approach is to primarily use one statistical programming package (e.g., Stata) and to keep data in the native format of that package.

1.3 Why PostgreSQL?

Much of this book involves the use of PostgreSQL as the backbone of a setup for empirical research in accounting, finance, economics, and fields with similar kinds of data.3 You may wonder why I emphasize PostgreSQL and not, say, MySQL or SQLite or even something hosted in the cloud, such as Google’s Redshift. I do so for a number of reasons:

  • To provide concrete examples. Any expert in SQL or relational databases knows that, while there are standards such as SQL 92 and SQL-2003, no system closely follows these standards.4 As such, to provide concrete working examples, I would often need to provide multiple versions for the different implementations.5 Instead, by focusing on PostgreSQL, I can provide working code without boring readers with the details of different implementations.
  • Other systems are similar. My experience is that if you understand one SQL-based system, it is pretty easy to apply that knowledge to another system. This is particularly true when using packages such as dplyr (for R) or SQLAlchemy (for Python), as these packages attempt to abstract from platform-specific details. My hope is that if you use, say, MySQL, then much of what I say will carry over to that platform.6
  • PostgreSQL is the system I use. A prosaic justification for featuring PostgreSQL is that it’s what I use. Trying to explain something I don’t use would be problematic, and setting up MySQL just to write this does not seem to be a good use of limited time.
  • PostgreSQL is very robust. When I decieded to migrate my data into a relational database in 2010 and 2011, I initially tried SQLite. This is incredibly easy to set up and is available everywhere, but after the second time a 20GB data table was corrupted, I decided it wasn’t quite what I was looking for. Since switching to PostgreSQL in 2011, I haven’t experience data corruption or database crashes.7
  • PostgreSQL has some great features. I also dabbled with MySQL alongside PostgreSQL for a while, but in the end was captivated by some features of PostgreSQL, some of which are compelling for researchers. These include the very rich set of data types supported by PostgreSQL, the availability of procedural languages for in-database programming, and the support for modern SQL features such as common-table expressions and window functions. I will discuss what these are and how they are useful for researchers in Chapter 3. While PostgreSQL has many compelling features for a wide range of situations, in the system I describe, PostgreSQL is largely functioning as a data repository. As such, many of the benefits could be obtained using another platform, such as MySQL.
  • PostgreSQL is good value for money. All this for nothing. PostgreSQL is supported by an enthusiastic and active group of users and developers.
  • A PostgreSQL server is now available from WRDS. Wharton Research Data Services (WRDS) is a major sourc of data for researchers in finance, accounting and other fields. WRDS recently began offering a server on wrds-pgdata.wharton.upenn.edu running on port 9737. This is some kind of a vote of confidence in PostgreSQL and also creates additional reasons to use PostgreSQL as doing so creates multiple ways to access WRDS data from within PostgreSQL.

  1. For many researchers in finance and accounting, it might be hard to imagine a study that did not use CRSP data.

  2. In ways I will discuss, it is effectively my exclusive data store.

  3. While PostgreSQL is the backbone, the research examples I will provide will generally use R or Python as the main coding language.

  4. My understanding is that PostgreSQL is one of the more standards-compliant implementations available.

  5. This is the approach taken by the SQL Cookbook.

  6. Another possibility is that you might like what you see in PostgreSQL as make the switch.

  7. Sometimes I have crashed the machine on which the database has been running, but that has never been PostgreSQL’s fault.