2 Research computing overview
In this chapter, we begin by providing a brief overview of the research computing landscape with a focus on the various statistical programming languages most commonly used in empirical accounting research. We then discuss some fundamental ideas associated with data management that help to explain the approach we emphasize in this book. We put this chapter so early in the book not because this is a book primarily about research computing, but because computing is so tightly woven into the remaining chapters that we need to cover some basics first, including a fast-paced tutorial on R in Chapter 3.
A downside of developing a course that involves hands-on data analysis is that material in at least one statistical programming language is needed. In this course, we focus on R and, somewhat indirectly, PostgreSQL as the tools. We believe that this is much less restrictive than it may seem. However, before explaining this choice, it makes sense to discuss the primary alternatives.
According to Wikipedia, “SAS is a statistical software suite developed by SAS Institute for data management, advanced analytics, multivariate analysis, business intelligence, criminal investigation, and predictive analytics.” SAS has long been a staple of research in accounting and finance, in part because it is used by Wharton Research Data Services (WRDS, pronounced “words”). WRDS provides its data sets in SAS data format and historically its research computing servers offered SAS as the primary programming language.
SAS offers a number of advantages to researchers.
- Rich set of statistical tools
- Ability to use SQL to prepare data sets (via
- Ability to process large data sets due to an on-disk orientation to data processing
- Availability to most researchers via WRDS servers
However, SAS has some disadvantages:
- Specialized programming language is relatively foreign to users of other languages
- Proprietary software limited to Windows and Linux (i.e., not MacOS)
- Limited set of data types
- Limitations in handling textual data
- Some important functionality (e.g., graphs) seems fairly inaccessible to most users
According to Wikipedia, “Stata is a general-purpose statistical software package created in 1985 by StataCorp. Most of its users work in research, especially in the fields of economics, sociology, political science, biomedicine, and epidemiology.”
Stata offers a number of advantages to researchers.
- Rich set of statistical tools, especially for economists and those in downstream disciplines, such as finance and accounting
- Availability on all three end-user operating systems (Windows, MacOS, and Linux)
- Friendly interface coupled with command-based orientation suits many workflows
However, Stata has some disadvantages:
- Proprietary system (i.e., it costs money)
- In-memory orientation makes it unsuitable for large data sets
- Historical limitation to one in-memory data set made data preparation challenging
- Apparent tendency for Stata users to produce code that is opaque
According to Wikipedia, “Python is an interpreted, high-level and general-purpose programming language.” While not designed specifically for statistical or econometric analyses, Python has become a mainstay of the data science community, and there exist many packages that make it possible to perform research computing tasks for accounting research in Python.
Python offers a number of advantages to researchers.
- As a full-featured computer programming language, pretty much anything can be done using Python
- Rich set of statistical tools, especially in areas such as statistical learning
- Availability on all three end-user operating systems (Windows, MacOS, and Linux)
However, Python has some disadvantages:
- Due to its origins as a general-purpose computing language, some design choices are not optimized for data analysis
- A little more difficult to set up for a relatively non-technical user
PostgreSQL is in some ways out of place in this list, as it is not a programming language or statistical software package. But given that much of this course uses PostgreSQL, we discuss it briefly here. According to its website, “PostgreSQL is a powerful, open source object-relational database system with over 35 years of active development that has earned it a strong reputation for reliability, feature robustness, and performance.” For our purposes, the relevant fact is that PostgreSQL is a powerful system for storing data in a way that it can be processed, combined, and used in analysis.
There are a number of similar database systems, including SQLite, MySQL, and Oracle. We use PostgreSQL in part because we make extensive use of WRDS data, which is made available in a PostgreSQL database, but also because PostgreSQL is a very robust database system offering perhaps the strongest feature set among the main open-source alternatives.
Starting from Chapter 7, we make extensive use of data sources from the PostgreSQL supplied by WRDS. WRDS essentially provides data in two forms: as SAS data files (accessible via SFTP) and as tables in a PostgreSQL database.
In Chapter 3 we provide an extended tutorial on R with an emphasis on the so-called Tidyverse.
dbplyr, tables in the WRDS database can be represented as a kind of data frame that allow us “to use remote database tables as if they are in-memory data frames by automatically converting
dplyr code into SQL” (we explain what a data frame is in Chapter 3).
In this way, the
dbplyr package allows us to deploy the skills we learn in Chapter 3 to the manipulation and analysis of data stored in relational databases almost seamlessly.
The best way to understand these ideas is through directions for hands-on use, which we provide in subsequent chapters.4
According to Wikipedia, “R is a programming language and free software environment for statistical computing and graphics”.
Relative to SAS, R has the benefit of being open-source and free. Like SAS, there is a seemingly endless list of packages to support almost any statistical procedure or data management task that one can think of. A disadvantage of R is that, like Stata, its default approach uses in-memory data analysis. However, as we will see, this disadvantage can be overcome with packages that facilitate on-disk data manipulation and analysis.
2.1.6 Why choose R?
This book contains a lot of R code and a reader might ask why we chose R for this book and we provide an answer in this section.
A more basic question might be: Why choose any software package? We believe that there are significant pedagogical benefits to covering not only research papers and their methods, but in studying in detail the computing steps required to apply those methods to produce the results in those papers. For example, having actual code that a reader can tweak makes it easy for the reader to appreciate how the results might have been different if different approaches had been used. Producing the numerous replications found in this book required a significant investment of time that a reader is unlikely to be able to make for everything covered in this book, especially for a reader who is new to research computing.
Having decided to include code in the book, why did we choose R in particular? An alternative approach would have been to include code in multiple languages. For example, “The Effect: An Introduction to Research Design and Causality” (Huntington-Klein, 2021) provides code in Python, R, and Stata. And “Causal Inference: The Mixtape” (Cunningham, 2021) provides code in R and Stata.
There are two critical differences between these two books and ours. First, we felt it was important to include code to reproduce every analysis in the book. In contrast, “The Effect” and “The Mixtape” only include code to produce some of the analyses they contain. Including code for every analysis in those books would likely have resulted in significant greater length and complexity, and doing so in multiple languages even more so.
Second, “The Effect” and “The Mixtape” largely skip complicated data steps and use small data sets.
In contrast, we generally go through all the data steps required to produce our analyses.5
And also we use (relatively!) large data sets, such as CRSP and Compustat that require paying attention to the “logistics” of getting the data into whatever statistical analysis package in being used.
In this regard, the approach we use in this book provides a particularly low-cost way to access these data sets, as we discuss in Section 7.1.
Achieving similar results with either Python or Stata would require in most cases a significant “data step” (to use SAS terminology) using SQL followed by data analysis of the resulting data frames using native tools.
We believe that the
dplyr-based approach used here results in code that is easier to understand.6
Finally, this book has been written with the
bookdown package for R and this package works most seamlessly with R code.
Producing this book using Stata, SAS, or even Python would have been much more challenging.7
Even though this book uses R, we believe it should be of interest to dedicated users of SAS, Stata, and Python.
First, one can probably get a lot of out this book with a minimal investment in learning R.
We have endeavoured to make the code understandable even if viewed as a kind of pseudo-code (in this regard, we think R—especially with
dplyr—is superior to, say, SAS or Stata).
While one could read the book without running code, we strongly encourage you to run code in some places, even if just to be able to do some of the exercises.
After some initial set-up (see Section 1.2.1), every analysis in the book can be produced by copying and pasting the code into R and the book has been written so that the code in any chapter can be run independently of that in other chapters (though code later in a chapter may depend on that earlier in the chapter).
Second, we believe that some investment in R is worthwhile even if you will remain a dedicated user of, say, Stata. Even a dedicated Stata user is likely to benefit from the ability to read SAS, as co-authors may use it, and authors of papers may supply SAS code. As R’s popularity increases, the ability to read R is likely to become more important. Already R is a kind of lingua franca in much of statistics, including (along with Python) machine learning.
Finally, the researcher who is a “dedicated user of X” is probably an endangered species. Most emerging researchers will likely need to use multiple statistical languages over their careers and R is a good candidate to be one of those.8
2.2 Data management
In this course, the larger data sets we will use are stored in a PostgreSQL database. But researchers use data stored in a variety of forms.
2.2.1 Data storage
We can think of three basic approaches to data storage:
- Text files
- Package-specific storage
- Relational databases
One way to store data is in text files, such as comma-separated values (CSV) files.
These text files, which may be compressed to reduce storage space requirements and increase read speeds, can then be read into pretty much any statistical programming package.
So long as one addresses issues such as encoding (e.g., UTF-8 versus Windows-1252) and quote characters (e.g.,
') embedded in values, text files are a robust approach to data storage and arguably the best way to share certain kinds of data with random strangers (e.g., posting
.csv files on a webpage).
However, text files suffer from being an essentially “dead” form of data when doing analysis, meaning that one needs to bring the data to life (typically by reading them into a computer’s memory) in order to actually use them.
This disadvantage matters when one is interested in using only small slivers of data at a time, which is very common in accounting research.
For example, suppose we are interested in getting the stock returns for Microsoft Corporation from 3 January 2018 to 6 January 2018.
If the data source is a text file version of CRSP’s daily stock file (
crsp.dsf), then we would typically need to read in the entire text file, which would take several seconds, even with a very fast storage medium (such as solid-state drives), and require several gigabytes of memory (
crsp.dsf is about 16 GB in SAS format).
Additionally, it is generally necessary to provide additional information about a data set so that statistical packages read them correctly.
For example, should a field with values like
256432 be read as text, as floating-point numerical values, or as integers?
Or if a field contains date-time values, what is the date format? And what is the time zone?
Package-specific storage means storing data as, say, SAS or Stata data files.
This approach can address some weaknesses of text files.
First, most statistical software packages have a richer set of data types and these are generally embedded into the native data storage formats for each package.
Second, some statistical software packages have the ability to make on-disk data “alive” by indexing it.
For example, SAS allows the user to index data so that small portions of the data can be loaded quickly without having to process the entire data file (e.g., the SAS data file for
crsp.dsf on WRDS is indexed by
date; so accessing the stock-return data for Microsoft for those few dates is quite fast).
A relational database provides a more rigid, structured way of storing data. Almost all database packages use Structured Query Language (SQL) as the primary language for the user to interact with the data. We include a brief primer on SQL in Appendix C. This course will focus on data stored in a PostgreSQL database, specifically the PostgreSQL database offered by WRDS. Having data in a relational database offers a number of advantages, which we will elaborate in the next section.
2.2.2 Benefits of using relational databases
Fast random access to on-disk data. Relational databases make it easy to index data, which makes it “alive” rather than “dead” in the sense we use here. We will see that accessing specific observations from large data sets is fast and easy with a relational database. Accessing stock-return data on
crsp.dsfin the WRDS PostgreSQL database for Microsoft for a few dates is quite fast.9
Data accessible by any software package. Data stored in a PostgreSQL database is accessible from R, Stata, Python, Perl, … pretty much any software. This has subtle but important advantages for data collaboration and multilingual programming. For example, you may be a Stata devotee who needs help preparing a data set from Web data. Finding people with the skills to do this is going to be a lot easier if you don’t need them to have Stata skills, but can work with a Python guru. With a relational database, a Stata user is unlikely to care much if the data coming out of PostgreSQL was put there using Python. In another example, you may have written some complicated Perl code to generate some data sets, and occasionally need to tweak the code to accommodate changes in the source data. As long as the data sets are stored in a PostgreSQL database, you will have no need to translate the entire code into something you still know how to code in. This benefit is probably also the primary driver of WRDS’s decision to add a PostgreSQL database to its offerings. With more users wanting to use Python or R for analysis, sticking to a SAS-format-only approach was surely not viable.
Data accessible from anywhere. Data on the WRDS PostgreSQL server can be accessed from anywhere with an internet connection (in the 2020s, this essentially means anywhere) by anyone with a WRDS account. For example, data stored on a PostgreSQL server in Boston can be easily accessed by coauthors in California, Australia, or Singapore.10
Centralised processing power. In this course, we will see a number of cases where the heavy part of data processing for some tasks is accomplished on the WRDS PostgreSQL server even when executing R code locally. While SAS has similar functionality with
RSUBMIT, running queries on a remote PostgreSQL server from RStudio is much more seamless.
Data can be (and should be) properly typed. PostgreSQL in particular has a much richer set of data types than a typical statistical programming package. For example, everything in SAS is fundamentally of either a floating-point numerical type or a fixed-width character type. Things such as dates and times are essentially formatting overlays over these types, and such formatting is easy to break, sometimes causing serious issues in practice. (Stata is fairly similar.) For example, PostgreSQL offers standard types such as strings, floating-point, integers, large integers, along with timestamps with time zones, JSON, XML, and other types. Most of the commonly used types are recognized fairly seamlessly and converted to appropriate types when data are brought into R (e.g., timestamps with time zones).
We say “can be” in the heading to this section because whether the data actually have appropriate types is a function of how much care was used in getting the data into the database in the first place. For example, at the time of writing, WRDS is often fairly careless about data types. So variables that are naturally integers, such as firm identifiers like CIKs or PERMNOs, are
double precisionrather than
integertypes.11 Text variables are generally
character varying, even though PostgreSQL’s
texttype is simpler and equally efficient. And timestamps are invariably of type
timestamp without time zonewhen they arguably should always be
timestamp with time zone, as a timestamp has no clear meaning without a time zone.
Textual data are handled with ease. Stata says that “Unicode is the modern way that computers encode characters such as the letters in the words you are now reading.” Yet prior to Stata 14, Stata did not support it. SAS files provided by WRDS have traditionally been encoded in some kind of Latin encoding. And SAS and Stata have limitations with regard to the size of text fields.
In contrast, PostgreSQL uses Unicode by default, as do Python and R, and there is effectively no constraint on the size of text fields in PostgreSQL.