Appendix B — Research computing overview
In this appendix, we begin by providing a brief overview of the research computing landscape with a focus on the various statistical programming languages most commonly used in empirical accounting research. We then discuss some fundamental ideas associated with data management that help to explain the approach we emphasize in this book. We put this chapter so early in the book not because this is a book primarily about research computing, but because computing is so tightly woven into the remaining chapters that we need to cover some basics first, including a fast-paced tutorial on R in Chapter 2.
B.1 Languages
A downside of developing a course that involves hands-on data analysis is that material in at least one statistical programming language is needed. In this course, we focus on R and, somewhat indirectly, PostgreSQL as the tools. We believe that this is much less restrictive than it may seem. However, before explaining this choice, it makes sense to discuss the primary alternatives.
B.1.1 SAS
According to Wikipedia, “SAS is a statistical software suite developed by SAS Institute for data management, advanced analytics, multivariate analysis, business intelligence, criminal investigation, and predictive analytics.” SAS has long been a staple of research in accounting and finance, in part because it is used by Wharton Research Data Services (WRDS, pronounced “words”). WRDS provides its data sets in SAS data format and historically its research computing servers offered SAS as the primary programming language.
SAS offers a number of advantages to researchers.
- Rich set of statistical tools
- Ability to use SQL to prepare data sets (via
PROC SQL
) - Ability to process large data sets due to an on-disk orientation to data processing
- Availability to most researchers via WRDS servers
However, SAS has some disadvantages:
- Specialized programming language is relatively foreign to users of other languages
- Proprietary software limited to Windows and Linux (i.e., not MacOS)
- Limited set of data types
- Limitations in handling textual data
- Some important functionality (e.g., graphs) seems fairly inaccessible to most users
B.1.2 Stata
According to Wikipedia, “Stata is a general-purpose statistical software package created in 1985 by StataCorp. Most of its users work in research, especially in the fields of economics, sociology, political science, biomedicine, and epidemiology.”
Stata offers a number of advantages to researchers.
- Rich set of statistical tools, especially for economists and those in downstream disciplines, such as finance and accounting
- Availability on all three end-user operating systems (Windows, MacOS, and Linux)
- Friendly interface coupled with command-based orientation suits many workflows
However, Stata has some disadvantages:
- Proprietary system (i.e., it costs money)
- In-memory orientation makes it unsuitable for large data sets
- Historical limitation to one in-memory data set made data preparation challenging
- Apparent tendency for Stata users to produce code that is opaque
B.1.3 Python
According to Wikipedia, “Python is an interpreted, high-level and general-purpose programming language.” While not designed specifically for statistical or econometric analyses, Python has become a mainstay of the data science community, and there exist many packages that make it possible to perform research computing tasks for accounting research in Python.
Python offers a number of advantages to researchers.
- As a full-featured computer programming language, pretty much anything can be done using Python
- Rich set of statistical tools, especially in areas such as statistical learning
- Availability on all three end-user operating systems (Windows, MacOS, and Linux)
However, Python has some disadvantages:
- Due to its origins as a general-purpose computing language, some design choices are not optimized for data analysis
- A little more difficult to set up for a relatively non-technical user
B.1.4 PostgreSQL
PostgreSQL is in some ways out of place in this list, as it is not a programming language or statistical software package. But given that much of this course uses PostgreSQL, we discuss it briefly here. According to its website, “PostgreSQL is a powerful, open source object-relational database system with over 35 years of active development that has earned it a strong reputation for reliability, feature robustness, and performance.” For our purposes, the relevant fact is that PostgreSQL is a powerful system for storing data in a way that it can be processed, combined, and used in analysis.
There are a number of similar database systems, including SQLite, MySQL, and Oracle. We use PostgreSQL in part because we make extensive use of WRDS data, which is made available in a PostgreSQL database, but also because PostgreSQL is a very robust database system offering perhaps the strongest feature set among the main open-source alternatives.
Starting from Chapter 6, we make extensive use of data sources from the PostgreSQL supplied by WRDS. WRDS essentially provides data in two forms: as SAS data files (accessible via SFTP) and as tables in a PostgreSQL database.
In Chapter 2 we provide an extended tutorial on R with an emphasis on the so-called Tidyverse. Using dbplyr
, tables in the WRDS database can be represented as a kind of data frame that allow us “to use remote database tables as if they are in-memory data frames by automatically converting dplyr
code into SQL” (we explain what a data frame is in Chapter 2). In this way, the dbplyr
package allows us to deploy the skills we learn in Chapter 2 to the manipulation and analysis of data stored in relational databases almost seamlessly. The best way to understand these ideas is through directions for hands-on use, which we provide in subsequent chapters.1
B.1.5 R
According to Wikipedia, “R is a programming language and free software environment for statistical computing and graphics”.
Relative to SAS, R has the benefit of being open-source and free. Like SAS, there is a seemingly endless list of packages to support almost any statistical procedure or data management task that one can think of. A disadvantage of R is that, like Stata, its default approach uses in-memory data analysis. However, as we will see, this disadvantage can be overcome with packages that facilitate on-disk data manipulation and analysis.
B.1.6 Why choose R?
This book contains a lot of R code and a reader might ask why we chose R for this book and we provide an answer in this section.
A more basic question might be: Why choose any software package? We believe that there are significant pedagogical benefits to covering not only research papers and their methods, but in studying in detail the computing steps required to apply those methods to produce the results in those papers. For example, having actual code that a reader can tweak makes it easy for the reader to appreciate how the results might have been different if different approaches had been used. Producing the numerous replications found in this book required a significant investment of time that a reader is unlikely to be able to make for everything covered in this book, especially for a reader who is new to research computing.
Having decided to include code in the book, why did we choose R in particular? An alternative approach would have been to include code in multiple languages. For example, “The Effect: An Introduction to Research Design and Causality” (Huntington-Klein, 2021) provides code in Python, R, and Stata. And “Causal Inference: The Mixtape” (Cunningham, 2021) provides code in R and Stata.
There are two critical differences between these two books and ours. First, we felt it was important to include code to reproduce every analysis in the book. In contrast, “The Effect” and “The Mixtape” only include code to produce some of the analyses they contain. Including code for every analysis in those books would likely have resulted in significant greater length and complexity, and doing so in multiple languages even more so.
Second, “The Effect” and “The Mixtape” largely skip complicated data steps and use small data sets. In contrast, we generally go through all the data steps required to produce our analyses.2 And also we use (relatively!) large data sets, such as CRSP and Compustat that require paying attention to the “logistics” of getting the data into whatever statistical analysis package in being used. In this regard, the approach we use in this book provides a particularly low-cost way to access these data sets, as we discuss in Section 6.1. Achieving similar results with either Python or Stata would require in most cases a significant “data step” (to use SAS terminology) using SQL followed by data analysis of the resulting data frames using native tools. We believe that the dplyr
-based approach used here results in code that is easier to understand.3
Finally, this book has been written with the bookdown
package for R and this package works most seamlessly with R code. Producing this book using Stata, SAS, or even Python would have been much more challenging.4
Even though this book uses R, we believe it should be of interest to dedicated users of SAS, Stata, and Python. First, one can probably get a lot of out this book with a minimal investment in learning R. We have endeavoured to make the code understandable even if viewed as a kind of pseudo-code (in this regard, we think R—especially with dplyr
—is superior to, say, SAS or Stata). While one could read the book without running code, we strongly encourage you to run code in some places, even if just to be able to do some of the exercises. After some initial set-up (see Section 1.2.1), every analysis in the book can be produced by copying and pasting the code into R and the book has been written so that the code in any chapter can be run independently of that in other chapters (though code later in a chapter may depend on that earlier in the chapter).
Second, we believe that some investment in R is worthwhile even if you will remain a dedicated user of, say, Stata. Even a dedicated Stata user is likely to benefit from the ability to read SAS, as co-authors may use it, and authors of papers may supply SAS code. As R’s popularity increases, the ability to read R is likely to become more important. Already R is a kind of lingua franca in much of statistics, including (along with Python) machine learning.
Finally, the researcher who is a “dedicated user of X” is probably an endangered species. Most emerging researchers will likely need to use multiple statistical languages over their careers and R is a good candidate to be one of those.5
B.2 Data management
In this course, the larger data sets we will use are stored in a PostgreSQL database. But researchers use data stored in a variety of forms.
B.2.1 Data storage
We can think of three basic approaches to data storage:
- Text files
- Package-specific storage
- Relational databases
One way to store data is in text files, such as comma-separated values (CSV) files. These text files, which may be compressed to reduce storage space requirements and increase read speeds, can then be read into pretty much any statistical programming package. So long as one addresses issues such as encoding (e.g., UTF-8 versus Windows-1252) and quote characters (e.g., "
or '
) embedded in values, text files are a robust approach to data storage and arguably the best way to share certain kinds of data with random strangers (e.g., posting .csv
files on a webpage).
However, text files suffer from being an essentially “dead” form of data when doing analysis, meaning that one needs to bring the data to life (typically by reading them into a computer’s memory) in order to actually use them. This disadvantage matters when one is interested in using only small slivers of data at a time, which is very common in accounting research. For example, suppose we are interested in getting the stock returns for Microsoft Corporation from 3 January 2018 to 6 January 2018. If the data source is a text file version of CRSP’s daily stock file (crsp.dsf
), then we would typically need to read in the entire text file, which would take several seconds, even with a very fast storage medium (such as solid-state drives), and require several gigabytes of memory (crsp.dsf
is about 16 GB in SAS format). Additionally, it is generally necessary to provide additional information about a data set so that statistical packages read them correctly. For example, should a field with values like 001244
, 123456
, 256432
be read as text, as floating-point numerical values, or as integers? Or if a field contains date-time values, what is the date format? And what is the time zone?
Package-specific storage means storing data as, say, SAS or Stata data files. This approach can address some weaknesses of text files. First, most statistical software packages have a richer set of data types and these are generally embedded into the native data storage formats for each package. Second, some statistical software packages have the ability to make on-disk data “alive” by indexing it. For example, SAS allows the user to index data so that small portions of the data can be loaded quickly without having to process the entire data file (e.g., the SAS data file for crsp.dsf
on WRDS is indexed by permno
, permco
, and date
; so accessing the stock-return data for Microsoft for those few dates is quite fast).
A relational database provides a more rigid, structured way of storing data. Almost all database packages use Structured Query Language (SQL) as the primary language for the user to interact with the data. We include a brief primer on SQL in Appendix D. This course will focus on data stored in a PostgreSQL database, specifically the PostgreSQL database offered by WRDS. Having data in a relational database offers a number of advantages, which we will elaborate in the next section.
The few exceptions are those where we use data from
farr
, the companion package for this book, and in general the data steps are made available in the source code for that package.↩︎though in many cases there will be an implicit SQL-based data step, just with
dplyr
handling most of the SQL-related details. SAS is an alternative with excellent native handling of WRDS data, but our understanding is that nowadays few accounting researchers exclusively use SAS for their research, suggesting it is not sufficient.↩︎The challenge with Python would be in additional complexity in accessing data from the WRDS PostgreSQL database.↩︎
For example, the Accounting Coding Camp provided by three accounting professors covers SAS, Stata, and Python.↩︎