Appendix C — Research computing overview
In this appendix, we provide a brief overview of the research computing landscape with a focus on the statistical programming languages most commonly used in empirical accounting research. We also discuss some fundamental ideas associated with data management that help to explain the approach we emphasize in this book.
C.1 Languages
A downside of developing a course that involves hands-on data analysis is that material in at least one statistical programming language is needed. In this course, we focus on R and, somewhat indirectly, on PostgreSQL as tools. We believe that this choice is much less restrictive than it may seem. However, before explaining this choice, it makes sense to discuss the primary alternatives.
C.1.1 SAS
According to Wikipedia, “SAS is a statistical software suite developed by SAS Institute for data management, advanced analytics, multivariate analysis, business intelligence, criminal investigation, and predictive analytics.” SAS has long been a staple of research in accounting and finance, in part because it is used by Wharton Research Data Services (WRDS, pronounced “words”). WRDS provides its data sets in SAS data format and historically WRDS offered SAS as the primary programming language on its research computing servers.
SAS offers a number of advantages to researchers.
- Rich set of statistical tools
- Ability to use SQL to prepare data sets (via
PROC SQL
) - Ability to process large data sets due to an on-disk orientation to data processing
- Availability to most researchers via WRDS servers
However, SAS has some disadvantages.
- Specialized programming language is relatively foreign to users of other languages
- Proprietary software limited to Windows and Linux (i.e., not MacOS)
- Limited set of data types
- Limitations in handling textual data
- Some important functionality (e.g., graphs) seems fairly inaccessible to most users
C.1.2 Stata
According to Wikipedia, “Stata is a general-purpose statistical software package created in 1985 by StataCorp. Most of its users work in research, especially in the fields of economics, sociology, political science, biomedicine, and epidemiology.”
Stata offers a number of advantages to researchers.
- Rich set of statistical tools, especially for economists and those in downstream disciplines, such as finance and accounting
- Availability on all three end-user operating systems (Windows, MacOS, and Linux)
- Friendly interface coupled with command-based orientation suits many workflows
However, Stata has some disadvantages.
- Proprietary system (i.e., it costs money)
- In-memory orientation makes it unsuitable for large data sets
- Historical limitation to one in-memory data set makes data preparation challenging
- Tendency for Stata users to produce code that is opaque
C.1.3 Python
According to Wikipedia, “Python is an interpreted, high-level and general-purpose programming language.” While not designed specifically for statistical or econometric analyses, Python has become a mainstay of the data science community, and there exist many packages that make it possible to perform research computing tasks for accounting research in Python.
Python offers a number of advantages to researchers.
- As a full-featured computer programming language, pretty much anything can be done using Python
- Rich set of statistical tools, especially in areas such as statistical learning
- Availability on all three end-user operating systems (Windows, MacOS, and Linux)
However, Python has some disadvantages.
- Due to its origins as a general-purpose computing language, some design choices are not optimized for data analysis
- A little more difficult to set up for a relatively non-technical user
C.1.4 PostgreSQL
PostgreSQL is in some ways out of place in this list, as it is not a programming language or statistical software package. But given that much of this course uses PostgreSQL, we discuss it briefly here. According to its website, “PostgreSQL is a powerful, open source object-relational database system with over 35 years of active development that has earned it a strong reputation for reliability, feature robustness, and performance.” For our purposes, the relevant fact is that PostgreSQL is a powerful system for storing data in a way that it can be processed, combined, and used in analysis.
There are a number of similar database systems, including SQLite, MySQL, and Oracle. We use PostgreSQL in part because we make extensive use of WRDS data, which is made available in a PostgreSQL database, and also because PostgreSQL is a very robust database system offering perhaps the strongest feature set among the main open-source alternatives.
Starting from Chapter 6, we make extensive use of data sources from the PostgreSQL supplied by WRDS. WRDS essentially provides data in two forms: as SAS data files (accessible via SFTP) and as tables in a PostgreSQL database.
Chapter 2 provides an extended tutorial on R with an emphasis on the so-called Tidyverse. Using dbplyr
, tables in a (possibly remote) database can be represented as a kind of data frame that allows us “to use remote database tables as if they are in-memory data frames by automatically converting dplyr
code into SQL” (we explain what a data frame is in Chapter 2). In this way, the dbplyr
package allows us to deploy the skills we learn in Chapter 2 to the manipulation and analysis of data stored in relational databases almost seamlessly. The best way to understand these ideas is through hands-on use, for which we provide guidance in the chapters beginning with Chapter 6.
C.1.5 R
According to Wikipedia, “R is a programming language and free software environment for statistical computing and graphics”.
Compared to SAS and Stata, R has the benefit of being open-source and free. Like SAS, there is a seemingly endless list of packages to support almost any statistical procedure or data management task that one can think of. A disadvantage of R is that, like Stata, its default approach uses in-memory data analysis. However, as we will see, this disadvantage can be overcome with packages that facilitate on-disk data manipulation and analysis.
C.1.6 Why R?
This book contains a lot of R code and a reader might ask why we chose R for this book and we provide an answer in this section.
A more basic question might be: Why choose any software package? We believe that there are significant pedagogical benefits to covering not only research papers and their methods, but in studying in detail the computing steps required to apply those methods to produce the results in those papers. For example, having actual code that a reader can tweak makes it easy for the reader to appreciate how the results might have been different if different approaches had been used. Producing the numerous replications found in this book required a significant investment of time that a reader is unlikely to be able to make for everything covered in this book, especially for a reader who is new to research computing.
Having decided to include code in the book, why did we choose R in particular? An alternative approach would have been to include code in multiple languages. For example, “The Effect: An Introduction to Research Design and Causality” (Huntington-Klein, 2021) provides code in Python, R, and Stata. And “Causal Inference: The Mixtape” (Cunningham, 2021) provides code in R and Stata.
There are two critical differences between these two books and ours. First, we felt it was important to include code to reproduce every analysis in the book. In contrast, “The Effect” and “The Mixtape” only include code to produce some of the analyses they contain. Including code for every analysis in those books would likely have resulted in significant greater length and complexity, and doing so in multiple languages even more so.
Second, “The Effect” and “The Mixtape” largely skip complicated data steps and use small data sets. In contrast, we generally go through all the data steps required to produce our analyses.1 And also we use (relatively!) large data sets, such as CRSP and Compustat that require paying attention to the “logistics” of getting the data into whatever statistical analysis package is being used. In this regard, the approach we use in this book provides a particularly low-cost way to access these data sets, as we discuss in Section 6.1. Achieving similar results with (say) Stata would require in most cases a significant “data step” (to use SAS terminology) using SQL followed by data analysis of the resulting data frames using native tools. We believe that the dplyr
-based approach used here results in code that is easier to understand.2
Finally, this book has been written with Quarto, a software package that works most seamlessly with R code. Producing this book using Stata or SAS would have been much more challenging.3
Even though this book uses R, we believe it should be of interest to dedicated users of SAS, Stata, and Python. First, one can probably get a lot of out this book with a minimal investment in learning R. We have endeavoured to make the code understandable even if viewed as a kind of pseudo-code (in this regard, we think R—especially with dplyr
—is superior to, say, SAS or Stata). While one could read the book without running code, we strongly encourage you to run code in some places, even if just to be able to do some of the exercises. After some initial set-up (see Section 1.2), every analysis in the book can be produced by copying and pasting the code into R and the book has been written so that the code in any chapter can be run independently of that in other chapters (though code later in a chapter may depend on that earlier in the chapter).
Second, we believe that some investment in R is worthwhile even if you will remain a dedicated user of, say, Stata. Even a dedicated Stata user is likely to benefit from the ability to read SAS, as co-authors may use it, and authors of papers may supply SAS code. As R’s popularity increases, the ability to read R is likely to become more important. Already R is a kind of lingua franca in much of statistics, including (along with Python) machine learning.
Finally, the researcher who is a “dedicated user of X” is probably an endangered species. Most emerging researchers will likely need to use multiple statistical languages over their careers and R is a good candidate to be one of those.4
C.2 Data management
Researchers use data stored in a variety of forms. In this section, we outline some desired attributes for a data storage approach. We then describe some common approaches to data storage, before evaluating them against the criteria outlined below.
C.2.1 Desiderata for data storage
In this section, we discuss features that are desired for a data storage approach. Not all features will be relevant to any given use case. Similarly, some features will be more important than others in a given situation. Nonetheless, these are important attributes across a wide portfolio of possible research uses.
Fast random access to on-disk data. Occasionally you will see benchmarks that compare alternative approaches to processing in-memory data. However, the reality is that data mostly “live” on disk and we want the ability to access these data quickly. In many cases, we will want only a small sample of the observations in a data set. For example, we may want stock-return data for Microsoft for a few dates.5 These data are found on
crsp.dsf
in the WRDS PostgreSQL database and can be retrieved from that system very quickly. In contrast, some data storage systems would require us to load (or at least scan) the entirety ofcrsp.dsf
to get these rows.Data accessible by any software package. Ideally, our data would be accessible from R, Stata, Python, Perl, … pretty much any software. Being able to do this has subtle but important advantages for data collaboration and multilingual programming. For example, you may be a Stata devotee who needs help preparing a data set from Web data. Finding people with the skills to do this is going to be a lot easier if you don’t need them to have Stata skills so you can work with a Python guru. So long as the data can be accessed through Stata, a Stata user is unlikely to care much if the data were created using Python or Stata. Alternatively, you may have written some complicated Perl code to generate some data sets, and occasionally need to tweak the code to accommodate changes in the source data. So long as the data sets are stored in a form that can be accessed by any system, you will have no need to translate the entire code into something you still know how to code in.
Data accessible from anywhere. Ideally, data should be able to be accessed from anywhere with an internet connection (in the 2020s, this essentially means anywhere). For example, it should be easy for a researcher in Boston to share data with co-authors in California, Australia, or Singapore.
Possibility of centralized processing. While less important than it once was, the ability to move data processing and analysis to a computer more powerful than one’s laptop can still be useful today. In the text, there are a number of cases where the heavy part of data processing is accomplished on the WRDS PostgreSQL server even when executing R code locally. SAS offers similar functionality with
RSUBMIT
.Support for a wide range of data types. Ideally, available data types will include strings, floating-point, integers, large integers, dates, timestamps with time zones, JSON, XML, and other types. At one end, text files are simply text, with data types either being inferred when reading the data or specified in a separate file. At the other end are systems such as PostgreSQL, which offer all the data types listed above and more.
Ability to handle textual data with ease. Textual data have emerged as an important data source in research in recent years. In practice, good text-handling capability often means supporting Unicode and encoding as UTF-8. According to Stata, “Unicode is the modern way that computers encode characters such as the letters in the words you are now reading.” Additionally, text fields can be “wide” and the data storage approach should not impose restrictions on these that limit its usefulness.
C.2.2 Data storage approaches
We can think of four basic approaches to data storage:
- Text files
- Package-specific binary data formats
- General-purpose binary data formats
- Relational databases
One way to store data is in text files, such as comma-separated values (CSV) files. These text files, which may be compressed to reduce storage space requirements and increase read speeds, can then be read into pretty much any statistical programming package. So long as one addresses issues such as encoding (e.g., UTF-8 versus Windows-1252) and quote characters (e.g., "
or '
) embedded in values, text files are a robust approach to data storage and arguably the best way to share certain kinds of data with random strangers (e.g., posting .csv
files on a webpage).
However, text files are not ideal for fast random access to on-disk data. This disadvantage matters when one is interested in using only small slivers of data at a time, which is very common in accounting research. For example, suppose we are interested in getting the stock returns for Microsoft Corporation from 3 January 2018 to 6 January 2018. If the data source is a text file version of CRSP’s daily stock file (crsp.dsf
), then we would typically need to read in the entire text file, which would take several seconds, even with a very fast storage medium (such as solid-state drives), and require several gigabytes of memory (crsp.dsf
is about 16 GB in SAS format).
Additionally, text files do not easily incorporate information about data types. This means it is generally necessary to provide additional information about a data set for statistical packages read them correctly. For example, should a field with values like 991201
, 120305
, 231130
be read as text, as floating-point numerical values, as integers, or even as dates? Or if a field contains date-time values, what is the format? And what is the time zone?
Package-specific storage means storing data as, say, SAS or Stata data files. This approach can address some weaknesses of text files.
First, most statistical software packages have a richer set of data types and these are generally embedded into the native data storage formats for each package. Nonetheless, some packages have limitations. For example, everything in SAS is fundamentally of either a floating-point numerical type or a fixed-width character type. Things such as dates and times are essentially formatting overlays over these types, and such formatting is easy to break, sometimes causing serious issues in practice. Stata is fairly similar. For example, dates and times in Stata are stored as integers and special formatting is applied to these to display them so that they are recognized as dates or times by a human.
Second, some statistical software packages have the ability to facilitate random on-disk access by indexing data files. For example, SAS allows the user to index data so that small portions of the data can be loaded quickly without having to process the entire data file (e.g., the SAS data file for crsp.dsf
on WRDS is indexed by permno
, permco
, and date
; so accessing the stock-return data for Microsoft for those few dates is quite fast).
While package-specific data formats do address some of the deficiencies of text files, they do so at the cost of locking users into the applicable package. While .dta
(Stata) data files can be read into R, they are unlikely to be the best approach to storing data unless all analyses are run using Stata (in which case, limitations of Stata are likely to arise). Either of the following approaches is likely to be superior to such an approach.
A third approach uses a binary format that is accessible to multiple software packages or systems. Perhaps the leading example of this for on-disk storage is the parquet format. The parquet format is provided by Apache Arrow, “a software development platform for building high performance applications that process and transport large data sets.”6 The parquet format offers much of the versatility of CSV files, while also allowing for a rich system of data types and random access to data. As seen in Chapter 23, parquet files can be combined with DuckDB to provide a lightweight, high-performance alternative to PostgreSQL. While this approach does not facilitate the centralization of data processing as easily as running a database server does, the performance of approaches based on DuckDB and parquet files on even limited hardware is often more than adequate. Appendix E provides guidance on setting up your own parquet-based data repository for research.
A relational database provides a more rigid, structured way of storing data. Almost all database packages use Structured Query Language (SQL) as the primary language for the user to interact with the data and a brief primer on SQL is provided in Appendix B. This course focuses on data stored in the PostgreSQL database offered by WRDS.
The WRDS PostgreSQL server offers a number of advantages. First, much of the data processing can be handled by the WRDS server. Second, the WRDS database offers fast random access to on-disk data.
PostgreSQL has excellent support for data types without requiring detailed understanding from users. While PostgreSQL likely uses a similar approach to storing dates as is used by SAS or Stata, a user is never forced to be aware of these details. Most of the commonly used types in PostgreSQL (e.g., timestamps with time zones) are recognized fairly seamlessly and converted to appropriate types when data are brought into R or Python. PostgreSQL also handles text data well.
PostgreSQL is easily accessed from other languages, including Python, R, and Stata. This benefit is perhaps a significant driver of WRDS’s decision to add a PostgreSQL database to its offerings. With more users wanting to use Python or R for analysis, sticking to a SAS-format-only approach was likely not viable. Finally, if you have a WRDS account, the WRDS PostgreSQL database is accessible from anywhere with an internet connection.
While the WRDS PostgreSQL database provides many of these benefits, it does not provide any mechanism to manage non-WRDS data and most significant research projects will involve such data. Using a relational database as the central data repository for your research offers a number of benefits, and Appendix D describes how to set up your own PostgreSQL server.
The few exceptions are those where we use data from
farr
, the companion package for this book, and in general the data steps are made available in the source code for that package: https://github.com/iangow/farr.↩︎While there will often be an implicit SQL-based step,
dplyr
handles most of the SQL-related details. SAS is an alternative with excellent native handling of WRDS data, but our understanding is that nowadays few accounting researchers exclusively use SAS for their research, suggesting it is not sufficient.↩︎Recent developments have made it easier to work with remote data sources using Python and it is possible that a future (alternative) edition of this book will use Python.↩︎
For example, the Accounting Coding Camp provided by three accounting professors covers SAS, Stata, and Python: https://accountingcodingcamp.com.↩︎
Here “random” means that we may not know which observations we want ahead of time.↩︎
See the Apache Arrow website at https://arrow.apache.org/overview/.↩︎