1  Introduction

1.1 Structure of the book

The book is organized into four parts.

Part I: Foundations covers a variety of topics, including research computing, statistics, causal inference, and some details of data sets commonly used in accounting research. This part of the book covers material often not included in the formal coursework of a PhD in accounting. For example, material related to statistics and causal inference is often assumed to be covered in coursework in statistics and econometrics rather than in the accounting-specific courses. Material on research computing and detailed investigation of data sets is generally not covered in PhD coursework at all, with the typical approach being for these skills and knowledge to be picked up informally.

Assuming very little in terms of prior knowledge, Part I: Foundations covers core concepts and skills in data analysis, statistics, and causal inference.

  • Chapter 1 provides an introduction to the book, including a reading guide and instructions for setting up your computer.

  • Given the centrality of data skills to getting the full value out of this book, we provide a fast-paced tutorial-style introduction to R in Chapter 2.

  • As we assume very little knowledge of statistics and regression analysis, we provide an introduction to the basics of regression analysis in Chapter 3.

  • Chapter 4 builds on Chapter 3 to provide an introduction to elements of causal inference.

  • Chapter 5 provides an introduction to statistical inference, which is a core part of empirical accounting research.

Part I: Foundations introduces key data sets frequently used in empirical accounting research.

  • Chapters 6 and 8 provide an introduction to Compustat and accessing data through WRDS.

  • Chapter 7 discusses the linking of data sets from different providers with a focus on linking financial statement data from Compustat with stock return data from CRSP.

  • We wrap up Part I with Chapter 9, which provides additional data skills useful both for later chapters and (we hope) readers’ own research efforts.

Part I provides the foundations for the remaining parts of the book. Depending on the preferences of readers and instructors, one could either continue with Part II: Capital Markets Research or skip ahead to Part III: Causal Inference. While some parts of Part III draw on skills and concepts covered in Part II, we flag such instances in each case.

The material of Part I could be covered in a number of ways. One approach would be to cover this material in a standalone introductory course or “boot camp”. A reader will notice that Chapter 2 actually incorporates by reference significant portions of R for Data Science, which could easily be a course in its own right (and one highly complementary to the material covered here), so there is plenty of material here for a full-fledged course for a program willing to devote class time to these skills.

Another approach might be to assign Part I: Foundations to students on a self-study basis, perhaps with select portions being covered when they are most relevant for later portions of the book. For example, for a course based on Part II: Capital Markets Research of the book, Chapter 7 covers the important topic of correctly linking databases—not often encountered in PhD courses—and could be assigned as background work as and when relevant to material from Part II.

Part II: Capital Markets Research provides the basis for a PhD-level course focused on capital markets research. This part alone easily provides materials for about eight weeks of coursework. For a ten- or twelve-week course, an instructor could draw on materials from other parts of the book, or could easily supplement using other materials. Part II is deliberately focused on more “classical” material and thus could easily complement related material that focuses on more contemporary work in financial accounting research. Part II starts with research from the 1960s—such as Fama et al. (1969), Ball and Brown (1968), and Beaver (1968)—and covers some of the most important studies of subsequent decades, including Bernard and Thomas (1989), Sloan (1996), and key earnings management papers of the 1980s and 1990s.

Part III: Causal Inference provides the basis for a PhD-level course focused on causal inference in empirical accounting research. Part III has a more contemporary orientation and is not focused on capital markets research.

Depending on the needs of students in a given program, Part III could be taught as a standalone course with elements of Part I being drawn upon as needed. Topics in Chapter 19 draw on materials in Part I, with extensive discussion of causal diagrams (Chapter 4), standard errors (Chapter 5), linking databases (Chapter 7), using regular expressions (Chapter 9), and two-step regressions (drawing on materials covered in Chapter 3).

While there are connections between Part II and Part III (e.g., Chapter 19 covers measures of accruals and earnings management that are covered in Chapters 15 and 16), these do not seem to rise to the level of considering Part II a prerequisite for Part III. Chapter 19 focuses on earnings management, which is the topic of an entire chapter in Part II (Chapter 16). While the material of Part III might typically be covered later in the coursework of an accounting PhD program, we have endeavoured to present this material in a way that is fairly self-contained and therefore accessible to students earlier in their PhD studies (perhaps using materials from Part I to fill in gaps). There may even be merit in covering most of Part III before Part II, as it will allow students to read Part II materials (mostly older papers) through a more contemporary lens.

Part IV: Additional Topics provides chapters on topics such as matching, handling extreme values, selection models, and statistical (machine) learning. While these are important topics, we believe they are less closely related than the materials of Parts II and III. Instructors could easily incorporate chapters from Part IV in courses based on Part II or Part III of this book, or as standalone material for courses not based on this book.

1.2 Setting up your computer

Assuming that you have the ability to install software and a WRDS account, setting up your computer so that you can run the code in this book is straightforward and takes just a few minutes. We list the required steps below and also provide a video demonstrating these steps online.

  1. Download and install R. R is available for all major platforms (Windows, Linux, and MacOS) here.

  2. Download and install RStudio. An open-source version of RStudio is available here.

  3. Install the required packages from CRAN. CRAN stands for “Comprehensive R Archive Network” and is the official repository for packages (also known as libraries) made available for R. In this course, we will make use of a number of R packages. These can be installed easily by running the following code in RStudio.1

    install.packages(c("DBI", "MASS", "MatchIt", "RPostgres",
        "arrow", "car", "duckdb", "farr", "fixest", "furrr",
        "glmnet", "httr2", "kableExtra", "lmtest", "modelsummary",
        "optmatch", "pdftools", "plm", "rdrobust", "robustbase",
        "rpart", "rpart.plot", "sandwich", "tidyverse"))
    Note that farr is an R package one of us created just for this course (Gow, 2022). (As the package is related to the course Financial Accounting Research at the University of Melbourne, farr stands for “Financial Accounting Research with R”.)
  1. Set up R to connect to the WRDS PostgreSQL database. To actually use much of the code from Chapter 6 on, you will need to tell R how to access WRDS data stored in its PostgreSQL database by running the following line within RStudio.
    Sys.setenv(PGHOST = "wrds-pgdata.wharton.upenn.edu",
               PGPORT = 9737L,
               PGDATABASE = "wrds",
               PGUSER = "your_WRDS_ID", 
               PGPASSWORD = "your_WRDS_password")

    Obviously, you should replace your_WRDS_ID and your_WRDS_password with your actual WRDS ID and WRDS password, respectively. This code will need to be run each time you open RStudio to access WRDS data in the code examples below. But once you have run this code, you do not need to run it again during the same session (i.e., until you close and reopen RStudio).

    If the only PostgreSQL database you access is the WRDS database, you could put the values above in .Renviron, a special file that is opened every time you open R (see here for more information on this file).2 The contents of this file would look something like this:
    PGHOST = "wrds-pgdata.wharton.upenn.edu"
    PGPORT = 9737L
    PGDATABASE = "wrds"
    PGUSER = "your_WRDS_ID"
    PGPASSWORD = "your_WRDS_password"

    We discuss alternative approaches to setting up the WRDS database connection in Section 6.1, but we recommend this approach as it keeps the user-specific aspects of the code separate from the parts of the code that should work for everyone. By using environment variables, we ensure that the code in the book works for you if you copy it and paste it in your R console.

    Note that we have striven to make the code in each chapter independent of the code in other chapters. So, if you feel comfortable with using R and have fulfilled the requirements listed above, you could easily jump ahead to a chapter of interest and start running code.

  1. You can copy and paste the code into the “Console” in RStudio.↩︎

  2. We put our passwords in a special password file, as described in the PostgreSQL documentation, so we don’t need to set PGPASSWORD. It’s obviously not a good idea to put your password in code.↩︎