1 Introduction

This book provides a course on financial accounting research that begins at an upper-undergraduate (“honours”) or introductory PhD level. One goal of the course, like most PhD courses, is to prepare PhD students to take further research courses and to go on to do their own research. Another goal of the course is to provide students with a set of skills that is useful in other domains, such as consulting or finance. This second goal stems from the origins of parts of this course as a joint honours-PhD course at the University of Melbourne, where the honours students are undergraduates completing an additional year of study with a focus on research. While some honours students progress to PhD studies, most elect to take jobs in industry, such as consulting, auditing, or public service.

1.1 Features of this book

Some features of this book mean that a course based on it will be distinguished from a more traditional PhD-level course in a number of respects that we discuss here.

1.1.1 Pedagogically driven selection of papers

Many syllabuses for PhD courses in accounting focus on recent papers with a view to giving students a sense of the current themes and trends in research to help students spot gaps in the literature that they can fill with their own research. We view such courses as complementary to this course, but take a different approach.

Aiming to provide a more fundamental understanding of accounting research, the selection of papers in this course is driven more by pedagogical goals than an attempt to represent the current state of play in accounting research. In some cases, this means covering older papers (e.g., Ball and Brown, 1968), but in other cases we use a recent paper that features a core idea or approach.

1.1.2 Incorporation of data analysis skills

A second feature that distinguishes this course from most PhD courses in accounting is an emphasis on data analysis skills, which are deliberately woven into the course throughout. While it might have been possible to make this a course focused exclusively on such skills, it is our view that these skills are best learned through applying them to real research questions. Conversely, we believe that being able to pull data, run simulations, and get more involved with critical elements of the research process engenders a better understanding of research.

We build data analysis and computing skills into the course at each step in a systematic way. In practice, research computing skills are the bread and butter of a researcher’s toolkit but are generally neglected in PhD programs’ formal curricula. The prevailing ethos seems to be that research computing skills are acquired informally from other students, through research assistantships and collaboration with faculty members, and so on.

In some doctoral programs, this approach may work to a degree. But in many doctoral programs, such informal learning fails to prepare students adequately. For example, if students’ collaboration with faculty members is informal and related to collaboration on papers where the students do the data analysis with limited or no hands-on guidance from the faculty member, then the opportunity for clear and comprehensive guidance is limited.

1.1.3 Greater emphasis on research design and methods

Accounting research is overwhelmingly an empirical discipline seeking to draw causal inferences and, as such, significant research training should be focused on research design issues. Part III of course examines causal inference in depth, including natural experiments, regression discontinuity designs, instrumental variables, and. However, there we will see that the common thinking that these techniques offer hope for warranted causal inference, rather than the settings in which they can be deployed, is generally flawed. Throughout the course, we will offer a broader set of tools for making inferences about real-world phenomena.

In this course, we strive to equip students with systematic data analysis and computing skills that are needed to conduct the analyses that we cover. We believe that understanding of statistical and econometric techniques by accounting researchers is more likely to be enhanced by hands-on simulation analysis than by analysis of consistency and asymptotic variance of estimators. By building in the data analysis skills needed to perform such simulations, we hope that this course provides a platform for accounting researchers to think more carefully about the properties of their estimators. A core commitment we make in this book is that every analysis we present can be conducted by the reader by copying and pasting the code found herein.

1.2 Prerequisites

We presume both prior knowledge of some topics and access to certain computing resources. We have endeavoured to keep these requirements to a minimum.

  1. Knowledge of accounting and business. In terms of accounting, we assume a solid understanding of the content of an introductory financial accounting course and enough understanding of business to make sense of accounting.

  2. Prior exposure to statistics and econometrics. In Chapter 5 we do presume that you are familiar with the elements of statistical inference and ordinary least-squares (OLS) regression, though even if you need to refer to a textbook to brush up on these topics, you should be able to follow this material and subsequent chapters that build on this.

  3. Access to a computer and the internet. To accommodate a broader audience and minimize set-up costs, we assume nothing of the reader other than access to a computer and the internet and basic proficiency in using these.

  4. Access to academic journals. The course makes extensive use of papers in academic journals. If you are a faculty member, research, or student at an academic institution, then you should be able to access the papers we use through your library. Some universities provide a service (perhaps for a fee) for alumni access to academic journals. Unfortunately, if you cannot access the papers, it will be difficult to make full use of this course book.

  5. Ability to install R and RStudio. In this course, while we focus on R, a popular open-source programming language for statistics and data science, we do not assume any knowledge of R in this course. Rather than developing our own material for the basic skills in using R, we lean on materials provided elsewhere for this purpose. Specifically, we direct the reader to “R for Data Science”, which is available as a book or for free on the internet. Chapter 3 provides a road map to using those materials.

  6. Access to WRDS data. Because it is difficult to go very far in accounting research without WRDS data, this book is targeted at the reader who has a WRDS account. If you do not have a WRDS account, but are eligible for one (e.g., you are a graduate student, researcher, or faculty member at a WRDS-subscribing institution), then you should apply for such an account.1

1.2.1 Setting up your computer

Assuming that you have the ability to install software and a WRDS account, setting up your computer so that you can run the code in this book is straightforward and takes just a few minutes. We list the required steps below and also provide a video demonstrating these steps here.

  1. Download and install R. R is available for all major platforms (Windows, Linux, and MacOS) here.

  2. Download and install RStudio. An open-source version of RStudio is available here.

  3. Install required packages from CRAN. CRAN stands for “Comprehensive R Archive Network” and is the official repository for packages (also known as libraries) made available for R. In this course, we will make use of a number of R packages. These can be installed easily by running the following code in RStudio.2

    install.packages(c("AER", "DBI", "MASS", "MatchIt", "RPostgres",
        "broom", "car", "dbplyr", "dplyr", "farr", "forcats", "furrr",
        "glmnet", "knitr", "lfe", "lubridate", "mgcv", "optmatch",
        "pdftools", "plm", "purrr", "rdrobust", "readr", "robustbase",
        "rpart", "sandwich", "stargazer", "stringr", "tidyr"))
    Note that farr is an R package we created just for this course. (As this book is related to the course Financial Accounting Research at the University of Melbourne, farr stands for “Financial Accounting Research with R”.)
  1. Set up R to connect to the WRDS PostgreSQL database. To actually use much of the code from Chapter 7 on, you will need to tell R how to access WRDS data stored in its PostgreSQL database by running the following line within RStudio.
    Sys.setenv(PGHOST = "wrds-pgdata.wharton.upenn.edu",
               PGPORT = 9737L,
               PGDATABASE = "wrds",
               PGUSER = "your_WRDS_ID", 
               PGPASSWORD = "your_WRDS_password")

    Obviously, you should replace your_WRDS_ID and your_WRDS_password with your actual WRDS ID and WRDS password, respectively. This code will need to be run each time you open RStudio to access WRDS data in the code examples below. But once you have run this code, you do not need to run it again during the same session (i.e., until you close and reopen RStudio).

    If the only PostgreSQL database you access is the WRDS database, you could put the values above in .Renviron, a special file that is opened every time you open R (see here for more information on this file).3 The contents of this file would look something like this:
    PGHOST = "wrds-pgdata.wharton.upenn.edu"
    PGPORT = 9737L
    PGDATABASE = "wrds"
    PGUSER = "your_WRDS_ID"

    We discuss alternative approaches to setting up the WRDS database connection in section 7.1, but we recommend this approach as it keeps the user-specific aspects of the code separate from the parts of the code that should work for everyone. By using environment variables, we ensure that the code in the book works for you if you copy it and paste it in your R console.

    Note that we have striven to make the code in each chapter independent of code in other chapters. So, if you feel comfortable with using R and have fulfilled the requirements listed above, you could easily jump ahead to a chapter of interest and start running code.

1.3 A guide for readers

The book is written so as to be fairly accessible to a novice reading independently (subject to the prerequisites outlined above). We recommend that such readers work through the first few chapters in order, including running the code, completing the exercises, and thinking about the discussion questions. That said, some elements of the exercises and discussion questions are subtle and having an instructor or someone to discuss these with will help you to get the full value from this material.

But we hope that this book will be useful to a variety of readers, learners, and instructors beyond the novices. Below we discuss possible approaches for some hypothetical readers.

  • I am interested in learning more about issues related to research design and causal inference. You might find you can dive into Chapters 3 and 5, then move to Chapter 19 and subsequent chapters.
  • I am interested in learning more about issues related to research design and causal inference, but I don’t really want to learn R. The plan for the hypothetical reader in the previous bullet point likely works. Even if you aren’t interested in learning R, we think that running the code helps solidify understanding and that what the code is doing is sufficiently clear that copy-pasting the code into your own computer should be enough to get the gist of what is going on.
  • I have heard about R and would like to learn more about it. Chapters 3 and 4 cover some of the basics. But if you’re already proficient in something like SAS or Stata, you may find it pretty easy to skip those chapters (after meeting the prerequisites above) and go to a chapter that aligns with your research interests, and see if you can figure out what the code is doing as you work through it. We have deliberately written the chapters so that you can work on each chapter independent of the others.

1.4 Structure of the book

The book is organized into four parts.

Part I: Foundations covers a variety of topics, including research computing, statistics, causal inference, and some details of data sets commonly used in accounting research. This part of the book covers material often not included in the formal coursework of a PhD in accounting. For example, material related to statistics and causal inference is often assumed to be covered in coursework in statistics and econometrics rather than in the accounting-specific courses. Material on research computing and detailed investigation of data sets is generally not covered in PhD coursework at all, with the typical approach being for these skills and knowledge to be picked up informally.

The material of Part I could be covered in a number of ways. One approach would be to cover this material in a standalone introductory course or “boot camp”. A reader will notice that Chapter 3 actually incorporates by reference significant portions of “R for Data Science”, which could easily be a course in its own right (and one highly complementary to the material covered here), so there is plenty of material here for a full-fledged course for a program willing to devote class time to these skills.

Another approach might be to assign Part I to students on a self-study basis, perhaps with select portions being covered when they are most relevant for later portions of the book.

For example, for a course based on Part II of the book, materials from Chapter 9 covers the important topic of correctly linking databases, which is not often encountered in PhD courses, could be assigned as background work or readings as and when relevant to material from Part II.

Turning to Part III of the book, topics covered in Chapter 21 draw on materials in Part I, with extensive discussion of causal diagrams (Chapter 5), standard errors (Chapter 6), linking databases (Chapter 9), using regular expressions (Chapter 10), and two-step regressions (drawing on materials covered in Chapter 4). Chapter 21 also focuses on earnings management, which is the topic of an entire chapter in Part II (Chapter 18).

Part II: Capital Markets Research provides the basis for a PhD-level course focused on capital markets research. This part alone easily provides materials for about eight weeks of coursework. For a ten- or twelve-week course, an instructor could draw on materials from other parts of the book, or could easily supplement using other materials. Part II is deliberately focused on more “classical” material and thus could easily complement related material that focuses on more contemporary work in financial accounting research. Part II starts with research from the 1960s—such as Fama et al. (1969), Ball and Brown (1968) and Beaver (1968)—and covers some of the most important studies of subsequent decades, including Bernard and Thomas (1989), Sloan (1996), and key earnings management papers of the 1980s and 1990s.

Part III: Causal Inference provides the basis for a PhD-level course focused on causal inference in empirical accounting research. Part III has a more contemporary orientation and is not focused on capital markets research. Depending on the needs of students in a given program, Part III could be taught as a standalone course with elements of Part I being drawn upon as needed. While there are connections between Part II and Part III (e.g., Chapter 21 covers measures of accruals and earnings management that are covered in Chapters 17 and 18), these do not seem to rise to the level of considering Part II a prerequisite for Part III.

While the material of Part III might typically be covered later in the coursework of an accounting PhD program, we have endeavoured to present this material in a way that is fairly self-contained and therefore accessible to students earlier in their PhD studies (perhaps using materials from Part I to fill in gaps). There may even be merit in covering most of Part III before Part II, as it will allow students to read Part II materials (mostly older papers) through a more contemporary lens.

Part IV: Additional Topics provides chapters on topics such as matching, handling extreme values, selection models, and statistical (machine) learning. While these are important topics, we believe they are less closely related than the materials of Parts II and III and that instructors could easily incorporate these chapters in courses based on Parts II or III of this book, or as standalone material for courses not based on this book.

1.5 Acknowledgements

While this book draws on materials we have been using for many years, writing this book began in earnest in early 2021. Since then we have received help from many others, ranging from supplying code and data, suggestions on content, feedback on drafts, and simply encouragement to persist with the project. We would like to recognize the help of Ulrich Atz, Andrew Baker, Ray Ball, Jeremy Bertomeu, Stu Black, Mark Bradshaw, Philip Brown, Patty Dechow, Jenny Zha Giedt, Amy Hutton, James Kavourakis, David Larcker, Andy Leone, Ying Liang, Christian Leuz, Miguel Minutti-Meza, Matt Pinnuck, Steve O’Byrne, Shiva Rajgopal, Mario Schabus, Stefan Schantl, Richard Sloan, Dan Taylor, Jake Thomas, Jake Thornock, Stephen Walker, Charlie Wang, Yihong Wang, Eddie Watts, and Anastasia Zakolyukina.

We also thank the many students who have suffered through earlier versions of the materials here, including students at Deakin, Harvard, Melbourne, Michigan, and Wharton.

1.6 Some notes on style

We follow British (hence Australian) conventions for the most part. Reflecting the enduring influence of a Pocket Oxford Dictionary one of us received at age seven, we tend to use “-ize” spellings instead of “-ise” spellings (in any case, these are more familiar to American readers). Also we likely use the Oxford comma more often than not. One benefit of our choice is that we do not have to follow the prescription of American English that commas and full stops (periods) always go inside quotes and can instead put them where they naturally belong (i.e., where speakers of languages other than American English put them) even if this produces sentences that may look odd to some American readers. (It’s hard to disagree with Hadley Wickham—the lead author of the Tidyverse—on this point: “That is literally the stupidest rule in American English and I refuse to follow it.”)

For code, we endeavour to follow Hadley Wickham’s style guide for R code, except that we often put the first item after the assignment operator (<-) on a new line.