4  Causal inference

The importance of causal inference in accounting research is clear from the research questions that accounting researchers seek to answer. Most long-standing questions in accounting research are causal:

Gow et al. (2016) provide a taxonomy for empirical research papers that comprises four categories:

Gow et al. (2016) find that most original research papers in the top three accounting research journals use observational data and that about 90% of these papers fall into the last category because they seek to draw causal inferences.

That accounting researchers focus on causal inference is consistent with the view that “the most interesting research in social science is about questions of cause and effect” (Angrist and Pischke, 2008, p. 3). While you may hear people talk about “interesting associations” at times, the reality is that associations (or correlations) are only interesting if there’s an interesting possible causal explanation for the associations. Associations that are mere coincidence, such as those found here are not at all interesting.1

At times, authors appear to disclaim any intention to draw causal inferences. Bertrand and Schoar (2003) is fairly typical: “There is no such thing as a random allocation of top executives to firms. Therefore, we are not hoping in this section to estimate the causal effect of managers on firm practices. Instead, our objective is more modest. We want to assess whether there is any evidence that firm policies systematically change with the identity of the top managers in these firms.”

There are at least two issues with this claim. First, why would anyone be interested in “evidence that firm policies systematically change with the identity of the top managers” if such changes come with no understanding as to why they change? Second, this claim is a bit of a pretence. The title of the paper is, after all, “Managing with style: The effect of managers on firm policies” and the first sentence of the abstract is “This paper investigates whether and how individual managers affect corporate behavior and performance” (emphasis added).

As another example, suppose a researcher argues that a paper that claims that “theory predicts \(X\) is associated \(Y\) and, consistent with that theory, we show \(X\) is associated with \(Y\)” is merely a descriptive paper that does not make causal inferences. However, theories are invariably causal in that they posit how exogenous variation in certain variables leads to changes in other variables. Further, by stating that “consistent with … theory, \(X\) is associated with \(Y\)”, the clear purpose is to argue that the evidence tilts the scale, however slightly, in the direction of believing the theory is a valid description of the real world: in other words, a causal inference is drawn. A paper that argues that \(Z\) is a common cause of \(X\) and \(Y\) and claims to find evidence of this is still making causal inferences (i.e., that \(Z\) causes \(X\) and \(Z\) causes \(Y\)).

Making causal inferences requires strong assumptions about the causal relations among variables. For example, as discussed below, estimating the causal effect of \(X\) on \(Y\) requires that the researcher has controlled for variables that could confound estimates of such effects.

Recently, some social scientists have argued that better research designs and statistical methods can increase the credibility of causal inferences. For example, Angrist and Pischke (2010) suggest that “empirical microeconomics has experienced a credibility revolution, with a consequent increase in policy relevance and scientific impact.” Angrist and Pischke (2010, p. 26) argue that such “improvement has come mostly from better research designs, either by virtue of outright experimentation or through the well-founded and careful implementation of quasi-experimental methods.”

Tip

The code in this chapter uses the following packages. For instructions on how to set up your computer to use the code found in this book, see Section 1.2 (note that Step 4 is not required as we do not use WRDS data in this chapter). Quarto templates for the exercises below are available at https://github.com/iangow/far_templates.

library(dplyr, warn.conflicts = FALSE)
library(ggplot2)
library(modelsummary)

4.1 Econometrics

Empirical financial accounting research can be viewed as fundamentally a highly specialized area of applied microeconomics. Financial accounting researchers typically take classes (at some level) in microeconomics, statistics, and econometrics, either before or in parallel with more specialized classes in accounting research. This model has two significant gaps that budding researchers need to address.

The first gap is the translation of ideas from econometrics and microeconomics into what researchers do from day to day. As we will see, sometimes things get lost in this translation. One goal of this book is to reduce the gap in the domain of empirical methods.

But it’s the second gap that we want to address here and this gap affects not only financial accounting researchers, but applied economic researchers more generally. This gap is between what accounting researchers want to do, which is causal inference, and what econometrics textbooks talk about, such as consistency, unbiasedness, and asymptotic variance.2

4.1.1 What is econometrics?

Angrist and Pischke (2014, p. xi) suggest that “economists’ use of data to answer cause-and-effect questions constitutes the field of applied econometrics.” Given that “applied” here is to in contradistinction to “theoretical” econometrics, which has no direct connection with accounting research, this is perhaps one working definition of econometrics for the purposes of accounting research. Given accounting researchers focus on “cause-and-effect questions” (discussed above), this definition seems consistent with what empirical accounting researchers do.

For example, we might posit that an economic phenomenon \(y\) is best modelled as follows

\[ y = X \beta + \epsilon \] where \(y\) and \(\epsilon\) are \(N\)-element vectors, \(X\) is an \(N \times K\) matrix, and \(\beta\) is a \(K\)-element vector of coefficients. This single-equation model “is still the workhorse in empirical economics” (Wooldridge, 2010, p. 49) and this is no doubt true in empirical accounting research.

An very important point about the equation above is that it should be viewed as a structural model. Wooldridge (2010, p. 49) writes “Goldberger (1972) defines a structural mode as one representing a causal relationship, as opposed to a relationship that simply captures statistical associations. A structural equation can be obtained from an economic model, or it can be obtained through informal reasoning.”

It we were only interested in estimate this equation to generate conditional expectation function linear in the parameters, then we would not be concerned with whether \(\mathbb{E}[X, \epsilon] \neq 0\), whether there were “bias” in the estimates \(\hat{\beta}\), or whether we had sufficiently “controlled for” variables in the right-hand side of the equation. Such notions only apply if we are viewing this equation as a structural model.

Viewing the equation above as a structural model, if we knew the coefficients in \(\beta\), we could understand how changes in \(X\) cause changes in \(y\). For example, if \(x_{ik}\) goes from \(0\) to \(1\), then we expect \(y_i\) to increase by \(\beta_k\). In other words, \(\beta_k\) can be viewed as the (causal) effect of \(x_{k}\) on \(y\). Of course, in reality, we don’t know \(\beta\); we need to estimate it using data. Much of econometrics is devoted to explaining how (and when) we can estimate \(\beta\) accurately and efficiently.

An important point to make is that in viewing the equation above as a structural model, we depart slightly from more recent usage, which has tended to view a structural model as something deriving from an economic model, typically one starting with preferences and technologies for production of economic goods and information and deriving predictions about economic phenomena from that model. This narrower view of structural models does not seem to fit the reality that almost all empirical research proceeds without access to such models, notwithstanding a focus on causal inference.

Sometimes the reasoning behind the use of particular models is poor, and models implied by what researchers do can be poorer than they need to be. But models are always simplifications of reality and thus always in some sense “wrong”. And, even if we accept a model in a given setting, it is often the case that we “know” that our estimators are unlikely to provide unbiased estimates of the model’s true parameter values. Yet that does not change what researchers are trying to do when they conduct empirical analysis using econometric techniques.

If you look at the indexes of some standard textbooks from just a few years ago, you will see no entries for causal, causation, or similar terms.3 But we would argue that econometrics properly conceived is the social science concerned with estimation of parameters of (structural) economic models. We would further argue that any time we believe we can read of an estimate of a causal effect from an econometric analysis, we are estimating a structural model, even if the model does not involve advanced economic analysis. For example, if we randomly assign observations to treatment and control, then most would agree that we can (under certain assumptions) read causal effects off a regression of the outcome of interest on the treatment indicator. So this is a “structural” model, albeit a very simple one.

Recently, some econometric textbooks have been more explicit about causal inference as the goal of almost all empirical research in economics. One example is Mostly Harmless Econometrics (Angrist and Pischke, 2008) and another is Causal Inference: The Mixtape (Cunningham, 2021).

The definition of econometrics that we provide here is not vacuous; not everything that accounting researchers do with econometric techniques could be described as econometric analysis. At times, researchers run regressions without giving any thought as to the model they are trying to estimate. As we will see later in this book, there are settings where accounting researchers have drawn (causal) inferences from estimated coefficients about economic phenomena whose connection to the empirical models used is very far from clear. In other settings, researchers will implicitly assume that the null hypothesis implies a zero coefficient on some variable without doing any modelling of this. One of the goals of this book is to encourage researchers to keep in mind what we are trying to do when we conduct econometric analysis.

4.1.2 Econometrics: The case of conditional conservatism

Conservatism has long been regarded as a hallmark of financial reporting. Many accounting standards treat losses and gains differently, with a greater willingness to recognize losses than gains when such losses and gains. For example, if the present value of expected future cash flows from an asset have declined below the asset’s carrying value, several accounting standards (e.g., IAS 36) require the recognition of a commensurate loss. However, if the present value of expected future cash flows from an asset have increased above the asset’s carrying value, those standards may defer any recognition of a gain until those future cash flows are realized. In this way, conservatism manifests as asymmetric timeliness in the recognition of losses and gains. Because this form of conservatism is conditional on news about the value of assets, it is also known as conditional conservatism.4

Basu (1997) developed a measure of asymmetric timeliness that is derived from estimation of the following regression.

\[ X_{it}/P_{it} = \alpha_0 + \alpha_1 D_{it} + \beta_0 R_{it} + \beta_1 (R_{it} \times D_{it}) + \epsilon_{it},\] where \(X_{it}\) is the earnings per share for firm \(i\) in fiscal year \(t\), \(P_{it}\) is the price per share for firm \(i\) at the beginning of fiscal year \(t\), \(R_{it}\) is the return on firm \(i\) from 9 months before fiscal year-end \(t\) to three months after fiscal year-end \(t\) and \(D_{it}\) is an indicator for \(R_{it} < 0\).

Using examples and intuition, Basu (1997) argues that \(\hat{\beta}_1\) is an measure of asymmetric timeliness. An extensive literature has exploited the Basu (1997) measure to understand the role of conservatism in accounting. For example, Ball et al. (2000, p. 22) “propose that common-law accounting income is more asymmetrically conservative than code-law, due to greater demand for timely disclosure of economic losses” and provide evidence based on differences in the Basu (1997) measure. Jayaraman and Shivakumar (2012, p. 95) “find an increase in [the Basu (1997) measure] after the passage of antitakeover laws for firms with high contracting pressures.”

This literature proceeded notwithstanding the absence of any demonstration that the Basu (1997) measure provides a unbiased and efficient estimator of an underlying conservatism parameter. While Ball et al. (2013) suggests the promise of covering the “econometrics of the Basu asymmetric timeliness coefficient” in its title, it contains little analysis of the kind seen in an econometrics text.

A more conventional econometric analysis might posit conditional conservatism as something that can be parameterized in some way. For example, specifying \(\theta\) as the key parameter—with higher values of \(\theta\) representing higher levels of conservatism—it might then be demonstrated how estimates of this underlying parameter \(\theta\) can be made—or at least how measures that are increasing in \(\theta\) can be constructed.

In an effort to establish some econometric foundations for the Basu (1997) measure, Ball et al. (2013) “model conditional conservatism as recognition of \(y\) in the current period only when it is sufficiently bad news … [and assume current period recognition] if \(y_t < c\) and zero otherwise, where \(c\) is a threshold below which current-period recognition occurs.” Focusing on the case where \(c = 0\), Ball et al. (2013) show that the expected Basu (1997) coefficient will be positive, but when there is no conditional conservatism (e.g., \(c = -\infty\) or \(c = \infty\)), the expected Basu (1997) coefficient will be zero. But Ball et al. (2013) do not even propose an underlying parameterization of conservatism on which conventional econometrics could be brought to bear. The only parameter here is \(c\), which does not have a monotonic relationship with conservatism (Ball et al. (2013) point out that both \(c = -\infty\) and \(c = \infty\) are consistent with the absence of conservatism). In effect, Ball et al. (2013) show that if conservatism is defined as an asymmetric relationship between returns and income around zero returns, then an OLS estimator that captures an asymmetric relationship between returns and income around zero returns (such as the Basu (1997) measure) will detect conservatism.5

Ball et al. (2013, p. 1073) state that “we show that, holding other things constant, the Basu regression identifies conditional conservatism only when it exists.” Apart from the heavy lifting being done by the words “holding other things constant”, this claim belies how the Basu (1997) measure has been used in research. The two papers cited above—Ball et al. (2000) and Jayaraman and Shivakumar (2012)—interpret higher levels of the Basu (1997) measure as capturing greater conservatism. Dietrich et al. (2022, p. 2150) point out that “differences in conservatism” across firms and time “is the explicit focus of nearly all conservatism research.” But nothing in Ball et al. (2013) even explains the concept of “greater conservatism” let alone how the Basu (1997) measure captures it.6

Ball et al. (2013, p. 1071) argue that “much criticism of the Basu regression misconstrues researchers’ objectives.” Specifically, Ball et al. (2013, p. 1091) suggest that criticism is “based on a misconception that the objective of a Basu specification is to estimate structural parameters” of a model” and argue that “when the research objective is to estimate the functional shape of the conditional expectation \(\mathbb{E}(I|R)\), return is the correct independent variable, and conditioning on it does not induce bias.” In other words, if your objective is to estimate the Basu (1997) specification, it makes little sense to describe the Basu (1997) measure as biased. This demonstrates the tight relation between the researcher’s objective (e.g., an consistent, efficient estimator of \(\theta\)) and the appropriateness of the estimators being used.

4.1.3 Econometrics: A brief illustration

While this is a text book on accounting research, let’s consider a stylized example from labour economics. Suppose that we posit the following structural model:

\[ \begin{aligned} y_i &= \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \epsilon_i \\ x_{i1} &= \alpha_0 + \alpha_2 x_{i2} + \alpha_3 x_{i3} + \eta_i \end{aligned} \] where the \(i\) subscripts denote individual \(i\). In the first equation, \(y_i\) is income at age 30, \(x_{i1}\) is years of education, \(x_{i2}\) is a measure of industriousness, \(x_{i3}\) is a measure of intelligence, and \(\epsilon_i\) can be interpreted as random factors that affect \(y_i\) independent of \(X_i\). In the second equation, we add coefficients, \(\alpha := \left(\alpha_0, \alpha_1, \alpha_2 \right)\), and \(\eta_i\), which can be interpreted as random factors that affect \(x_{i1}\) independent of \(x_{i2}\) and \(x_{i3}\).7

As researchers, we can postulate a model of the form above and we might obtain data on \((y, X)\) that we could use to estimate \(\beta := \left(\beta_0, \beta_1, \beta_2, \beta_3 \right)\). But we don’t know that the model is correct and even if we did, we don’t know the values in \(\beta\).

As we saw in Chapter 3, the OLS regression estimator can be written

\[ \hat{\beta} = \left(X^{\mathsf{T}}X \right)^{-1} (X^{\mathsf{T}}y) \] and it can be shown mathematically that OLS has good properties—such as unbiasedness and efficiency—under certain conditions. But in a world of cheap computing, we don’t need to break out our pencils to do mathematics. Instead, we can “play God” in some sense and fix parameter values, simulate the data, then examine how well a researcher would do in estimating the parameter values that we set.

set.seed(2021)

n <- 100000

df <- tibble(
  industry = rnorm(n),
  intelligence = rnorm(n),
  education = 3 * intelligence + 4 * industry + rnorm(n),
  income = 10 + 5 * education + 6 * intelligence + 7 * industry + rnorm(n))

Now we can estimate three different models. It will turn out to be convenient to store the fitted models in a list (fms), which we initiate by creating an empty list with list().

fms <- list()
fms[[1]] <- lm(income ~ education, data = df)
fms[[2]] <- lm(income ~ education + intelligence + industry, data = df)
fms[[3]] <- lm(income ~ intelligence + industry, data = df)

The results from these models are presented in the table below.

modelsummary(fms,
             estimate = "{estimate}",
             statistic = NULL,
             gof_map = c("nobs", "r.squared"),
             stars = c('*' = .1, '**' = 0.05, '***' = .01))
Table 4.1: Student admissions: Regression results
 (1)   (2)   (3)
(Intercept) 9.997 10.003 10.019
education 6.767 4.999
intelligence 6.003 21.008
industry 7.000 27.007
Num.Obs. 100000 100000 100000
R2 0.996 0.999 0.978
* p < 0.1, ** p < 0.05, *** p < 0.01

4.1.4 Exercises

  1. Looking at the simulation code, what are the true values of \(\beta := \left(\beta_0, \beta_1, \beta_2, \beta_3 \right)\) and \(\alpha := \left(\alpha_0, \alpha_1, \alpha_2 \right)\)?
  2. Does any one of the three equations provide good estimates of \(\beta\)?
  3. Consider model (3), with regard to the first of the two equations, are there any issues with regard to estimating \(\beta\)? What (if any) OLS assumption is violated?
  4. What happens if you substitute the second equation (for \(x_{i1}\)) into the first equation (for \(y_i\))? Does this equation satisfy OLS assumptions in some way?
  5. Using the equations, what happens if we arbitrarily increase the value of industry (\(x_{i3}\)) by one unit? What happens to education (\(x_{i1}\))? What happens to income (\(y_i\))?
  6. Can you read the effect sizes from the previous question off any of the regression results? Which one(s)?

4.2 Basic causal relations

The section provides a brief introduction to causal diagrams. Fuller coverage of this topic can be found in Morgan and Winship (2014) and Huntington-Klein (2021); Pearl (2009a) offers a more advanced treatment.

Figure 4.1, Figure 4.2, and Figure 4.3 illustrate the basic ideas of causal diagrams and how they can be used to facilitate thinking about causal inference. Each figure depicts potential relationships among three observable variables. In each case, we are interested in understanding how the presence of a variable \(Z\) impacts the estimation of the causal effect of \(X\) on \(Y\). The only difference between the three graphs is the direction of the arrows linking either \(X\) and \(Z\), or \(Y\) and \(Z\). The boxes (or nodes) represent random variables and the arrows (or edges) connecting boxes represent hypothesized causal relations, with each arrow pointing from a cause to a variable assumed to be affected by it.

Pearl (2009a) shows that, if we are interested in assessing the causal effect of \(X\) on \(Y\), we may be able to do so by conditioning on a set of variables, \(Z\), that satisfies certain criteria. These criteria imply that very different conditioning strategies are needed for each of the causal diagrams (see Gow et al. (2016) for a more formal discussion).

While conditioning on variables is much like the standard notion of “controlling for” such variables in a regression, there are critical differences. First, conditioning means estimating effects for each distinct level of the set of variables in \(Z\). This concept of nonparametric conditioning on \(Z\) is more demanding than simply including \(Z\) as another regressor in a linear regression model.8 Second, the inclusion of a variable in \(Z\) may not be an appropriate conditioning strategy. Indeed, it can be that the inclusion of \(Z\) results in biased estimates of causal effects.

Figure 4.1: \(Z\) is a confounder

Figure 4.1 is straightforward. In this case, we need to condition on \(Z\) in order to estimate the causal effect of \(X\) on \(Y\). Note the notion of “condition on” again is more general than just including \(Z\) in a parametric (linear) model.9 The need to condition on \(Z\) arises because \(Z\) is what is known as a confounder.

Figure 4.2: \(Z\) is a mediator

Figure 4.2 is a bit different. Here \(Z\) is a mediator of the effect of \(X\) on \(Y\). No conditioning is required in this setting to estimate the total effect of \(X\) on \(Y\). If we condition on \(X\) and \(Z\), then we obtain a different estimate, one that includes the indirect effect of \(X\) on \(Z\).

Figure 4.3: \(Z\) is a collider

Finally in Figure 4.3, we have \(Z\) acting as what is referred to as a “collider” variable (Pearl, 2009b).10 Again, not only do we not need to condition on \(Z\), but that we should not condition on \(Z\) to get an estimate of the total effect of \(X\) on \(Y\). While in epidemiology, the issue of “collider bias can be just as severe as confounding” (p.,186 of Glymour and Greenland (2008)), collider bias appears to receive less attention in accounting research than confounding. Many intuitive examples of collider bias involve selection or stratification. Admission to university could be a function of combined test scores (\(T\)) and interview performance (\(I\)) exceeding a threshold, i.e., \(T + I \geq C\). Even if \(T\) and \(I\) are unrelated unconditionally, a regression of \(T\) on \(I\) conditioned on admission to university is likely to show a negative relation between these two variables.

set.seed(2021)

n <- 100000

admissions <- tibble(
  test = rnorm(n),
  interview = rnorm(n),
  score = test + interview,
  cutoff = quantile(score, .90),
  admitted = score >= cutoff)
admissions |> 
  ggplot(aes(x = test, y = interview, color = admitted)) +
  geom_point() +
  theme_bw()
Figure 4.4: Admission status, interview and test scores

We will fit some model to the data we have generated.

In the following code, we estimate four models and store the estimated models in a list fms (for “fitted models”).11 Models (3) and (2) capture subsets of data where applicant are or are not admitted, respectively. Models (1) and (4) use all observations, but differ in terms of the regression terms included.

fms <- list()
fms[[1]] <- lm(interview ~ test, data = admissions)
fms[[2]] <- lm(interview ~ test, data = admissions, subset = !admitted)
fms[[3]] <- lm(interview ~ test, data = admissions, subset = admitted)
fms[[4]] <- lm(interview ~ test * admitted, data = admissions)

The results from these models are presented in the following table.

modelsummary(fms,
             estimate = "{estimate}",
             statistic = NULL,
             gof_map = c("nobs", "r.squared"),
             stars = c('*' = .1, '**' = 0.05, '***' = .01))
Table 4.2: Regression with collider variables
 (1)   (2)   (3)   (4)
(Intercept) 0.001 −0.161 2.137 −0.161
test −0.008 −0.174 −0.719 −0.174
admittedTRUE 2.298
test × admittedTRUE −0.545
Num.Obs. 100000 90000 10000 100000
R2 0.000 0.031 0.515 0.227
* p < 0.1, ** p < 0.05, *** p < 0.01

4.2.1 Exercises

  1. Imagine that, while you understand the basic idea that both tests and interviews affect admissions, you only have access to the regression results reported above. Which of the four models seems to best “describe the data”? Which of the four models seems to do worst?
  2. How are the coefficients in model (4) related to those in models (2) and (3)? Is this coincidence? Or would we always expect these relations to hold?
  3. As economists, we should be alert to the “endogeneity” of institutions. For example, if universities admitted students based on test and interview performance, they probably have good reasons for doing so. Can you think of reasons why a university might add test and interview performance into a single score? Does your story have implications for the relationship between test and interview performance?
  4. Using mutate, create a fourth variable test_x_admitted that is the product of test and admitted and run a version of model (4) above that uses this variable. Then regress, interview on test, admitted, and test_x_admitted. Do you get the same results are are shown above for model (4)? Why or why not?

4.3 Causal diagrams: Formalities

Here we provide a brief formal treatment of some of the ideas on causal diagrams discussed above. See Pearl (2009a) for more detailed coverage.

4.3.1 Definitions and a result

We first introduce some basic definitions and a key result. While this material may seem forbiddingly formal, the ideas are actually fairly straightforward and we will cover a few examples to demonstrate this. We recommend that in your first pass through the material, you spend just a little time on the two definitions and the theorem and then go back over them after reading the guide we provide just after the theorem.

Definition 4.1 A path \(p\) is said to be \(d\)-separated (or blocked) by a set of nodes \(Z\) if and only if

  1. \(p\) contains a chain \(i \rightarrow m \rightarrow j\) or a fork \(i \leftarrow m \rightarrow j\) such that the middle node \(m\) is in \(Z\), or
  2. \(p\) contains an inverted fork (or ) \(i \rightarrow m \leftarrow j\) such that the middle node \(m\) is not in \(Z\) and such that no descendant of \(m\) is in \(Z\).

Definition 4.2 A set of variables \(Z\) satisfies the back-door criterion relative to a an ordered pair of variables \((X, Y)\) in a directed acyclic graph (DAG) \(G\) if:

  1. no node in \(Z\) is a descendant of \(X\); and
  2. \(Z\) blocks every path between \(X\) and \(Y\) that contains an arrow into \(X\).12

Given this criterion, Pearl (2009a, p. 79) proves the following result.

Theorem 4.1 If a set of variables \(Z\) satisfies the back-door criterion relative to \((X, Y)\), then the causal effect of \(X\) on \(Y\) is identifiable and is given by the formula \[ P(y | x) = \sum_{z} P(y | x, z) P(z), \] where \(P(y|x)\) stands for the probability that \(Y = y\), given that \(X\) is set to level \(X=x\) by external intervention.13

In plainer language, Theorem 4.1 tells us that we can estimate the causal effect of \(X\) on \(Y\) if we have a set of variables \(Z\) that satisfies the back-door criterion. Then Definition 4.2 tells us that \(Z\) needs to block all back-door paths while not containing any descendant of \(X\). Finally, Definition 4.1 tells what it means to block a path.

4.3.2 Application of back-door criterion to basic diagrams

Applying the back-door criterion to Figure 4.1 is straightforward and intuitive. The set of variables \(\{Z\}\) or simply \(Z\) satisfies the criterion, as \(Z\) is not a descendant of \(X\) and \(Z\) blocks the back-door path \(X \leftarrow Z \rightarrow Y\). So by conditioning on \(Z\), we can estimate the causal effect of \(X\) on \(Y\). This situation is a generalization of linear model in which \(Y = X \beta + Z \gamma + \epsilon_Y\) and \(\epsilon_Y\) is independent of \(X\) and \(Z\), but \(X\) and \(Z\) are correlated. In this case, it is well known that omission of \(Z\) would result in a biased estimate of \(\beta\), the causal effect of \(X\) on \(Y\), but by including \(Z\) in the regression, we get an unbiased estimate of \(\beta\). In this situation, \(Z\) is a confounder.

Turning to Figure 4.2, we see that \(Z\), which is a of the effect of \(X\) on \(Y\), does not satisfy the back-door criterion, because \(Z\) is a descendant of \(X\). However, \(\emptyset\) (i.e., the empty set) does satisfy the back-door criterion. Clearly, \(\emptyset\) contains no descendant of \(X\). Furthermore, the only path other than \(X \rightarrow Y\) that exists is \(X \rightarrow Z \rightarrow Y\), which does not have a back-door into \(X\). Note that the back-door criterion not only implies that we need not condition on \(Z\) to obtain an unbiased estimate of the causal effect of \(X\) on \(Y\), but that we should not condition of \(Z\) to get such an estimate.

Finally in Figure 4.3, we have \(Z\) acting as what Pearl (2009b) (p. 17) refers to as a “collider” variable.14 Again, we see that \(Z\) does not satisfy the back-door criterion, because \(Z\) is a descendant of \(X\). However, \(\emptyset\) again satisfies the back-door criterion. First, it contains no descendant of \(X\). Second, the only path other than \(X \rightarrow Y\) that exists is \(X \rightarrow Z \leftarrow Y\), which does not have a back-door into \(X\). Again, the back-door criterion not only implies that we need not condition on \(Z\), but that we should not condition of \(Z\) to get an unbiased estimate of the causal effect of \(X\) on \(Y\).

4.3.3 Exercises

  1. Draw the DAG for the structural model above relating intelligence, industriousness, education, and income. For each \(x\) variable, identify the sets of conditioning variables that satisfy the backdoor criterion with respect to estimating a causal effect of the variable on \(y\).
  2. For any valid set of conditioning variables not considered in Table 4.1, run regressions to confirm that these indeed deliver good estimates of causal effects.

4.4 Discrimination and bias

Let’s examine a real-world example related to possible gender discrimination in labour markets.15 When critics claimed that Google systematically underpaid its female employees, Google responded that when “location, tenure, job role, level and performance” are taken into consideration, women’s pay was basically identical to that of men’s.16 In other words, controlling for characteristics of the job, women received the same pay.

But what if stereotyping means men are given roles that are paid better? In this case, naive comparisons of wages by gender “controlling for” occupation would understate the presence of discrimination. Let’s illustrate this with a DAG based on a simple occupational sorting model with unobserved heterogeneity.

Figure 4.5: Discrimination

Note that there is in fact no effect of being female (\(F\)) on earnings (\(Y\)) except through discrimination (\(D\)). Thus, if we could control for discrimination, we’d get a coefficient of zero on \(F\). In this example, we aren’t interested in estimating the effect of being female on earnings; we are interested in estimating the effect of discrimination itself. Note also that discrimination is not directly observed, but given our DAG, we can use \(F\) as a proxy for \(D\), as we have assumed that there is no relation between \(F\) and \(Y\) (or between \(F\) and \(O\), which denotes occupational assignment) except through \(D\).

In this DAG, there are two paths between \(D\) and \(Y\):

\[ \begin{aligned} D &\rightarrow Y \\ D &\rightarrow O \rightarrow Y \end{aligned} \]

Neither path is a backdoor path between \(D\) and \(Y\) because neither has an arrow pointing into \(D\). Conditioning on \(O\) causes the estimated effect of discrimination on income to be biased because \(O\) is a descendant of \(D\), and a set of conditioning variables \(Z\) satisfies the back-door criterion only if no node in \(Z\) is a descendant of \(D\).

n <- 100000

set.seed(2023)
df <- tibble(
  female = runif(n) >= 0.5,
  discrimination = female,
  occupation = 1 + 0 * female - 2 * discrimination + rnorm(n),
  salary = 1 - 1 * discrimination + 2 * occupation + rnorm(n) 
)

lm_1 <- lm(salary ~ female, data = df)
lm_2 <- lm(salary ~ female + occupation, data = df)
modelsummary(list(lm_1, lm_2),
             estimate = "{estimate}",
             statistic = NULL,
             gof_map = c("nobs", "r.squared"),
             stars = c('*' = .1, '**' = 0.05, '***' = .01))
Table 4.3: Discrimination regressions
 (1)   (2)
(Intercept) 3.013 1.003
femaleTRUE −5.005 −0.996
occupation 1.999
Num.Obs. 100000 100000
R2 0.556 0.911
* p < 0.1, ** p < 0.05, *** p < 0.01

4.4.1 Exercises

  1. Given the equations for occupation and salary in the simulation above, what is the direct effect of discrimination on salary? What is the indirect effect (i.e., the effect via occupation) of discrimination on salary? What is the total effect of discrimination on salary? How do each of these effects show up in the reported regression results? (Note: Because of sampling variation, the relationships will not be exact.)
  2. Consider the possibility of an additional unobserved variable, ability (\(A\)), that affects role assignment (\(O\)) and also affects income (\(Y\)) directly.
Figure 4.6: Discrimination with ability

What would be the correct conditioning strategy in this case? Would it now make sense to condition on \(O\) if the goal is to estimate the total effect of discrimination on \(Y\)? (Hint: In answering this question, it may help to adapt the simulation above to generate an ability variable and to incorporate that in the model as follows.)

set.seed(2023)
ability = rnorm(n)
occupation = 1 + 0 * female - 2 * discrimination + 2 * ability + rnorm(n)
salary = 1 - 1 * discrimination + 2 * occupation + 0.3 * ability + rnorm(n)
  1. Consider the additional possibility of different occupational preferences between males and females, as depicted below.
Figure 4.7: Discrimination with ability and job preferences

Given this DAG, is it possible to identify a set of conditioning variables that allow you to estimate the total effect of discrimination on salary? What roles does \(O\) have in this DAG? (Hint: Replace the 0 coefficient in the occupation equation used in the last question with a value of either -1 or +1. Does the sign of this coefficient affect the sign of bias, if any?) Is it possible to estimate the direct effect of discrimination on salary? If so, how? If not, why?

4.5 Causal diagrams: Applications in accounting

A typical paper in accounting research will include many variables to “control for” the potential confounding of causal effects. While many of these variables should be considered confounders, less attention is given to explaining why it is reasonable to assume that they are not mediators or colliders. Such a discussion is important because the inclusion of “controls” that are mediators or colliders will generally lead to bias.

One paper that does discuss this distinction is Larcker et al. (2007), who use a multiple regression (or logistic) model of the form:17

\[ Y = \alpha + \sum_{r=1}^R \gamma _r Z_r + \sum_{s=1}^S \beta_s X_s + \epsilon \tag{4.1}\]

Larcker et al. (2007) suggest that

“One important feature in the structure of Equation 4.1 is that the governance factors [\(X\)] are assumed to have no impact on the controls (and thus no indirect impact on the dependent variable). As a result, this structure may result in conservative estimates for the impact of governance on the dependent variable. Another approach is to only include governance factors as independent variables, or:

\[ Y = \alpha + \sum_{s =1}^S \beta_s X_s + \epsilon \tag{4.2}\]

The structure in Equation 4.2 would be appropriate if governance impacts the control variables and both the governance and control variables impact the dependent variable (i.e., the estimated regression coefficients for the governance variables will capture the total effect or the sum of the direct effect and the indirect effect through the controls).”

But there are some subtle issues here. If some elements of \(Z_r\) are mediators and others are confounders, then both equations will be subject to bias. The estimator based on Equation 4.2 will be biased due to omission of confounders, while that based on Equation 4.1 will be biased due to inclusion of mediating variables. Additionally, the claim that the estimates are “conservative” is only correct if the indirect effect via mediators is of the same sign as the direct (i.e., unmediated) effect. If this is not the case, then the relation between the magnitude (and even the sign) of the direct effect and the indirect effect is unclear.

Additionally, this discussion does not allow for the possibility of colliders. For example, governance plausibly affects leverage choices, while performance is also likely to affect leverage. If so, “controlling for” leverage might induce associations between governance and performance even absent a true relation between these variables.18 While the with-and-without-controls approach used by Larcker et al. (2007) has intuitive appeal, a more robust approach requires careful thinking about the plausible causal relations between the treatment variables, the outcomes of interest, and the candidate control variables.


  1. The author of this site used to have an insightful video on YouTube discussing some of these issues, but the link no longer works. We discuss related issues in Chapter 19.↩︎

  2. We provide some intuitive discussion of these ideas in Chapter 5.↩︎

  3. You might find an entry on Granger causality, but with some caveat that this is a misnomer.↩︎

  4. In contrast, the non-recognition of most internally generated intangible assets is not conditional on news, and it therefore considered unconditional conservatism.↩︎

  5. Ball et al. (2013) was in large part a response to results in Dietrich et al. (2007) that showed “bias” in the Basu (1997) measure. The careful reader might wonder how bias can arise if it is not clear what underlying parameter is being estimated. But Dietrich et al. (2007) choose their words carefully and merely seek to demonstrate bias in the test statistics in favour of a hypothesis of conservatism when none exists rather than bias in an estimate coefficient relative to its true value.↩︎

  6. Dietrich et al. (2022) make some tentative steps in the direction of a model in which conservatism is parameterized by considering two different levels of conservatism in addition to the no-conservatism regime.↩︎

  7. Here \(:=\) means “defined as …”.↩︎

  8. Including variables in a linear regression framework “controls for” only under strict assumptions, such as linearity in the relations between \(X\), \(Y\), and \(Z\).↩︎

  9. Inclusion of \(Z\) blocks the “back-door” path from \(Y\) to \(X\) via \(Z\).↩︎

  10. The two arrows from \(X\) and \(Y\) “collide” in \(Z\).↩︎

  11. To learn more about lists, read the related material in R for Data Science.↩︎

  12. The “arrow into \(X\)” is the portion of the definition that is explains the “back-door” terminology.↩︎

  13. How the quantities \(P(y|x)\) map into estimates of causal effects is not critical to the current discussion, it suffices to note that in a given setting, it can be calculated if the needed variables are observable.↩︎

  14. The two arrows from \(X\) and \(Y\) “collide” in \(Z\).↩︎

  15. This subsection is adapted from Cunningham (2021), which is available at mixtape.scunning.com.↩︎

  16. See gizmodo.com/google-employees-organize-their-own-study-of-gender-pay-1802767010.↩︎

  17. We alter the mathematical notation and equation numbers of Larcker et al. (2007) to conform with notation and numbering we use here.↩︎

  18. Note that Larcker et al. (2007) do not in fact use leverage as a control when performance is a dependent variable.↩︎