26 Matching

In Chapter 5, we discussed three basic causal diagrams and suggested that the concept of nonparametric conditioning on \(Z\) is more demanding than simply including \(Z\) as another regressor in a linear regression model. This chapter develops this idea more fully and examines practical approaches to properly conditioning on confounding variables.

The code in this chapter uses the packages listed in the code below. For instructions on how to set up your computer to use the code found in this book, see Section 1.2.1.

We use the stargazer package to render regression output from this chapter. For the HTML version of this book, we set sg_format <- "html", but sg_format <- "text" is a better option if looking at the results interactively.

sg_format <- "html"

26.1 Background on auditor choice

For concreteness, we will explore the basic ideas of this chapter using the setting of auditor choice and its effects. An extensive literature has examined the question of whether “Big N” auditors produce higher audit quality. This question is examined in two of the papers we will study in this chapter (DeFond et al., 2017; Lawrence et al., 2011).

For data on auditors, the table comp.funda has a column labelled au , which lines up with aucd in comp.r_auditors.

pg <- dbConnect(RPostgres::Postgres(), bigint = "integer")

company <- tbl(pg, sql("SELECT * FROM comp.company"))
funda <- tbl(pg, sql("SELECT * FROM comp.funda"))
r_auditors <- tbl(pg, sql("SELECT * FROM comp.r_auditors"))

While Lawrence et al. (2011) refers to the “Big Four”, they are actually referring to the current group that DeFond et al. (2017) refer to as the “Big N”. The latter term alludes to the prior incarnations of grouping of the top audit firms, including the “Big Five” (prior to the demise of Arthur Andersen” and the “Big Eight” (prior to mergers in the late 1980s) (see Gow and Kells, 2018 for more on this history).

r_auditors %>% collect(n = 12)
## # A tibble: 12 × 2
##     aucd audesc                          
##    <int> <chr>                           
##  1     0 Unaudited                       
##  2     1 Arthur Andersen                 
##  3     2 Arthur Young                    
##  4     3 Coopers & Lybrand               
##  5     4 Ernst & Young                   
##  6     5 Deloitte & Touche               
##  7     6 KPMG                            
##  8     7 PricewaterhouseCoopers          
##  9     8 Touche Ross                     
## 10     9 Other                           
## 11    10 Altschuler, Melvoin, and Glasser
## 12    11 BDO

Looking closer at r_auditors, we see that Arthur Andersen has an aucd of 1. Note that Arthur Young is now part of Ernst & Young, Coopers & Lybrand is now part of PricewaterhouseCoopers, and Touche Ross is now part of Deloitte & Touche. So a reasonable approach to “Big N” would appear to be big4 = aucd %in% 1:8L.192 It’s commonly understood that most large firms choose a Big Four auditor, but rather than simply accepting that, we can look at the data.

Here we focus on firms with meaningful financial statements (sale > 0, at > 0) and fiscal 2019, the latest fiscal year with complete data are available at the time of writing.

size_big4 <-
  funda %>%
  filter(indfmt == "INDL", datafmt == "STD",
         consol == "C", popsrc == "D") %>%
  filter(sale > 0, at > 0, fyear == 2019, !is.na(au)) %>%
  mutate(au = as.integer(au)) %>%
  select(gvkey, datadate, fyear, au, prcc_f, csho) %>%
  mutate(big4 = au %in% 1:8L,
         mkt_cap = prcc_f * csho * 1e6) %>%
  collect()

We calculate market share by “bins” where each bin represents an order of magnitude of market capitalization. For example, (9, 10] includes all firms with market capitalization over $1 billion (\(10^9\)) and less than or equal to $10 billion (\(10^{10}\)).

size_big4 %>% 
  filter(mkt_cap > 1e6) %>% 
  mutate(log_mkt_cap = log10(mkt_cap)) %>%
  mutate(bin = cut(log_mkt_cap, breaks = seq(6, 13, 1))) %>%
  group_by(bin) %>%
  summarize(big4_perc = 100 * mean(big4)) %>%
  knitr::kable(digits = 2)
bin big4_perc
(6,7] 16.80
(7,8] 27.56
(8,9] 57.86
(9,10] 89.70
(10,11] 96.76
(11,12] 98.91
(12,13] 100.00

Interestingly, the one case of a company with an apparent market capitalization over $100 billion and a non–Big Four auditor is Vanjia, which apparently misstated the number of shares outstanding as 8,550 million rather than 30 million.193

size_big4 %>%
  filter(!big4, mkt_cap > 1e11) %>%
  knitr::kable()
gvkey datadate fyear au prcc_f csho big4 mkt_cap
026478 2019-12-31 2019 9 14.38 8550 FALSE 1.22949e+11

We can also present the data as a histogram. From either the table or the histogram, it is clear that the Big Four have overwhelming market share among the largest firms and have most of the market even among firms with market capitalization in the $100 million-to-$1 billion range.

size_big4 %>% 
  filter(mkt_cap > 1e6) %>% 
  ggplot(aes(x = log10(mkt_cap), fill = big4)) +
  geom_histogram(breaks = seq(6, 12.25, 0.25))

26.2 Replication of Lawrence et al. (2011)

To facilitate discussion, we conduct an approximate replication of Lawrence et al. (2011) and begin by constructing a subset of Compustat with calculations used in that paper.

comp <-
  funda %>%
  filter(indfmt == "INDL", datafmt == "STD",
         consol == "C", popsrc == "D") %>%
  filter(!is.na(sich), !between(sich, 6000, 6999),
         sale > 0, at > 0, (dltt >= 0 | is.na(dltt)) & (dlc >= 0 | is.na(dlc)),
         prcc_f * csho > 0, ceq > 0) %>%
  select(gvkey, datadate, fyear, au, prcc_f, csho, at, ib,
         dltt, dlc, rect, ppent, sale, act, lct, sich,
         oancf) %>%
  mutate(sic2 = floor(sich/100)) %>%
  filter(between(fyear, 1988, 2006)) %>%
  collect() %>%
  group_by(gvkey) %>%
  arrange(fyear) %>%
  mutate(lag_fyear = lag(fyear),
         avg_at = (lag(at) + at)/2,
         log_at = log(at),
         aturn = if_else(lag(at) > 0, sale/lag(at), NA_real_),
         inv_avg_at = 1/avg_at,
         d_sale = sale - lag(sale),
         d_rect = rect - lag(rect),
         d_sale_at = (d_sale - d_rect)/avg_at,
         ppe_at = ppent/avg_at,
         curr = if_else(lct > 0, act/lct, NA_real_),
         lag_curr = lag(curr),
         lev = if_else(avg_at > 0, 
                       (dltt + coalesce(dlc, 0))/avg_at, NA_real_),
         lag_lev = lag(lev),
         roa = if_else(avg_at > 0, ib/avg_at, NA_real_),
         lag_roa = lag(roa),
         accruals = ib - oancf,
         ta = accruals/avg_at,
         mkt_cap = prcc_f * csho,
         log_mkt = if_else(mkt_cap > 0, log(mkt_cap), NA_real_)) %>%
  ungroup() %>%
  filter(lag_fyear == fyear - 1L)

Lawrence et al. (2011) estimate discretionary accrual models by industry and year, subject to a requirement of at least 10 data points for each model. So we compile a list of industry-years that meet this criterion.

industries <- 
  comp %>%
  group_by(fyear, sic2) %>%
  summarize(n_firms = n(), .groups = "drop") %>%
  filter(n_firms >= 10)

We next construct a function to estimate discretionary accruals along the lines described in footnote 11 of Lawrence et al. (2011).

get_klw_data <- function(sic2, fyear) {

  reg_data <-
    comp %>%
    filter(sic2 == !!sic2, fyear == !!fyear) 
           
  fm <- tryCatch(lm(ta ~ inv_avg_at + d_sale_at + ppe_at,
                    data = reg_data, na.action = na.exclude),
                 error = function(e) NULL)
    
  if (is.null(fm)) return(NULL)
  
  results <-
    reg_data %>% 
    select(gvkey, fyear, roa) %>%
    bind_cols(da = resid(fm)) %>%
    filter(!is.na(da))
  
  ada_data <-
    results %>%
    inner_join(results, by = "fyear",
               suffix = c("", "_match"),
               multiple = "all") %>%
    filter(gvkey != gvkey_match) %>%
    mutate(roa_diff = abs(roa - roa_match)) %>%
    group_by(gvkey, fyear) %>%
    filter(roa_diff == min(roa_diff, na.rm = TRUE)) %>%
    mutate(ada = abs(da - da_match)) %>%
    select(gvkey, fyear, da, ada) %>%
    ungroup()
  
  ada_data
}

We can then Map this function onto our industry-years and store the results in klw_results.194

klw_results <- bind_rows(Map(get_klw_data, industries$sic2, industries$fyear))
full_sample <-
  comp %>%
  inner_join(klw_results, by = c("gvkey", "fyear")) %>%
  filter(avg_at > 0) %>%
  mutate(big4 = au %in% 1:8L,
         mkt_cap = prcc_f * csho,
         log_mkt = if_else(mkt_cap > 0, log(mkt_cap), NA_real_))
## Warning in inner_join(., klw_results, by = c("gvkey", "fyear")): Each row in `x` is expected to match at most 1 row in `y`.
## ℹ Row 250 of `x` matches multiple rows.
## ℹ If multiple matches are expected, set `multiple = "all"` to silence this
##   warning.
fms <- list()
fms[[1]] <- felm(ada ~ big4 + log_mkt + lag_roa + 
                   lag_lev + lag_curr | sic2 + fyear,
                 data = full_sample)

Lawrence et al. (2011) use the widely used approach of winsorizing variables used in their analysis. We use the winsorize function from the farr package (and discussed in Chapter 25) to do this.

full_sample_win <-
  full_sample %>%
  mutate(across(c(ada, log_mkt, lag_roa, lag_lev, lag_curr, roa,
                  lev, curr, mkt_cap), winsorize, 
                prob = 0.01))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `across(...)`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))
fms[[2]] <- 
  felm(ada ~ big4 + log_mkt + lag_roa + lag_lev + lag_curr | 
              sic2 + fyear | 0 | gvkey + fyear,
            data = full_sample_win)
lmz_match <-
  full_sample %>%
  filter(!is.na(log_at) &
           !is.na(log_mkt) &
           !is.na(aturn) &
           !is.na(curr) &
           !is.na(lev) &
           !is.na(roa)) %>%
  matchit(!big4 ~ log_at + log_mkt + aturn +
                    curr + lev + roa, 
                  data = .,
                  caliper = 0.0300, std.caliper = FALSE,
                  m.order = "largest", discard = "both")
full_sample_win_no_miss <-
  full_sample_win %>%
  filter(!is.na(log_at) &
           !is.na(log_mkt) &
           !is.na(aturn) &
           !is.na(curr) &
           !is.na(lev) &
           !is.na(roa))

lmz_match_win <- matchit(!big4 ~ log_at + log_mkt + aturn +
                    curr + lev + roa, 
                  data = full_sample_win_no_miss,
                  caliper = 0.0300, std.caliper = FALSE,
                  m.order = "largest", discard = "both")

We can estimate propensity scores as the fitted values from the estimated logit model (we specify type = "response" to get fitted values on the original \([0, 1]\) scale of the dependent variable).

Following Lawrence et al. (2011), we discard matches where the difference in p-scores is greater than 3%.

lmz_matched <- match.data(lmz_match)

fms[[3]] <-  
  felm(ada ~ big4 + log_mkt + lag_roa + lag_lev + lag_curr | 
              sic2 + fyear | 0 | gvkey + fyear, 
       data = lmz_matched)
lmz_matched_win <- match.data(lmz_match_win)

fms[[4]] <-  
  felm(ada ~ big4 + log_mkt + lag_roa + lag_lev + lag_curr | 
              sic2 + fyear | 0 | gvkey + fyear,
            data = lmz_matched_win)
stargazer(fms, type = sg_format, header = FALSE)
Dependent variable:
ada
(1) (2) (3) (4)
big4 -0.016*** -0.014*** -0.007*** -0.006***
(0.001) (0.001) (0.002) (0.002)
log_mkt -0.006*** -0.005*** -0.007*** -0.005***
(0.0002) (0.0003) (0.001) (0.001)
lag_roa -0.108*** -0.124*** -0.107*** -0.136***
(0.002) (0.005) (0.017) (0.005)
lag_lev -0.045*** -0.045*** -0.054*** -0.050***
(0.003) (0.002) (0.005) (0.005)
lag_curr -0.0001** -0.001*** -0.0002 -0.001**
(0.0001) (0.0002) (0.0001) (0.0003)
Observations 73,010 73,010 24,212 24,236
R2 0.127 0.150 0.116 0.134
Adjusted R2 0.127 0.149 0.113 0.132
Residual Std. Error 0.124 (df = 72934) 0.100 (df = 72934) 0.147 (df = 24137) 0.119 (df = 24161)
Note: p<0.1; p<0.05; p<0.01

26.3 Simulation analysis

We will use simulation analysis to examine more closely how propensity-score matching works. We take the actual empirical distribution of market capitalization from the data depicted in the histogram above and give each firm a probability of using a Big Four auditor equal to the observed conditional probability for the histogram bin it is in.

prob_big4 <-
  size_big4 %>%
  filter(mkt_cap > 1e6) %>% 
  select(big4, mkt_cap) %>%
  mutate(log_mkt_cap = log10(mkt_cap)) %>%
  mutate(bin = cut(log_mkt_cap, breaks = seq(6, 12.25, 0.25))) %>%
  group_by(bin) %>%
  mutate(p_big4 = mean(big4)) %>%
  ungroup()

n_firms <- 5000

We then draw 5000 firms from this distribution and assign each firm an auditor based on the conditional probabilities calculated above. A critical assumption here is that conditional on market capitalization, whether a Big Four auditor is chosen is completely random. Finally, we calculate a measure of audit quality that is a non-linear function of market capitalization and a random noise component.

set.seed(2021)
sim_auditor <-
  prob_big4 %>%
  sample_n(size = n_firms, replace = TRUE) %>% 
  mutate(rand = runif(nrow(.)),
         big4 = rand < p_big4) %>%
  select(-rand, -p_big4, -bin) %>%
  mutate(audit_quality = big4 * 0 + mkt_cap ^(1/3) * 0.003 +
           rnorm(nrow(.))) 

Given this is a simulation, we know that the correct causal diagram is the following:

Market capitalization is a confounder

Figure 26.1: Market capitalization is a confounder

We draw the arrow connecting \(X\) and \(Y\) with a dashed line because the true coefficient on big4 is zero, so in this simulation there is no actual causal relation between big4 and audit_quality.

The good news from this causal diagram is that conditioning on \(Z\) gives us an unbiased estimator of the causal effect of \(X\) on \(Y\). So let’s regress audit_quality on big4 and mkt_cap.

sg_format <- "html"
fm <- lm(audit_quality ~ big4 + mkt_cap, 
         data = sim_auditor)         
stargazer(fm, type = sg_format, header = FALSE)
Dependent variable:
audit_quality
big4 2.209***
(0.063)
mkt_cap 0.000***
(0.000)
Constant 1.590***
(0.050)
Observations 5,000
R2 0.572
Adjusted R2 0.572
Residual Std. Error 2.099 (df = 4997)
F Statistic 3,342.538*** (df = 2; 4997)
Note: p<0.1; p<0.05; p<0.01

So we have a clearly positive coefficient on big4 even “controlling for” mkt_cap, and we might be tempted to conclude that we have identified a causal effect. Of course, we know that this conclusion is invalid. The issue is that we are effectively assuming a linear relation between mkt_cap and audit quality, whereas the true relation is non-linear in mkt_cap. As a result, “controlling for” mkt_cap by including mkt_cap does not adequately control for the relation between mkt_cap and audit quality. Because there is a non-linear relation between big4 and mkt_cap, the regression specification can use information in big4 to explain variation in audit_quality in addition to that variation explained by a linear function of mkt_cap. This additional explanatory power shows up as a statistically significant coefficient on big4 in the regression above.

Would some kind of matching analysis help?

Given its prominence in accounting research, we will focus on propensity-score matching here. The first step is the estimation of propensity scores. The most common approach is to use logistic regression, which we can do using the glm function (which fits “generalized linear models”) with family = binomial(link = "logit")).

sim_fm <- glm(big4 ~ mkt_cap, data = sim_auditor, 
          family = binomial(link = "logit"))

We can estimate propensity scores as the fitted values from the estimated logit model (we specify type = "response" to get fitted values on the original \([0, 1]\) scale of the dependent variable). We then feed the estimated pscore values to the pairmatch function from the optmatch package, which creates a matrix with the distance from each treatment observation to each control observation. The pairmatch function will return a factor indicating which matched pair an observation belongs to, which will be NA is the observation is not matched.

Following Lawrence et al. (2011), we discard matches where the difference in p-scores is greater than 3% (we do this by setting match to FALSE for these cases).195

sim_matches <-
  sim_auditor %>%
  mutate(pscore = predict(sim_fm, type = "response")) %>%
  mutate(match = pairmatch(big4 ~ pscore, data = .)) %>%
  group_by(match) %>% 
  mutate(matched = max(pscore) - min(pscore) <= 0.03) %>%
  ungroup()

Here we see that there are many treatment observations with p-scores above say 0.70, but very few control observations with comparable p-scores to match with these. In this relatively simple setting, this is really just another way of saying there are many large Big Four clients and few similarly sized non–Big Four clients to match with them. This suggests that many observations will be unmatched.

To understand the success of the match, we can compare two pairs of histograms. The first pair of histograms shows the propensity scores for all observations. The second pair of histograms shows the propensity scores for the matched observations. We can see that, apart from observations with p-scores around 0.5, the distributions of p-scores for treatment and control observations are very similar, suggesting a fairly good match.

sim_matches %>% 
  ggplot(aes(x = pscore, fill = big4)) + 
  geom_histogram(binwidth = 0.025) + 
  facet_grid(big4 ~ .)
Propensity scores distribution: All observations

Figure 26.2: Propensity scores distribution: All observations

sim_matches %>% 
  filter(matched) %>%
  ggplot(aes(x = pscore, fill = big4)) + 
  geom_histogram(binwidth = 0.025) + 
  facet_grid(big4 ~ .)
Propensity scores distribution: Matched observations

Figure 26.3: Propensity scores distribution: Matched observations

If we are happy with the match, we can finish our analysis by estimating regressions. Like Lawrence et al. (2011), we estimate regressions with and without the variables from the p-score regression (Shipman et al., 2016 refer to these approaches as “MR” and “t-test”, respectively).

fms <- list()
fms[[1]] <- lm(audit_quality ~ big4, data = sim_matches, subset = matched) 
fms[[2]] <- lm(audit_quality ~ big4 + mkt_cap, 
               data = sim_matches, subset = matched)
stargazer(fms, type = sg_format, header = FALSE)
Dependent variable:
audit_quality
(1) (2)
big4 0.295*** 0.246***
(0.086) (0.070)
mkt_cap 0.000***
(0.000)
Constant 1.713*** 1.548***
(0.061) (0.050)
Observations 1,828 1,828
R2 0.006 0.333
Adjusted R2 0.006 0.332
Residual Std. Error 1.835 (df = 1826) 1.505 (df = 1825)
F Statistic 11.768*** (df = 1; 1826) 454.617*** (df = 2; 1825)
Note: p<0.1; p<0.05; p<0.01

Interestingly, we see that big4 still has a statistically significant relation with audit quality. This is likely due to a residual effect of mkt_cap being picked up by big4 due to the imperfection of the propensity-score match.

26.4 DeFond et al. (2017)

DeFond et al. (2017) re-examine the question of Lawrence et al. (2011), but with some differences in approach.

The first set of differences relates to design choices in the application of propensity-score matching. While Lawrence et al. (2011) focus on a single propensity-score model with five primary covariates, DeFond et al. (2017) augment the propensity-score model with twenty additional variables, namely the squares and cubes of these covariates and the ten interactions between the five covariates. DeFond et al. (2017) then select 1,500 random subsets from the twenty variables to develop propensity-score models. For each estimated propensity-score model, the caliper is set at a random value (less than 30%) that yields a “balanced matched sample”.

A second design-choice variation that DeFond et al. (2017) consider is matching with replacement. While Lawrence et al. (2011) focus on matching one treatment observation with (at most) one control observation without replacement (i.e., once a control observation is matched to a treatment observation, it is not available to match with another treatment observation), DeFond et al. (2017) consider matches with replacement and consider matches where each treatment observation is matched to one, two, or three control firms.

The second set of differences between DeFond et al. (2017) and Lawrence et al. (2011) relate to the measures of audit quality considered. Lawrence et al. (2011) consider cost of equity capital and analyst forecast accuracy, while DeFond et al. (2017) omit these two measures on the basis that they have “poor construct validity, and as a result are rarely used in the prior literature” (DeFond et al., 2017, p. 3637). Instead, DeFond et al. (2017) consider three other measures of audit quality: income-increasing discretionary accruals, restatements, and going concern opinions.

26.5 Exercises

Note that the matching algorithm used above takes some time.

  1. What happens in the above analyses if the true effect of big4 on audit quality is 10 instead of 0?
  2. What happens in the above analyses if the caliper is reduced to 0.01 from 0.03 (but the true effect of big4 on audit quality is still 0)?
  3. What happens in the above analyses if the caliper is reduced to 0.01 from 0.03 and the true effect of big4 on audit quality is 10?
  4. Given the above, what are your conclusions about the value of propensity-score matching and OLS in causal inference?
  5. Was endogeneity an issue in the simulation above? If not, why not? If so, how and what could you do to address it?

26.6 Discussion questions

  1. Given the evidence presented in the simulation, how do you interpret the regression results presented above? How do DeFond et al. (2017) appear to interpret the results?

  2. Referring back to the basic causal relations described in Chapter 2.2, what is the causal diagram implied by equation (1) of and Table 2 of Lawrence et al. (2011)? What variables are in PROXY_CONTROLS for Table 2? Why is LOG_ASSETS not found in Table 2?

  3. Do Lawrence et al. (2011) report results from estimating equation (1)? What happens to the “difference in means” between the Full Sample and the Propensity-Score Matched Sample in Table 1? Does this make sense?

  4. Why do you think the difference in means is still statistically significant for two variables in the Propensity-Score Matched Sample column? Do you think this is a concern?

  5. Lawrence et al. (2011) evaluate three outcomes as measures of audit quality: discretionary accruals, cost of equity, and analyst forecast accuracy? Evaluate each of these measures. Which do you think is best? What are the strengths and weaknesses of this best measure?

  6. Can you think of other measures of audit quality that might make sense?

  7. Apply the check list on pp.217–218 of Shipman et al. (2016) and the questions in Panels B and C of Table 1 of Shipman et al. (2016) to Lawrence et al. (2011). How does Lawrence et al. (2011) stack up against these?