24  Extreme values and sensitivity analysis

In this chapter, we look at approaches used in handling the kinds of extreme values encountered in real-world data sets. We start the chapter with Leone et al. (2019), which provides guidance on how to address extreme values. We then examine Call et al. (2018), which explores a setting in which extreme values are present. We also use this chapter to explore the idea of sensitivity analysis using the approach of Frank (2000) seen in Call et al. (2018).


The code in this chapter uses the following packages. We load tidyverse because we use several packages from the Tidyverse. For instructions on how to set up your computer to use the code found in this book, see Section 1.2. Quarto templates for the exercises below are available at https://github.com/iangow/far_templates.

library(dbplyr)       # window_order()
library(furrr)        # future_map()
library(lmtest)       # coeftest()
library(sandwich)     # vcovHC()
library(robustbase)   # lmrob()

24.1 Leone et al. (2019)

Leone et al. (2019) show that many papers in accounting research modify or eliminate observations with extreme values, as these can have an outsized influence on regression results. Winsorization defines extreme values as those above or below certain quantiles and replaces these with values at those quantiles (we saw winsorization in Chapter 25). Another approach is truncation, which is like winsorization, except that extreme observations are eliminated from subsequent analysis.

Leone et al. (2019) also cover two alternative approaches to handling extreme values. The first alternative approach is the use of diagnostic statistics to identify influential observations where “such observations are excluded from second-pass (or later) regressions estimated to draw a researcher’s ‘final’ inferences” (p. 343).

The second alternative approach is robust regression. While Leone et al. (2019) allude to an extensive literature on robust regression, we follow Leone et al. (2019) in focusing on so-called MM-estimators, which are made available via the robustbase package.

Leone et al. (2019) provide three kinds of evidence regarding approaches to extreme values and influential observations. First, they survey the literature to gather data describing what researchers do to address extreme observations. The results of this effort are found in Table 1 of Leone et al. (2019).

Second, Leone et al. (2019) replicate three studies and examine how inferences vary by approach. Below we study one of these replications.

Third, Leone et al. (2019) provide simulation evidence on the effectiveness of the various approaches. We replicate this simulation evidence in Section 24.4.

24.1.1 Replication analysis

To better understand the approaches discussed in Leone et al. (2019), we will follow their replication of Chen et al. (2018), which is also the focus of material later in this chapter. While Chen et al. (2018) focus on fiscal years from 1996 through 2015, we get data for fiscal-year 1995, so that lagged values are available for 1996. Note that after collecting the data, we convert fyear into a factor, which facilitates estimation of fixed effects and interaction terms.1

db <- dbConnect(RPostgres::Postgres(), bigint = "integer")

company <- tbl(db, Id(schema = "comp", table = "company"))
funda <- tbl(db, Id(schema = "comp", table = "funda"))
db <- dbConnect(duckdb::duckdb())

company <- load_parquet(db, schema = "comp", table = "company")
funda <-  load_parquet(db, schema = "comp", table = "funda")
comp <-
  funda |>
  filter(indfmt == "INDL", datafmt == "STD",
         consol == "C", popsrc == "D") |>
  filter(!between(sich, 6000, 6999) | is.na(sich)) |>
  filter(between(fyear, 1995, 2015)) |>
  group_by(gvkey) |>
  window_order(fyear) |>
  mutate(big_n = as.integer(au) %in% 1:8L,
         lag_at = lag(at),
         inv_at = if_else(lag_at > 0, 1 / lag_at, NA),
         ta = if_else(lag_at > 0, (ib - oancf) / lag_at, NA),
         roa = if_else(lag_at > 0, ib / lag_at, NA),
         cfo = if_else(lag_at > 0, oancf / lag_at, NA),
         mkt_cap = prcc_f * csho,
         lag_mkt_cap = lag(mkt_cap),
         size = if_else(lag_mkt_cap > 0, log(lag_mkt_cap), NA),
         debt = coalesce(dltt, 0) + coalesce(dlc, 0),
         lev = if_else(lag_at > 0, debt / lag_at, NA),
         mtb = if_else(lag(ceq) > 0, lag_mkt_cap / lag(ceq), NA),
         assets = lag_at,
         d_sale = if_else(lag_at > 0, (revt - lag(revt)) / lag_at, NA),
         d_ar =  if_else(lag_at > 0, (rect - lag(rect)) / lag_at, NA),
         ppe = if_else(lag_at > 0, ppent / lag_at, NA)) |>
  ungroup() |>
  select(gvkey, datadate, fyear, big_n, ta, big_n, roa, cfo, size, lev, 
         mtb, inv_at, d_sale, d_ar, ppe) |>
  collect() |>
  mutate(fyear = as.factor(fyear)) 

We create two functions that are quite similar in structure. The winsorize() function offers more than we use here, as we stick to the default argument of prob = 0.01, which is then mapped into p_low = 0.01 and p_high = 0.99, which implements “the 1 percent and 99 percent winsorization rule common to accounting studies” (and also used in Chen et al. (2018)). The first line of the function uses quantile() to identify the 1st and 99th percentiles of the distribution of x and stores these two values in the vector cuts. The type argument specifies how ties are resolved and how averages at breakpoints are calculated. The type argument should be “an integer between 1 and 9 selecting one of the nine quantile algorithms … to be used.”2. By selecting type = 2, we get the SAS default, which means we get the same values as we would using the SAS macro commonly used by other researchers. The third line of the winsorize() function sets all values below cuts[1], which is the p_low quantile, equal to cuts[1]. The fourth line sets all values above cuts[2] (the p_high quantile) equal to cuts[2]. The last line returns the modified value of x.

winsorize <- function(x, prob = 0.01, p_low = prob, p_high = 1 - prob) {
  cuts <- quantile(x, probs = c(p_low, p_high), type = 2, na.rm = TRUE)
  x[x < cuts[1]] <- cuts[1]
  x[x > cuts[2]] <- cuts[2]

The truncate() function is almost identical, except that values outside the bounds in cuts are set to NA, which effectively removes them from later analysis.

truncate <- function(x, prob = 0.01, p_low = prob, p_high = 1 - prob) {
  cuts <- quantile(x, probs = c(p_low, p_high), type = 2, na.rm = TRUE)
  x[x < cuts[1]] <- NA
  x[x > cuts[2]] <- NA

To apply these functions we use the mutate() function across() all variables between ta and ppe inclusive (i.e., we do not apply them to gvkey, datadate, fyear, or big_n) and store the results in comp_win and comp_trunc.

comp_win <-
  comp |>
  mutate(across(ta:ppe, winsorize))

comp_trunc <-
  comp |>
  mutate(across(ta:ppe, truncate))

We next run regressions. The first model we consider is essentially that found in Chen et al. (2018). As Leone et al. (2019) point out, roa and cfo can only be included in a model with ta as the dependent variable when winsorization breaks the identity ta = roa - cfo.3 In the second model, we use raw data, but drop roa to make the model meaningful.

fms <- list(
  "Wins, w/ROA" = feols(ta ~ big_n + roa + cfo + size + lev + mtb + 
                          fyear  * (inv_at + I(d_sale - d_ar) + ppe), 
                        ~ gvkey + fyear,
                        data = comp_win, na.action = na.exclude),
  "Raw" = feols(ta ~ big_n + cfo + size + lev + mtb + 
                  fyear  * (inv_at + I(d_sale - d_ar) + ppe), 
                ~ gvkey + fyear,
                data = comp, na.action = na.exclude))

The third and fourth models tweak the second model ("Raw") by using winsorized and truncated data respectively. We use the update() function to make the relationships between the models clearer and to avoid errors caused by unintended differences between models.

fms[["Wins"]] <- update(fms[["Raw"]], data = comp_win)
fms[["Trunc"]] <- update(fms[["Raw"]], data = comp_trunc)

Given that we don’t use comp_trunc other than to estimate fms[["Trunc"]], we could have used the following alternative to avoid creating comp_trunc as a separate variable.

fms[["Trunc"]] <- 
  comp |>
  mutate(across(ta:ppe, truncate)) |>
  update(fms[["Raw"]], data = _)

The fifth model we estimate uses Cook’s D to identify extreme observations for exclusion. As the fixest objects returned by feols() do not support the cooks.distance() function, we estimate an OLS model and use that to identify extreme observations for exclusion.

fm <- lm(ta ~ big_n + cfo + size + lev + mtb + 
                   fyear  * (inv_at + I(d_sale - d_ar) + ppe),
         data = comp, na.action = na.exclude)

fms[["Cook's D"]] <-  
  comp |>
  mutate(cooksd = cooks.distance(fm),
         extreme = cooksd > 4 / nobs(fm),
         ta = if_else(extreme, NA, ta)) |>
  update(fms[["Raw"]], data = _)

The sixth and final model uses robust regression. Leone et al. (2019) use the robreg module made available for Stata, which defaults to an asymptotic efficiency of 85% for the bisquare estimator used in the second step of the MM estimation procedure and we mimic this choice here. This efficiency level is obtained by specifying the tuning.psi parameter as 3.4437.4

An alternative would be to follow Koller and Stahel (2017) by selecting control = lmrob.control("KS2014"). Koller and Stahel (2017) suggest that this approach “allows for … estimation in designs with factors and interactions between factors and continuous regressors”, which describes our setting here due to the interaction of fyear (a factor) with inv_at + I(d_sale - d_ar) + ppe (continuous variables).

Following the recommendation in Leone et al. (2019), we recover the weights from the fitted lmrob object and feed those to an update of the fixest model in fms[["Raw"]], as doing so allows us to use cluster-robust standard errors. Note that many values returned by weights() will be NA, so we use coalesce(0) to convert those to zeros.

fmrob <- lmrob(formula(fm),
               data = comp, na.action = na.exclude,
               method = "MM",
               control = lmrob.control(tuning.psi = 3.4437))

fms[["Robust"]] <- 
  fmrob |>
  weights(type = "robustness") |>
  coalesce(0) |>
  update(fms[["Raw"]], data = comp, weights = _)

Results of estimating each of these models are shown in Table 24.1.

             estimate = "{estimate}{stars}",
             statistic = "({statistic})",
             coef_omit = "(fyear|ppe|inv_at|d_sale)",
             gof_map = c("nobs", "r.squared"),
             stars = c('*' = .1, '**' = 0.05, '***' = .01))
Table 24.1: Results from various approaches to extreme values
Wins, w/ROA Raw Wins Trunc Cook's D Robust
(Intercept) −0.031*** 0.262 −0.085*** −0.119*** −0.107*** −0.020***
(−8.918) (1.207) (−7.305) (−14.800) (−10.322) (−17.870)
big_nTRUE 0.000 −0.073 0.006 0.009** 0.020*** −0.010***
(0.254) (−1.052) (1.292) (2.614) (3.950) (−18.968)
roa 0.815***
cfo −0.786*** 0.799* 0.257*** 0.124*** 0.194*** −0.386***
(−37.418) (1.763) (8.918) (5.573) (5.211) (−159.147)
size 0.003*** −0.066* 0.004** 0.007*** 0.001 0.003***
(5.223) (−1.857) (2.660) (5.680) (0.686) (18.439)
lev −0.035** −0.072 −0.111*** −0.031*** −0.074*** −0.019***
(−2.639) (−0.344) (−5.101) (−4.478) (−4.174) (−22.238)
mtb −0.001*** −0.003 −0.007*** −0.006*** −0.001*** 0.000***
(−4.683) (−0.931) (−9.127) (−11.030) (−3.797) (−19.849)
Num.Obs. 113991 113991 113991 106581 113563 83529
R2 0.911 0.561 0.278 0.095 0.129 0.928

24.1.2 Discussion questions

  1. Table 5 of Leone et al. (2019) presents results from a simulation analysis. Which panel of that table likely best reflects the kind of data researchers encounter in practice?
  2. Consider the left half of Table 5 of Leone et al. (2019) (i.e., \(\beta_1 = 0\)). Using the panel you identified in question 1 above, interpret the implications of the simulation results for researchers in practice. Note that the mean estimates of \(\beta_1\) using Cook’s D or robust regression are very similar. Does Table 5 of Leone et al. (2019) provide strong support for rejecting Cook’s D in favour of robust regression?
  3. In the search for new research questions, accounting researchers increasingly need to study variables whose most plausible effect sizes are consistent with the right half of Table 5 of Leone et al. (2019) (i.e., \(\beta_1 = 0\)). Using the panel you identified in question 1 above, interpret the implication of the simulation results for researchers in practice.
  4. Table 2 of Leone et al. (2019) reports results similar to those given in columns (4), (5), and (6) of Table 24.1. They say, “[when we] perform estimation using RR, the coefficient on \(BIG\_N\) is \(-0.011\) (significant at the 1 percent level), opposite of what Chen et al. (2018) find, and instead consistent with the negative coefficient documented in prior studies using the two-step procedure.” In light of the simulation evidence, how persuasive do you find the evidence above in support of a negative coefficient on big_n? How probative is the evidence of prior studies if it is based on approaches (e.g., winsorization and two-step procedures) that Leone et al. (2019) and Chen et al. (2018) demonstrate lead to unreliable results?

24.2 Call et al. (2018)

Call et al. (2018, p. 123) examines “whether there is a link between whistleblower involvement and the outcomes of enforcement action” by US regulators. Specifically, they show that “whistleblower involvement is associated with larger monetary penalties for the targeted firms and longer prison sentences for targeted employees” (2018, p. 126).

Kuvvet (2019) points out that a relatively small number of observations account for the bulk of the monetary penalties imposed on firms, employees and others. The cmsw_2018 data frame provided with the farr package includes data used in Call et al. (2018), which allows us to reproduce Table 2 of Kuvvet (2019) as Table 24.2.5

cmsw_2018 |> 
  filter(tousesox == 1) |>
  pivot_longer(cols = matches("(penalty|mos)"), 
               values_to = "amount", names_to = "target") |>
  group_by(target) |> 
  arrange(desc(amount)) |>
  mutate(rank = row_number()) |>
  summarize(total = sum(amount),
            total_top = sum(if_else(rank <= 6, amount, 0)),
            .groups = "drop") |>
  mutate(top_perc = round(total_top / total * 100, 1),
         across(total:total_top, \(x) round(x, 0)))
Table 24.2: Top 1% observations as percentage of total penalties
target total total_top top_perc
emppenalty 21,176 15,877 75.0
empprisonmos 17,181 3,658 21.3
firmpenalty 13,578 7,396 54.5
otherpenalty 11,107 9,684 87.2

24.2.1 Econometric claims in accounting research

Call et al. (2018, p. 134) concede the existence of “two challenges” for empirical analysis in their setting: “the large number of zero-valued observations (i.e., enforcement actions without any resultant penalties or criminal prison sentences) and the severe positive skewness in the dependent variable (i.e., some extremely large penalties).”

Call et al. (2018, p. 134) assert that “whereas other regression techniques … suffer from potentially severe bias … prior research shows that the Poisson pseudo-maximum likelihood (PPML) estimator … is a particularly effective modeling technique for data distributions characterized by a disproportionate number of zeros and severe skewness” and cite seven sources for this claim.

This approach to econometrics is fairly standard in accounting research. While it is rare to see econometric analysis of estimators in accounting research papers, it is common for claims to be made and for readers to be directed to research in economics, finance, or econometrics for support for those claims. Given the prevalence of this approach, we believe it is useful to demonstrate how a reader can evaluate such claims.

We consider each the seven cited sources as a basis for the claims made in Call et al. (2018). The first of these (Irarrazabal et al., 2013) can be eliminated, as it does not use this estimator, let alone provide econometric support for these claims.6 Similarly, Cameron and Trivedi (2009) and Wooldridge (2010) are standard econometrics references that say precisely nothing about the ability of Poisson regression to handle “a disproportionate number of zeros and severe skewness” in any setting, let alone in settings that do not use the count dependent variables that motivate the use of Poisson regression.

The two papers by Santos Silva and Tenreyo seem more relevant.7 Santos Silva and Tenreyro (2006, p. 642) note that “the pioneering work of Tinbergen (1962) initiated a vast theoretical and empirical literature on the gravity equation for trade. … In its simplest form, the gravity equation for trade states that the trade flow from country \(i\) to country \(j\), denoted by \(T_{ij}\), is proportional to the product of the two countries’ GDPs, denoted by \(Y_i\) and \(Y_j\), and inversely proportional to their distance, \(D_{ij}\)”.

This implies the following equation (in stochastic form):

\[ T_{ij} = \alpha_0Y_i^{\alpha_1} Y_j^{\alpha_2} D_{ij}^{\alpha_3} \eta_{ij}\] where \(\alpha_0\), \(\alpha_1\), \(\alpha_2\), and \(\alpha_3\) are unknown parameters and \(\eta_{ij}\) is an error factor assumed the independent of regressors. Santos Silva and Tenreyro (2006) note that “there is a long tradition in the trade literature of log-linearizing” the equation above and estimating using OLS:

\[ \ln T_{ij} = \alpha_0 + \alpha_1 \ln Y_i + \alpha_2 \ln Y_j + \alpha_3 \ln D_{ij} + \ln \eta_{ij} \]

Notwithstanding the popularity of this approach, Santos Silva and Tenreyro (2006) find that “the present of heteroskedasticity can generate strikingly difference estimates when the gravity equation is log-linearized, rather than estimated in levels [and] … that inferences drawn on log-linearized regressions can produce misleading conclusions.”

Fortunately, Santos Silva and Tenreyro (2011) summarize Santos Silva and Tenreyro (2006) as providing “simulation evidence that the PPML is well behaved in a wide range of situations and is resilient to the presence of a specific type of measurement error of the dependent variable.” However, Santos Silva and Tenreyro (2006) provides little evidence on the performance of PPML when the dependent variable is frequently zero and Santos Silva and Tenreyro (2011) seeks to address this by simulating a simple structural model in which the number and size of exporters is stochastic in a model that incorporate ideas from the gravity model, but yields zero trade levels for most of their simulated samples. Santos Silva and Tenreyro (2011) find that PPML performs very well in this simulated setting.

Before discussing the implications of these results, it is important to note that that PPML as used by Santos Silva and Tenreyro (2006) and Santos Silva and Tenreyro (2011) is nothing more or less than Poisson regression. An important property of the Poisson model is that the mean and variance of a Poisson-distributed random variable are equal. While statisticians have developed generalizations of the basic Poisson model to allow for (say) variance greater than the mean (this is called overdispersion and is common in real-world data), Santos Silva and Tenreyro (2006, p. 645) argue that such models “might give excessive weight to the observations that are more prone to measurement errors” in the context of trade data. As such, “the Poisson regression emerges as a reasonable compromise, giving less weight to the observations with larger variance than the standard NLS estimator, without giving too much weight to observations more prone to contamination by measurement error” (Santos Silva and Tenreyro, 2006, p. 646).

Now that we better understand the results of prior research (Santos Silva and Tenreyro, 2011, 2006), we can return to the claims of Call et al. (2018) provided above. Unfortunately, Call et al. (2018) provides no reason to believe that the results of Santos Silva and Tenreyro (2011) carry over to whatever model seems applicable to the setting of Call et al. (2018). There is no “vast theoretical and empirical literature” to suggest any model of the factors driving firm penalties, let alone that either the gravity model or the Santos Silva and Tenreyro (2011) model is applicable to this new setting. As such, the results of prior research are of little value in the Call et al. (2018) setting.

The reasoning of Call et al. (2018) seems analogous to saying that “prior research shows that when \(y\) is a function of \(X\) the best estimator is OLS with OLS standard errors” (in the sense used in Chapter 5) without noting that such results are applicable to the classical linear model with spherical errors and without justifying that model in the current setting.

One concern highlighted by the above discussion is the presence of fundamental asymmetries in the making and checking of claims like those in Call et al. (2018). Making such as claim is close to costless; a simple internet search is likely to yield papers supporting the use of a particular estimator in a context that shares features with the authors’ setting.

Additionally, the econometric knowledge needed to make such claims is close to zero. This is illustrated by footnote 13 of Call et al. (2018, p. 134), where it is claimed that “the primary difference between PPML and conventional Poisson regression is that PPML does not impose the assumption of equality in the first and second moments of the distribution.” Yet if the authors looked at their own Stata code, they would see that, as in Santos Silva and Tenreyro (2006), PPML is Poisson regression.8

In contrast, checking such claims is far from costless. One needs to carefully understand what is going on in those papers and how they relate to the current setting and this requires some knowledge of econometrics. Adding to the asymmetry is the matter of incentives, authors have strong incentives to claim the appropriateness and robustness of those methods that yield results, whereas reviewers have very weak incentives of any kind.

Loose econometric analysis in accounting research is not new. Many researchers claimed that FM-NW and Z2 estimators of standard errors had properties that they did not have. These claims were either unsubstantiated or justified by citing papers that did not address them in any way. Once these claims were evaluated rigorously (Gow et al., 2010), they were found to have no merit.

In light of the discussion above, the response of Call et al. (2019) to the concern of Kuvvet (2019) about influential observations that Call et al. (2018) “addresses the role of extreme observations in enforcement actions with an estimator designed specifically to handle skewed data (Poisson pseudo-maximum likelihood)” seems unconvincing.

24.2.2 Replicating Call et al. (2018)

To better understand the estimator used by Call et al. (2018), we replicate their Table 4 below. First, we prepare the data. Following Call et al. (2018), we winsorize a subset of the control variables at the 1st and 99th percentiles. Second, we follow Call et al. (2018) in calculating a log-transformation of the dependent variables considered (we will use these transformed variables below). We use set_names() so that the result of applying map() to yvars retains the values of yvars in the names of the elements. We also convert ff12 to a factor variable and any logical variables to integers.9

yvars <- set_names(c("firmpenalty", "emppenalty", "empprisonmos"))

cmsw <-
  cmsw_2018 |>
  mutate(across(c(blckownpct, initabret, pctinddir, mkt2bk, lev), 
         across(any_of(yvars), \(x) log(1 + x), .names = "ln_{.col}"),
         ff12 = as.factor(ff12),
         across(where(is.logical), as.integer)) |>
  filter(tousesox == 1)

Next, we create a function get_poisson_fit() to estimate Poisson regressions using glm(). Because we refer to the same set of controls in other contexts below, we specify controls outside get_poisson_fit().

x <- "wbflag"
controls <- c("selfdealflag", "blckownpct", "initabret", "lnvioperiod", 
              "bribeflag", "mobflag", "deter", "lnempcleveln", 
              "lnuscodecnt", "viofraudflag", "misledflag", "audit8flag", 
              "exectermflag", "coopflag", "impedeflag", "pctinddir", 
              "recidivist", "lnmktcap", "mkt2bk", "lev", "lndistance", 

get_poisson_fit <- function(y, df = cmsw) {
  form <- as.formula(str_c(y, " ~ ", 
                           str_c(c(x, controls), collapse = " + ")))
  fm <- glm(form, family = "poisson", data = df,
            control = glm.control(maxit = 100))

We then use get_poisson_fit() to get regression results.

fms <- map(yvars, get_poisson_fit)

We use coeftest() to return “robust” standard errors that match those used in Call et al. (2018). In addition to type = "HC1", vcovHC() offers "HC2" and "HC3". The differences between these options relate to degrees-of-freedom corrections that have the greatest impact with small samples. Chapter 8 of Angrist and Pischke (2008) provides more discussion and explanation of HC1, HC2, and HC3.

get_coefs <- function(fm, type = "HC1") {
  coeftest(fm, vcov = vcovHC(fm, type = type))

Finally, we tabulate the results in Table 24.3. To save space, we omit estimated coefficients for the control variables.10

modelsummary(map(fms, get_coefs),
             estimate = "{estimate}{stars}",
             statistic = "{statistic}",
             coef_map = "wbflag",
             gof_map = c("nobs", "r.squared"),
             stars = c('*' = .1, '**' = 0.05, '***' = .01))
Table 24.3: Enforcement outcomes (Table 4 of CMSW)
firmpenalty emppenalty empprisonmos
wbflag 1.250* 1.111*** 0.621**
1.949 2.637 2.372
Num.Obs. 658 658 658

Call et al. (2019) flag concerns raised by Kuvvet (2019) about the “tipster” variable as an example of “claims offered by Kuvvet [that] reflect a misunderstanding of both the enforcement action setting and the whistleblower designations in CMSW”. Call et al. (2019) argue that “CMSW’s ‘tipster’ and ‘non-tipster’ whistleblower designations are defined specifically in relation to the enforcement process and are not an attempt to identify the individual who first uncovered the misconduct.” However, Call et al. (2018) say “some whistleblowers sound the first alarm about potential violations, bringing to the attention of regulators possible violations that could be investigated. We refer to these whistleblowers as tipsters.”

24.2.3 Discussion questions

  1. Suppose you were a regulator interested in understanding the effects of whistleblowers on enforcement outcomes. How might you design an experiment to examine these effects? What challenges would you expect to face in implementing your experiment? How would the experiment different from the setting of Call et al. (2018)?

  2. As an additional example of “misunderstanding” by Kuvvet (2019), Call et al. (2019) claim that “Kuvvet argues that CMSW’s findings speak to correlation rather than causation. The published version of CMSW makes this point clearly throughout the paper.” What claim in Kuvvet (2019) are Call et al. (2019) addressing here? Do you agree that “CMSW makes this point clearly throughout the paper”?

  3. “The published version of CMSW empirically addresses the role of extreme observations in enforcement actions with an estimator designed specifically to handle skewed data (Poisson pseudo-maximum likelihood) and with additional robustness tests, including one focused on the incidence rather than the magnitude of penalties.” As we see in Table 24.3, the estimator used in Call et al. (2018) is standard Poisson regression, which is called in R using the glm() function with family = "poisson". How might we use data sets covered in Section 24.1 or Section 24.4 to evaluate the claim that Poisson regression estimator is “designed specifically to handle skewed data”?

  4. Could we use approaches covered in Section 24.1 or Section 24.4 to address extreme observations in the setting of Call et al. (2018)?

  5. Call et al. (2019) argue that “unlike many other settings in accounting, finance, and economics where the focus is often on the average firm, the enforcement action setting is inherently extreme” and claim that this is another example of misunderstanding” in Kuvvet (2019). Do you agree that “the enforcement action setting is inherently extreme”? Does this inherent extremeness undermine the arguments of Kuvvet (2019)?

  6. What claim by Kuvvet (2019) are Call et al. (2019) trying to refute regarding the “tipster” variable? Do you find this response convincing?

24.3 Sensitivity analysis

In general, sensitivity analysis is used to assess how outputs differ as inputs or assumptions are varied. If we could assume that assignment of whistleblower status to an enforcement action is (conditional on control variables) unconfounded in the setting of Call et al. (2018), then we can interpret regression coefficients as estimates of causal effects.11 For this reason, we focus in this section on the analysis of the sensitivity of inferences to posited violations of the unconfoundedness of treatment assignment.

In practice, whistleblowers are not just randomly assigned to enforcement actions. Rather there is the possibility that the factors driving assignment of whistleblowers also affect the enforcement outcomes, making them confounders in the sense studied in Section 4.2. If these factors are not observed by the researcher, then they cannot be controlled for in empirical analysis.

Even in this narrow sense of the term we use here, there are various approaches to sensitivity analysis that can be employed. In this section, we apply the approach proposed by Frank (2000), which involves the calculation of the impact threshold for a confounding variable or ITCV.

To proceed with the calculation of ITCV, we begin by estimating OLS regressions, as the Frank (2000) approach is predicated on OLS regressions.

Following Call et al. (2018), we use the log-transformed versions of the confounding variable and robust standard errors.

get_ols_fit <- function(y, df = cmsw) {
  form <- as.formula(paste0("ln_", y, " ~ ", 
                            str_c(c(x, controls), collapse = " + ")))
  lm(form, data = df)

fms <- map(yvars, get_ols_fit)

Regression results analogous to those in columns (7) through (9) of Panel A of Table 8 of Call et al. (2018) are reported in Table 24.4.

modelsummary(map(fms, get_coefs),
             estimate = "{estimate}{stars}",
             statistic = "{statistic}",
             coef_map = "wbflag",
             gof_map = c("nobs", "r.squared"),
             stars = c('*' = .1, '**' = 0.05, '***' = .01))
Table 24.4: Enforcement outcomes with OLS (Table 8 of CMSW)
firmpenalty emppenalty empprisonmos
wbflag 0.484*** 0.072 0.181
3.316 0.627 1.118
Num.Obs. 658 658 658

The impact of a confounding variable in this context is the effect it would have on the estimated coefficient of the variable of interest if it were included in the regression analysis. In the context of OLS regression in which we are regressing \(y\) on \(x\) and controls \(Z\), we know from Section 3.3 that the regression coefficient \(\beta\) on \(x\) can be calculated from a regression of the residuals of a regression of \(y\) on \(Z\) against the residuals of a regression of \(x\) on \(Z\). If we consider a confounding variable \(cv\), then its impact comes from (i) its effect on the residuals of a regression of \(y\) including both \(Z\) and \(cv\) and (ii) its effect on the residuals of a regression of \(x\) including both \(Z\) and \(cv\).

It turns out that these two quantities relate to partial correlations. A partial correlation between \(x\) and \(y\) measures the degree of association between \(x\) and \(y\), with the effect of a set of control variables (say \(Z\)) removed. The partial correlations of interest will be that between \(y\) and the hypothetical confounding variable \(cv\) (\(r_{y \cdot cv}\)) and that between \(x\) and the hypothetical confounding variable \(cv\) (\(r_{x \cdot cv}\)).

The ITCV can be expressed in terms of the product \(r_{y \cdot cv} \times r_{x \cdot cv}\). In effect, the ITCV represents the minimum value of this product that would turn a positive and statistically significant coefficient into a positive and just statistically significant coefficient, with impact values greater that the ITCV leading to statistically insignificant coefficients.

In Table 24.4, we have a statistically significant coefficient in just one regression (the one with firm penalties as the dependent variable), so like Call et al. (2018) we focus on this regression in calculating ITCV.12

The get_itcvs() function below calculates the ITCV for \(\alpha \in \{0.01, 0.05, 0.10\}\). While we refer the reader to Frank (2000) for details on the mathematics, it seems noteworthy that we extract just two quantities from the fitted model: the \(t\)-statistic for the coefficient on the variable of interest (var) and the degrees of freedom for the model. Lower \(\alpha\) values imply higher critical values for the \(t\)-statistic, hence lower threshold values.

get_itcvs <- function(fm, var) {
  coefs <- get_coefs(fm)
  tstat <- coefs[var, "t value"]
  df <- df.residual(fm)
  numer <- tstat^2 / df
  r_yx <- sqrt(numer / (1 + numer))

  tibble(alpha = c(0.01, 0.05, 0.1)) |>
    mutate(crit_val = qt(1 - alpha / 2, df),
           rhash = crit_val / sqrt(df + crit_val^2),
           itcv = (r_yx - rhash) / (1 - abs(rhash))) |>

We can now apply the get_itcvs() function to the first model and the variable of interest (wbflag). Results are reported in Table 24.5.13

get_itcvs(fms[[1]], "wbflag")
Table 24.5: Impact threshold for a confounding variable (ITCV)
alpha crit_val itcv
0.01 2.584 0.032
0.05 1.964 0.058
0.10 1.647 0.070

Evaluating ITCVs requires some sense of plausible magnitudes for \(r_{y \cdot cv}\) and \(r_{x \cdot cv}\). Absent some prior beliefs about confounding variables, a set of natural benchmarks comes from the control variables included in the regression (\(Z\)). Inclusion of these control variables is (or should be) based on these variables being confounding variables themselves.

As such we can calculate \(r_{y \cdot z}\) and \(r_{x \cdot z}\) for each \(z \in Z\). To do, we first create a function pcor() to calculate partial correlations. This function is adapted from code for the the pcor() function in the ppcor package: it is trimmed down to focus on our limited needs here, which are limited to the vector of partial correlations between the first variable in x and the remaining variables.

pcor <- function (x) {
  cvx <- cov(as.matrix(x))
  if (det(cvx) < .Machine$double.eps) {
    icvx <- MASS::ginv(cvx)
  else icvx <- solve(cvx)
  pcor <- -cov2cor(icvx)
  diag(pcor) <- 1
  pcor[-1, 1]

The pcor() function requires data in matrix form, which in turn requires that each factor variable be encoded as a set of indicator variables. The get_factors() flags the factor variables in a data frame.

get_factors <- function(df) {
  names(which(unlist(map(df, is.factor))))

The make_dummies() function converts var in the data frame df to a set of dummies.

make_dummies <- function(df, var) {
  vals <- as.vector(unique(df[[var]]))
  for (val in vals) {
    df[[str_c(var, "_", val)]] <- as.integer(df[[var]] == val)
  df |> select(-any_of(var)) |> as_tibble()

Finally convert_factors() calls get_factors() to identify factor variables then make_dummies() to convert each of these.

convert_factors <- function(df) {
  vars <- get_factors(df)
  for (var in vars) {
    df <- make_dummies(df, var)

Finally, we create the function get_impacts() to calculate the elements of impact for the control variables.

get_impacts <- function(df, y, x, controls) {
  corr_data <-
    df |>
    select(any_of(c(y, x, controls))) |>
    convert_factors() |>

  r_yz <-
    corr_data |>
    select(-all_of(x)) |>

  r_xz <-
    corr_data |>
    select(-all_of(y)) |>

  var <- names(corr_data |> select(-any_of(c(x, y))))

  tibble(var, r_yz, r_xz) |>
    mutate(impact = r_yz * r_xz) |>

In Table 24.6 the impact of the 10 most impactful control variables are shown. Only one of these control variables (lnmktcap) has an impact exceeding the ITCV for \(\alpha = 0.05\), suggesting that the conjectured control variable would have to rank second on this list of controls to have enough impact to drive the \(p\)-value above \(0.05\). As such, the OLS results reported in Table 24.4 for firm penalties might be interpreted as somewhat robust.

            y = "ln_firmpenalty",
            x = "wbflag",
            controls = controls) |>
  top_n(n = 10, wt = impact)
Table 24.6: Impacts for control variables
var r_yz r_xz impact
lnmktcap 0.332 0.274 0.091
bribeflag 0.289 0.035 0.010
lnvioperiod 0.140 0.053 0.007
lev 0.070 0.100 0.007
blckownpct -0.092 -0.075 0.007
lnuscodecnt 0.158 0.036 0.006
mkt2bk -0.060 -0.079 0.005
misledflag -0.051 -0.079 0.004
initabret 0.098 0.040 0.004
recidivist 0.034 0.080 0.003

However, a concern with this analysis is that it may be unrealistic to imagine that a single confounding variable has been omitted from the analysis. This concern seems accentuated in a setting where our understanding of the factors driving assignment to treatment is limited. It seems plausible that the factors driving the presence of whistleblowers are complex and poorly captured by the set of controls used in Call et al. (2018).

Another concern is that there is some arbitrariness to the measurement of controls. For example, \(z_1\), \(z_2\), and \(z_3\) might all be noisy proxies for an underlying confounder \(z\). Including all three variables might make sense to control for \(z\), but the impact of each variable is likely to be low because of multicollinearity, thus understating the impact of \(z\).

One approach to addressing this is to consider the controls as a single variable. The fitted value \(\hat{z}\) from a regression of \(x\) on \(Z = \{z_i \}\) can be used a single control in a regression where \(y\) is the dependent variable without affecting the coefficient on \(x\).14 As such, we can examine the impact of \(\hat{z}\) as a composite control.

We embed this logic in the function get_combined_impact().

get_combined_impact <- function(df, y, x, controls) {

  form <- as.formula(str_c(x, " ~ ", 
                           str_flatten(controls, collapse = " + ")))
  fm <- lm(form, data = df)

  df |>
    mutate(z = fitted(fm)) |>
    get_impacts(y = y,
                x = x,
                controls = "z") 

Results from get_combined_impact() are provided in Table 24.7. There it can be seen that the impact of the included controls as a set (0.273) is much higher than the ITCV for \(\alpha = 0.05\) (0.0578), suggesting that the impact of omitted confounders only has to be a fraction of the impact of included variables for the results to be statistically insignificant.

                    y = "ln_firmpenalty",
                    x = "wbflag",
                    controls = controls)
Table 24.7: Combined impacts of control variables
var r_yz r_xz impact
z 0.553 0.494 0.273

24.4 Appendix: Simulation study from Leone et al. (2019)

Here we present code to run simulations like those underlying Table 5 of Leone et al. (2019). One reason we present this here is to help the reader better understand the results in Table 5.

An additional reason for presenting this particular simulation is to illustrate a modular approach to simulations that is fostered by use of R functions. This approach potentially offers a template for readers looking to develop simulations.

A big-picture view of the simulation is that the run_sim() function uses get_data() to generate the data, then feeds these data to fit_models(), then sends the returned fitted models to compile_stats(). The get_results() function feeds parameter values from params to the run_sim() function.

The first step is to produce the data. If you wanted to change the data-generating process, this could be achieved simply by editing this function. The purpose of this function is to replicate the data-generating process described in Leone et al. (2019).15 Rather than create separate code for each of the three panels of Table 5, we create one function and use argument values to select the data-generating process used to create the data. For example, the default values (b_true = 0.8, clean = FALSE, and random = TRUE) produce data matching the left half of Panel C of Table 5.

get_data <- function(b_true = 0.8, clean = FALSE,
                     random = TRUE, n = 10000) {
  prop_cont <- if_else(random, 0.02, 0.25)
  df <- tibble(x1 = rnorm(n = n),
               x2 = rnorm(n = n),
               x3 = rnorm(n = n),
               v = rnorm(n = n),
               y = b_true * x1 + 0.4 * x2 + 0.2 * x3 + v,
               z = rnorm(n = n, mean = 3, sd = 1)) |>
    mutate(id = row_number(),
           contaminated = 
             case_when(!clean & !random ~ id < prop_cont * n & x1 < -1.5,
                       !clean & random  ~ id < prop_cont * n,
                       .default = FALSE),
           cy = if_else(contaminated, y + z, y),
           cyw = winsorize(cy), x1w = winsorize(x1),
           x2w = winsorize(x2), x3w = winsorize(x3),
           cyt = truncate(cy), x1t = truncate(x1),
           x2t = truncate(x2), x3t = truncate(x3))

Another benefit of putting get_data() in a separate function that returns a data frame is that we can examine the pieces one at a time.

# A tibble: 10,000 × 17
       x1     x2     x3       v      y     z    id contaminated    cy   cyw
    <dbl>  <dbl>  <dbl>   <dbl>  <dbl> <dbl> <int> <lgl>        <dbl> <dbl>
 1 -0.890  1.19   0.873  0.285   0.225  2.77     1 TRUE          2.99  2.99
 2 -1.08  -1.52   1.08   0.303  -0.950  3.32     2 TRUE          2.37  2.37
 3  0.113 -1.39   1.76  -0.201  -0.314  2.28     3 TRUE          1.96  1.96
 4 -1.37   0.333 -1.38   1.38    0.150  2.64     4 TRUE          2.79  2.79
 5 -0.560 -0.203  1.20  -0.105  -0.394  3.48     5 TRUE          3.09  3.09
 6  0.481  0.943  0.325  0.427   1.25   2.16     6 TRUE          3.42  3.42
 7  0.458  1.55  -1.26   0.923   1.66   2.23     7 TRUE          3.89  3.63
 8 -1.56   0.207  0.180  0.0936 -1.03   4.03     8 TRUE          3.00  3.00
 9  0.738 -0.673  1.26  -0.843  -0.269  1.32     9 TRUE          1.05  1.05
10  0.106 -0.928  1.96   0.906   1.01   3.20    10 TRUE          4.21  3.63
# ℹ 9,990 more rows
# ℹ 7 more variables: x1w <dbl>, x2w <dbl>, x3w <dbl>, cyt <dbl>,
#   x1t <dbl>, x2t <dbl>, x3t <dbl>

The next function fits the models of interest. Following Leone et al. (2019), we fit seven models and store each model as an element of a list, as this makes manipulation of the models (e.g., extracting statistics) easier.

fit_models <- function(df) {
  fms <- list(lm(cy ~ x1, data = df))
  fms[[2]] <- update(fms[[1]], cyw ~ x1w + x2w + x3w)
  fms[[3]] <- update(fms[[1]], cy ~ x1w + x2w + x3w)
  fms[[4]] <- update(fms[[1]], cyt ~ x1t + x2t + x3t)
  fms[[5]] <- update(fms[[1]], cy ~ x1t + x2t + x3t)
  fms[[6]] <-
    df |>
    mutate(cooksd = cooks.distance(fms[[1]]),
           extreme = cooksd > 4 / nobs(fms[[1]])) |>
    filter(!extreme) |>
    update(fms[[1]], data = _)

  fmrob <- lmrob(formula(fms[[1]]),
                 data = df,
                 method = "MM",
                 control = lmrob.control(tuning.psi = 3.4437))

  fms[[7]] <- fmrob

For the simulation output, we want to retain estimated coefficients and results of tests of the null hypothesis at sizes of 5% and 1%. We create the extract_stats() function to do this, as this makes for leaner, easier-to-debug code.

extract_stats <- function(fm, b_null) {
  rdf <- fm$df.residual
  coefs <- summary(fm)$coefficients

  est <- coefs[2, 1]
  se <- coefs[2, 2]

  tval <- (est - b_null)/se
  pval <- 2 * pt(abs(tval), rdf, lower.tail = FALSE)

  p_5 <- pval < 0.05
  p_1 <- pval < 0.01

  tibble(est, p_5, p_1)

We next create compile_stats(), which applies extract_stats() to a list of models (such as those returned by fit_models()) and puts the results in a data frame with human-friendly labels for each of the models.

compile_stats <- function(fms, b_null) {
    map(fms, extract_stats, b_null = b_null) |>
    list_rbind() |>
    mutate(model_num = row_number(),
           model = c("Do nothing", 
                     "Winsorize all variables", "Winsorize only X",
                     "Truncate all variables", "Truncate only X",
                     "Cook's Distance", "Robust regression"))

Now we create run_sim(), which runs one iteration of the simulation and returns a data frame with parameter values and statistics from the simulation.

run_sim <- function(sim_num = 1, b_true = 0.8, b_null = b_true,
                    clean = FALSE, random = FALSE) {
  get_data(b_true = b_true, clean = clean, random = random) |>
  fit_models() |>
  compile_stats(b_null = b_null) |>
  mutate(sim_num = sim_num, b_true = b_true, b_null = b_null,
         clean = clean, random = random)

We now compile a data frame with the parameters we will run in the simulation. We will consider different two different values of each of b_true, clean, and random. Note that when clean is TRUE, the value of random has no significance, so we delete the case where clean is TRUE and random is also TRUE to avoid running redundant parameter values.

params <- expand_grid(b_true = c(0, 0.8),
                      clean = c(TRUE, FALSE),
                      random = c(FALSE, TRUE)) |>
  filter(!(clean & random))

Now we finally have almost all the pieces to run the simulation. We will do this using get_results(), which uses future_map() from the furrr package to run in multiple threads for substantially shorter run times.

num_sims <- 1000L

get_results <- function(b_true = 0, clean = FALSE, random = FALSE, ...) {
  future_map(1:num_sims, run_sim, b_true = b_true,
             clean = clean, random = random,
             .options = furrr_options(seed = 2021)) |>

The following step, which actually runs get_results(), takes more than a minute to run and produces data for all panels of Table 5 of Leone et al. (2019).


all_results <-
  pmap(params, get_results) |>
  list_rbind() |>
   user  system elapsed 
  5.548   0.205  96.247 

Table 24.8 provides a version of the left half of Panel B of Table 5 of Leone et al. (2019).

all_results |>
  filter(b_true == 0.8, random == FALSE, clean == FALSE) |>
  group_by(b_true, clean, random, model_num, model) |>
  summarize(across(est:p_1, mean), .groups = "drop") |>
  select(model, est, p_5, p_1)
Table 24.8: Partial replication of Panel B of Table 5 of Leone et al. (2019)
model est p_5 p_1
Do nothing 0.703 1.000 1.000
Winsorize all variables 0.705 1.000 1.000
Winsorize only X 0.715 1.000 1.000
Truncate all variables 0.682 1.000 1.000
Truncate only X 0.710 1.000 1.000
Cook’s Distance 0.760 0.929 0.828
Robust regression 0.774 0.523 0.281

  1. While much of the following code resembles that used in replicating Lawrence et al. (2011) in Chapter 25, there are minor differences.↩︎

  2. See help by typing ? quantile at the R console; R Core Team (2021)↩︎

  3. Actually this is the model estimated in column (1) of Table 2 of Leone et al. (2019), as a model estimated with “raw data” would not work. Additionally, Andy Leone confirms that the coefficients on roa and cfo have been switched in Table 2 of Leone et al. (2019).↩︎

  4. See p.31 of Maronna et al. (2019). The Stata routine robreg uses the function mm_biweight_k from the moremata routine to convert efficiency to the equivalent of the tuning.psi parameter.↩︎

  5. There is a slight discrepancy apparent here for employee penalties, for which Kuvvet (2019) appears to have summed the top 7 values instead of the top 6.↩︎

  6. Footnote 13 of Irarrazabal et al. (2013, p. 80) says “the Poisson estimator … delivers qualitatively similar results (available on request)”.↩︎

  7. Tenreyro (2007) involves an extension of Santos Silva and Tenreyro (2006) involving instrumental variables and directs readers to Santos Silva and Tenreyro (2006) regarding its efficiency and robustness. The seventh paper (Karolyi and Taboada, 2015) has a more careful linking of prior research to its setting, but does not provide independent evidence of the effectiveness of PPML.↩︎

  8. Even if the authors were using the ppml Stata routine provided by the authors of Santos Silva and Tenreyro (2006), they would see that it’s just Poisson regression by default.↩︎

  9. The latter transformation does not affect the substance of the regressions, but does simplify their presentation.↩︎

  10. The coefficients on the control variables are reported in Table 4 of Call et al. (2018).↩︎

  11. Of course, even in such a case we need to be mindful of functional-form issues of the kind explored in Chapter 25.↩︎

  12. The ITCV is not meaningfully defined when the coefficient of interest is not statistically significant even without the inclusion of a hypothetical confounding variable.↩︎

  13. He we see a slight difference from the 0.059 reported in Call et al. (2018), perhaps due to their using 1.96 as the critical value in place of the more exact value returned by qt() in our code here.↩︎

  14. Consistency might suggest \(\hat{x}\) as the better notation, but \(\hat{z}\) seems to better capture what we are aiming for here.↩︎

  15. We thank Miguel Minutti-Meza for sharing his Stata code with us, but note that the description of the simulation in the Leone et al. (2019) is clear and complete and sufficient to allow a reader to reproduce it without this code.↩︎