23 Handling extreme values
This material may be moved as the book develops and the best pedagogical approach becomes clear.
In this chapter, we look at approaches used in handling extreme values encountered in real-world data sets.
Leone et al. (2019) show that many papers in accounting research modify or eliminate observations with extreme values, as these can have an out-sized influence on regression results. Winsorization defines extreme values as those above or below certain quantiles and replaces these with values at those quantiles (we saw winsorization in Chapter 24). Another approach is truncation, which defines extreme values much as in winsorization, but instead eliminates extreme observations from subsequent analysis.
Leone et al. (2019) also two alternative approaches to handling extreme values. The first approach is the use of diagnostic statistics to identify influential observations where “such observations are excluded from second-pass (or later) regressions estimated to draw a researcher’s ‘final’ inferences” (p. 343).
The second approach is the use of robust regression. While Leone et al. (2019) allude to an extensive literature on robust regression, like Leone et al. (2019), we focus on so-called MM-estimators, which are made available via the robustbase
package.
Leone et al. (2019) provide three kinds of evidence regarding approaches to extreme values and influential observations. First, they survey the literature to gather data describing what researchers do to address extreme observations. The results of this effort are found in Table 1 of Leone et al. (2019).
Second, Leone et al. (2019) replicate three studies and examine how inferences vary by approach. Below we study one of these replications.
Third, Leone et al. (2019) provide simulation evidence on the effectiveness of the various approaches.
23.1 Replication analysis
To better understand the approaches discussed in Leone et al. (2019), we will follow their replication of Chen et al. (2018), which is also the focus of material later in this chapter.
Much of the following code resembles that used in replicating Lawrence et al. (2011) in Chapter 24, but there are differences. Note that while Chen et al. (2018) focus on fiscal years from 1996 through 2015, we get data for fiscal-year 1995, so that lagged values are available for 1996. Note that aftering collecting the data, we convert fyear
into a factor, which facilitates estimation of fixed effects and interaction terms.
pg <- dbConnect(RPostgres::Postgres(), bigint = "integer")
company <- tbl(pg, sql("SELECT * FROM comp.company"))
funda <- tbl(pg, sql("SELECT * FROM comp.funda"))
comp <-
funda |>
filter(indfmt == "INDL", datafmt == "STD",
consol == "C", popsrc == "D") |>
filter(!between(sich, 6000, 6999) | is.na(sich)) |>
filter(between(fyear, 1995, 2015)) |>
group_by(gvkey) |>
window_order(fyear) |>
mutate(big_n = as.integer(au) %in% 1:8L,
lag_at = lag(at),
inv_at = if_else(lag_at > 0, 1/lag_at, NA_real_),
ta = if_else(lag_at > 0, (ib - oancf)/lag_at, NA_real_),
roa = if_else(lag_at > 0, ib/lag_at, NA_real_),
cfo = if_else(lag_at > 0, oancf/lag_at, NA_real_),
mkt_cap = prcc_f * csho,
lag_mkt_cap = lag(mkt_cap),
size = if_else(lag_mkt_cap > 0, log(lag_mkt_cap), NA_real_),
debt = coalesce(dltt, 0) + coalesce(dlc, 0),
lev = if_else(lag_at > 0, debt/lag_at, NA_real_),
mtb = if_else(lag(ceq) > 0, lag_mkt_cap/lag(ceq), NA_real_),
assets = lag_at,
d_sale = if_else(lag_at > 0, (revt - lag(revt))/lag_at, NA_real_),
d_ar = if_else(lag_at > 0, (rect - lag(rect))/lag_at, NA_real_),
ppe = if_else(lag_at > 0, ppent/lag_at, NA_real_)) |>
ungroup() |>
select(gvkey, datadate, fyear, big_n, ta, big_n, roa, cfo, size, lev,
mtb, inv_at, d_sale, d_ar, ppe) |>
collect() |>
mutate(fyear = as.factor(fyear))
We create two functions that are quite similar in structure. We saw the winsorize
function in Chapter 24. This function has more general applicability than we will use here, as we will stick to the default argument of prob = 0.01
, which is then mapped into p_low = 0.01
and p_high = 0.99
, which implements “the 1 percent and 99 percent winsorization rule common to accounting studies” (and also used in Chen et al. (2018)). The first line of the function uses the quantile
function to identify the 1st and 99th percentiles of the distribution of x
and stores these two values in the vector cuts
. The type
argument should be “an integer between 1 and 9 selecting one of the nine quantile algorithms … to be used.” [see help by typing ? quantile
at the R console; R Core Team (2021)]. The type
argument specifies how ties are resolved and how averages at breakpoints are calculated. By selecting type = 2
, we get the SAS default, which means we get the same values as we would using the SAS macro commonly used by other researchers. The third line of the winsorize
function sets all values below cuts[1]
, which is the p_low
quantile, equal to cuts[1]
. The fourth line sets all values above cuts[2]
(p_high
) equal to cuts[2]
. The last line is equivalent to return(x)
(i.e., the modified value of x
is returned).
The truncate
function is almost identical, except that values outside the bounds in cuts
are set to NA
, which effectively removes them from later analysis.
winsorize <- function(x, prob = 0.01, p_low = prob, p_high = 1 - prob) {
cuts <- quantile(x, probs = c(p_low, p_high), type = 2, na.rm = TRUE)
x[x < cuts[1]] <- cuts[1]
x[x > cuts[2]] <- cuts[2]
x
}
truncate <- function(x, prob = 0.01, p_low = prob, p_high = 1 - prob) {
cuts <- quantile(x, probs = c(p_low, p_high), type = 2, na.rm = TRUE)
x[x < cuts[1]] <- NA
x[x > cuts[2]] <- NA
x
}
To apply these functions we use the mutate
function across
all variables between ta
and ppe
inclusive (i.e., we do not apply them to gvkey
, datadate
, fyear
, or big_n
) and store the results in comp_win
and comp_trunc
.
We next run regressions. The first model we consider is essentially that found in Chen et al. (2018). As Leone et al. (2019) point out, roa
and cfo
can only be included in a model with ta
as the dependent variable when winsorization breaks the identity ta = roa - cfo
.1 In the second model, we use raw data, but drop roa
to make the model meaningful. The third and fourth models tweak the second model by using winsorized and truncated data respectively. Note that we use the update
function to make the relationships between the models clearer and to avoid errors caused by unintended differences between models.
fms <- list()
fms[[1]] <- feols(ta ~ big_n + roa + cfo + size + lev + mtb +
fyear * (inv_at + I(d_sale - d_ar) + ppe),
~ gvkey + fyear,
data = comp_win, na.action = na.exclude)
fms[[2]] <- feols(ta ~ big_n + cfo + size + lev + mtb +
fyear * (inv_at + I(d_sale - d_ar) + ppe),
~ gvkey + fyear,
data = comp, na.action = na.exclude)
fms[[3]] <- update(fms[[2]], data = comp_win)
fms[[4]] <- update(fms[[2]], data = comp_trunc)
Given that we don’t use comp_win
other than to estimate fms[[3]]
, we could have used an alternative syntax such as the following to avoid creating comp_win
as a separate variable.
The fifth model we estimate uses Cook’s D to identify extreme observations for exclusion. As felm
objects do not support the cooks.distance
function, we estimate an OLS model and use that to identify extreme observations for exclusion.
The sixth and final model uses robust regression. Leone:2019uc use the robreg
module made available for Stata, which defaults to an asymptotic efficiency of 85% for the bisquare estimator used in the second step of the MM estimation procedure and we mimic this choice here.This efficiency level is obtained by specifying the tuning.psi
parameter as 3.4437
.2
An alternative would be to follow Koller and Stahel (2017) by selecting control = lmrob.control("KS2014")
. Koller and Stahel (2017) suggest that this approach “allows for … estimation in designs with factors and interactions between factors and continuous regressors”, which describes our setting here due to the interaction of fyear
(a factor) with inv_at + I(d_sale - d_ar) + ppe
(continuous variables).
Following the recommendation in Leone et al. (2019), we recover the weights from the fitted lmrob
object and feed those to an update of the felm
model in fms[[2]]
, as doing so allows us to use cluster-robust standard errors.
modelsummary(fms,
estimate = "{estimate}{stars}",
statistic = "({statistic})",
coef_omit = "(fyear|ppe|inv_at|d_sale)",
gof_map = c("nobs", "r.squared", "adj.r.squared"),
stars = c('*' = .1, '**' = 0.05, '***' = .01))
(1) | (2) | (3) | (4) | (5) | (6) | |
---|---|---|---|---|---|---|
(Intercept) | −0.031*** | 0.262 | −0.085*** | −0.119*** | −0.107*** | −0.024*** |
(−8.907) | (1.205) | (−7.326) | (−14.801) | (−10.392) | (−26.129) | |
big_nTRUE | 0.000 | −0.073 | 0.006 | 0.009** | 0.020*** | −0.008*** |
(0.265) | (−1.087) | (1.301) | (2.623) | (3.927) | (−16.153) | |
roa | 0.815*** | |||||
(63.797) | ||||||
cfo | −0.786*** | 0.799* | 0.257*** | 0.123*** | 0.194*** | −0.406*** |
(−37.418) | (1.763) | (8.915) | (5.570) | (5.212) | (−150.798) | |
size | 0.003*** | −0.066* | 0.004** | 0.007*** | 0.001 | 0.003*** |
(5.108) | (−1.864) | (2.580) | (5.681) | (0.682) | (24.969) | |
lev | −0.035** | −0.072 | −0.111*** | −0.031*** | −0.074*** | −0.018*** |
(−2.642) | (−0.344) | (−5.090) | (−4.479) | (−4.183) | (−23.287) | |
mtb | −0.001*** | −0.003 | −0.007*** | −0.006*** | −0.001*** | 0.000 |
(−4.382) | (−0.943) | (−8.656) | (−11.104) | (−3.782) | (0.010) | |
Num.Obs. | 113980 | 113980 | 113980 | 106571 | 113552 | 83495 |
R2 | 0.912 | 0.573 | 0.278 | 0.095 | 0.129 | 0.780 |
R2 Adj. | 0.911 | 0.572 | 0.278 | 0.094 | 0.128 | 0.780 |
23.2 Discussion questions
- Leone et al. (2019) present results from a simulation analysis in Table 5 of that paper. Which panel of that table likely best reflects data researchers encounter in practice?
- Consider the left half of Table 5 (i.e., \(\beta_1 = 0\)). Using the panel you identified in question 1 above, interpret the implication of the simulation results for researchers in practice. Note that the mean estimates of \(\beta_1\) using Cook’s D or robust regression are very similar. Does Table 5 provide strong support for rejecting Cook’s D in favour of robust regression?
- In the search for new research questions, accounting researchers increasingly need to study variables whose most plausible effect sizes are consistent with the right half of Table 5 (i.e., \(\beta_1 = 0\)). Using the panel you identified in question 1 above, interpret the implication of the simulation results for researchers in practice.
-
Leone et al. (2019) report results in Table 2 similar to those given in columns (4), (5), and (6) above. They say, “[when we] perform estimation using RR, the coefficient on \(BIG\_N\) is \(-0.011\) (significant at the 1 percent level), opposite of what Chen et al. (2018) find, and instead consistent with the negative coefficient documented in prior studies using the two-step procedure.” In light of the simulation evidence, how persuasive do you find the evidence above in support of a negative coefficient on
big_n
? How probative is the evidence of prior studies if it is based on approaches (e.g., winsorization and two-step procedures) that Leone et al. (2019) and Chen et al. (2018) demonstrate lead to unreliable results?
23.3 Appendix: Simulation study from Leone et al. (2019)
Here we present code to run simulations like those underlying Table 5 of Leone et al. (2019). One reason we present this here is to help the reader better understand the results in Table 5.
An additional reason for presenting this particular simulation is to illustrate a modular approach to simulations that is fostered by use of R functions. This approach potentially offers a template for readers looking to develop simulations.
A big-picture view of the simulation is that the run_sim
function uses get_data()
to generate the data, then feeds the resulting data frame to fit_models()
, then fits the returned fitted models to compile_stats
. The get_results
function feeds parameter values from params
to the run_sim
function.
The first step is to produce the data. If you wanted to change the data-generating process, this could be achieved simply by editing this function. The purpose of this function is to replicate the data-generating process described in Leone et al. (2019).3 Rather than create separate code for each of the three panels of Table 5, we create one function and use argument values to select the data-generating process used to create the data. For example, the default values (b_true = 0.8
, clean = FALSE
, and random = TRUE
) produce data matching the left half of Panel C of Table 5.
get_data <- function(b_true = 0.8, clean = FALSE,
random = TRUE, n = 10000) {
prop_cont <- if_else(random, 0.02, 0.25)
df <- tibble(x1 = rnorm(n = n),
x2 = rnorm(n = n),
x3 = rnorm(n = n),
v = rnorm(n = n),
y = b_true * x1 + 0.4 * x2 + 0.2 * x3 + v,
z = rnorm(n = n, mean = 3, sd = 1)) |>
mutate(id = row_number(),
contaminated = case_when(!clean & !random ~ id < prop_cont * n & x1 < -1.5,
!clean & random ~ id < prop_cont * n,
.default = FALSE),
cy = if_else(contaminated, y + z, y),
cyw = winsorize(cy),
x1w = winsorize(x1),
x2w = winsorize(x2),
x3w = winsorize(x3),
cyt = truncate(cy),
x1t = truncate(x1),
x2t = truncate(x2),
x3t = truncate(x3))
df
}
Another benefit of putting get_data
in a separate function that returns a data frame is that we can examine the pieces one at a time.
get_data()
# A tibble: 10,000 × 17
x1 x2 x3 v y z id contaminated cy cyw
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <lgl> <dbl> <dbl>
1 -0.510 -0.725 0.583 -0.901 -1.48 4.04 1 TRUE 2.55 2.55
2 -0.242 0.517 0.983 -1.22 -1.01 4.41 2 TRUE 3.40 3.40
3 0.774 -1.15 0.687 0.342 0.640 2.66 3 TRUE 3.30 3.30
4 0.193 -0.291 0.0200 0.566 0.608 4.58 4 TRUE 5.18 3.76
5 0.146 -0.622 0.849 -1.51 -1.47 2.07 5 TRUE 0.601 0.601
6 -0.550 0.283 0.191 1.07 0.784 3.12 6 TRUE 3.91 3.76
7 0.0762 -0.0543 -0.698 1.06 0.963 3.75 7 TRUE 4.72 3.76
8 -0.205 -0.921 -0.0902 1.59 1.04 3.52 8 TRUE 4.55 3.76
9 0.460 0.124 -1.09 -0.598 -0.399 4.69 9 TRUE 4.29 3.76
10 0.372 1.24 -0.860 -0.325 0.295 3.37 10 TRUE 3.66 3.66
# ℹ 9,990 more rows
# ℹ 7 more variables: x1w <dbl>, x2w <dbl>, x3w <dbl>, cyt <dbl>, x1t <dbl>,
# x2t <dbl>, x3t <dbl>
The next function fits the models of interest. Following Leone et al. (2019), we fit seven models and store each model as an element of a list, as this makes manipulation of the models (e.g., extracting statistics) easier.
fit_models <- function(df) {
fms <- list()
fms[[1]] <- lm(cy ~ x1, data = df)
fms[[2]] <- update(fms[[1]], cyw ~ x1w + x2w + x3w)
fms[[3]] <- update(fms[[1]], cy ~ x1w + x2w + x3w)
fms[[4]] <- update(fms[[1]], cyt ~ x1t + x2t + x3t)
fms[[5]] <- update(fms[[1]], cy ~ x1t + x2t + x3t)
fms[[6]] <-
df |>
mutate(cooksd = cooks.distance(fms[[1]]),
extreme = cooksd > 4 / nobs(fms[[1]])) |>
filter(!extreme) |>
update(fms[[1]], data = _)
fmrob <- lmrob(formula(fms[[1]]),
data = df,
method = "MM",
control = lmrob.control(tuning.psi = 3.4437))
fms[[7]] <- fmrob
fms
}
For the simulation output, we want to retain estimated coefficients and results of tests of the null hypothesis at sizes of 5% and 1%. We create a function to do this, as this makes for lean, easy-to-debug code.
We next create a function that applies extract_stats
to a list of models (such as those returned by fit_models
) and puts the results in a data frame with human-friendly labels for each of the models.
compile_stats <- function(fms, b_null) {
bind_rows(lapply(fms, extract_stats, b_null = b_null)) |>
mutate(model_num = row_number(),
model = c("Do nothing",
"Winsorize all variables", "Winsorize only X",
"Truncate all variables", "Truncate only X",
"Cook's Distance", "Robust regression"))
}
Now we create run_sim
, which runs one iteration of the simulation and returns a data frame with parameter values and statistics from the simulation.
run_sim <- function(sim_num = 1, b_true = 0.8, b_null = b_true,
clean = FALSE, random = FALSE) {
get_data(b_true = b_true, clean = clean, random = random) |>
fit_models() |>
compile_stats(b_null = b_null) |>
mutate(sim_num = sim_num, b_true = b_true, b_null = b_null,
clean = clean, random = random)
}
We now compile a data frame with the parameters we will run in the simulation. We will consider different two different values of each of b_true
, clean
, and random
. Note that when clean
is TRUE
, the value of random
has no significance, so we delete the case where clean
is TRUE
and random
is also TRUE
to avoid running redundant parameter values.
params <- expand_grid(b_true = c(0, 0.8),
clean = c(TRUE, FALSE),
random = c(FALSE, TRUE)) |>
filter(!(clean & random))
Now we finally have almost all the pieces to run the simulation. We will do this using a get_results
function.
One version of the get_results
function would use lapply
and run in a single thread.
An alternative approach to get_results
would use mclapply
from the parallel
package and run in multiple threads, for substantially shorter run times. The appropriate value for mc.cores = 8
will depend on your computer hardware.
library(parallel)
The following step, which actually runs get_results
, takes a few minutes to run and produces data for all panels of Table 5 of Leone et al. (2019).4
Here we present the left half of Panel B of Table 5 of Leone et al. (2019).
all_results |>
filter(b_true == 0.8, random == FALSE, clean == FALSE) |>
group_by(b_true, clean, random, model_num, model) |>
summarize(across(est:p_1, mean), .groups = "drop") |>
select(model, est, p_5, p_1)
model | est | p_5 | p_1 |
---|---|---|---|
Do nothing | 0.703 | 1.000 | 1.000 |
Winsorize all variables | 0.704 | 1.000 | 1.000 |
Winsorize only X | 0.715 | 1.000 | 1.000 |
Truncate all variables | 0.681 | 1.000 | 1.000 |
Truncate only X | 0.710 | 1.000 | 1.000 |
Cook's Distance | 0.760 | 0.908 | 0.808 |
Robust regression | 0.774 | 0.494 | 0.283 |
Actually this is the model estimated in column (1) of Table 2 of Leone et al. (2019), as a model estimated with “raw data” would not work. Additionally, Andy Leone confirms that the coefficients on
roa
andcfo
have been switched in Table 2 of Leone et al. (2019).↩︎See p.31 of Maronna et al. (2019). The Stata routine
robreg
uses the functionmm_biweight_k
from themoremata
routine to convert efficiency to the equivalent of thetuning.psi
parameter.↩︎We thank Miguel Minutti-Meza for sharing his Stata code with us, but note that the description of the simulation in the Leone et al. (2019) is clear and complete and sufficient to allow a reader to reproduce it without this code.↩︎
On a 2020 Mac mini, which has 4 “performance cores” and 4 “efficiency cores”, producing
all_results
takes 191 seconds withmc.cores = 4
inget_results
and 153 seconds withmc.cores = 8
.↩︎