25 Handling extreme values
This material may be moved as the book develops and the best pedagogical approach becomes clear.
In this chapter, we look at approaches used in handling extreme values encountered in realworld data sets.
Leone et al. (2019) show that many papers in accounting research modify or eliminate observations with extreme values, as these can have an outsized influence on regression results. Winsorization defines extreme values as those above or below certain quantiles and replaces these with values at those quantiles (we saw winsorization in Chapter 26). Another approach is truncation, which defines extreme values much as in winsorization, but instead eliminates extreme observations from subsequent analysis.
Leone et al. (2019) also two alternative approaches to handling extreme values. The first approach is the use of diagnostic statistics to identify influential observations where “such observations are excluded from secondpass (or later) regressions estimated to draw a researcher’s ‘final’ inferences” (p. 343).
The second approach is the use of robust regression.
While Leone et al. (2019) allude to an extensive literature on robust regression, like Leone et al. (2019), we focus on socalled MMestimators, which are made available via the robustbase
package.
library(dplyr, warn.conflicts = FALSE)
library(dbplyr)
library(DBI)
library(ggplot2)
library(stargazer)
library(sandwich)
library(robustbase)
library(lfe)
library(purrr)
library(tidyr)
library(farr)
Leone et al. (2019) provide three kinds of evidence regarding approaches to extreme values and influential observations. First, they survey the literature to gather data describing what researchers do to address extreme observations. The results of this effort are found in Table 1 of Leone et al. (2019).
Second, Leone et al. (2019) replicate three studies and examine how inferences vary by approach. Below we study one of these replications.
Third, Leone et al. (2019) provide simulation evidence on the effectiveness of the various approaches.
25.0.1 Replication analysis
To better understand the approaches discussed in Leone et al. (2019), we will follow their replication of Chen et al. (2018), which is also the focus of material later in this chapter.
Much of the following code resembles that used in replicating Lawrence et al. (2011) in Chapter 26, but there are differences.
Note that while Chen et al. (2018) focus on fiscal years from 1996 through 2015, we get data for fiscalyear 1995, so that lagged values are available for 1996.
Note that aftering collecting the data, we convert fyear
into a factor, which facilitates estimation of fixed effects and interaction terms.
pg < dbConnect(RPostgres::Postgres(), bigint = "integer")
company < tbl(pg, sql("SELECT * FROM comp.company"))
funda < tbl(pg, sql("SELECT * FROM comp.funda"))
comp <
funda %>%
filter(indfmt == "INDL", datafmt == "STD",
consol == "C", popsrc == "D") %>%
filter(!between(sich, 6000, 6999)  is.na(sich)) %>%
filter(between(fyear, 1995, 2015)) %>%
group_by(gvkey) %>%
window_order(fyear) %>%
mutate(big_n = as.integer(au) %in% 1:8L,
lag_at = lag(at),
inv_at = if_else(lag_at > 0, 1/lag_at, NA_real_),
ta = if_else(lag_at > 0, (ib  oancf)/lag_at, NA_real_),
roa = if_else(lag_at > 0, ib/lag_at, NA_real_),
cfo = if_else(lag_at > 0, oancf/lag_at, NA_real_),
mkt_cap = prcc_f * csho,
lag_mkt_cap = lag(mkt_cap),
size = if_else(lag_mkt_cap > 0, log(lag_mkt_cap), NA_real_),
debt = coalesce(dltt, 0) + coalesce(dlc, 0),
lev = if_else(lag_at > 0, debt/lag_at, NA_real_),
mtb = if_else(lag(ceq) > 0, lag_mkt_cap/lag(ceq), NA_real_),
assets = lag_at,
d_sale = if_else(lag_at > 0, (revt  lag(revt))/lag_at, NA_real_),
d_ar = if_else(lag_at > 0, (rect  lag(rect))/lag_at, NA_real_),
ppe = if_else(lag_at > 0, ppent/lag_at, NA_real_)) %>%
ungroup() %>%
select(gvkey, datadate, fyear, big_n, ta, big_n, roa, cfo, size, lev,
mtb, inv_at, d_sale, d_ar, ppe) %>%
collect() %>%
mutate(fyear = as.factor(fyear))
We create two functions that are quite similar in structure.
We saw the winsorize
function in Chapter 26.
This function has more general applicability than we will use here, as we will stick to the default argument of prob = 0.01
, which is then mapped into p_low = 0.01
and p_high = 0.99
, which implements “the 1 percent and 99 percent winsorization rule common to accounting studies” (and also used in Chen et al. (2018)).
The first line of the function uses the quantile
function to identify the 1st and 99th percentiles of the distribution of x
and stores these two values in the vector cuts
.
The type
argument should be “an integer between 1 and 9 selecting one of the nine quantile algorithms … to be used.” [see help by typing ? quantile
at the R console; R Core Team (2021)].
The type
argument specifies how ties are resolved and how averages at breakpoints are calculated.
By selecting type = 2
, we get the SAS default, which means we get the same values as we would using the SAS macro commonly used by other researchers.
The third line of the winsorize
function sets all values below cuts[1]
, which is the p_low
quantile, equal to cuts[1]
.
The fourth line sets all values above cuts[2]
(p_high
) equal to cuts[2]
.
The last line is equivalent to return(x)
(i.e., the modified value of x
is returned).
The truncate
function is almost identical, except that values outside the bounds in cuts
are set to NA
, which effectively removes them from later analysis.
winsorize < function(x, prob = 0.01, p_low = prob, p_high = 1  prob) {
cuts < quantile(x, probs = c(p_low, p_high), type = 2, na.rm = TRUE)
x[x < cuts[1]] < cuts[1]
x[x > cuts[2]] < cuts[2]
x
}
truncate < function(x, prob = 0.01, p_low = prob, p_high = 1  prob) {
cuts < quantile(x, probs = c(p_low, p_high), type = 2, na.rm = TRUE)
x[x < cuts[1]] < NA
x[x > cuts[2]] < NA
x
}
To apply these functions we use the mutate
function across
all variables between ta
and ppe
inclusive (i.e., we do not apply them to gvkey
, datadate
, fyear
, or big_n
) and store the results in comp_win
and comp_trunc
.
comp_win <
comp %>%
mutate(across(ta:ppe, winsorize))
comp_trunc <
comp %>%
mutate(across(ta:ppe, truncate))
We next run regressions.
The first model we consider is essentially that found in Chen et al. (2018).
As Leone et al. (2019) point out, roa
and cfo
can only be included in a model with ta
as the dependent variable when winsorization breaks the identity ta = roa  cfo
.^{188}
In the second model, we use raw data, but drop roa
to make the model meaningful.
The third and fourth models tweak the second model by using winsorized and truncated data respectively.
Note that we use the update
function to make the relationships between the models clearer and to avoid errors caused by unintended differences between models.
fms < list()
fms[[1]] < felm(ta ~ big_n + roa + cfo + size + lev + mtb +
fyear * (inv_at + I(d_sale  d_ar) + ppe)  0  0 
gvkey + fyear,
data = comp_win, na.action = na.exclude)
fms[[2]] < update(fms[[1]], ~ .  roa, data = comp)
fms[[3]] < update(fms[[2]], data = comp_win)
fms[[4]] < update(fms[[2]], data = comp_trunc)
Given that we don’t use comp_win
other than to estimate fms[[3]]
, we could have used an alternative syntax such as the following to avoid creating comp_win
as a separate variable.
The fifth model we estimate uses Cook’s D to identify extreme observations for exclusion.
As felm
objects do not support the cooks.distance
function, we estimate an OLS model and use that to identify extreme observations for exclusion.
fm < lm(ta ~ big_n + cfo + size + lev + mtb +
fyear * (inv_at + I(d_sale  d_ar) + ppe),
data = comp, na.action = na.exclude)
fms[[5]] <
comp %>%
mutate(cooksd = cooks.distance(fm),
extreme = cooksd > 4 / nobs(fm),
ta = if_else(extreme, NA_real_, ta)) %>%
update(fms[[2]], data = .)
The sixth and final model uses robust regression.
Leone:2019uc use the robreg
module made available for Stata, which defaults to an asymptotic efficiency of 85% for the bisquare estimator used in the second step of the MM estimation procedure and we mimic this choice here.This efficiency level is obtained by specifying the tuning.psi
parameter as 3.4437
.^{189}
An alternative would be to follow Koller and Stahel (2017) by selecting control = lmrob.control("KS2014")
.
Koller and Stahel (2017) suggest that this approach “allows for … estimation in designs with factors and interactions between factors and continuous regressors”, which describes our setting here due to the interaction of fyear
(a factor) with inv_at + I(d_sale  d_ar) + ppe
(continuous variables).
Following the recommendation in Leone et al. (2019), we recover the weights from the fitted lmrob
object and feed those to an update of the felm
model in fms[[2]]
, as doing so allows us to use clusterrobust standard errors.
fmrob < lmrob(formula(fm),
data = comp, na.action = na.exclude,
method = "MM",
control = lmrob.control(tuning.psi = 3.4437))
fms[[6]] <
fmrob %>%
weights(type = "robustness") %>%
coalesce(0) %>%
update(fms[[2]], data = comp, weights = .)
sg_format < "html"
Finally, to produce a table more like Table 2 of Leone et al. (2019), we coax stargazer
into putting \(t\)statistics in parentheses by feeding \(t\)statistics where it expects standard errors and feeding \(p\)values separately so as to get the right “stars” on the coefficients.
ts < map(fms, ~ summary(.)$coefficients[, 3])
pvals < map(fms, ~ summary(.)$coefficients[, 4])
stargazer(fms, type = sg_format,
column.labels = c("Winsorized", "Raw data", "Winsorized",
"Truncated", "Cook's D", "Robust reg"),
omit = "(fyearppeinv_atd_sale)", report=('vc*s'),
omit.stat = c("f", "ser", "rsq"), header = FALSE,
se = ts, p = pvals,
dep.var.caption = "",
dep.var.labels.include = FALSE)
Winsorized  Raw data  Winsorized  Truncated  Cook’s D  Robust reg  
(1)  (2)  (3)  (4)  (5)  (6)  
big_n  0.0004  0.073  0.006  0.009^{***}  0.020^{***}  0.011^{***} 
(0.263)  (1.085)  (1.287)  (2.632)  (3.913)  (12.671)  
roa  0.815^{***}  
(63.771)  
cfo  0.786^{***}  0.799^{*}  0.257^{***}  0.123^{***}  0.194^{***}  0.145^{***} 
(37.380)  (1.762)  (8.924)  (5.571)  (5.238)  (42.791)  
size  0.003^{***}  0.066^{*}  0.004^{***}  0.007^{***}  0.001  0.003^{***} 
(5.136)  (1.861)  (2.586)  (5.745)  (0.700)  (9.349)  
lev  0.035^{***}  0.072  0.111^{***}  0.031^{***}  0.074^{***}  0.004^{***} 
(2.649)  (0.344)  (5.046)  (4.435)  (4.161)  (4.797)  
mtb  0.001^{***}  0.003  0.007^{***}  0.006^{***}  0.001^{***}  0.00000 
(4.482)  (0.935)  (8.852)  (10.870)  (3.975)  (1.447)  
Constant  0.013^{***}  0.331  0.029^{***}  0.054^{***}  0.059^{***}  0.025^{***} 
(5.519)  (1.443)  (3.277)  (8.764)  (6.766)  (17.363)  
Observations  113,979  113,979  113,979  106,570  113,551  113,979 
Adjusted R^{2}  0.911  0.574  0.278  0.094  0.128  0.943 
Note:  ^{}p<0.1; ^{}p<0.05; ^{}p<0.01 
25.0.2 Discussion questions
 Leone et al. (2019) present results from a simulation analysis in Table 5 of that paper. Which panel of that table likely best reflects data researchers encounter in practice?
 Consider the left half of Table 5 (i.e., \(\beta_1 = 0\)). Using the panel you identified in question 1 above, interpret the implication of the simulation results for researchers in practice. Note that the mean estimates of \(\beta_1\) using Cook’s D or robust regression are very similar. Does Table 5 provide strong support for rejecting Cook’s D in favour of robust regression?
 In the search for new research questions, accounting researchers increasingly need to study variables whose most plausible effect sizes are consistent with the right half of Table 5 (i.e., \(\beta_1 = 0\)). Using the panel you identified in question 1 above, interpret the implication of the simulation results for researchers in practice.

Leone et al. (2019) report results in Table 2 similar to those given in columns (4), (5), and (6) above. They say, “[when we] perform estimation using RR, the coefficient on \(BIG\_N\) is \(0.011\) (significant at the 1 percent level), opposite of what Chen et al. (2018) find, and instead consistent with the negative coefficient documented in prior studies using the twostep procedure.”
In light of the simulation evidence, how persuasive do you find the evidence above in support of a negative coefficient on
big_n
? How probative is the evidence of prior studies if it is based on approaches (e.g., winsorization and twostep procedures) that Leone et al. (2019) and Chen et al. (2018) demonstrate lead to unreliable results?
25.1 Appendix: Simulation study from Leone et al. (2019)
Here we present code to run simulations like those underlying Table 5 of Leone et al. (2019). One reason we present this here is to help the reader better understand the results in Table 5.
An additional reason for presenting this particular simulation is to illustrate a modular approach to simulations that is fostered by use of R functions. This approach potentially offers a template for readers looking to develop simulations.
A bigpicture view of the simulation is that the run_sim
function uses get_data()
to generate the data, then feeds the resulting data frame to fit_models()
, then fits the returned fitted models to compile_stats
.
The get_results
function feeds parameter values from params
to the run_sim
function.
The first step is to produce the data.
If you wanted to change the datagenerating process, this could be achieved simply by editing this function.
The purpose of this function is to replicate the datagenerating process described in Leone et al. (2019).^{190}
Rather than create separate code for each of the three panels of Table 5, we create one function and use argument values to select the datagenerating process used to create the data.
For example, the default values (b_true = 0.8
, clean = FALSE
, and random = TRUE
) produce data matching the left half of Panel C of Table 5.
get_data < function(b_true = 0.8, clean = FALSE,
random = TRUE, n = 10000) {
prop_cont < if_else(random, 0.02, 0.25)
df < tibble(x1 = rnorm(n = n),
x2 = rnorm(n = n),
x3 = rnorm(n = n),
v = rnorm(n = n),
y = b_true * x1 + 0.4 * x2 + 0.2 * x3 + v,
z = rnorm(n = n, mean = 3, sd = 1)) %>%
mutate(id = row_number(),
contaminated = case_when(!clean & !random ~ id < prop_cont * n & x1 < 1.5,
!clean & random ~ id < prop_cont * n,
TRUE ~ FALSE),
cy = if_else(contaminated, y + z, y),
cyw = winsorize(cy),
x1w = winsorize(x1),
x2w = winsorize(x2),
x3w = winsorize(x3),
cyt = truncate(cy),
x1t = truncate(x1),
x2t = truncate(x2),
x3t = truncate(x3))
df
}
Another benefit of putting get_data
in a separate function that returns a data frame is that we can examine the pieces one at a time.
get_data()
## # A tibble: 10,000 × 17
## x1 x2 x3 v y z id contaminated cy cyw
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <lgl> <dbl> <dbl>
## 1 1.22 0.590 0.256 0.202 1.36 4.62 1 TRUE 3.26 3.26
## 2 1.78 0.426 1.93 2.67 4.31 2.62 2 TRUE 6.93 3.62
## 3 0.0264 0.0968 0.904 0.563 0.761 4.62 3 TRUE 5.38 3.62
## 4 0.171 0.320 0.734 1.48 1.33 3.14 4 TRUE 4.47 3.62
## 5 0.511 2.38 1.72 0.787 1.67 2.92 5 TRUE 4.59 3.62
## 6 0.715 0.153 0.713 0.513 0.0221 3.33 6 TRUE 3.31 3.31
## 7 0.227 1.18 0.0928 0.207 0.479 3.67 7 TRUE 3.19 3.19
## 8 0.468 0.815 0.739 0.184 0.0118 3.06 8 TRUE 3.07 3.07
## 9 1.31 0.438 0.237 0.708 1.98 2.00 9 TRUE 0.0249 0.0249
## 10 1.67 0.273 3.24 1.18 3.28 2.76 10 TRUE 6.03 3.62
## # … with 9,990 more rows, and 7 more variables: x1w <dbl>, x2w <dbl>,
## # x3w <dbl>, cyt <dbl>, x1t <dbl>, x2t <dbl>, x3t <dbl>
The next function fits the models of interest. Following Leone et al. (2019), we fit seven models and store each model as an element of a list, as this makes manipulation of the models (e.g., extracting statistics) easier.
fit_models < function(df) {
fms < list()
fms[[1]] < lm(cy ~ x1, data = df)
fms[[2]] < update(fms[[1]], cyw ~ x1w + x2w + x3w)
fms[[3]] < update(fms[[1]], cy ~ x1w + x2w + x3w)
fms[[4]] < update(fms[[1]], cyt ~ x1t + x2t + x3t)
fms[[5]] < update(fms[[1]], cy ~ x1t + x2t + x3t)
fms[[6]] <
df %>%
mutate(cooksd = cooks.distance(fms[[1]]),
extreme = cooksd > 4 / nobs(fms[[1]])) %>%
filter(!extreme) %>%
update(fms[[1]], data = .)
fmrob < lmrob(formula(fms[[1]]),
data = df,
method = "MM",
control = lmrob.control(tuning.psi = 3.4437))
fms[[7]] < fmrob
fms
}
For the simulation output, we want to retain estimated coefficients and results of tests of the null hypothesis at sizes of 5% and 1%. We create a function to do this, as this makes for lean, easytodebug code.
extract_stats < function(fm, b_null) {
rdf < fm$df.residual
coefs < summary(fm)$coefficients
est < coefs[2, 1]
se < coefs[2, 2]
tval < (est  b_null)/se
pval < 2 * pt(abs(tval), rdf, lower.tail = FALSE)
p_5 < pval < 0.05
p_1 < pval < 0.01
tibble(est, p_5, p_1)
}
We next create a function that applies extract_stats
to a list of models (such as those returned by fit_models
) and puts the results in a data frame with humanfriendly labels for each of the models.
compile_stats < function(fms, b_null) {
bind_rows(lapply(fms, extract_stats, b_null = b_null)) %>%
mutate(model_num = row_number(),
model = c("Do nothing",
"Winsorize all variables", "Winsorize only X",
"Truncate all variables", "Truncate only X",
"Cook's Distance", "Robust regression"))
}
Now we create run_sim
, which runs one iteration of the simulation and returns a data frame with parameter values and statistics from the simulation.
run_sim < function(sim_num = 1, b_true = 0.8, b_null = b_true,
clean = FALSE, random = FALSE) {
get_data(b_true = b_true, clean = clean, random = random) %>%
fit_models() %>%
compile_stats(b_null = b_null) %>%
mutate(sim_num = sim_num, b_true = b_true, b_null = b_null,
clean = clean, random = random)
}
We now compile a data frame with the parameters we will run in the simulation.
We will consider different two different values of each of b_true
, clean
, and random
.
Note that when clean
is TRUE
, the value of random
has no significance, so we delete the case where clean
is TRUE
and random
is also TRUE
to avoid running redundant parameter values.
params < expand_grid(b_true = c(0, 0.8),
clean = c(TRUE, FALSE),
random = c(FALSE, TRUE)) %>%
filter(!(clean & random))
Now we finally have almost all the pieces to run the simulation.
We will do this using a get_results
function.
One version of the get_results
function would use lapply
and run in a single thread.
num_sims < 1000L
get_results < function(b_true = 0, clean = FALSE, random = FALSE) {
set.seed(2021)
bind_rows(lapply(1:num_sims, run_sim, b_true = b_true,
clean = clean, random = random))
}
An alternative approach to get_results
would use mclapply
from the parallel
package and run in multiple threads, for substantially shorter run times.
The appropriate value for mc.cores = 8
will depend on your computer hardware.
library(parallel)
num_sims < 1000L
get_results < function(b_true = 0, clean = FALSE, random = FALSE) {
set.seed(2021)
bind_rows(mclapply(1:num_sims, run_sim, b_true = b_true,
clean = clean, random = random,
mc.cores = 4))
}
The following step, which actually runs get_results
, takes a few minutes to run and produces data for all panels of Table 5 of Leone et al. (2019).^{191}
Here we present the left half of Panel B of Table 5 of Leone et al. (2019).
all_results %>%
filter(b_true == 0.8, random == FALSE, clean == FALSE) %>%
group_by(b_true, clean, random, model_num, model) %>%
summarize(across(est:p_1, mean), .groups = "drop") %>%
select(model, est, p_5, p_1) %>%
knitr::kable(digits = 3)
model  est  p_5  p_1 

Do nothing  0.702  1.000  1.000 
Winsorize all variables  0.704  1.000  1.000 
Winsorize only X  0.714  1.000  1.000 
Truncate all variables  0.681  1.000  1.000 
Truncate only X  0.709  1.000  1.000 
Cook’s Distance  0.759  0.926  0.835 
Robust regression  0.774  0.536  0.291 