# 26  Prediction

This chapter focuses on prediction using approaches that fall under the broad heading of statistical learning, also known as machine learning. We begin by discussing what prediction means and how it can be used in research. We then explore methods and concepts related to prediction using the setting of accounting fraud examined in Bao et al. (2020).

Tip

The code in this chapter uses the packages listed below. We load tidyverse because we use several packages from the Tidyverse. For instructions on how to set up your computer to use the code found in this book, see Section 1.2. Quarto templates for the exercises below are available on GitHub.

library(tidyverse)
library(DBI)
library(farr)
library(furrr)
library(rpart)
library(glmnet)
Important

Some of the code in this chapter—especially in Section 26.6—takes a long time to run, even with multiple cores. To make it easier for readers to run just the code they are interested in, here we provide some guidance regarding how code from one section depends on code in other sections.

Running the code in Section 26.2 is a prerequisite for running code in Section 26.3. Section 26.5, and Section 26.6, each of which can be run independently of the others. While some results from Section 26.6 are used in Section 26.7, we have embedded these results in the code for Section 26.7, so that from a reader’s perspective the latter section only depends on results of code in Section 26.2. (There is no code in Section 26.1 or Section 26.4.)

## 26.1 Prediction in research

Prediction can be taken to mean the activity of using known information to estimate or predict unknown events or values. In ordinary usage, prediction suggests that the unknown events or values are in the future, but often prediction is of pre-existing facts which may only become known at a later date (e.g., we may use prediction models to diagnose a disease or to detect misstatements in observed financial statements).

In this chapter we will cover a number of statistical approaches to prediction. Prediction is perhaps the most active area of research in statistics and the methods we cover in this chapter fall under the umbrella of statistical learning.

It seems natural to ask why social science researchers would be interested in prediction. As discussed in Chapter 4, research papers can often be distinguished by their research objectives, which might include prediction, measurement, description, and causal inference.

When prediction is part of the research question, the methods covered in this chapter are clearly relevant. A researcher might be interested in prediction not for its own sake, but to understand how information might be used in practice. For example, Bernard and Thomas (1989) were not interested in predicting post–earnings announcement returns so much as providing evidence of the existence of a predictable element of this as evidence of market efficiency.

Prediction methods are also often used in measurement. For example, papers have developed models for predicting events such as bankruptcy and earnings restatement and these models have often been used in subsequent research as measures of, say, bankruptcy risk.

Finally, an emerging area of research examines the usefulness of prediction for causal inference. While prediction is at the heart of causal inference in that a causal model has predictive value, it has been less clear whether predictive models developed using the methods provided in this chapter provide insights into the existence of causal relations.

Beyond research, prediction is also fundamental to financial accounting in many settings. For example, the provision for loan losses is a prediction of future write-offs. Researchers might seek to understand accounting in such settings as prediction problems and might even simulate such processes using statistical learning approaches like those studied in this chapter.

And beyond financial accounting, prediction models have innumerable uses in practice, from identifying anomalies in accounting records to making movie recommendations to Netflix customers.

## 26.2 Predicting accounting fraud

To help the reader understand the issues and approaches used in statistical learning, we focus in this chapter on the setting of Bao et al. (2020), which focuses on predicting accounting fraud in publicly traded US firms during the period 2003–2008.

### 26.2.1 Features

Any prediction problem involves an outcome that we want to predict based on a set of features. Features (also known as predictors) refer to the variables or data that we will use to make predictions.

In this chapter, we follow Bao et al. (2020) in using “28 raw financial data items” drawn from Compustat are our features. While not described in the paper, from the PDF version of the SAS code provided by the authors, four variables (ivao, pstk, ivst, txp) are replaced with zeros if missing. The SAS code also includes the equivalent of filter(!is.na(at), !is.na(prcc_f)), but given that the equivalent of the na.omit() is applied later (as we do at the end of our code here), we have omitted this filter.

db <- dbConnect(RPostgres::Postgres(), bigint = "integer")

funda <- tbl(db, Id(schema = "comp", table = "funda"))

X_vars <- c('act', 'ap', 'at', 'ceq', 'che', 'cogs', 'csho', 'dlc',
'dltis', 'dltt', 'dp', 'ib', 'invt', 'ivao', 'ivst',
'lct', 'lt', 'ni', 'ppegt', 'pstk', 're', 'rect', 'sale',
'sstk', 'txp', 'txt', 'xint', 'prcc_f')

y_var <- "misstate"

features_all <-
funda |>
filter(indfmt == "INDL", datafmt == "STD",
consol == "C", popsrc == "D") |>
filter(fyear >= 1991, fyear <= 2008) |>
mutate(across(c(ivao, pstk, ivst, txp),
$$x) coalesce(x, 0))) |> select(gvkey, datadate, fyear, all_of(X_vars)) features <- features_all |> collect() |> na.omit() db <- dbConnect(duckdb::duckdb()) funda <- load_parquet(db, schema = "comp", table = "funda") X_vars <- c('act', 'ap', 'at', 'ceq', 'che', 'cogs', 'csho', 'dlc', 'dltis', 'dltt', 'dp', 'ib', 'invt', 'ivao', 'ivst', 'lct', 'lt', 'ni', 'ppegt', 'pstk', 're', 'rect', 'sale', 'sstk', 'txp', 'txt', 'xint', 'prcc_f') y_var <- "misstate" features_all <- funda |> filter(indfmt == "INDL", datafmt == "STD", consol == "C", popsrc == "D") |> filter(fyear >= 1991, fyear <= 2008) |> mutate(across(c(ivao, pstk, ivst, txp), \(x) coalesce(x, 0))) |> select(gvkey, datadate, fyear, all_of(X_vars)) features <- features_all |> collect() |> na.omit() While the features that are used in building a prediction model and even how they are encoded can “have a significant impact on model performance” , we only consider the features used by Bao et al. (2020) in our analysis here and discuss issues related to features later in the chapter. ### 26.2.2 Outcome An outcome might be measured as a continuous variable (e.g., life expectancy) or it might be a categorical variable (e.g., “no restatement”, “minor restatement”, or “major restatement”). A categorical variable will often naturally be a binary variable taking one of just two values, but in principle any variable code can be coded as a binary variable (e.g., “hot dog” or “not hot dog”).1 In this chapter, we follow Bao et al. (2020) and focus on a single binary categorical variable taking values of “no fraud” and “fraud”, which might be represented using \(\{0, 1\}$$ (i.e., an indicator variable), but in statistical learning one often sees such variables represented using $$\{-1, 1\}$$.2 Bao et al. (2020) identify frauds based on the existence of an Accounting and Auditing Enforcement Releases (AAERs) released by the SEC affecting the firm.3 While the focus on binary variables might seem limiting, it facilitates our presentation of some core ideas of statistical learning within the confines of a single chapter.

Data on the outcome are found in the data set aaer_firm_year supplied with the farr package.

aaer_firm_year
# A tibble: 415 × 4
p_aaer gvkey  min_year max_year
<chr>  <chr>     <int>    <int>
1 1033   021110     1992     1992
2 1037   008496     1992     1993
3 1044   028140     1993     1993
4 1047   012455     1994     1994
5 1053   025927     1993     1993
6 1067   023126     1993     1993
7 1071   024712     1992     1993
8 1104   028450     1994     1995
9 1106   002711     1992     1993
10 1110   010185     1994     1995
# ℹ 405 more rows

As can be seen above, for each AAER (identified by p_aaer), aaer_firm_year contains information on the GVKEY of the affected firm and the first and last fiscal years affected by accounting fraud (min_year and max_year, respectively).

aaer_long <-
aaer_firm_year |>
rowwise() |>
mutate(fyear = list(seq(min_year, max_year, by = 1))) |>
unnest(fyear) |>
select(gvkey, p_aaer, fyear, min_year, max_year) 

In practice, careful thinking about outcome measures is required. The outcome selected and the approach to measuring it are likely to depend on the purpose for which prediction is being made. While Bao et al. (2020) aim to predict “accounting fraud”, it is important to note that the variable p_aaer indicates an Accounting and Auditing Enforcement Release (AAER) by the SEC. But accounting fraud might not result in AAERs, either because it is never detected, or because it is detected but does not rise to the level that leads to an AAER, or even because the fraud is so profound that an AAER is somewhat irrelevant.

With regard to the last category, consider Enron, the public company at the heart of one of most notorious cases of accounting fraud this century. While the CEO (Jeffrey Skilling) and CFO (Andrew Fastow) of Enron ended up serving time in prison, there is no AAER related to either Skilling or Fastow (many AAERs relate to individuals). There is not even an AAER directed at Enron, perhaps because it entered bankruptcy shortly after fraud was detected. The one AAER in the Bao et al. (2020) sample related to Enron has p_aaer of "1821".4 However, this AAER related to the order for Citigroup to pay a amount in a settlement arising because “Citigroup assisted [Enron and Dynegy] in enhancing artificially their financial presentations through a series of complex structured transactions … to allow those companies to report proceeds of financings as cash from operating activities”.

Even if we accept that AAERs are the measure of interest, there are several wrinkles related to timing. To see the issues here, consider the AAER with p_aaer equal to 1500, which affected fiscal years 1998, 1999, and 2000.

aaer_firm_year |>
filter(p_aaer == "1500")
# A tibble: 1 × 4
p_aaer gvkey  min_year max_year
<chr>  <chr>     <int>    <int>
1 1500   028958     1998     2000

Suppose we were an analyst and had data on features up to the end of fiscal 1999 and were looking to train a model that could be used to predict future AAERs related to fiscal 1999. This would place us somewhere in 2000 (fiscal 1999 includes firms with fiscal years ending as late as 2000-05-31) and the largest meaningful model-building data set could include fiscal years up to 1998. Sitting in 2018 or 2022, we might be tempted to code p_aaer as 1 for the firm-year with gvkey of 028958 and fyear of 1998, but it turns out that AAERs are only disclosed after the end of the last fiscal year that they affect and this would mean that the p_aaer == '1500' AAER would not be observed by our hypothetical analyst in 2000, who would instead see misstate as 0 for that observation. To simulate the approach taken by our hypothetical analyst, we need to code misstate for this case as 0.

At some point it would be reasonable to assume that the analyst can see the AAER just as we can and use it in training a model, but the question is “when?”. Bao et al. (2020, p. 209) “require a gap of 24 months between the financial results announcement of the last training year and the results announcement of a test year … because Dyck et al. (2010) find that it takes approximately 24 months, on average, for the initial disclosure of the fraud.” For the p_aaer == '1500'AAER, Bao et al. (2020)’s approach would mean that the hypothetical analyst could observe the AAER by the end of fiscal 2002 (two years after the last affected period of 2000).

While Bao et al. (2020) relied on the broad observation of Dyck et al. (2010), data on AAER dates are easily obtained from the SEC’s website for AAERs since late 1999 and included in the data set aaer_dates made available with the farr package.5

head(aaer_dates |> select(aaer_num, aaer_date))
# A tibble: 6 × 2
aaer_num  aaer_date
<chr>     <date>
1 AAER-1212 1999-12-14
2 AAER-1211 1999-12-03
3 AAER-1210 1999-11-19
4 AAER-1209 1999-11-16
5 AAER-1208 1999-11-15
6 AAER-1206 1999-11-10

Using these data we can examine the actual distribution of the time between the end of the latest affected fiscal year and the AAER being released by the SEC.

aaer_days <-
aaer_dates |>
mutate(p_aaer = str_replace(aaer_num, "^AAER-", "")) |>
inner_join(aaer_firm_year, by = "p_aaer") |>
mutate(fyear = max_year) |>
inner_join(features, by = c("gvkey", "fyear")) |>
mutate(days_to_aaer = as.integer(aaer_date - datadate)) |>
filter(!is.na(days_to_aaer)) 

The plot below reveals a lot of variation that is ignored by a blanket “24 months” assumption As can be seen in the plot, there is a great deal of variation from that assumption (represented by the red dashed vertical line) and most AAERs are released more than two years after the end of the last fiscal year (median is 1350, which is about 3.7 years).

aaer_days |>
ggplot(aes(x = days_to_aaer)) +
geom_histogram(binwidth = 30) +
geom_vline(xintercept = 365 * 2, linetype = "dashed", color = "red")

How might we use these data? It would seem appropriate to imagine our analyst looking to apply a model to a fiscal year where that model has been trained on features drawn from prior fiscal years and AAERs revealed to date. For example, if our analyst were now looking to apply a model to data from fiscal 2001, she would be situated somewhere in late 2002 (fiscal 2001 includes firms with fiscal years ending as late as 2002-05-31) and thus would only see AAERs released prior to that time. Assuming the conventional four months for financial data to be released, we might set a “test date” of 2002-09-30 for fiscal 2001, so AAERs with an aaer_date prior to 2002-09-30 would be observable to the analyst.6

For concreteness, let’s assume a “test date” of 2002-09-30 and construct a data set of AAERs. In this data set, we measure misstate, which is assumed to be observable to the analyst on test_date, and misstate_latent, which includes accounting fraud observable to the researcher, but not to the analyst.

get_aaers <- function(test_date, include_latent = FALSE) {
response = within_sample$misstate)  response predicted FALSE TRUE FALSE 66961 201 TRUE 4 8 Here we see 8 true positive cases, 66961 true negative cases, 201 false negative cases, and 4 false positive cases. ### 26.3.3 Exercises 1. One claimed benefit of classification trees is their ease of interpretation. Can you provide an intuitive explanation for the fitted tree above? If so, outline your explanation. If not, what challenges prevent you from doing so? 2. Use 5-fold cross validation to assess the performance of the classification tree approach using the parameters above. 3. What do the parameters cp and minbucket represent? 4. How does the classification tree change if you increase the parameter cp? 5. Use 5-fold cross validation to assess the performance of the classification tree approach with three different parameters for minbucket (the one used above and two others). ## 26.4 Performance metrics It is intuitively clear that the performance of the models is not particularly impressive, but in general we need ways to measure model performance more formally and precisely. A number of evaluation metrics are available to measure the performance of prediction models. In this section, we discuss some of these metrics with a focus on two-class classification problems.22 ### 26.4.1 Accuracy One standard classification performance metric is accuracy, defined as $$\frac{TP+TN}{TP+FN+FP+TN}$$, where $$TP$$ (true positive) is (in our current context) the number of fraudulent firm-years that are correctly classified as fraud; $$FN$$ (false negative) is the number of fraudulent firm-years that are misclassified as non-fraud; $$TN$$ (true negative) is the number of non-fraudulent firm-years that are correctly classified as non-fraud; and $$FP$$ (false positive) is the number of non-fraudulent firm-years that are misclassified as fraud. Unfortunately, accuracy is a not a very useful metric in this setting because of the severe class imbalance. With fewer than 1% of firms in a typical year being affected by accounting fraud, simply classifying all firm-years as non-fraud would have accuracy of more than 99%. Two other statistics that are often considered are sensitivity (also known as the true positive rate, or TPR) and specificity (also known as the true negative rate or TNR) $\textit{Sensitivity} = \textit{TPR} = \frac{TP}{TP + FN}$ $\textit{Specificity} = \textit{TNR} = \frac{TN}{TN + FP}$ Often we focus on the complement to specificity, which is know as the false positive rate (FPR) and calculated as follows: $1 - \textit{Specificity} = \textit{FPR} = 1 - \frac{TN}{TN + FP}= \frac{FP}{TN + FP}$ ### 26.4.2 AUC There is an inherent trade-off between sensitivity and specificity. A classification model will typically return something like an estimated probability that an observation is a positive case. We can choose a cut-off probability and classify all observations with an estimated probability at or above the cut-off as a positive case, and classify all observations with estimated probabilities below the cut-off as negative cases. Starting at a cut-off of zero, every observation would be classified as a positive and the TPR would be 1, but the FPR would also be 1. As the cut-off increases, both the TPR and the FPR will decrease. If we collect the pairs of values of TPR and FPR as we increase the cut-off, we can plot the TPR against the FPR to form what is called the response operator curve (or ROC). In terms of the ROC, the worst-case model just randomly assigns probabilities to each observation. So with a cut-off of 0.25, we’d classify 75% of observations as positive and 25% as negative. In expectation, we’d have 75% of the positive cases classified as positive, and 75% of the negative cases classified as positive, so the TPR and FPR values would be equal in expectation. This would yield an ROC close to the 45-degree line. The best-case model would be a perfect classifier that yielded zero FNR and 100% TPR. This would yield an ROC that rose vertically along the $$y$$-axis and horizontally at $$y = 100\%$$. This suggests the area under the ROC curve (or AUC) as a measure of the quality of a model, with values of $$0.5$$ for the random-guess model and $$1$$ for the perfect classifier. Below, we follow Bao et al. (2020) (and many others) in using AUC as the performance measure we use to evaluate models. ### 26.4.3 NDCG@k Bao et al. (2020) also consider an alternative performance evaluation metric called the normalized discounted cumulative gain at position $$k$$ (NDCG@k). NDCG@k metric can be motivated using the idea of budget constraints. Suppose that a regulator (or other party interesting in unearthing frauds) has a limited budget and can only investigate $$k$$ frauds in a given year. Denoting the number of actual frauds in a year as $$j$$, if $$j \geq k$$, then the best that the regulator can do is to investigate $$k$$ cases, all of which are frauds. If $$j < k$$, then the regulator can do no better than investigating $$k$$ cases with all $$j$$ frauds being among those investigated. The assumption is that the regulator investigates the $$k$$ cases with the highest estimated probability of being frauds (scores). This suggests a performance measure that scores a model higher the greater the number of true frauds ranked in the top $$k$$ scores. The measure NDCG@k is such a measure. NDCG@k also awards more “points” the higher in the list (based on scores) each true fraud is placed. This is implemented by multiplying each correct guess by $$1/\log_2 (i + 1)$$. Formally, the discounted cumulative gain at position $$k$$ is defined as follows: $\textrm{DCG@k} = \sum_{i=1}^k \textit{rel}_i/\log_2 (i + 1)$ The measure NDCG@k is DCG@k normalized by the the DCG@k value obtained when all the true instances of fraud are ranked at the top of the ranking list. Hence, the values of NDCG@k are bounded between $$0$$ and $$1$$, and a higher value represents better performance. Bao et al. (2020) select $$k$$ so that the number of firm-years in the ranking list represents 1% of all the firm-years in a test year. Bao et al. (2020) “select a cut-off of 1% because the average frequency of accounting fraud punished by the SEC’s AAERs is typically less than 1% of all firms in a year.” Bao et al. (2020) report modified versions of sensitivity and specificity: $$\text{Sensitivity@k} = \frac{TP@k}{TP@k + FN@k}$$ and $$\text{Precision@k} = \frac{TP@k}{TP@k + FP@k}$$ where TP@k, FN@k and FP@k are analogous to TP, FN and FP, but based on classifying as frauds only the $$k\%$$ of observations with the highest scores. ## 26.5 Penalized models Having a number of ways to measure model performance alone does not address the basic problem of poor model performance using the two fundamental approaches above. In this section and the next, we describe two strategies that can be used to improve model performance. The first approach uses penalized models. Perhaps the easiest way to understand penalized models is to recognize that many common modelling approaches seek to minimize the sum of squared errors $\textrm{SSE} = \sum_{i = 1}^{n} (y_i - \hat{y}_i )^2.$ With penalized models, the objective is modified by adding a penalty term. For example, in ridge regression the following objective function is minimized.23: $\textrm{SSE}_{L_2} = \sum_{i = 1}^{n} (y_i - \hat{y}_i )^2 + \lambda \sum_{j=1}^{P} \beta_j^2.$ An alternative approach is least absolute shrinkage and selection operator (or lasso) model, which uses the following objective function: $\textrm{SSE}_{L_1} = \sum_{i = 1}^{n} (y_i - \hat{y}_i )^2 + \lambda \sum_{j=1}^{P} \left|\beta_j \right|.$ While both ridge regression and lasso will shrink coefficients towards zero (hence the alternative term shrinkage methods), some coefficients will be set to zero by lasso for some value of $$\lambda$$. In effect, lasso performs both shrinkage and feature selection. A third shrinkage method worth mentioning is elastic net , which includes the penalties from both ridge regression and lasso $\textrm{SSE}_{\textrm{ENet}} = \sum_{i = 1}^{n} (y_i - \hat{y}_i )^2 + \lambda_1 \sum_{j=1}^{P} \beta_j^2 + \lambda_2 \sum_{j=1}^{P} \left|\beta_j \right|.$ One aspect of these penalized objective functions that we have not discussed is $$\lambda$$. Implicitly, $$\lambda > 0$$ (if $$\lambda = 0$$, then the penalty term is zero), but we have said nothing about how this parameter is selected.24 In fact, $$\lambda$$ is a meta-parameter of the kind alluded to above, as it is not estimated directly as part of the estimation process. A common approach to setting meta-parameters like $$\lambda$$ uses cross-validation. With cross-validation, a number of candidate values of $$\lambda$$ are considered. For each candidate value and each cross-validation fold, the model is estimated omitting the data in the fold. The performance of the model is evaluated using predicted values for the cross-validation folds (e.g., the average mean-squared error, or MSE, across all folds). Finally, the value $$\lambda$$ that maximizes performance (e.g., minimizes MSE) is chosen as the value to be used when applying the model to new data (e.g., test data when applicable). To illustrate, we use the lasso analogue to the logit regression we estimated above.25 For this purpose, we use the glmnet library and specify alpha = 1 to focus on the lasso model. The glmnet library contains built-in functionality for cross-validation. We simply need to specify the performance metric to use in evaluating models (here we set type.measure = "auc" to use AUC) and—because we constructed our folds to put all observations on a given firm in the same fold—the foldid for each observation dft <- data_train |> inner_join(sample_splits, by = "gvkey") fm_lasso_cv <- cv.glmnet(x = as.matrix(dft[X_vars]), y = dft[[y_var]], family = "binomial", alpha = 1, type.measure = "auc", foldid = dft[["fold"]], keep = TRUE) The glmnet function will automatically consider a range of $$\lambda$$ values. Note that the scale of the features matters with penalized regression, as the penalties increase with the magnitude of the coefficients, which in turn depends on the scale of the variables. For example, a logit coefficient on at expressed in millions of dollars of $$0.123$$ is equivalent to a coefficient of $$123$$ on at expressed in thousands of dollars. Thus glmnet will by default scale the features (see ? glmnet for additional details). tibble(lambda = fm_lasso_cv$lambda, auc = fm_lasso_cv$cvm) |> ggplot(aes(x = lambda, y = auc)) + geom_line() Here we see that AUC is maximized with $$\lambda \approx 0.00035$$. Predicted values for each value of $$\lambda$$ considered are stored in fit.preval of fm_lasso_cv as log-odds ratios. We can convert these to probabilities using the function $$f(x) = \frac{\exp(x)}{1+\exp(x)}$$. Here we select the value of $$\lambda$$ that maximizes AUC (this is stored in fm_lasso_cv$lambda.min).26

idmin <- match(fm_lasso_cv$lambda.min, fm_lasso_cv$lambda)

fit_lasso_cv <-
dft |>
select(misstate) |>
mutate(logodds = fm_lasso_cv$fit.preval[, idmin], prob = exp(logodds)/(1 + exp(logodds)), predicted = logodds > 0) From this we can calculate the confusion matrix based on the models fitted using cross-validation. table(predicted = fit_lasso_cv$predicted,
response = fit_lasso_cv$misstate)  response predicted FALSE TRUE FALSE 66960 209 TRUE 5 0 We can produce the fitted model based on our chosen $$\lambda$$ by estimating a single model for the entire training data set. We store this model in fm_lasso. fm_lasso <- glmnet(x = as.matrix(data_train[X_vars]), y = data_train[[y_var]], family = "binomial", lambda = fm_lasso_cv$lambda.min,
alpha = 1)

1. What features are selected by the lasso model in this case? (Hint: You may find it useful to inspect fm_lasso$beta[, 1].) 2. Modify the code above to estimate a ridge regression model. (Hint: Set alpha = 0.) 3. Describe how would you estimate an elastic net model. (Hint: Discussion here might be useful. One approach would be to put some of the steps above in a function that accepts alpha as an argument and then estimate over different values of $$\alpha$$.) 4. Calculate AUC and produce the confusion matrix for fm_lasso applied to data_train (i.e., in sample). Interpret these results. (Hint: You might use this code: predict(fm_lasso, newx = as.matrix(data_train[X_vars])) and the auc function supplied with the farr package.) 5. Calculate the AUC for the data stored in within_sample above (this is for the tree stored in fm2). ## 26.6 Ensemble methods A second approach to addressing the weaknesses of building models directly from the building blocks uses ensemble methods, which are methods that combine the predictions from many models to generate a single prediction for each observation. The ensemble methods we cover here are all based on trees. One of the earliest ensemble methods is bagging, which is short for bootstrap aggregation. With bagging a value $$M$$ is selected and for each $$i \in (1, \dots, M)$$ a bootstrap sample of the original data is taken and a model is fit. The prediction for any observation is an average of the predictions of the $$M$$ models.27 One benefit of bagging is that for any given observation, several of the $$M$$ models will not have used that observation in the fitting process, so the predictions from these models are, in a sense, out of sample for that observation. Such observations are termed out-of-bag and provide a natural measure of the predictive performance much like cross-validation measures. One additional aspect of bagging is that there are no dependencies between the fitted models. So one could in principle divide the estimation of the $$M$$ models between $$M$$ computers (or cores) and (assuming sufficient computing resources) generate a prediction almost as quickly as one could from a single model. A refinement of bagging is random forests, which not only uses bootstrap samples of the observations, but also randomly draws a subset of $$k < P$$ features in fitting each model. By randomly selecting features in addition to randomly selecting observations, random forests generate trees with lower correlation than bagging does. This can reduce the risk that the model overfits the sample using certain features and also produces more robust models when features are highly correlated. The final set of ensemble methods we discuss are known as boosting methods. The basic idea of boosting is that each observation has a weight that varies from one iteration to the next: observations that are incorrectly classified in iteration $$k$$ receive more weight in iteration $$k + 1$$. In this way, the observations that are difficult to classify can receive increasingly larger weight until the algorithm is able to identify a model that classifies them correctly. Below we describe two boosting algorithms in more detail: AdaBoost—which was “the first practical implementation of boosting theory” —and RUSBoost—which was introduced to accounting research by Bao et al. (2020).28 Because the outcome of iteration $$k$$ affects the model fit in iteration $$k + 1$$, boosting is not as easily implemented in parallel as bagging or random forests. ### 26.6.1 AdaBoost Boosting is perhaps best understood in the context of an actual algorithm implementing it. Hastie et al. (2013) outline the AdaBoost algorithm as follows: 1. Initialize the observation weights $$w_i = \frac{1}{N}$$, $$i = 1,2,\dots,N$$. 2. For $$m = 1 \to M$$: 1. Fit a classifier $$G_m(x)$$ to the training data using weights $$w_i$$. 2. Compute $\mathrm{err}_m = \frac{\sum_{i=1}^N w_i I(y_i \neq G_m(x_i))}{\sum_{i=1}^{N} w_i}$ 3. Compute $$\alpha_m = \log((1 - \mathrm{err}_m)/\mathrm{err})$$. 4. Set $$w_i \leftarrow w_i \exp\left[\alpha_m I(y_i \neq G_m(x_i)) \right]$$. 3. Output $$G(x) = \mathrm{sign} \left[\sum_{m=1}^M \alpha_m G_m (x) \right]$$ A couple of observations on this algorithm are in order. First, step 3 implicitly assumes that the classes are coded as either $$+1$$ or $$-1$$ so that the sign indicates the class. Second, we have not described what classifier is represented by $$G_m(x)$$. We provide an implementation of AdaBoost in the farr package. Because the RUSBoost algorithm we discuss below is a variation on AdaBoost, we use one function (rusboost) to cover both AdaBoost (rus = FALSE) and RUSBoost (rus = TRUE). We use the rpart function discussed above as the basic building block for rusboost (i.e., $$G_m(x)$$). This means any arguments that can be passed to rpart via the control argument can be used with rusboost. In addition, a user of rusboost can specify the number of iterations ($$M$$) using the size argument and a learning rate ($$L \in (0, 1]$$) using the learn_rate argument. The learning rate modifies step 2(c) of the AdaBoost algorithm above to: 1. Compute $$\alpha_m = L \log((1 - \mathrm{err}_m)/\mathrm{err})$$. (Hence the default learn_rate = 1 implements the algorithm above.) We make a convenience function fit_rus_model that allows us to set maxdepth and minbucket values passed to rpart directly. fit_rus_model <- function(df, size = 30, rus = TRUE, learn_rate = 1, maxdepth = NULL, minbucket = NULL, ir = 1) { if (!is.null(maxdepth)) control <- rpart.control(maxdepth = maxdepth) if (!is.null(minbucket)) control <- rpart.control(minbucket = minbucket) fm <- rusboost(formula, df, size = size, ir = ir, learn_rate = learn_rate, rus = rus, control = control) return(fm) } We next create a convenience function for calculating fitted probabilities and predicted classes for data in a fold. rus_predict <- function(fold, ...) { dft <- data_train |> inner_join(sample_splits, by = "gvkey") |> mutate(misstate = as.integer(misstate)) fm <- dft |> filter(fold != !!fold) |> fit_rus_model(...) res <- dft |> filter(fold == !!fold) |> mutate(prob = predict(fm, pick(everything()), type = "prob"), predicted = predict(fm, pick(everything()), type = "class")) |> select(gvkey, fyear, prob, predicted) res } We next create a function that fits models by fold and returns the resulting cross-validated AUC. Note that we use future_map() from the furrr library here. get_auc <- function(...) { set.seed(2021) rus_fit <- folds |> future_map(rus_predict, .options = furrr_options(seed = 2021), ...) |> list_rbind() |> inner_join(data_train, by = c("gvkey", "fyear")) |> select(gvkey, fyear, predicted, prob, misstate) auc(score = rus_fit$prob, response = as.integer(rus_fit$misstate)) } The meta-parameters that we evaluated as size ($$M$$, the number of iterations), learn_rate ($$L$$, the learning rate), and maxdepth (the maximum depth of trees). params <- expand_grid(rus = FALSE, size = c(30, 50), learn_rate = c(0.1, 0.5, 1), maxdepth = c(3, 5, 7)) We can now use our get_auc() function to evaluate the model performance for each of the possible parameter values in params. plan(multisession) adaboost_results <- params |> mutate(auc = pmap_dbl(params, get_auc)) |> select(-rus) Here are the first few rows of the results. adaboost_results |> arrange(desc(auc)) |> head() We store the row (set of parameter values) that maximizes AUC in optim_ada. optim_ada <- adaboost_results |> filter(auc == max(auc)) ### 26.6.2 RUSBoost RUSBoost is a variant of AdaBoost that makes use of random undersampling (RUS) to address the problem of class imbalance learning . When there are very few positive cases (i.e., fraud firm-years) relative to negative cases, RUSBoost allows a model to train using samples with substantially more positive cases than found in the broader population. Given the relatively small number of fraud firm-years, the setting of Bao et al. (2020) seems to be a good one for RUSBoost. Seiffert et al. (2008) lay out the RUSBoost algorithm as follows:29 1. Initialize the observation weights $$w_i = \frac{1}{N}$$, $$i = 1,2,\dots, N$$. 2. For $$m = 1 \to M$$: 1. Create temporary training dataset $$S^{'}_t$$ with distribution $$D^{'}_t$$ using random under-sampling 2. Fit a classifier $$G_m(x)$$ to the training data using weights $$w_i$$. 3. Compute $\mathrm{err}_m = \frac{\sum_{i=1}^N w_i I(y_i \neq G_m(x_i))}{\sum_{i=1}^{N} w_i}$ 4. Compute $$\alpha_m = L \log((1 - \mathrm{err}_m)/\mathrm{err})$$. 5. Set $$w_i \leftarrow w_i \exp\left[\alpha_m I(y_i \neq G_m(x_i)) \right]$$. 3. Output $$G(x) = \mathrm{sign} \left[\sum_{m=1}^M \alpha_m G_m (x) \right]$$ In each iteration, the RUS algorithm samples from the training data differently for fraudulent and non-fraudulent firm-years. RUSBoost requires setting the ratio between the number of undersampled majority class observations (i.e., non-fraud) and the number of minority class observations (i.e., fraud). The rusboost() function allows the user to specify this ratio using the ir argument. Bao et al. (2020) estimate RUSBoost models using a 1:1 ratio (i.e., ir = 1), which means each sample they used in estimation contained the same number of fraud and non-fraudulent observations. Fraudulent firm-years—which are much rarer—are randomly sampled with replacement up to their number in the training sample (e.g., if there are 25 fraudulent observations in the training data, 25 such observations will be in the training sample used). Non-fraudulent firm-years—which are more numerous—are randomly sampled without replacement up to a number that is typically expressed as a multiple of the number of fraudulent observations in the training sample (e.g., if there are 25 fraudulent observations in the training data and ir = 1, 25 such observations will be in the training sample used). In addition to ir = 1, we consider ir = 2 (the latter selecting two non-fraud observations for each fraud observation). We use a different set of meta-parameter values for size, learn_rate, and maxdepth for RUSBoost than we did for AdaBoost, as the relation between meta-parameters and performance differs.30 params_rus <- expand_grid(rus = TRUE, size = c(30, 100, 200, 300), learn_rate = c(0.01, 0.1, 0.5, 1), maxdepth = c(2, 3, 5, 10, 15), ir = c(1, 2)) Again, we can now use the get_auc() function to evaluate the model performance for each of the possible parameter values (now in params_rus). plan(multisession) rusboost_results <- params_rus |> mutate(auc = pmap_dbl(params_rus, get_auc)) Here are the first few rows of the results. rusboost_results |> arrange(desc(auc)) |> head() We store the row that maximizes AUC in optim_rus. optim_rus <- rusboost_results |> filter(auc == max(auc)) Compared with AdaBoost, we see that RUSBoost performance is maximized with larger models (size is 300 rather than 30), deeper trees (maxdepth is 10 rather than 7). We see that both models use a learn_rate less than 1. Before testing our models, it is noteworthy that the AUCs for both AdaBoost and RUSBoost are appreciably higher than those for the logit model, the single tree, and the penalized logit models from above. ### 26.6.3 Discussion questions 1. Provide reasons why the tree-based models outperform the logit-based models in this setting? Do you think this might be related to the set of features considered? How might you alter the features if you wanted to improve the performance of the logit-based models? ## 26.7 Testing models We now consider the out-of-sample performance of our models. Given the apparent superiority of the boosted ensemble approaches (AdaBoost and RUSBoost) over the other approaches, we focus on those two approaches here. We follow Bao et al. (2020) in a number of respects. First, we use fiscal years 2003 through 2008 as our test years. Second, we fit a separate model for each test year using data on AAERs that had been released on the posited test date for each test year. Third, we exclude the year prior to the test year from the model-fitting data set (we do this by using filter(fyear <= test_year - gap) where gap = 2). But note that the rationale provided by Bao et al. (2020) for omitting this year differs from ours: Bao et al. (2020) wanted some assurance that the AAERs were observable when building the test model; in contrast, we simply want to reduce measurement error in the misstate variable when using too-recent a sample.31 Fourth, rather than using gap to get assurance that the AAERs were observable when building the test model, we use data on AAER dates to ensure that the AAERs we use had been revealed by our test dates. For each test_year, we assume that we are fitting a model on 30 September in the year after test_year, so as to allow at least four months for financial statement data to be disclosed (fiscal years for, say, 2001 can end as late as 31 May 2002, so 30 September gives four months for these). test_years <- 2003:2008 test_dates <- tibble(test_year = test_years) |> mutate(test_date = as.Date(paste0(test_year + 1, "-09-30"))) We now create a function that allows us to pass a set of features, specify the test_year and the parameters to be used in fitting the model (e.g., learn_rate and ir) and which returns a fitted model. train_model <- function(features, test_year, gap = 2, size, rus = TRUE, learn_rate, ir, control) { test_date <- test_dates |> filter(test_year == !!test_year) |> select(test_date) |> pull() aaers <- get_aaers(test_date = test_date) data_train <- aaers |> right_join(features, by = c("gvkey", "fyear")) |> filter(fyear <= test_year - gap) |> mutate(misstate = as.integer(coalesce(misstate, FALSE))) fm <- rusboost(formula, data_train, size = size, learn_rate = learn_rate, rus = rus, ir = ir, control = control) return(fm) } Note that in training our final model, we can use all available data in the training data set. For a test year of 2003, our training data set is the same as that used for selecting the meta-parameters above, but there we generally only used a subset of the data (i.e., in cross-validation) in fitting models in that process. We next create a function test_model() that allows us to pass a set of features (features), a fitted model (fm) and to specify the test year (test_year) and which returns the predicted values and estimated probabilities for the test observations. test_model <- function(features, fm, test_year) { target <- as.character(stats::as.formula(fm[["formula"]])[[2]]) df <- features |> left_join(aaer_long, by = c("gvkey", "fyear")) |> select(-max_year, -min_year) |> mutate(misstate = as.integer(!is.na(p_aaer))) data_test <- df |> filter(fyear == test_year) return(list(score = predict(fm, data_test, type = "prob"), predicted = as.integer(predict(fm, data_test, type = "class") == 1), response = as.integer(data_test[[target]] == 1))) } We next make a convenience function that calls train_model() then test_model() in turn.32 fit_test <- function(x, features, gap = 2, size, rus = TRUE, learn_rate, ir = 1, control) { set.seed(2021) fm <- train_model(features = features, test_year = x, size = size, rus = rus, ir = ir, learn_rate = learn_rate, control = control, gap = gap) test_model(features = features, fm = fm, test_year = x) } We first test AdaBoost using the meta-parameter values stored in optim_ada and store the results in results_ada. We have included the values generated from code in Section 26.6 above here so that you do not need to run the code from that section. optim_ada <- tibble(size = 30, learn_rate = 0.1, maxdepth = 7) Again, we use future_map() from the furrr library here. plan(multisession) results_ada <- future_map(test_years, fit_test, features = features, size = optim_ada$size,
rus = FALSE,
learn_rate = optim_ada$learn_rate, control = rpart.control(maxdepth = optim_ada$maxdepth),
.options = furrr_options(seed = 2021))

We then test RUSBoost using the meta-parameter values stored in optim_rus and store the results in results_rus. We have included the values generated from code in Section 26.6 above here so that you do not need to run the code from that section.

optim_rus <- tibble(size = 300, learn_rate = 0.01, maxdepth = 10, ir = 2)
plan(multisession)

results_rus <- future_map(test_years, fit_test,
features = features,
size = optim_rus$size, rus = TRUE, ir = optim_rus$ir,
learn_rate = optim_rus$learn_rate, control = rpart.control( maxdepth = optim_rus$maxdepth),
.options = furrr_options(seed = 2021))

To make it easier to tabulate results, we create a small function that collects performance statistics for a set of results that can be used to tabulate the results.

get_stats <- function(results) {
response <- results$response score <- results$score

ndcg_k <- function(k) {
c(ndcg(score, response, k = k))
}

res <- list(auc(score, response),
unlist(map(seq(0.01, 0.03, 0.01), ndcg_k)))
unlist(res)
}

The analyses in Bao et al. (2020) appear to evaluate test performance using the means across test years, so we make a small function (add_means()) to add a row of means to the bottom of a data frame and use this function in table_results(), which extracts performance statistics from a set of results, adds means, then formats the resulting data frame for display.

add_means <- function(df) {
bind_rows(df, summarise(df, across(everything(), mean)))
}

table_results <- function(results, ...) {
results |>
map(get_stats) |>
bind_rows() |>
mutate(test_year = c(test_years, "Mean")) |>
select(test_year, everything())
}

We can now simply call table_results() for results_ada and results_rus in turn.

table_results(results_ada)
table_results(results_rus)

From the results above, it does seem that RUSBoost is superior to AdaBoost in the current setting using the features drawn from Bao et al. (2020). Note that Bao et al. (2020) do not separately consider AdaBoost as an alternative to RUSBoost. Bao et al. (2020) argue that RUSBoost is superior to the all-features logit model, which seems consistent with our evidence above, though we did not consider the logit model beyond the cross-validation analysis given its poor performance (but see exercises below).

### 26.7.1 Discussion questions and exercises

1. In asserting that RUSBoost is superior to AdaBoost we did not evaluated the statistical significance of the difference between the AUCs for the two approaches. Describe how you might evaluate the statistical significance of this difference.

2. Compare the performance statistics from cross-validation and out-of-sample testing. What do these numbers tell you? Do you believe that these provide evidence that using (say) 2001 data to train a model that is evaluated on (say) 1993 frauds is problematic?

3. Bao et al. (2022) report an AUC in the test sample for an all-features logit model of 0.7228 (see Panel B of Table 1). What differences are there between our approach here and those in Bao et al. (2020) and Bao et al. (2022) that might account for the differences in AUC?

4. Bao et al. (2022) report an AUC in the test sample for an all-features logit model of 0.6842 (see Panel B of Table 1). Calculate an equivalent value for AUC from applying logistic regression to data from above. You may find it helpful to adapt the code above for RUSBoost in the following steps:

1. Create a function train_logit() analogous to train_model() above, but using glm() in place of rusboost().
2. Create a function test_logit() analogous to test_model() above, but using calls to predict() like those used above for logistic regression.
3. Create a function fit_test_logit() analogous to fit_test() above, but calling the functions you created above.
4. Use map() or future_map() with test_years and fit_test_logit() and store the results in results_logit.
5. Tabulate results using table_results(results_logit). (Table 26.5 below contains our results from following these steps.)
5. Do the AUC results obtained from logit (from answer to previous question or Table 26.5 below) surprise you given the cross-validation and in-sample AUCs calculated (in exercises) above? How does the average AUC compare with the 0.6842 of Bao et al. (2022)? What might account for the difference?

6. Provide an intuitive explanation of the out-of-sample (test) NDCG@k results for the RUSBoost model. (Imagine you are trying to explain these to a user of prediction models like those discussed earlier in this chapter.) Do you believe that these results are strong, weak, or somewhere in between?

7. We followed Bao et al. (2020) in using gap = 2 (i.e., training data for fiscal years up to two years before the test year), but discussed the trade-offs in using a shorter gap period (more data, but more measurement error). How would you evaluate alternative gap periods?

8. Which results are emphasized in Bao et al. (2020)? What are the equivalent values in Bao et al. (2022)? Which results are emphasized in Bao et al. (2022)?

9. Which audience (e.g., practitioners, researchers) would be most interested in the results of Bao et al. (2020) and Bao et al. (2022)? What limitations of Bao et al. (2020) and Bao et al. (2022) might affect the interest of this audience?

There are many excellent books for learning more about statistical learning and prediction. Kuhn and Johnson (2013) provides excellent coverage and intuitive explanations. Hastie et al. (2013) is widely regarded as the standard text. James et al. (2022) provides a more accessible introduction.

1. Obviously, some information is lost in recoding a variable in this way and this may not be helpful in many decision contexts.↩︎

2. In many respects, the specific representation does not matter. However a number of statistical learning algorithms will be described presuming one representation or another, in which case care is needed to represent the variable in the manner presumed.↩︎

3. See the SEC website for more information: https://www.sec.gov/divisions/enforce/friactions.htm.↩︎

4. Use Enron’s GVKEY 006127 to find this: aaer_firm_year |> filter(gvkey == "006127").↩︎

5. The code used to create this table can be found in source code for the farr package: https://github.com/iangow/farr/blob/main/data-raw/get_aaer_dates.R.↩︎

6. Ironically given the attention we are paying to the distribution of AAER dates, we are ignoring the variation in the time between fiscal period-end and filing of financial statements. Additionally, we are implicitly assuming that the analyst is constrained to wait until all firms have reported in fiscal year before applying a prediction model. However, it would be feasible to apply a model to prediction accounting fraud as soon as a firm files financial statements, which might be in March 2002 for a firm with a 2001-12-31 year-end. This might be a better approach for some prediction problems because of the need for more timely action than that implied by the “fiscal year completist” approach assumed in the text.↩︎

7. Actually, it may be that some restatements affecting fiscal 2001 have already been revealed and these could be used to fit a model. But we assume there would be few such cases and thus focus on fiscal 2000 as the latest plausible year for training. As we will see, measurement error seems sufficiently high for fiscal 2000 that using data fiscal 2001 for training is unlikely to be helpful.↩︎

8. It might be suggested that we shouldn’t wait for an AAER to be released before encoding an AAER as 1. For example, we don’t need to wait until 2003-07-28 to code the affected Enron firm-years as misstated. But we argue that this would embed one prediction problem inside another (that is, we’d be predicting AAERs using general knowledge to use them in predicting AAERs using a prediction model). Additionally, it was far from guaranteed that there would be an AAER for Enron and in less extreme cases, it would be even less clear that there would be an AAER. So some predicted AAERs that we’d code would not eventuate. We argue that the only reasonable approach is to assume that the prediction models are trained only with confirmed AAERs at the time they are trained.↩︎

9. We put truth in “scare quotes” because even misstate_latent has issues as a measure of AAERs, because some AAERs may emerge after the sample is constructed by the researcher, which in turn may be some years after the test period.↩︎

10. It is important to note that our rationale for doing this differs from that of Bao et al. (2020). Bao et al. (2020) exclude AAERs affecting fiscal years after 1999 on the basis that these would not be known when hypothetically testing their models; for this purpose we use data on AAER dates that Bao et al. (2020) did not collect. Our rationale for excluding AAERs affecting fiscal year 2000 and released prior to the test date is one about measurement error.↩︎

11. There are problems here in terms of having a well-formed prediction problem if the prediction model itself changes the set of future AAERs due to changes in SEC behaviour. But we ignore these here.↩︎

12. Of course, the analysis we do below is limited to publicly available information, so perhaps not a good model of the prediction problem that is actually faced by the SEC, which presumably has access to private information related to future AAERs.↩︎

13. As a German company not listed in the United States, Wirecard would likely not be the subject of an AAER, but presumably there is some rough equivalent in the German context.↩︎

14. See https://gist.github.com/iangow/092d54d3b19d5d0d6bd4ddc406d8cee4 for details.↩︎

15. The Wirecard fraud—including Earl’s role in trying to uncover it—is the subject of a Netflix documentary: https://www.netflix.com/watch/81404807.↩︎

16. Note that the word sample is frequently used in the statistical learning literature to refer to a single observation, a term that may seem confusing to researchers used to thinking of a sample as a set of observations.↩︎

17. This idea is also related to p-hacking as discussed in Chapter 19, whereby a hypothesis that appears to apply in an estimation sample does not have predictive power in a new sample.↩︎

18. Note that we are relying on the selection of the meta-parameter to provide some degree of protection against this kind of overfitting, but it may be that the models will gravitate to the features that allow them to overfit “serial frauds” no matter the meta-parameter values. In such a case, we could still rely on cross-validation performance as a reliable indicator of the resultant weakness of the models we train.↩︎

19. Chapter 19 of Wickham (2019) has more details on this “unquote” operator !!.↩︎

20. The metaphor is not perfect in that leaves are generally found at the bottom of trees.↩︎

21. The meaning of these parameters is deferred to the discussion questions below.↩︎

22. See Chapters 5 and 11 of Kuhn and Johnson (2013) for more details on performance metrics for regression problems and classification problems, respectively.↩︎

23. See Kuhn and Johnson (2013, p. 123) for more discussion of penalized models.↩︎

24. If $$\lambda < 0$$, the model would not be penalizing models, but “rewarding” them.↩︎

25. The knowledgeable reader might note that logit is estimated by maximum likelihood estimation, but the basic idea of penalization is easily extended to such approaches, as discussed by Kuhn and Johnson (2013, p. 303).↩︎

26. A popular alternative is to use the largest value of $$\lambda$$ that yields AUC within one standard error of the maximum AUC. This value is stored in fm_lasso_cv\$lambda.1se.↩︎

27. What this averaging means is straightforward when the models are regression models. With classification models, a common approach is to take the classification that has the most “votes”; if more models predict “hot dog” than predict “not hot dog” for an observation, then “hot dog” would be the prediction for the ensemble.↩︎

28. For more on boosting, see pp. 389–392 of Kuhn and Johnson (2013).↩︎

29. We modify the algorithm slightly to include a learning-rate parameter, $$L$$ in step 2(d).↩︎

30. In writing this chapter, we actually considered a much wider range of meta-parameter values, but we limit the grid of values here because of computational cost.↩︎

31. As discussed above many of the apparent non-fraud observations will later be revealed to be frauds.↩︎

32. It also sets the seed of the random-number generator, but this likely has little impact on the results apart from replicability.↩︎