# 28 Prediction

This chapter focuses on prediction using approaches that fall under the broad heading of statistical learning, also known as machine learning. We begin by discussing what prediction means and how it can be used in research. We then explore methods and concepts related to prediction using the setting of accounting fraud prediction examined in Bao et al. (2020).

In this chapter, we will use a number of packages, which can be loaded into R using the following code. For instructions on how to set up your computer to use the code found in this book, see Chapter 1.2.1.

library(readr)
library(rpart)
library(farr)
library(dplyr, warn.conflicts = FALSE)
library(DBI)
library(tidyr)
library(ggplot2)
library(glmnet)
library(parallel)

Some of the code in this chapter—especially in Section 28.6—takes a long time to run, even with multiple cores. To make it easier for readers to run just the code they are interested in, here we provide some guidance regarding how code from one section depends on code in other sections.

Running the code in Section 28.2 is a prerequisite for running code in each of Sections 28.3. 28.5, and 28.6, each of which can be run independently of the others. While some results from Section 28.6 are used in Section 28.7, we have embedded the relevant results from Section 28.6 in the code for Section 28.7, so that from a reader’s perspective the latter section only depends on results of code in Section 28.2. (There is no code in Sections 28.1 and 28.4.)

## 28.1 Prediction in research

Prediction can be taken to mean the activity of using known information to estimate or predict unknown events or values. In ordinary usage, prediction suggests that the unknown events or values are in the future, but often prediction is of pre-existing facts which may only become known at a later date (e.g., we may use prediction models to diagnose a disease or to detect misstatements in observed financial statements).

In this chapter we will cover a number of statistical approaches to prediction. Prediction is perhaps the most active area of research in statistics and the methods we cover in this chapter fall under the umbrella of statistical learning.

It seems natural to ask why social science researchers would be interested in prediction. As discussed in Chapter 5, research papers can often be distinguished by their research objectives, which might include prediction, measurement, description, and causal inference.

When prediction is part of the research question, the methods covered in this chapter are clearly relevant. A researcher might be interested in prediction not for its own sake, but to understand how information might be used in practice. For example, Bernard and Thomas (1989) were not interested in predicting post–earnings announcement returns so much as providing evidence of the existence of a predictable element of this as evidence of market efficiency.

Prediction methods are also often used in measurement. For example, papers have developed models for predicting events such as bankruptcy and earnings restatement and these models have often been used in subsequent research as measures of, say, bankruptcy risk.

Finally, an emerging area of research examines the usefulness of prediction for causal inference. While prediction is at the heart of causal inference in that a causal model has predictive value, it has been less clear whether predictive models developed using the methods provided in this chapter provide insights into the existence of causal relations.

Beyond research, prediction is also fundamental to financial accounting in many settings. For example, the provision for loan losses is a prediction of future write-offs. Researchers might seek to understand accounting in such settings as prediction problems and might even simulate such processes using statistical learning approaches like those studied in this chapter.

And beyond financial accounting, prediction models have innumerable uses in practice, from identifying anomalies in accounting records to making movie recommendations to Netflix customers.

## 28.2 Predicting accounting fraud

To help the reader understand the issues and approaches used in statistical learning, we focus in this chapter on the setting of Bao et al. (2020), which focuses on predicting accounting fraud in publicly traded US firms during the period 2003–2008.

### 28.2.1 Features

Any prediction problem involves an outcome that we want to predict based on a set of features. Features (also known as predictors) refer to the variables or data that we will use to make predictions.

In this chapter, we follow Bao et al. (2020) in using “28 raw financial data items” drawn from Compustat are our features. While not described in the paper, from the PDF version of the SAS code provided by the authors, four variables (ivao, pstk, ivst, txp) are replaced with zeros if missing. The SAS code also includes the equivalent of filter(!is.na(at), !is.na(prcc_f)), but given that the equivalent of the na.omit() is applied later (as we do at the end of our code here), we have omitted this filter.

pg <- dbConnect(RPostgres::Postgres(), bigint = "integer")

funda <- tbl(pg, sql("SELECT * FROM comp.funda"))

X_vars <- c('act', 'ap', 'at', 'ceq', 'che', 'cogs', 'csho', 'dlc',
'dltis', 'dltt', 'dp', 'ib', 'invt', 'ivao', 'ivst',
'lct', 'lt', 'ni', 'ppegt', 'pstk', 're', 'rect', 'sale',
'sstk', 'txp', 'txt', 'xint', 'prcc_f')

y_var <- "misstate"

features_all <-
funda %>%
filter(indfmt == "INDL", datafmt == "STD",
consol == "C", popsrc == "D") %>%
filter(fyear >= 1991, fyear <= 2008) %>%
mutate(across(c(ivao, pstk, ivst, txp), ~ coalesce(., 0))) %>%

features <-
features_all %>%
collect() %>%
na.omit()

While the features that are used in building a prediction model and even how they are encoded can “have a significant impact on model performance” , we only consider the features used by Bao et al. (2020) in our analysis here and discuss issues related to features later in the chapter.

### 28.2.2 Outcome

An outcome might be measured as a continuous variable (e.g., life expectancy) or it might be a categorical variable (e.g., “no restatement”, “minor restatement”, or “major restatement”). A categorical variable will often naturally be a binary variable taking one of just two values, but in principle any variable code can be coded as a binary variable (e.g., “hot dog” or “not hot dog”).196

In this chapter, we follow Bao et al. (2020) and focus on a single binary categorical variable taking values of “no fraud” and “fraud”, which might be represented using $$\{0, 1\}$$ (i.e., an indicator variable), but in statistical learning one often sees such variables represented using $$\{-1, 1\}$$.197 Bao et al. (2020) identify frauds based on the existence of an Accounting and Auditing Enforcement Releases (AAERs) released by the SEC affecting the firm.198 While the focus on binary variables might seem limiting, it facilitates our presentation of some core ideas of statistical learning within the confines of a single chapter.

Data on the outcome are found in the data set aaer_firm_year supplied with the farr package.

aaer_firm_year
## # A tibble: 415 × 4
##    p_aaer gvkey  min_year max_year
##    <chr>  <chr>     <int>    <int>
##  1 1033   021110     1992     1992
##  2 1037   008496     1992     1993
##  3 1044   028140     1993     1993
##  4 1047   012455     1994     1994
##  5 1053   025927     1993     1993
##  6 1067   023126     1993     1993
##  7 1071   024712     1992     1993
##  8 1104   028450     1994     1995
##  9 1106   002711     1992     1993
## 10 1110   010185     1994     1995
## # … with 405 more rows

As can be seen above, for each AAER (identified by p_aaer), aaer_firm_year contains information on the GVKEY of the affected firm and the first and last fiscal years affected by accounting fraud (min_year and max_year, respectively).

aaer_long <-
aaer_firm_year %>%
rowwise() %>%
mutate(fyear = list(seq(min_year, max_year, by = 1))) %>%
unnest(fyear) %>%
select(gvkey, p_aaer, fyear, min_year, max_year) 

In practice, careful thinking about outcome measures is required. The outcome selected and the approach to measuring it are likely to depend on the purpose for which prediction is being made. While Bao et al. (2020) aim to predict “accounting fraud”, it is important to note that the variable p_aaer indicates an Accounting and Auditing Enforcement Release (AAER) by the SEC. But accounting fraud might not result in AAERs, either because it is never detected, or because it is detected but does not rise to the level that leads to an AAER, or even because the fraud is so profound that an AAER is somewhat irrelevant.

With regard to the last category, consider Enron, the public company at the heart of one of most notorious cases of accounting fraud this century. While the CEO (Jeffrey Skilling) and CFO (Andrew Fastow) of Enron ended up serving time in prison, there is no AAER related to either Skilling or Fastow (many AAERs relate to individuals). There is not even an AAER directed at Enron, perhaps because it entered bankruptcy shortly after fraud was detected. The one AAER in the Bao et al. (2020) sample related to Enron has p_aaer of "1821".199 However, this AAER related to the order for Citigroup to pay a amount in a settlement arising because “Citigroup assisted [Enron and Dynegy] in enhancing artificially their financial presentations through a series of complex structured transactions … to allow those companies to report proceeds of financings as cash from operating activities”.

Even if we accept that AAERs are the measure of interest, there are several wrinkles related to timing. To see the issues here, consider the AAER with p_aaer equal to 1500, which affected fiscal years 1998, 1999, and 2000.

aaer_firm_year %>%
filter(p_aaer == "1500")
## # A tibble: 1 × 4
##   p_aaer gvkey  min_year max_year
##   <chr>  <chr>     <int>    <int>
## 1 1500   028958     1998     2000

Suppose we were an analyst and had data on features up to the end of fiscal 1999 and were looking to train a model that could be used to predict future AAERs related to fiscal 1999. This would place us somewhere in 2000 (fiscal 1999 includes firms with fiscal years ending as late as 2000-05-31) and the largest meaningful model-building data set could include fiscal years up to 1998. Sitting in 2018 or 2022, we might be tempted to code p_aaer as 1 for the firm-year with gvkey of 028958 and fyear of 1998, but it turns out that AAERs are only disclosed after the end of the last fiscal year that they affect and this would mean that the p_aaer == '1500' AAER would not be observed by our hypothetical analyst in 2000, who would instead see misstate as 0 for that observation. To simulate the approach taken by our hypothetical analyst, we need to code misstate for this case as 0.

At some point it would be reasonable to assume that the analyst can see the AAER just as we can and use it in training a model, but the question is “when?”. Bao et al. (2020, p. 209) “require a gap of 24 months between the financial results announcement of the last training year and the results announcement of a test year … because Dyck et al. (2010) find that it takes approximately 24 months, on average, for the initial disclosure of the fraud.” For the p_aaer == '1500'AAER, Bao et al. (2020)’s approach would mean that the hypothetical analyst could observe the AAER by the end of fiscal 2002 (two years after the last affected period of 2000).

While Bao et al. (2020) relied on the broad observation of Dyck et al. (2010), data on AAER dates are easily obtained from the SEC’s website for AAERs since late 1999 and included in the data set aaer_dates made available with the farr package.200

head(aaer_dates %>% select(aaer_num, aaer_date))
## # A tibble: 6 × 2
##   aaer_num  aaer_date
##   <chr>     <date>
## 1 AAER-1212 1999-12-14
## 2 AAER-1211 1999-12-03
## 3 AAER-1210 1999-11-19
## 4 AAER-1209 1999-11-16
## 5 AAER-1208 1999-11-15
## 6 AAER-1206 1999-11-10

Using these data we can examine the actual distribution of the time between the end of the latest affected fiscal year and the AAER being released by the SEC.

aaer_days <-
aaer_dates %>%
mutate(p_aaer = gsub("^AAER-", "", aaer_num)) %>%
inner_join(aaer_firm_year, by = "p_aaer") %>%
mutate(fyear = max_year) %>%
inner_join(features, by = c("gvkey", "fyear")) %>%
mutate(days_to_aaer = as.integer(aaer_date - datadate)) %>%
filter(!is.na(days_to_aaer)) 

The plot below reveals a lot of variation that is ignored by a blanket “24 months” assumption As can be seen in the plot, there is a great deal of variation from that assumption (represented by the red dashed vertical line) and most AAERs are released more than two years after the end of the last fiscal year (median is 1350, which is about 3.7 years).

aaer_days %>%
ggplot(aes(x = days_to_aaer)) +
geom_histogram(binwidth = 30) +
geom_vline(xintercept = 365 * 2, linetype = "dashed", color = "red")

How might we use these data? It would seem appropriate to imagine our analyst looking to apply a model to a fiscal year where that model has been trained on features drawn from prior fiscal years and AAERs revealed to date. For example, if our analyst were now looking to apply a model to data from fiscal 2001, she would be situated somewhere in late 2002 (fiscal 2001 includes firms with fiscal years ending as late as 2002-05-31) and thus would only see AAERs released prior to that time. Assuming the conventional four months for financial data to be released, we might set a “test date” of 2002-09-30 for fiscal 2001, so AAERs with an aaer_date prior to 2002-09-30 would be observable to the analyst.201

For concreteness, let’s assume a “test date” of 2002-09-30 and construct a data set of AAERs. In this data set, we measure misstate, which is assumed to be observable to the analyst on test_date, and misstate_latent, which includes accounting fraud observable to the researcher, but not to the analyst.

get_aaers <- function(test_date, include_latent = FALSE) {
response = within_sample$misstate) ## response ## predicted FALSE TRUE ## FALSE 66963 201 ## TRUE 4 8 Here we see 8 true positive cases, 66963 true negative cases, 201 false negative cases, and 4 false positive cases. ### 28.3.3 Exercises 1. One claimed benefit of classification trees is their ease of interpretation. Can you provide an intuitive explanation for the fitted tree above? If so, outline your explanation. If not, what challenges prevent you from doing so? 2. Use 5-fold cross validation to assess the performance of the classification tree approach using the parameters above. 3. What do the parameters cp and minbucket represent? 4. How does the classification tree change if you increase the parameter cp? 5. Use 5-fold cross validation to assess the performance of the classification tree approach with three different parameters for minbucket (the one used above and two others). ## 28.4 Performance metrics It is intuitively clear that the performance of the models is not particularly impressive, but in general we need ways to measure model performance more formally and precisely. A number of evaluation metrics are available to measure the performance of prediction models. In this section, we discuss some of these metrics with a focus on two-class classification problems.216 ### 28.4.1 Accuracy One standard classification performance metric is accuracy, defined as $$\frac{TP+TN}{TP+FN+FP+TN}$$, where $$TP$$ (true positive) is (in our current context) the number of fraudulent firm-years that are correctly classified as fraud; $$FN$$ (false negative) is the number of fraudulent firm-years that are misclassified as non-fraud; $$TN$$ (true negative) is the number of non-fraudulent firm-years that are correctly classified as non-fraud; and $$FP$$ (false positive) is the number of non-fraudulent firm-years that are misclassified as fraud. Unfortunately, accuracy is a not a very useful metric in this setting because of the severe class imbalance. With fewer than 1% of firms in a typical year being affected by accounting fraud, simply classifying all firm-years as non-fraud would have accuracy of more than 99%. Two other statistics that are often considered are sensitivity (also known as the true positive rate, or TPR) and specificity (also known as the **true negative rate* or TNR) $\textit{Sensitivity} = \textit{TPR} = \frac{TP}{TP + FN}$ $\textit{Specificity} = \textit{TNR} = \frac{TN}{TN + FP}$ Often we focus on the complement to specificity, which is know as the false positive rate (FPR) and calculated as follows: $1 - \textit{Specificity} = \textit{FPR} = 1 - \frac{TN}{TN + FP}= \frac{FP}{TN + FP}$ ### 28.4.2 AUC There is an inherent trade-off between sensitivity and specificity. A classification model will typically return something like an estimated probability that an observation is a positive case. We can choose a cut-off probability and classify all observations with an estimated probability at or above the cut-off as a positive case, and classify all observations with estimated probabilities below the cut-off as negative cases. Starting at a cut-off of zero, every observation would be classified as a positive and the TPR would be 1, but the FPR would also be 1. As the cut-off increases, both the TPR and the FPR will decrease. If we collect the pairs of values of TPR and FPR as we increase the cut-off, we can plot the TPR against the FPR to form what is called the response operator curve (or ROC). In terms of the ROC, the worst-case model just randomly assigns probabilities to each observation. So with a cut-off of 0.25, we’d classify 75% of observations as positive and 25% as negative. In expectation, we’d have 75% of the positive cases classified as positive, and 75% of the negative cases classified as positive, so the TPR and FPR values would be equal in expectation. This would yield an ROC close to the 45-degree line. The best-case model would be a perfect classifier that yielded zero FNR and 100% TPR. This would yield an ROC that rose vertically along the $$y$$-axis and horizontally at $$y = 100\%$$. This suggests the area under the ROC curve (or AUC) as a measure of the quality of a model, with values of $$0.5$$ for the random-guess model and $$1$$ for the perfect classifier. Below, we follow Bao et al. (2020) (and many others) in using AUC as the performance measure we use to evaluate models. ### 28.4.3 NDCG@k Bao et al. (2020) also consider an alternative performance evaluation metric called the normalized discounted cumulative gain at position $$k$$ (NDCG@k). NDCG@k metric can be motivated using the idea of budget constraints. Suppose that a regulator (or other party interesting in unearthing frauds) has a limited budget and can only investigate $$k$$ frauds in a given year. Denoting the number of actual frauds in a year as $$j$$, if $$j \geq k$$, then the best that the regulator can do is to investigate $$k$$ cases, all of which are frauds. If $$j < k$$, then the regulator can do no better than investigating $$k$$ cases with all $$j$$ frauds being among those investigated. The assumption is that the regulator investigates the $$k$$ cases with the highest estimated probability of being frauds (scores). This suggests a performance measure that scores a model higher the greater the number of true frauds ranked in the top $$k$$ scores. The measure NDCG@k is such a measure. NDCG@k also awards more “points” the higher in the list (based on scores) each true fraud is placed. This is implemented by multiplying each correct guess by $$1/\log_2 (i + 1)$$. Formally, the discounted cumulative gain at position $$k$$ is defined as follows: $\textrm{DCG@k} = \sum_{i=1}^k \textit{rel}_i/\log_2 (i + 1)$ The measure NDCG@k is DCG@k normalized by the the DCG@k value obtained when all the true instances of fraud are ranked at the top of the ranking list. Hence, the values of NDCG@k are bounded between $$0$$ and $$1$$, and a higher value represents better performance. Bao et al. (2020) select $$k$$ so that the number of firm-years in the ranking list represents 1% of all the firm-years in a test year. Bao et al. (2020) “select a cut-off of 1% because the average frequency of accounting fraud punished by the SEC’s AAERs is typically less than 1% of all firms in a year.” Bao et al. (2020) report modified versions of sensitivity and specificity: $$\text{Sensitivity@k} = \frac{TP@k}{TP@k + FN@k}$$ and $$\text{Precision@k} = \frac{TP@k}{TP@k + FP@k}$$ where TP@k, FN@k and FP@k are analogous to TP, FN and FP, but based on classifying as frauds only the $$k\%$$ of observations with the highest scores. ## 28.5 Penalized models Having a number of ways to measure model performance alone does not address the basic problem of poor model performance using the two fundamental approaches above. In this section and the next, we describe two strategies that can be used to improve model performance. The first approach uses penalized models. Perhaps the easiest way to understand penalized models is to recognize that many common modelling approaches seek to minimize the sum of squared errors $\textrm{SSE} = \sum_{i = 1}^{n} (y_i - \hat{y}_i )^2.$ With penalized models, the objective is modified by adding a penalty term. For example, in ridge regression the following objective function is minimized.217: $\textrm{SSE}_{L_2} = \sum_{i = 1}^{n} (y_i - \hat{y}_i )^2 + \lambda \sum_{j=1}^{P} \beta_j^2.$ An alternative approach is least absolute shrinkage and selection operator (or lasso) model, which uses the following objective function: $\textrm{SSE}_{L_1} = \sum_{i = 1}^{n} (y_i - \hat{y}_i )^2 + \lambda \sum_{j=1}^{P} \left|\beta_j \right|.$ While both ridge regression and lasso will shrink coefficients towards zero (hence the alternative term shrinkage methods), some coefficients will be set to zero by lasso for some value of $$\lambda$$. In effect, lasso performs both shrinkage and feature selection. A third shrinkage method worth mentioning is elastic net , which includes the penalties from both ridge regression and lasso $\textrm{SSE}_{\textrm{ENet}} = \sum_{i = 1}^{n} (y_i - \hat{y}_i )^2 + \lambda_1 \sum_{j=1}^{P} \beta_j^2 + \lambda_2 \sum_{j=1}^{P} \left|\beta_j \right|.$ One aspect of these penalized objective functions that we have not discussed is $$\lambda$$. Implicitly, $$\lambda > 0$$ (if $$\lambda = 0$$, then the penalty term is zero), but we have said nothing about how this parameter is selected.218 Instead, $$\lambda$$ is a meta-parameter of the kind alluded to above, as it is not estimated directly as part of the estimation process. A common approach to setting meta-parameters like $$\lambda$$ is cross-validation. With cross-validation, a number of candidate values of $$\lambda$$ are considered. For each candidate value and each cross-validation fold, the model is estimated omitting the data in the fold. The performance of the model is evaluated using predicted values for the cross-validation folds (e.g., the average mean-squared error, or MSE, across all folds). Finally, the value $$\lambda$$ that maximizes performance (e.g., minimizes MSE) is chosen as the value to be used when applying the model to new data (e.g., test data when applicable). To illustrate, we use the lasso analogue to the logit regression we estimated above.219 For this purpose, we use the glmnet library and specify alpha = 1 to focus on the lasso model. The glmnet library contains built-in functionality for cross-validation. We simply need to specify the performance metric to use in evaluating models (here we set type.measure = "auc" to use AUC) and—because we constructed our folds to put all observations on a given firm in the same fold—the foldid for each observation dft <- data_train %>% inner_join(sample_splits, by = "gvkey") fm_lasso_cv <- cv.glmnet(x = as.matrix(dft[X_vars]), y = dft[[y_var]], family = "binomial", alpha = 1, type.measure = "auc", foldid = dft[["fold"]], keep = TRUE) The glmnet function will automatically consider a range of $$\lambda$$ values. Note that the scale of the features matters with penalized regression, as the penalties increase with the magnitude of the coefficients, which in turn depends on the scale of the variables. For example, a logit coefficient on at expressed in millions of dollars of $$0.123$$ is equivalent to a coefficient of $$123$$ on at expressed in thousands of dollars. Thus glmnet will by default scale the features (see ? glmnet for additional details). tibble(lambda = fm_lasso_cv$lambda, auc = fm_lasso_cv$cvm) %>% ggplot(aes(x = lambda, y = auc)) + geom_line() Here we see that AUC is maximized with $$\lambda \approx 1.2e-05$$. Predicted values for each value of $$\lambda$$ considered are stored in fit.preval of fm_lasso_cv as log-odds ratios. We can convert these to probabilities using the function $$f(x) = \frac{\exp(x)}{1+\exp(x)}$$. Here we select the value of $$\lambda$$ that maximizes AUC (this is stored in fm_lasso_cv$lambda.min).220

idmin <- match(fm_lasso_cv$lambda.min, fm_lasso_cv$lambda)

fit_lasso_cv <-
dft %>%
select(misstate) %>%
mutate(logodds = fm_lasso_cv$fit.preval[, idmin], prob = exp(logodds)/(1 + exp(logodds)), predicted = logodds > 0) From this we can calculate the confusion matrix based on the models fitted using cross-validation. table(predicted = fit_lasso_cv$predicted,
response = fit_lasso_cv$misstate) ## response ## predicted FALSE TRUE ## FALSE 66956 209 ## TRUE 11 0 We can produce the fitted model based on our chosen $$\lambda$$ by estimating a single model for the entire training data set. We store this model in fm_lasso. fm_lasso <- glmnet(x = as.matrix(data_train[X_vars]), y = data_train[[y_var]], family = "binomial", lambda = fm_lasso_cv$lambda.min,
alpha = 1)

1. What features are selected by the lasso model in this case? (Hint: You may find it useful to inspect fm_lasso$beta[, 1].) 2. Modify the code above to estimate a ridge regression model. (Hint: Set alpha = 0.) 3. Describe how would you estimate an elastic net model. (Hint: Discussion here might be useful. One approach would be to put some of the steps above in a function that accepts alpha as an argument and then estimate over different values of $$\alpha$$.) 4. Calculate AUC and produce the confusion matrix for fm_lasso applied to data_train (i.e., in sample). Interpret these results. (Hint: You might use this code: predict(fm_lasso, newx = as.matrix(data_train[X_vars])) and the auc function supplied with the farr package.) 5. Calculate the AUC for the data stored in within_sample above (this is for the tree stored in fm2). ## 28.6 Ensemble methods A second approach to addressing the weaknesses of building models directly from the building blocks uses ensemble methods, which are methods that combine the predictions from many models to generate a single prediction for each observation. The ensemble methods we cover here are all based on trees. One of the earliest ensemble methods is bagging, which is short for bootstrap aggregation. With bagging a value $$M$$ is selected and for each $$i \in (1, \dots, M)$$ a bootstrap sample of the original data is taken and a model is fit. The prediction for any observation is an average of the predictions of the $$M$$ models.221 One benefit of bagging is that for any given observation, several of the $$M$$ models will not have used that observation in the fitting process, so the predictions from these models are, in a sense, out of sample for that observation. Such observations are termed out-of-bag and provide a natural measure of the predictive performance much like cross-validation measures. One additional aspect of bagging is that there are no dependencies between the fitted models. So one could in principle divide the estimation of the $$M$$ models between $$M$$ computers (or cores) and (assuming sufficient computing resources) generate a prediction almost as quickly as one could from a single model. A refinement of bagging is random forests, which not only uses bootstrap samples of the observations, but also randomly draws a subset of $$k < P$$ features in fitting each model. By randomly selecting features in addition to randomly selecting observations, random forests generate trees with lower correlation than bagging does. This can reduce the risk that the model overfits the sample using certain features and also produces more robust models when features are highly correlated. The final set of ensemble methods we discuss are known as boosting methods. The basic idea of boosting is that each observation has a weight that varies from one iteration to the next: observations that are incorrectly classified in iteration $$k$$ receive more weight in iteration $$k + 1$$. In this way, the observations that are difficult to classify can receive increasingly larger weight until the algorithm is able to identify a model that classifies them correctly. Below we describe two boosting algorithms in more detail: AdaBoost—which was “the first practical implementation of boosting theory” —and RUSBoost—which was introduced to accounting research by Bao et al. (2020).222 Because the outcome of iteration $$k$$ affects the model fit in iteration $$k + 1$$, boosting is not as easily implemented in parallel as bagging or random forests. ### 28.6.1 AdaBoost Boosting is perhaps best understood in the context of an actual algorithm implementing it. Hastie et al. (2013) outline the AdaBoost algorithm as follows: 1. Initialize the observation weights $$w_i = \frac{1}{N}$$, $$i = 1,2,\dots,N$$. 2. For $$m = 1 \to M$$: 1. Fit a classifier $$G_m(x)$$ to the training data using weights $$w_i$$. 2. Compute $\mathrm{err}_m = \frac{\sum_{i=1}^N w_i I(y_i \neq G_m(x_i))}{\sum_{i=1}^{N} w_i}$ 3. Compute $$\alpha_m = \log((1 - \mathrm{err}_m)/\mathrm{err})$$. 4. Set $$w_i \leftarrow w_i \exp\left[\alpha_m I(y_i \neq G_m(x_i)) \right]$$. 3. Output $$G(x) = \mathrm{sign} \left[\sum_{m=1}^M \alpha_m G_m (x) \right]$$ A couple of observations on this algorithm are in order. First, step 3 implicitly assumes that the classes are coded as either $$+1$$ or $$-1$$ so that the sign indicates the class. Second, we have not described what classifier is represented by $$G_m(x)$$. We provide an implementation of AdaBoost in the farr package. Because the RUSBoost algorithm we discuss below is a variation on AdaBoost, we use one function (rusboost) to cover both AdaBoost (rus = FALSE) and RUSBoost (rus = TRUE). We use the rpart function discussed above as the basic building block for rusboost (i.e., $$G_m(x)$$). This means any arguments that can be passed to rpart via the control argument can be used with rusboost. In addition, a user of rusboost can specify the number of iterations ($$M$$) using the size argument and a learning rate ($$L \in (0, 1]$$) using the learn_rate argument. The learning rate modifies step 2(c) of the AdaBoost algorithm above to: 1. Compute $$\alpha_m = L \log((1 - \mathrm{err}_m)/\mathrm{err})$$. (Hence the default learn_rate = 1 implements the algorithm above.) We make a convenience function fit_rus_model that allows us to set maxdepth and minbucket values passed to rpart directly. fit_rus_model <- function(df, size = 30, rus = TRUE, learn_rate = 1, maxdepth = NULL, minbucket = NULL, ir = 1) { if (!is.null(maxdepth)) control <- rpart.control(maxdepth = maxdepth) if (!is.null(minbucket)) control <- rpart.control(minbucket = minbucket) fm <- rusboost(formula, df, size = size, ir = ir, learn_rate = learn_rate, rus = rus, control = control) return(fm) } We next create a convenience function for calculating fitted probabilities and predicted classes for data in a fold. rus_predict <- function(fold, ...) { dft <- data_train %>% inner_join(sample_splits, by = "gvkey") %>% mutate(misstate = as.integer(misstate)) fm <- dft %>% filter(fold != !!fold) %>% fit_rus_model(...) res <- dft %>% filter(fold == !!fold) %>% mutate(prob = predict(fm, ., type = "prob"), predicted = predict(fm, ., type = "class")) %>% select(gvkey, fyear, prob, predicted) res } We next create a function that fits models by fold and returns the resulting cross-validated AUC. Note that we use mclapply from the parallel library here; this may not work on Windows (if so, replace mclapply with lapply and delete mc.cores = 8). get_auc <- function(...) { set.seed(2021) rus_fit <- bind_rows(mclapply(folds, rus_predict, mc.cores = 8, ...)) %>% inner_join(data_train, by = c("gvkey", "fyear")) %>% select(gvkey, fyear, predicted, prob, misstate) auc(score = rus_fit$prob, response = as.integer(rus_fit$misstate)) } The meta-parameters that we evaluated as size ($$M$$, the number of iterations), learn_rate ($$L$$, the learning rate), and maxdepth (the maximum depth of trees). params <- expand_grid(size = c(30, 50), learn_rate = c(0.1, 0.5, 1), maxdepth = c(3, 5, 7)) We can now use our get_auc function to evaluate the model performance for each of the possible parameter values in params. adaboost_results <- params %>% mutate(auc = unlist(Map(get_auc, rus = FALSE, size = size, learn_rate = learn_rate, maxdepth = maxdepth))) ## Warning: There were 18 warnings in mutate(). ## The first warning was: ## ℹ In argument: auc = unlist(...). ## Caused by warning in inner_join(): ## ! Each row in x is expected to match at most 1 row in y. ## ℹ Row 58 of x matches multiple rows. ## ℹ If multiple matches are expected, set multiple = "all" to silence this ## warning. ## ℹ Run dplyr::last_dplyr_warnings() to see the 17 remaining warnings. Here are the first few rows of the results. adaboost_results %>% arrange(desc(auc)) %>% head() %>% knitr::kable() size learn_rate maxdepth auc 30 0.1 7 0.6662838 50 0.1 7 0.6503775 50 0.1 5 0.6447974 30 0.1 5 0.6424609 50 0.5 7 0.6415961 50 0.1 3 0.6267669 We store the row (set of parameter values) that maximizes AUC in optim_ada. optim_ada <- adaboost_results %>% filter(auc == max(auc)) ### 28.6.2 RUSBoost RUSBoost is a variant of AdaBoost that makes use of random undersampling (RUS) to address the problem of class imbalance learning . When there are very few positive cases (i.e., fraud firm-years) relative to negative cases, RUSBoost allows a model to train using samples with substantially more positive cases than found in the broader population. Given the relatively small number of fraud firm-years, the setting of Bao et al. (2020) seems to be a good one for RUSBoost. Seiffert et al. (2008) lay out the RUSBoost algorithm as follows:223 1. Initialize the observation weights $$w_i = \frac{1}{N}$$, $$i = 1,2,\dots, N$$. 2. For $$m = 1 \to M$$: 1. Create temporary training dataset $$S^{'}_t$$ with distribution $$D^{'}_t$$ using random under-sampling 2. Fit a classifier $$G_m(x)$$ to the training data using weights $$w_i$$. 3. Compute $\mathrm{err}_m = \frac{\sum_{i=1}^N w_i I(y_i \neq G_m(x_i))}{\sum_{i=1}^{N} w_i}$ 4. Compute $$\alpha_m = L \log((1 - \mathrm{err}_m)/\mathrm{err})$$. 5. Set $$w_i \leftarrow w_i \exp\left[\alpha_m I(y_i \neq G_m(x_i)) \right]$$. 3. Output $$G(x) = \mathrm{sign} \left[\sum_{m=1}^M \alpha_m G_m (x) \right]$$ In each iteration, the RUS algorithm samples from the training data differently for fraudulent and non-fraudulent firm-years. RUSBoost requires setting the ratio between the number of undersampled majority class observations (i.e., non-fraud) and the number of minority class observations (i.e., fraud). The rusboost function allows the user to specify this ratio using the ir argument. Bao et al. (2020) estimate RUSBoost models using a 1:1 ratio (i.e., ir = 1), which means each sample they used in estimation contained the same number of fraud and non-fraudulent observations. Fraudulent firm-years—which are much rarer—are randomly sampled with replacement up to their number in the training sample (e.g., if there are 25 fraudulent observations in the training data, 25 such observations will be in the training sample used). Non-fraudulent firm-years—which are more numerous—are randomly sampled without replacement up to a number that is typically expressed as a multiple of the number of fraudulent observations in the training sample (e.g., if there are 25 fraudulent observations in the training data and ir = 1, 25 such observations will be in the training sample used). In addition to ir = 1, we consider ir = 2 (the latter selecting two non-fraud observations for each fraud observation). We use a different set of meta-parameter values for size, learn_rate, and maxdepth for RUSBoost than we did for AdaBoost, as the relation between meta-parameters and performance differs.224 params_rus <- expand_grid(size = c(30, 100, 200, 300), learn_rate = c(0.01, 0.1, 0.5, 1), maxdepth = c(2, 3, 5, 10, 15), ir = c(1, 2)) Again, we can now use the get_auc function to evaluate the model performance for each of the possible parameter values (now in params_rus). rusboost_results <- params_rus %>% mutate(auc = unlist(Map(get_auc, rus = TRUE, size = size, learn_rate = learn_rate, maxdepth = maxdepth, ir = ir))) ## Warning: There were 160 warnings in mutate(). ## The first warning was: ## ℹ In argument: auc = unlist(...). ## Caused by warning in inner_join(): ## ! Each row in x is expected to match at most 1 row in y. ## ℹ Row 58 of x matches multiple rows. ## ℹ If multiple matches are expected, set multiple = "all" to silence this ## warning. ## ℹ Run dplyr::last_dplyr_warnings() to see the 159 remaining warnings. Here are the first few rows of the results. rusboost_results %>% arrange(desc(auc)) %>% head() %>% knitr::kable() size learn_rate maxdepth ir auc 30 0.01 10 2 0.6943248 300 0.01 15 2 0.6894045 200 0.01 10 2 0.6879628 300 0.01 10 2 0.6867877 200 0.01 15 2 0.6865816 200 0.10 15 1 0.6853092 We store the row that maximizes AUC in optim_rus optim_rus <- rusboost_results %>% filter(auc == max(auc)) Compared with AdaBoost, we see that RUSBoost performance is maximized with larger models (size is 30 rather than 30), deeper trees (maxdepth is 10 rather than 7). We see that both models use a learn_rate less than 1. Before testing our models, it is noteworthy that the AUCs for both AdaBoost and RUSBoost are appreciably higher than those for the logit model, the single tree, and the penalized logit models from above. ### 28.6.3 Discussion questions 1. Provide reasons why the tree-based models outperform the logit-based models in this setting? Do you think this might be related to the set of features considered? How might you alter the features if you wanted to improve the performance of the logit-based models? ## 28.7 Testing models We now consider the out-of-sample performance of our models. Given the apparent superiority of the boosted ensemble approaches (AdaBoost and RUSBoost) over the other approaches, we focus on those two approaches here. We follow Bao et al. (2020) in a number of respects. First, we use fiscal years 2003 through 2008 as our test years. Second, we fit a separate model for each test year using data on AAERs that had been released on the posited test date for each test year. Third, we exclude the year prior to the test year from the model-fitting data set (we do this by using filter(fyear <= test_year - gap) where gap = 2). But note that the rationale provided by Bao et al. (2020) for omitting this year differs from ours: Bao et al. (2020) wanted some assurance that the AAERs were observable when building the test model; in contrast, we simply want to reduce measurement error in the misstate variable when using too-recent a sample.225 Fourth, rather that using gap to get assurance that the AAERs were observable when building the test model, we use data on AAER dates to ensure that the AAERs we use had been revealed by our test dates. For each test_year, we assume that we are fitting a model on 30 September in the year after test_year, so as to allow at least four months for financial statement data to be disclosed (fiscal years for, say, 2001 can end as late as 31 May 2002, so 30 September gives four months for these). test_years <- 2003:2008 test_dates <- tibble(test_year = test_years) %>% mutate(test_date = as.Date(paste0(test_year + 1, "-09-30"))) We now create a function that allows us to pass a set of features, specify the test_year and the parameters to be used in fitting the model (e.g., learn_rate and ir) and which returns a fitted model. train_model <- function(features, test_year, gap = 2, size, rus = TRUE, learn_rate, ir, control) { test_date <- test_dates %>% filter(test_year == !!test_year) %>% select(test_date) %>% pull() aaers <- get_aaers(test_date = test_date) data_train <- aaers %>% right_join(features, by = c("gvkey", "fyear")) %>% filter(fyear <= test_year - gap) %>% mutate(misstate = as.integer(coalesce(misstate, FALSE))) fm <- rusboost(formula, data_train, size = size, learn_rate = learn_rate, rus = rus, ir = ir, control = control) return(fm) } Note that in training our final model, we can use all available data in the training data set. For a test year of 2003, our training data set is the same as that used for selecting the meta-parameters above, but there we generally only used a subset of the data (i.e., in cross-validation) in fitting models in that process. We next create a function that allows us to pass a set of features (features), a fitted model (fm) and to specify the test year (test_year) and which returns the predicted values and estimated probabilities for the test observations. test_model <- function(features, fm, test_year) { target <- as.character(stats::as.formula(fm[["formula"]])[[2]]) df <- features %>% left_join(aaer_long, by = c("gvkey", "fyear")) %>% select(-max_year, -min_year) %>% mutate(misstate = as.integer(!is.na(p_aaer))) data_test <- df %>% filter(fyear == test_year) return(list(score = predict(fm, data_test, type = "prob"), predicted = as.integer(predict(fm, data_test, type = "class") == 1), response = as.integer(data_test[[target]] == 1))) } We next make a convenience function that calls train_model then test_model in turn.226 fit_test <- function(x, features, gap = 2, size, rus = TRUE, learn_rate, ir = 1, control) { set.seed(2021) fm <- train_model(features = features, test_year = x, size = size, rus = rus, ir = ir, learn_rate = learn_rate, control = control, gap = gap) test_model(features = features, fm = fm, test_year = x) } We first test AdaBoost using the meta-parameter values stored in optim_ada and store the results in results_ada. We have included the values generated from code in Section 28.6 above here so that you do not need to run the code from that section. optim_ada <- tibble(size = 30, learn_rate = 0.1, maxdepth = 7) Again, we use mclapply from the parallel library here; this may not work on Windows (if so, replace mclapply with lapply and delete mc.cores = 8). results_ada <- mclapply(test_years, fit_test, features = features, size = optim_ada$size,
rus = FALSE,
learn_rate = optim_ada$learn_rate, control = rpart.control( maxdepth = optim_ada$maxdepth),
mc.cores = 8)

We then test RUSBoost using the meta-parameter values stored in optim_rus and store the results in results_rus. We have included the values generated from code in Section 28.6 above here so that you do not need to run the code from that section.

optim_rus <- tibble(size = 30, learn_rate = 0.01, maxdepth = 10, ir = 2)
results_rus <- mclapply(test_years, fit_test,
features = features,
size = optim_rus$size, rus = TRUE, ir = optim_rus$ir,
learn_rate = optim_rus$learn_rate, control = rpart.control( maxdepth = optim_rus$maxdepth),
mc.cores = 8)

To make it easier to tabulate results, we create a small function that collects performance statistics for a set of results that can be used to tabulate the results.

get_stats <- function(results) {
response <- results$response score <- results$score

ndcg_k <- function(k) {
c(ndcg(score, response, k = k))
}

res <- list(auc(score, response),
unlist(lapply(seq(0.01, 0.03, 0.01), ndcg_k)))
unlist(res)
}

The analyses in Bao et al. (2020) appear to evaluate test performance using the means across test years, so we make a small function (add_means) to add a row of means to the bottom of a data frame and use this function in table_results, which is a function that extracts performance statistics from a set of results, adds means, then formats the resulting data frame for display.

add_means <- function(df) {
bind_rows(df, summarize_all(df, mean))
}

table_results <- function(results, ...) {
results %>%
lapply(get_stats) %>%
bind_rows() %>%
mutate(test_year = c(test_years, "Mean")) %>%
select(test_year, everything()) %>%
knitr::kable(digits = 4, ...)
}

We can now simply call table_results for results_ada and results_rus in turn.

table_results(results_ada)
test_year auc ndcg_0.01 ndcg_0.02 ndcg_0.03
2003 0.7010 0.0451 0.0504 0.0760
2004 0.6292 0.0257 0.0361 0.0458
2005 0.6821 0.0182 0.0182 0.0413
2006 0.6851 0.0000 0.0308 0.0308
2007 0.5674 0.0000 0.0000 0.0148
2008 0.6061 0.0000 0.0180 0.0180
Mean 0.6452 0.0148 0.0256 0.0378
table_results(results_rus)
test_year auc ndcg_0.01 ndcg_0.02 ndcg_0.03
2003 0.7134 0.1116 0.1300 0.1385
2004 0.7048 0.0321 0.0321 0.0321
2005 0.7019 0.0359 0.0359 0.0478
2006 0.7237 0.0181 0.0344 0.0492
2007 0.5636 0.0000 0.0000 0.0000
2008 0.6140 0.0234 0.0234 0.0234
Mean 0.6702 0.0368 0.0426 0.0485

From the results above, it does seem that RUSBoost is superior to AdaBoost in this current setting using the features drawn from Bao et al. (2020). Note that Bao et al. (2020) do not separately consider AdaBoost as an alternative to RUSBoost. Bao et al. (2020) argue that RUSBoost is superior to the all-features logit model, which seems consistent with our evidence above, though we did not consider the logit model beyond the cross-validation analysis given its poor performance (but see exercises below).

### 28.7.1 Discussion questions and exercises

1. In asserting that RUSBoost is superior to AdaBoost we did not evaluated the statistical significance of the difference between the AUCs for the two approaches. Describe how you might evaluate the statistical significance of this difference.
2. Compare the performance statistics from cross-validation and out-of-sample testing. What do these numbers tell you? Do you believe that these provide evidence that using (say) 2001 data to train a model that is evaluated on (say) 1993 frauds is problematic?
3. Bao et al. (2022) report an AUC in the test sample for an all-features logit model of 0.7228 (see Panel B of Table 1). What differences are there between our approach here and those in Bao et al. (2020) and Bao et al. (2022) that might account for the differences in AUC?
4. Bao et al. (2022) report an AUC in the test sample for an all-features logit model of 0.6842 (see Panel B of Table 1). Calculate an equivalent value for AUC from applying logistic regression to data from above. You may find it helpful to adapt the code above for RUSBoost in the following steps:
1. Create a function train_logit analogous to train_model above, but using glm in place of rusboost.
2. Create a function test_logit analogous to test_model above, but using calls to predict() like those used above for logistic regression.
3. Create a function fit_test_logit analogous to fit_test above, but calling the functions you created above.
4. Use mclapply or lapply with test_years and fit_test_logit and store the results in results_logit.
5. Tabulate results using table_results(results_logit). (Table 28.1 below contains our results from following these steps.)
5. Do the AUC results obtained from logit (from answer to previous question or Table 28.1 below) surprise you given the cross-validation and in-sample AUCs calculated (in exercises) above? How does the average AUC compare with the 0.6842 of Bao et al. (2022)? What might account for the difference?
6. Provide an intuitive explanation of the out-of-sample (test) NDCG@k results for the RUSBoost model. (Imagine you are trying to explain these to a user of prediction models like those discussed earlier in this chapter.) Do you believe that these results are strong, weak, or somewhere in between?
7. We followed Bao et al. (2020) in using gap = 2 (i.e., training data for fiscal years up to two years before the test year), but discussed the trade-offs in using a shorter gap period (more data, but more measurement error). How would you evaluated alternative gap periods?
8. Which results are emphasized in Bao et al. (2020)? What are the equivalent values in Bao et al. (2020)? Which results are emphasized in Bao et al. (2022)?
9. Which audience (e.g., practitioners, researchers) would be most interested in the results of Bao et al. (2020) and Bao et al. (2022)? What limitations of Bao et al. (2020) and Bao et al. (2022) might affect the interest of this audience?
Table 28.1: Performance statistics for logit models
test_year auc ndcg_0.01 ndcg_0.02 ndcg_0.03
2003 0.6682 0.0000 0.0289 0.0464
2004 0.6727 0.0000 0.0000 0.0200
2005 0.6538 0.0000 0.0133 0.0245
2006 0.6457 0.0000 0.0157 0.0157
2007 0.5981 0.0000 0.0000 0.0000
2008 0.6341 0.0599 0.0599 0.0774
Mean 0.6454 0.0100 0.0196 0.0307