21 Natural experiments revisited

In this chapter, we return to the topic of natural experiments. We first discuss the notion of registered reports, their purpose and their limitations. We then focus on an experiment (“Reg SHO”) run by the United States Securities and Exchange Commission (SEC) and studies that examined the effects of Reg SHO, with particular focus on one study that exploited this regulation to study effects on earnings management.

This chapter provides opportunities to sharpen our skills and knowledge in a number of areas. First, we will revisit the topic of earnings management and learn about some developments in its measurement since Dechow et al. (1995), which we covered in Chapter 18. Second, we further develop our skills in evaluating claimed natural experiments, using Reg SHO and the much-studied setting of broker-closure shocks. Third, we explore the popular difference-in-differences approach, both when predicated on random assignment and when based on the so-called parallel trends assumption in the absence of random assignment. Fourth, we will have an additional opportunity to apply ideas related to causal diagrams and causal mechanisms (covered in Chapters 5 and 20, respectively). Fifth, we will revisit the topic of statistical inference, using this chapter as an opportunity to consider randomization inference. Sixth, we build on the Frisch-Waugh-Lovell theorem to consider issues associated with the use of two-step regressions, which are common in many areas of accounting research.

This chapter is longer than others in the book, so we have made it easier to run code from one section without having to run all the code preceding it. The following libraries are needed to run the code in this chapter, so you should run this before running other code in this chapter.

library(lfe)
library(stargazer)
library(dplyr, warn.conflicts = FALSE)
library(lubridate)   # For year() and month()
library(tidyr)       # unnest(), expand_grid(), pivot_longer() & separate()
library(broom)       # For tidy()
library(stringr)     # For str_detect() and str_match()
library(DBI)
library(ggplot2)
library(farr)

Beyond that, the code in each of the following Sections 21.1, 21.2, 21.3, and 21.5 is independent of other code in this chapter and can be run independently of those other sections.141 Code or exercises in Sections 21.7 and 21.8 depend on code in Section 21.6, so you will need to run the code in Section 21.6 before running the code in those later two sections.

We use the stargazer package for regression output and set sg_format to "html" here (change sg_format to "text" if viewing the output below on screen).

sg_format <- "html"

21.1 A replication crisis?

A Financial Times article by Robert Wigglesworth covered the alleged “replication crisis” in finance research. Wigglesworth quotes Campbell Harvey, professor of finance at Duke University, who suggests that “at least half of the 400 supposedly market-beating strategies identified in top financial journals over the years are bogus.”

Wigglesworth identified “the heart of the issue” as what researchers call p-hacking, which is the practice whereby researchers search for “significant” and “positive” results. Here “significant” refers to statistical significance and “positive” refers to results that reject so-called “null hypotheses” and thereby (allegedly) pushing human knowledge forward. Harvey (2017) cites research suggesting that 90% of published studies report such “significant” and “positive” results. Reporting “positive” results is important not only for getting published, but also for attracting citations, which drive behaviour for both researchers and journals.

Simmons et al. (2011, p. 1359) describe what they term researcher degrees of freedom. “In the course of collecting and analyzing data, researchers have many decisions to make: Should more data be collected? Should some observations be excluded? Which conditions should be combined and which ones compared? Which control variables should be considered? Should specific measures be combined or transformed or both?” Simmons et al. (2011, p. 1364) identify another well-known researcher degree of freedom, namely that of “reporting only experiments that ‘work’”, which is known as the file-drawer problem (because experiments that don’t “work” are put in a file-drawer).

To illustrate the power of researcher degrees of freedom, Simmons et al. (2011) conducted two experiments with live subjects and describe the results of two hypothetical studies based on those experiments. They argue that these studies “demonstrate how unacceptably easy it is to accumulate (and report) statistically significant evidence for a false hypothesis” [p. 1359]. Simmons et al. (2011, p. 1359) conclude that “flexibility in data collection, analysis, and reporting dramatically increases actual false-positive rates.”

Perhaps in response to concerns similar to those raised by Simmons et al. (2011), the Journal of Accounting Research (JAR) conducted a trial for its annual conference held in May 2017. According to the JAR website, at this conference “authors presented papers developed through a Registration-based Editorial Process (REP). The goal of the conference was to see whether REP could be implemented for accounting research, and to explore how such a process could be best implemented. Papers presented at the conference were subsequently published in May 2018. As summarized by Bloomfield et al. (2018), we learned a lot through the process of developing and publishing these papers, and deemed the experiment a success, but also an ongoing learning process.”

According to Bloomfield et al. (2018, p. 317), “under REP, authors propose a plan to gather and analyze data to test their predictions. Journals send promising proposals to one or more reviewers and recommend revisions. Authors are given the opportunity to review their proposal in response, often multiple times, before the proposal is either rejected or granted in-principle acceptance … regardless of whether [subsequent] results support their predictions.”

Bloomfield et al. (2018, p. 317) contrast REP with the Traditional Editorial Process (“TEP”). Under the TEP, “authors gather their data, analyze it, and write and revise their manuscripts repeatedly before sending them to editors.” Bloomfield et al. (2018, p. 317) suggest that “almost all peer-reviewed articles in social science are published under … the TEP.”

The REP is designed to eliminate some of questionable research practices, including those identified by Simmons et al. (2011). For example, one form of p-hacking is HARKing (from “Hypothesizing After Results are Known”). In its extreme form, HARKing involves searching for a “significant” correlation and then developing a hypothesis to “predict” it. To illustrate, consider the spurious correlations website provided by Tyler Vigen. This site lists a number of evidently spurious correlations, such as the 99.26% correlation between the divorce rate in Maine and margarine consumption or the 99.79% correlation between US spending on science, space, and technology and suicides by hanging, strangulation and suffocation. The correlations are deemed spurious because normal human beings have strong prior beliefs that there is no underlying causal relation explaining these correlations. Instead, these are regarded as mere coincidence.

However, a creative academic can probably craft a story to “predict” any correlation. Perhaps increasing spending on science raises its perceived importance to society. But drawing attention to science only serves to highlight how the US has inevitably declined in relative stature in many fields, including science. While many Americans can carry on notwithstanding this decline, others are less sanguine about it and may go to extreme lengths as a result … . This is clearly a silly line of reasoning, but if one added some references to published studies and fancy terminology, it would probably read a lot like the hypothesis development sections of some academic papers.

Bloomfield et al. (2018, p. 326) examine “the strength of the papers’ results” from the 2017 JAR conference in their section 4.2 and conclude that “of the 30 predictions made in the … seven proposals, we count 10 as being supported at $$p \leq 0.05$$ by at least one of the 134 statistical tests the authors reported. The remaining 20 predictions are not supported at $$p \leq 0.05$$ by any of the 84 reported tests. Overall, our analysis suggests that the papers support the authors’ predictions far less strongly than is typical among papers published in JAR and its peers.”142

21.1.1 Discussion questions

1. Simmons et al. (2011) provide a more in-depth examination of issues with the TEP discussed in Bloomfield et al. (2018, pp. 318–9). How plausible do you find the two experiments studied Simmons et al. (2011) as representations of accounting research in practice? What differences are likely to exist in empirical accounting research using archival data?

2. Bloomfield et al. (2018, p. 326) say “we exclude Hail et al. (2018) from our tabulation [of results] because it does not state formal hypotheses.” Given the lack of formal hypotheses, do you think it made sense to include the proposal from Hail et al. (2018) in the 2017 JAR conference? Does the REP have relevance to papers without formal hypotheses? Does the absence of formal hypotheses imply that Hail et al. (2018) were not testing hypotheses? Is your answer to the last question consistent with how Hail et al. (2018, p. 650) discuss results reported in Table 5 of that paper?

3. According the analysis of Bloomfield et al. (2018), there were 218 tests of 30 hypotheses and different hypotheses had different numbers of tests. In the following analysis, we assume 30 hypotheses with each having 7 tests (for a total of 210 tests).

set.seed(2021)
results <-
expand_grid(hypothesis = 1:30, test = 1:7) %>%
mutate(p = runif(nrow(.)),
reject = p < 0.05)

results %>%
group_by(hypothesis) %>%
summarize(reject_one = any(reject), .groups = "drop") %>%
count(reject_one)
## # A tibble: 2 × 2
##   reject_one     n
##   <lgl>      <int>
## 1 FALSE         19
## 2 TRUE          11
Does this analysis suggest an alternative possible interpretation of the results than the “far less strongly than is typical” conclusion offered by Bloomfield et al. (2018). Does choosing a different value for set.seed() alter the tenor of the results from the analysis above? How might you make the analysis above more definitive?
1. Bloomfield et al. (2018, p. 326) argue “it is easy to imagine revisions of several conference papers would allow them to report results of strength comparable to those found in most papers published under TEP.” For example, “Li and Sandino (2018) yielded no statistically significant support for their main hypotheses. However, they found significant results in their planned additional analyses that are consistent with informal predictions included in the accepted proposal. … [In light of this evidence] we are not ready to conclude that the studies in the issue actually provide weaker support for their predictions than most studies published under TEP.” . Can these results instead be interpreted as saying something about the strength of results of studies published under TEP?

2. Do you believe that it would be feasible for REP to become the dominant research paradigm in accounting research? What challenges would such a development face?

3. A respondent to the survey conducted by Bloomfield et al. (2018, p. 337) wrote:

I do not find the abundance of “null results” surprising. It could have been discovered from one’s own experience. Research is an iterative process and it involves learning. I am not sure if there is anything useful that we discover in the research process by shutting down the learning channel; especially with the research questions that are very novel and we do not know much about.

Comment on this remark. What do you think the respondent has in mind with regard to the “learning channel”? Do you agree that the REP shuts down this channel?

21.2 The Reg SHO experiment

To better understand the issues raised by the discussion above in a real research setting, we will focus on the Reg SHO experiment, which has been the focus of many studies. In July 2004, the SEC adopted Reg SHO, a regulation governing short-selling activities in equity markets. Reg SHO contained a pilot program in which stocks in the Russell 3000 index were ranked by trading volume within each exchange and every third one was designated as a pilot stock. From May 2, 2005 to August 6, 2007, short sales on pilot stocks were exempted from price tests, including the tick test for exchange-listed stocks and the bid test for NASDAQ National Market stocks.

In its initial order, the SEC stated that “the Pilot will allow [it] to study trading behavior in the absence of a short sale price test.” The SEC’s plan was to “examine, among other things, the impact of price tests on market quality (including volatility and liquidity), whether any price changes are caused by short selling, costs imposed by a price test, and the use of alternative means to establish short positions.”

21.2.1 The SHO pilot sample

The assignment mechanism in the Reg SHO experiment is unusually transparent, even by the standards of natural experiments. Nonetheless care is needed to identify the treatment and control firms and we believe it is instructive to walk through the steps needed to do so, as we do in this section. (Readers who find the code details a little tedious, could easily skip ahead to Section 21.2.3 on a first reading. We say “first reading” because there are subtle issues with natural experiments that this section helps to highlight.

The SEC’s website provides the names and tickers of the Reg SHO pilot firms. These can be parsed and are included as the sho_tickers data set in the farr package .

sho_tickers
## # A tibble: 986 × 2
##    ticker co_name
##    <chr>  <chr>
##  1 A      AGILENT TECHNOLOGIES INC
##  2 AAI    AIRTRAN HOLDINGS INC
##  3 AAON   AAON INC
##  4 ABC    AMERISOURCEBERGEN CORP
##  5 ABCO   ADVISORY BOARD CO
##  6 ABCW   ANCHOR BANCORP INC
##  7 ABGX   ABGENIX INC
##  8 ABK    AMBAC FINANCIAL GRP INC
##  9 ABMD   ABIOMED INC
## 10 ABR    ARBOR REALTY TRUST INC
## # … with 976 more rows

However, these are just the pilot firms and we need to other sources to obtain the details of the control firms. It might seem perverse for the SEC to have published lists of treatment stocks, but no information on control stocks.143 One explanation for this might be that, because special action (i.e., elimination of price tests) was only required for the treatment stocks (for the control stocks, it was business as usual), no lists of controls were needed for the markets to implement the pilot. Additionally, because the SEC had a list of the control stocks that it would use in its own statistical analysis, it had no reason to publish lists for this purpose. Fortunately, while the SEC did not tell us who the control stocks were, it provides enough information for us to identify them, which we do below.

First, we know that the pilot stocks were selected from the Russell 3000, the component stocks of which are found in the sho_r3000 data set from the farr package.

sho_r3000
## # A tibble: 3,000 × 2
##    russell_ticker russell_name
##    <chr>          <chr>
##  1 A              AGILENT TECHNOLOGIES INC
##  2 AA             ALCOA INC
##  3 AACC           ASSET ACCEPTANCE CAPITAL
##  4 AACE           ACE CASH EXPRESS INC
##  5 AAI            AIRTRAN HOLDINGS INC
##  6 AAON           AAON INC
##  7 AAP            ADVANCE AUTO PARTS INC
##  8 AAPL           APPLE COMPUTER INC
##  9 ABAX           ABAXIS INC
## 10 ABC            AMERISOURCEBERGEN CORP
## # … with 2,990 more rows

While the Russell 3000 contains 3,000 securities, the SEC and Black et al. (2019) tell us that, in constructing the pilot sample, the SEC excluded 32 stocks in the Russell 3000 index that, as of 25 June 2004, were not listed on the Nasdaq National Market, NYSE or AMEX “because short sales in these securities are currently not subject to a price test.” The SEC also excluded 12 stocks that started trading after April 30, 2004 due to IPOs or spin-offs. And, from Black et al. (2019), we know there were two additional stocks that stopped trading after 25 June 2004 but before the SEC constructed its sample on 28 June 2004. We can get the data for each of these criteria from CRSP, but we need to first merge the Russell 3000 data with CRSP to identify the right PERMNO for each security. For this purpose, we will use data from the five CRSP tables below:

pg <- dbConnect(RPostgres::Postgres(), bigint = "integer")
rs <- dbExecute(pg, "SET search_path TO crsp")

mse <- tbl(pg, "mse")
msf <- tbl(pg, "msf")
stocknames <- tbl(pg, "stocknames")
dseexchdates <- tbl(pg, "dseexchdates")
ccmxpf_lnkhist <- tbl(pg, "ccmxpf_lnkhist")

(Note that because all tables come from one database schema, we can use SET search_path TO crsp and then replace, for example, tbl(pg, sql("SELECT * FROM crsp.mse") with tbl(pg, "mse").)

One thing we note is that some of the tickers from the Russell 3000 sample append the class of stock to the ticker. We can detect these cases by looking for a dot (.) using regular expressions. Because a dot has special meaning in regular expressions (regex), we need to escape it using a backslash (\). (For more on regular expressions, see Chapter 10 and references cited there.) Because a backslash has a special meaning in strings in R, we need to escape the backslash itself to tell R that we mean a literal backslash. In short, we use the regex \\. to detect dots in strings.

sho_r3000 %>% filter(str_detect(russell_ticker, "\\."))
## # A tibble: 12 × 2
##    russell_ticker russell_name
##    <chr>          <chr>
##  1 AGR.B          AGERE SYSTEMS INC
##  2 BF.B           BROWN FORMAN CORP
##  3 CRD.B          CRAWFORD & CO
##  4 FCE.A          FOREST CITY ENTRPRS
##  5 HUB.B          HUBBELL INC
##  6 JW.A           WILEY JOHN & SONS INC
##  7 KV.A           K V PHARMACEUTICAL CO
##  8 MOG.A          MOOG INC
##  9 NMG.A          NEIMAN MARCUS GROUP INC
## 10 SQA.A          SEQUA CORPORATION
## 11 TRY.B          TRIARC COS INC
## 12 VIA.B          VIACOM INC

In these cases, CRSP takes a different approach. For example, where the Russell 3000 sample has AGR.B, CRSP has ticker equal to AGR and shrcls equal to B.

The other issue is that some tickers from the Russell 3000 data have the letter E appended to what CRSP shows as just a four-letter ticker.

sho_r3000 %>% filter(nchar(russell_ticker) == 5,
substr(russell_ticker, 5, 5) == "E")
## # A tibble: 4 × 2
##   russell_ticker russell_name
##   <chr>          <chr>
## 1 CVNSE          COVANSYS CORP
## 2 SONSE          SONUS NETWORKS INC
## 3 SPSSE          SPSS INC
## 4 VXGNE          VAXGEN INC

A curious reader might wonder how we identified these two issues with tickers, and how we know that they are exhaustive of the issues in the data. We explore these questions in the exercises at the end of this section.

To address these ticker issues, we create two functions: one (clean_ticker) to “clean” each ticker so that it can be matched with CRSP, and one (get_shrcls) to extract the share class (if any) specified in the Russell 3000 data.

The following code uses a regex to match cases where the text ends with either “A” or “B” ([AB]$ in regex) preceded by the a dot (\\. in regex, as discussed above). The expression uses capturing parentheses (i.e., ( and )) to capture the text before the dot from the beginning of the string (^(.*)) to the dot and to capture the letter “A” or “B” at the end (([AB])$).

The case_when in the clean_ticker function first drops the E from the end of five-letter tickers, then applies the regex to extract the “clean” ticker (the first captured text), and in all other cases returns the original ticker.

The get_shrcls function extracts the second capture group from the regex (the first value returned by str_match is the complete match, so we use [, 3] to get the second capture group).

regex <- "^(.*)\\.([AB])$" clean_ticker <- function(x) { case_when(nchar(x) == 5 & substr(x, 5, 5) == "E" ~ substr(x, 1, 4), str_detect(x, regex) ~ str_replace(x, regex, "\\1"), TRUE ~ x) } get_shrcls <- function(x) { str_match(x, regex)[, 3] } sho_r3000_tickers <- sho_r3000 %>% select(russell_ticker, russell_name) %>% mutate(ticker = clean_ticker(russell_ticker), shrcls = get_shrcls(russell_ticker)) sho_r3000_tickers %>% filter(russell_ticker != ticker) ## # A tibble: 16 × 4 ## russell_ticker russell_name ticker shrcls ## <chr> <chr> <chr> <chr> ## 1 AGR.B AGERE SYSTEMS INC AGR B ## 2 BF.B BROWN FORMAN CORP BF B ## 3 CRD.B CRAWFORD & CO CRD B ## 4 CVNSE COVANSYS CORP CVNS <NA> ## 5 FCE.A FOREST CITY ENTRPRS FCE A ## 6 HUB.B HUBBELL INC HUB B ## 7 JW.A WILEY JOHN & SONS INC JW A ## 8 KV.A K V PHARMACEUTICAL CO KV A ## 9 MOG.A MOOG INC MOG A ## 10 NMG.A NEIMAN MARCUS GROUP INC NMG A ## 11 SONSE SONUS NETWORKS INC SONS <NA> ## 12 SPSSE SPSS INC SPSS <NA> ## 13 SQA.A SEQUA CORPORATION SQA A ## 14 TRY.B TRIARC COS INC TRY B ## 15 VXGNE VAXGEN INC VXGN <NA> ## 16 VIA.B VIACOM INC VIA B Now that we have “clean” tickers, we can merge with CRSP. The following code proceeds in two steps. First, we create crsp_sample, which is permno, ticker, and shrcls values applicable on 2004-06-25, the date on which the Russell 3000 that the SEC used was created. Second, we merge sho_r3000_tickers with crsp_sample using ticker and then use filter to retain cases where, if a share class is specified in the SEC-provided ticker, it matches the one row in CRSP with that share class and retaining all rows where no share class is specified in the SEC-provided ticker. crsp_sample <- stocknames %>% mutate(test_date = as.Date("2004-06-25")) %>% filter(test_date >= namedt, test_date <= nameenddt) %>% select(permno, permco, ticker, shrcls) %>% distinct() %>% collect() sho_r3000_merged <- sho_r3000_tickers %>% inner_join(crsp_sample, by = "ticker", suffix = c("", "_crsp")) %>% filter(shrcls == shrcls_crsp | is.na(shrcls)) %>% select(russell_ticker, permco, permno) Unfortunately, this approach results in some tickers being matched to multiple PERMNO values. sho_r3000_merged %>% group_by(russell_ticker) %>% filter(n() > 1) %>% ungroup() ## # A tibble: 40 × 3 ## russell_ticker permco permno ## <chr> <int> <int> ## 1 AGM 28392 80168 ## 2 AGM 28392 80169 ## 3 BDG 20262 77881 ## 4 BDG 20262 55781 ## 5 BIO 655 61516 ## 6 BIO 655 61508 ## 7 CW 20546 89223 ## 8 CW 20546 18091 ## 9 EXP 30381 80415 ## 10 EXP 30381 89983 ## # … with 30 more rows In each case, these appear to be cases where there are multiple securities (permno values) for the same company (permco value). To choose the security that is the one most likely included in the Russell 3000 index used by the SEC, we will keep the one with the greatest dollar trading volume for the month of June 2004. We collect the data on dollar trading volumes in the data frame trading_vol and then make a new version of the table sho_r3000_merged that includes just the permno value that has the greatest trading volume for each ticker. trading_vol <- msf %>% filter(date == "2004-06-30") %>% mutate(dollar_vol = coalesce(abs(prc) * vol, 0)) %>% select(permno, dollar_vol) %>% collect() sho_r3000_merged <- sho_r3000_tickers %>% inner_join(crsp_sample, by = "ticker", suffix = c("", "_crsp")) %>% filter(is.na(shrcls) | shrcls == shrcls_crsp) %>% inner_join(trading_vol, by = "permno") %>% group_by(russell_ticker) %>% filter(dollar_vol == max(dollar_vol, na.rm = TRUE)) %>% ungroup() %>% select(russell_ticker, permno) Black et al. (2019) identify the 32 stocks not listed on the Nasdaq National Market, NYSE or AMEX firms “using historical exchange code (exchcd) and Nasdaq National Market Indicator (nmsind) from the CRSP monthly stock file” (in practice, these 32 stocks are smaller Nasdaq-listed stocks). However, exchcd and nmsind are not included in the crsp.msf file we use. Black et al. (2019) likely use the CRSP monthly stock file obtained from the web interface provided by WRDS, which often merges in data from other tables. Fortunately, we can obtain nmsind from the CRSP monthly events file (crsp.mse). This file includes information about delisting events, distributions (such as dividends), changes in NASDAQ information (such as nmsind), and name changes. We get data on nmsind by pulling the latest observation on crsp.mse on or before 2004-06-28 where the event related to NASDAQ status (event == "NASDIN")). nmsind_data <- mse %>% filter(date <= "2004-06-28", event == "NASDIN") %>% group_by(permno) %>% filter(date == max(date, na.rm = TRUE)) %>% ungroup() %>% select(permno, date, nmsind) %>% collect() We can obtain exchcd from the CRSP stock names file (crsp.stocknames), again pulling the value applicable on 2004-06-28.144 exchcd_data <- stocknames %>% filter(exchcd > 0) %>% mutate(test_date = as.Date("2004-06-28")) %>% filter(test_date >= namedt, test_date <= nameenddt) %>% select(permno, exchcd) %>% distinct() %>% collect() According to its website, the SEC “also excluded issuers whose initial public offerings commenced after April 30, 2004.” Following Black et al. (2019), we used CRSP data to identify these firms. Specifically, the table crsp.dseexchdates includes the variable begexchdate. ipo_dates <- dseexchdates %>% select(permno, begexchdate) %>% distinct() %>% collect() Finally, it appears that there were stocks listed in the Russell 3000 file likely used by the SEC (created on 2004-06-25) that were delisted prior to 2004-06-28, the date on which the SEC appears to have finalized the sample for its pilot program. We again use crsp.dse to identify these firms. recent_delistings <- mse %>% filter(event == "DELIST", date >= "2004-06-25", date <= "2004-06-28") %>% rename(delist_date = date) %>% select(permno, delist_date) %>% collect() Now, we put all these pieces together and create variables nasdaq_small, recent_listing, and delisted corresponding to the three exclusion criteria discussed above. sho_r3000_permno <- sho_r3000_merged %>% left_join(nmsind_data, by = "permno") %>% left_join(exchcd_data, by = "permno") %>% left_join(ipo_dates, by = "permno") %>% left_join(recent_delistings, by = "permno") %>% mutate(nasdaq_small = coalesce(nmsind == 3 & exchcd == 3, FALSE), recent_listing = begexchdate > "2004-04-30", delisted = !is.na(delist_date), keep = !nasdaq_small & !recent_listing & !delisted) sho_r3000_permno %>% count(keep, nasdaq_small, recent_listing, delisted) ## # A tibble: 3 × 5 ## keep nasdaq_small recent_listing delisted n ## <lgl> <lgl> <lgl> <lgl> <int> ## 1 FALSE FALSE TRUE FALSE 12 ## 2 FALSE TRUE FALSE FALSE 6 ## 3 TRUE FALSE FALSE FALSE 2982 As can be seen below, we have a final sample of 2982 stocks that we can merge with sho_tickers to create the pilot indicator. sho_r3000_sample <- sho_r3000_permno %>% filter(keep) %>% rename(ticker = russell_ticker) %>% left_join( sho_tickers %>% select(ticker) %>% mutate(pilot = TRUE), by = "ticker") %>% mutate(pilot = coalesce(pilot, FALSE)) %>% select(ticker, permno, pilot) sho_r3000_sample %>% count(pilot) ## # A tibble: 2 × 2 ## pilot n ## <lgl> <int> ## 1 FALSE 1996 ## 2 TRUE 986 As can be seen the number of treatment and control firms in this sample corresponds exactly with the numbers provided on p.42 of Black et al. (2019). Finally, later we will want to link these data with data from Compustat, which means we need to link these observations with GVKEYs. For this, we use ccm_link (as used and discussed in Chapter 9) to produce sho_r3000_gvkeys, the sample we can use in later analysis. ccm_link <- ccmxpf_lnkhist %>% filter(linktype %in% c("LC", "LU", "LS"), linkprim %in% c("C", "P")) %>% rename(permno = lpermno) %>% select(gvkey, permno, linkdt, linkenddt) gvkeys <- ccm_link %>% mutate(test_date = as.Date("2004-06-28")) %>% filter(test_date >= linkdt, test_date <= linkenddt | is.na(linkenddt)) %>% select(gvkey, permno) %>% collect() sho_r3000_gvkeys <- sho_r3000_sample %>% inner_join(gvkeys, by = "permno") sho_r3000_gvkeys ## # A tibble: 2,979 × 4 ## ticker permno pilot gvkey ## <chr> <int> <lgl> <chr> ## 1 A 87432 TRUE 126554 ## 2 AA 24643 FALSE 001356 ## 3 AACC 90020 FALSE 157058 ## 4 AACE 78112 FALSE 025961 ## 5 AAI 80670 TRUE 030399 ## 6 AAON 76868 TRUE 021542 ## 7 AAP 89217 FALSE 145977 ## 8 AAPL 14593 FALSE 001690 ## 9 ABAX 77279 FALSE 024888 ## 10 ABC 81540 TRUE 031673 ## # … with 2,969 more rows To better understand the potential issues with constructing the pilot indicator variable, it is useful to compare the approach above with that taken in a paper we study closely later in this chapter . To construct sho_data as Fang et al. (2016) do, we use fhk_pilot from the farr package.145 We compare sho_r3000_sample and sho_r3000_gvkeys with sho_data in the exercises below. sho_data <- fhk_pilot %>% select(gvkey, pilot) %>% distinct() %>% group_by(gvkey) %>% filter(n() == 1) %>% ungroup() %>% inner_join(fhk_pilot, by = c("gvkey", "pilot"))  ## Warning in inner_join(., fhk_pilot, by = c("gvkey", "pilot")): Each row in x is expected to match at most 1 row in y. ## ℹ Row 81 of x matches multiple rows. ## ℹ If multiple matches are expected, set multiple = "all" to silence this ## warning. 21.2.2 Exercises 1. Before running the following code, can you tell from output above how many rows this query will return? What is this code doing? At what stage would code like this have been used in process of creating the sample above? Why is code like this not included above? sho_r3000 %>% anti_join(crsp_sample, join_by(russell_ticker == ticker)) %>% collect() 1. Focusing on the values of ticker and pilot in fhk_pilot, what differences do you observe between fhk_pilot and sho_r3000_sample? What do you believe is the underlying cause for these discrepancies? 2. What do the following observations represent? Choose a few observations from this output and examine whether these reveal issues in the sho_r3000_sample or in fhk_pilot. sho_r3000_sample %>% inner_join(fhk_pilot, by = "ticker", suffix = c("_ours", "_fhk")) %>% filter(permno_ours != permno_fhk) ## # A tibble: 37 × 6 ## ticker permno_ours pilot_ours gvkey permno_fhk pilot_fhk ## <chr> <int> <lgl> <chr> <int> <lgl> ## 1 AGM 80169 FALSE 015153 80168 FALSE ## 2 AGR.B 89400 TRUE 141845 88917 TRUE ## 3 BDG 55781 TRUE 002008 77881 TRUE ## 4 BF.B 29946 TRUE 002435 29938 TRUE ## 5 BIO 61516 TRUE 002220 61508 TRUE ## 6 CRD.B 27618 TRUE 003581 76274 TRUE ## 7 CW 18091 FALSE 003662 89223 FALSE ## 8 EXP 80415 FALSE 030032 89983 FALSE ## 9 FCE.A 31974 TRUE 004842 65584 TRUE ## 10 GEF 83233 TRUE 005338 83264 TRUE ## # … with 27 more rows 1. In constructing the pilot indicator, FHK omit cases (gvkey values) where there is more than one distinct value for the indicator. A question is: Who are these firms? Why is there more than one value for pilot for these firms? And does omission of these make sense? (Hint: It may help to compare fhk_pilot with sho_r3000_gvkeys for these firms.) sho_dupes <- fhk_pilot %>% select(gvkey, pilot) %>% distinct() %>% group_by(gvkey) %>% filter(n() > 1) %>% ungroup() %>% arrange(gvkey) sho_dupes ## # A tibble: 6 × 2 ## gvkey pilot ## <chr> <lgl> ## 1 007017 TRUE ## 2 007017 FALSE ## 3 030146 FALSE ## 4 030146 TRUE ## 5 141400 FALSE ## 6 141400 TRUE 1. What issue is implicit in the output from the code below? How could you fix this issue? Would you expect a fix for this issue to significantly affect the regression results? Why or why not? sho_data %>% count(gvkey, ticker) %>% arrange(desc(n)) ## # A tibble: 2,993 × 3 ## gvkey ticker n ## <chr> <chr> <int> ## 1 001076 RNT 2 ## 2 002008 BDG 2 ## 3 002220 BIO 2 ## 4 002435 BF.B 2 ## 5 002710 STZ 2 ## 6 003581 CRD.B 2 ## 7 003662 CW 2 ## 8 003708 TRY.B 2 ## 9 004842 FCE.A 2 ## 10 005284 GTN 2 ## # … with 2,983 more rows 21.2.3 Early studies of Reg SHO The first study of the effects of Reg SHO was conducted by the SEC’s own Office of Economic Analysis. The SEC study examines the “effect of pilot on short selling, liquidity, volatility, market efficiency, and extreme price changes” [p. 86]. The authors of the 2007 SEC study “find that price restrictions reduce the volume of executed short sales relative to total volume, indicating that price restrictions indeed act as a constraint to short selling. However, in neither market do we find significant differences in short interest across pilot and control stocks. … We find no evidence that short sale price restrictions in equities have an impact on option trading or open interest. … We find that quoted depths are augmented by price restrictions but realized liquidity is unaffected. Further, we find some evidence that price restrictions dampen short term within-day return volatility, but when measured on average, they seem to have no effect on daily return volatility.” The SEC researchers conclude “based on the price reaction to the initiation of the pilot, we find limited evidence that the tick test distorts stock prices—on the day the pilot went into effect, Listed Stocks in the pilot sample underperformed Listed Stocks in the control sample by approximately 24 basis points. However, the pilot and control stocks had similar returns over the first six months of the pilot.” In summary, it seems fair to say that the SEC found that exemption from price tests had relatively limited effect on the market outcomes of interest, with no apparent impact on several outcomes. Alexander and Peterson (2008, p. 84) “examine how price tests affect trader behavior and market quality, which are areas of interest given by the [SEC] in evaluating these tests.” Alexander and Peterson (2008, p. 86) find that NYSE pilot stocks have similar spreads, but smaller trade sizes, more short trades, more short volume, and smaller ask depths. With regard to Nasdaq, Alexander and Peterson (2008, p. 86) find that the removed “bid test is relatively inconsequential.” Diether et al. (2009, p. 37) find that “while short-selling activity increases both for NYSE- and Nasdaq-listed Pilot stocks, returns and volatility at the daily level are unaffected.” 21.2.4 Discussion questions and exercises 1. Earlier we identified one feature of a randomized controlled trial (RCT) as that “proposed analyses are specified in advance”, as in a registered reports process. Why do you think the SEC did not use a registered report for its 2007 paper? Do you think the analyses of the SEC would be more credible if conducted as part of a registered reports process? Why or why not? 2. Do you have concerns that the results Alexander and Peterson (2008) have been p-hacked? What factors increase or reduce your concerns in this regard? 3. Evaluate the hypotheses found in the section of Diether et al. (2009, pp. 41–45) entitled Testable Hypotheses with particular sensitivity to concerns about HARKing. What kind of expertise is necessary in evaluating hypotheses in this way? 4. How might the SEC have conducted Reg SHO as part of a registered reports process open to outside research teams, such as Alexander and Peterson (2008) and Diether et al. (2009)? How might such a process have been run? What challenges would such a process face? 21.3 Analysing natural experiments Both Alexander and Peterson (2008) and Diether et al. (2009) use the difference-in-differences estimator (“DiD”) of the causal effect that we saw in Chapter 4. The typical approach to DiD involves estimating a regression of the following form: $Y_{it} = \beta_0 + \beta_1 \times \textit{POST}_t + \beta_2 \times \textit{TREAT}_i + \beta_3 \times \textit{POST}_t \times \textit{TREAT}_i$ In this specification, the estimated treatment effect is given by the fitted coefficient $$\hat{\beta}_3$$. While DiD is clearly popular among researchers in economics and adjacent fields, it is important to note that it is not obvious that it is the best choice in every experimental setting and that credible alternatives exist. Another approach would be to limit the sample to the post-treatment period and estimate the following regression $Y_{it} = \beta_0 + \beta_1 \times \textit{TREAT}_i$ In this specification, the estimated treatment effect is given by the fitted coefficient $$\beta_1$$. This approach is common in drug trials, which are typically conducted as RCTs. For example, in the paxlovid trial “participants were randomised 1:1, with half receiving paxlovid and the other half receiving a placebo orally every 12 hours for five days. Of those who were treated within three days of symptom onset, 0.8% (3/389) of patients who received paxlovid were admitted to hospital up to day 28 after randomization, with no deaths. In comparison, 7% (27/385) of patients who received placebo were admitted, with seven deaths.” . For the hospital admission outcome, it would have been possible to incorporate prior hospitalization rates in a difference-in-difference analysis, but this would only make sense if hospitalization rates in one period had a high predictive power for subsequent hospitalization rates.146 Yet another approach would include pre-treatment values of the outcome variable as a control. $Y_{it} = \beta_0 + \beta_1 \times \textit{TREAT}_i + \beta_2 \times Y_{i,t-1}$ To evaluate each of these approaches, we can use simulation analysis. The following analysis is somewhat inspired Frison and Pocock (1992), who use different assumptions about their data more appropriate to their (medical) setting and who focus on mathematical analysis instead of simulations. Frison and Pocock (1992) assume a degree of correlation in measurements of outcome variables for a given unit (e.g., patient) that is independent of the time between observations. A more plausible model in many business settings would be correlation in outcome measures for a given unit (e.g., firm) that fades as observations become further apart in time. Specifically, we assume that, absent a treatment effect or any period effects, the outcome in question follows an autoregressive process embedded in the get_outcomes function below, which has the key parameter $$\rho$$ (rho).147 get_outcomes <- function(rho = 0, periods = 7) { e <- rnorm(periods) y <- rep(NA, periods) y[1] <- e[1] for (i in 2:periods) { y[i] <- rho * y[i - 1] + e[i] } tibble(t = 1:periods, y = y) } We can use this get_outcomes function to generate data for outcomes in the absence of treatment. The following get_sample function uses this for n firms for given values of rho, periods (the number of periods observed for each firm), and effect, the underlying size of the effect of treatment on y. Here treatment is randomly assigned to half the firms in the sample and the effect is added to y when both treat and post are true. We also add a time-specific effect (t_effect) for each period, which is common to all observations (a common justification for the use of DiD is the existence of such period effects). get_sample <- function(n = 100, rho = 0, periods = 7, effect = 0) { treat <- sample(1:n, size = floor(n/2), replace = FALSE) t_effects <- tibble(t = 1:periods, t_effect = rnorm(periods)) f <- function(x) tibble(id = x, get_outcomes(rho = rho, periods = periods)) df <- lapply(1:n, f) %>% bind_rows() %>% inner_join(t_effects, by = "t") %>% mutate(treat = id %in% treat, post = t > periods/2, y = y + if_else(treat & post, effect, 0) + t_effect) %>% select(-t_effect) } The following function applies a number of estimators to a given data set and returns the estimated treatment effect for each estimator. The estimators we consider are the following (the labels POST, CHANGE, and ANCOVA come from Frison and Pocock, 1992): • DiD, the difference-in-difference estimator estimated by regressing y on the treatment indicator, treat interacted with the post-treatment indicator, post (with the lm() function automatically including the main effects of treat and post). • POST, which is based on OLS regression of y on treat, but with the sample restricted to the post-treatment observations. • CHANGE, which is based on OLS regression of the change in the outcome on treat. The change in outcome (y_change) is calculated as the mean of post-treatment outcome value (y_post) minus the mean of the pre-treatment outcome value (y_pre) for each unit. • ANCOVA, which is a regression of y_post on y_pre and treat. est_effect <- function(df) { df_treat <- df %>% select(id, treat) %>% distinct() fm_DiD <- lm(y ~ treat * post, data = df) est_DiD <- fm_DiD$coefficients[["treatTRUE:postTRUE"]]

df_POST <-
df %>%
filter(post) %>%
group_by(id, treat) %>%
summarize(y = mean(y), .groups = "drop")

fm_POST <- lm(y ~ treat, data = df_POST)
est_POST <- fm_POST$coefficients[["treatTRUE"]] df_CHANGE <- df %>% group_by(id, treat, post) %>% summarize(y = mean(y), .groups = "drop") %>% pivot_wider(names_from = "post", values_from = "y") %>% rename(y_pre = FALSE, y_post = TRUE) %>% mutate(y_change = y_post - y_pre) fm_CHANGE <- lm(I(y_post - y_pre) ~ treat, data = df_CHANGE) est_CHANGE <- fm_CHANGE$coefficients[["treatTRUE"]]

fm_ANCOVA <- lm(y_post ~ y_pre + treat, data = df_CHANGE)
est_ANCOVA <- fm_ANCOVA$coefficients[["treatTRUE"]] tibble(est_DiD, est_POST, est_CHANGE, est_ANCOVA) } The following run_sim function calls get_sample for supplied parameter values to create a data set, and then returning a data frame containing the results of applying est_effect to that data set. run_sim <- function(i, n = 100, rho = 0, periods = 7, effect = 0) { df <- get_sample(n = n, rho = rho, periods = periods, effect = effect) tibble(i = i, est_effect(df)) } To facilitate running of the simulation for various values of effect and rho, we create a data frame (params) with effect sizes running from 0 to 1 and $$\rho \in \{ 0, 0.18, 0.36, 0.54, 0.72, 0.9 \}$$. rhos <- seq(from = 0, to = 0.9, length.out = 6) effects <- seq(from = 0, to = 1, length.out = 5) params <- expand.grid(effect = effects, rho = rhos) The following function runs 1,000 simulations for the supplied values of effect and rho and returns a data frame with the results. run_sim_n <- function(effect, rho) { n_sims <- 1000 set.seed(2021) tibble(effect, rho, bind_rows(lapply(1:n_sims, run_sim, rho = rho, effect = effect))) } Actually running the simulation (i.e., the following line of code) takes quite some time (about 25 minutes on a single core of an M1 Mac).148 Fortunately, nothing in the subsequent exercises requires that you run this code, so only do so if you have time and want to examine results directly. results <- bind_rows(Map(run_sim_n, effect = params$effect, rho = params$rho)) With results in hand, we can do some analysis. The first thing to note is that est_CHANGE is equivalent to est_DiD, as all estimates are within rounding error of each other for these two methods. results %>% filter(abs(est_DiD - est_CHANGE) > 0.00001) %>% nrow() ## [1] 0 Thus we just use the label DiD in subsequent analysis. The second thing we check is that the methods provide unbiased estimates of the causal effect. The following plot suggests that the estimates are very close to the true values of causal effects for all three methods. results %>% pivot_longer(starts_with("est"), names_to = "method", values_to = "est") %>% mutate(method = gsub("^est.(.*)$", "\\1", method)) %>%
group_by(rho, method) %>%
summarize(bias = mean(est - effect), .groups = "drop") %>%
filter(method != "CHANGE") %>%
ggplot(aes(x = rho, y = bias, colour = method)) +
geom_line() +
ylim(-0.1, 0.1)

Having confirmed that there is no apparent bias in any of the estimators in this setting, we next consider the empirical standard errors for each method. Because we get essentially identical plots with each value of the true effect, we focus on effect == 0.5 in the following analysis. Here we rearrange the data so that we have a method column and an est column for the estimated causal effect. We then calculate, for each method and value of rho, the standard deviation of est, which is the empirical standard error we seek. Finally, we plot the values for each value of rho.

results %>%
filter(effect == 0.5) %>%
pivot_longer(starts_with("est"),
names_to = "method", values_to = "est") %>%
value = as.vector(modelcoefficients)) %>% filter(grepl("^year.", name)) %>% separate(name, into = c("year", "pilot"), sep = ":", fill = "right") %>% mutate(year = as.integer(gsub("^year", "", year)), pilot = coalesce(pilot == "pilotTRUE", FALSE)) %>% ggplot(aes(x = year, y = value, color = pilot)) + geom_line() + scale_x_continuous(breaks = 2000:2012L) + geom_rect(xmin = 2005, xmax = 2007, ymin = -Inf, ymax = Inf, color = NA, alpha=0.01) } We then estimate one of the models above by year and then feed the fitted model to the plot_coefficients function. The resulting plot is show in Figure 21.1. sho_accruals %>% mutate(year = as.factor(year(datadate))) %>% felm(da_adj ~ year * pilot - pilot - 1 + log(at) + mtob + roa + leverage | 0 | 0 | year + gvkey, data = .) %>% plot_coefficients() 21.6.5 Exercises 1. In words, how does sho_accruals_alt (defined below) differ from sho_accruals? Does using sho_accruals_alt in place of sho_accruals affect the regression results? firm_years <- controls_raw %>% select(gvkey, datadate, fyear) sho_accruals_alt <- sho_r3000_gvkeys %>% inner_join(firm_years, by = "gvkey") %>% left_join(df_controls, by = c("gvkey", "fyear")) %>% left_join(pdmas, by = c("gvkey", "fyear")) %>% group_by(fyear) %>% mutate_at(all_of(win_vars), winsorize, prob = 0.01) %>% ungroup() 1. In an online appendix BDLYY say “FHK winsorize covariates for their covariate balance table at 1/99%. We inferred that they also winsorized accruals at this level. Whether they winsorize across sample years or within each year, they do not specify.” The code above winsorized within each year. How would you modify the code to winsorize “across sample years”? Does doing so make a difference? 2. How would you modify the code to winsorize at the 2%/98% level? Does this make a difference to the results? (Hint: With the farr package loaded, type ? winsorize in the R console to get help on this function.) 3. How would you modify the code to not winsorize at all? Does this make a difference to the results? 4. Some of the studies discussed by BDLYY exclude 2004 data from the sample. How would you modify the above code to do this here? Does excluding 2004 here make a significant difference? 5. What are FHK doing in the creation of controls_filled? (Hint: The key “verb” is fill.) Does this seem appropriate? Does doing this make a difference? 6. What are FHK doing in the creation of df_controls from controls_fyear? Does this seem appropriate? Does doing this make a difference? 7. What is the range of values for year in sho_accruals? Does this suggest any issues with the code post = year %in% c(2008, 2009, 2010) above? If so, does fixing any issue have an impact on the results reported above? 8. Would it make sense, in creating perf above, if we instead calculated ib_at as if_else(at > 0, ib/at, NA_real_))? What is the effect on the regression results if we use this modified calculation of ib_at? What do Kothari et al. (2005) recommend on this point? (Hint: Use pm_lag = FALSE where applicable.) 9. Fang et al. (2019, p. 10) follow Fang et al. (2016), who “exclude observations for which the absolute value of total accruals-to-total assets) exceeds one. This is a standard practice in the accounting literature because firms with such high total accruals-to-total assets are often viewed as extreme outliers. Nonetheless, the FHK results are robust to winsorizing the accrual measures at the 1% and 99% levels instead of excluding extreme outliers.” Does this claim hold up in the reproduction above? What happens if the filter on abs(acc_at) <= 1 is removed from the code above? (Hint: Use drop_extreme = FALSE where applicable.) 10. Explain what each line of the function plot_coefficients before the line starting with ggplot is doing. (Hint: It may be helpful to store the model that is fed to the function above in the variable model and then run the function line by line.) 21.7 Statistical inference One point of difference between FHK and BDLYY concerns clustered standard errors. Fang et al. (2016) generally use “standard errors clustered by year and firm” , while Black et al. (2019) advocate the use of standard errors clustered by firm. Citing Cameron et al. (2008), Black et al. (2019, p. 30) suggest that “clustered standard errors with a small number of clusters can be downward biased.” In the context of FHK, there are thousands of firms, but a relatively small number of years, so clustering by year (or firm and year) may create a problem with too few clusters. One approach to determining the appropriate clustering is more empirical. In this regard, it is useful to note that cluster-robust standard errors are a generalization of an idea from White (1980b). White (1980b) provides not only an estimator of standard errors that is robust to heteroskedasticity, but also a test of a null hypothesis of homoskedasticity. Intuitively, if the covariance matrix assuming heteroskedasticity is sufficiently different from that assuming homoskedasticity, then we may reject the null hypothesis of homoskedasticity. With a little algebra, it would be possible to develop a test analogous to that of White (1980b) of the null hypothesis of no clustering on variable $$g$$. In practice, many researchers will, lacking a formally derived test, compare standard errors with and without clustering on variable $$g$$ and elect to cluster on variable $$g$$ when the standard errors when doing so seem significantly higher than when not doing so. This heuristic breaks down in the case of Fang et al. (2016), because standard errors are generally lower when clustering on firm and year than when clustering firm alone. However, if clustering on firm alone is appropriate, standard errors clustering on firm and year will provide noisier estimates than clustering on firm alone, and thus could be lower or higher in any given data set. A more theoretical approach can be used in the setting of FHK because of our deeper understanding of the assignment mechanism. In this regard, it is important to note that cluster-robust standard errors address correlation in both $$X$$ and $$\epsilon$$ across units within clusters. To explore this (slightly) more formally, recall from Chapter 6 that the cluster-robust covariance matrix is estimated using the following expression $\hat{V}(\hat{\beta}) = (X'X)^{-1} \hat{B} (X'X)^{-1} \text{, where } \hat{B} = \sum_{g=1}^G X'_g u_g u'_g X_g$ where the observations grouped into $$G$$ clusters of $$N_g$$ observations for $$g$$ in $${1, \dots, G}$$, $$X_g$$ is the $$N_g \times K$$ matrix of regressors, and $$u_g$$ is the $$N_g$$-vector of residuals for cluster $$g$$. If we have a single regressor, demeaned $$x$$ with no constant term and two firms ($$i$$ and $$j$$) in a cluster, then the contribution of that cluster to $$\hat{B}$$ will be \begin{aligned} \begin{bmatrix} x_i & x_j \end{bmatrix} \begin{bmatrix} u_i \\ u_j \end{bmatrix} \begin{bmatrix} u_i & u_j \end{bmatrix} \begin{bmatrix} x_i \\ x_j \end{bmatrix} &= \begin{bmatrix} x_i & x_j \end{bmatrix} \begin{bmatrix} u_i^2 & u_i u_j \\ u_i u_j & u_j^2 \end{bmatrix} \begin{bmatrix} x_i \\ x_j \end{bmatrix} \\ &= \begin{bmatrix} x_i & x_j \end{bmatrix} \begin{bmatrix} x_i u_i^2 + x_j u_i u_j \\ x_i u_i u_j + x_j u_j^2 \end{bmatrix} \\ &= \begin{bmatrix} x_i^2 u_i^2 + x_i x_j u_i u_j \\ x_i x_j u_i u_j + x_j^2 u_j^2 \end{bmatrix} \end{aligned} Now, if $$x_i$$ and $$x_j$$ are uncorrelated then, even if $$\epsilon_i$$ and $$\epsilon_j$$ are correlated, this resolves in expectation to $\begin{bmatrix} x_i^2 \sigma_i^2 \\ x_j^2 \sigma_j^2 \end{bmatrix}$ which is the expectation of the analogous component of the heteroskedasticity-robust estimator from White (1980b). In the setting of Fang et al. (2016), the “$$x$$” of primary interest is the Reg SHO pilot indicator, which is assumed to be randomly assigned, and thus (in expectation) uncorrelated across firms. For this reason, we do not expect cross-sectional dependence to affect standard error estimates on average. On the other hand, the Reg SHO pilot indicator is perfectly correlated over time within firm, so any serial dependence in errors within firm over time will lead to effects of time-series dependence on standard error estimates. This (somewhat loose) theoretical analysis suggests we should cluster by firm (time-series dependence), but not by year (cross-sectional dependence), as suggested by Black et al. (2019). However, the assumed random assignment of treatment allows us to adopt an alternative approach to statistical inference that is agnostic to the form of clustering in the data. This approach is known as randomization inference and builds on the Fisher sharp null hypothesis of no effect of any kind. This is a “sharp null” because it is more restrictive that a null hypothesis of zero mean effect, which could be true even if half the observations had a treatment effect of $$+1$$ and half the observations had a treatment effect of $$-1$$, in which case the Fisher sharp null would not be true even though null hypothesis of zero mean effect is true. Under the Fisher sharp null hypothesis and with random assignment to treatment, in principle we can evaluate the distribution of any given test statistic by considering all possible assignments. Focusing on the 2982 firms that the SEC focused on as its initial sample, if assignment to treatment were purely random then any other assignment of treatment to 986 was as likely as the one chosen. Given that the Fisher sharp null implies that there was no impact of treatment assignment on outcomes, we know what the distribution of the test statistic would have been if the SEC had chosen any one of those alternative assignments because the outcomes would have been exactly the same. With smaller samples, we might proceed to calculate the test statistic for every possible assignment and thereby construct the exact distribution of the test statistic under the Fisher sharp null.157 But in our case there will be a huge number of ways to choose 986 treatment firms from 2982 possibilities, so a more feasible approach is to draw a random sample of possible assignments and use the empirical distribution of the test statistic for that random sample as an approximation for the exact distribution. get_coef_rand <- function(i) { treatment <- sho_accruals %>% select(gvkey, pilot) %>% distinct() %>% mutate(pilot = sample(pilot, size = length(pilot), replace = FALSE)) reg_data_alt <- sho_accruals %>% select(-pilot) %>% inner_join(treatment, by = "gvkey") reg_data_alt %>% reg_year_fe(controls = TRUE, firm_fe = TRUE) %>% tidy() %>% select(term, estimate) %>% pivot_wider(names_from = "term", values_from = "estimate") %>% mutate(iteration = i) %>% suppressWarnings() } The test statistic we are interested in here is the coefficient on $$\mathit{PILOT} \times \mathit{DURING}$$. Below we calculate the $$p$$-value of the coefficients on variables involving $$\mathit{PILOT}$$ using the empirical distribution of coefficients, and the standard errors associated with the coefficients as the standard deviation of those coefficients. (Note that the following calculation takes nearly three minutes on a single core of an M1 Mac.158) set.seed(2021) rand_results <- bind_rows(lapply(1:1000, get_coef_rand)) The following (somewhat unattractive) code runs regressions with standard errors based on clustering by firm and year, by firm alone, and using randomization inference. fms <- list() fms[[1]] <- reg_year_fe(sho_accruals, cl_2 = TRUE) fms[[2]] <- reg_year_fe(sho_accruals, cl_2 = FALSE) fms[[3]] <- fms[[2]] ses <- list() ses[[1]] <- summary(fms[[1]])coefficients[, "Cluster s.e."]
ses[[2]] <- summary(fms[[2]])$coefficients[, "Cluster s.e."] ses[[3]] <- ses[[2]] ses[[3]]["pilotTRUE:duringTRUE"] <- sd(rand_results[["pilotTRUE:duringTRUE"]]) ses[[3]]["pilotTRUE:postTRUE"] <- sd(rand_results[["pilotTRUE:postTRUE"]]) pvals <- list() pvals[[1]] <- summary(fms[[1]])$coefficients[, "Pr(>|t|)"]
pvals[[2]] <- summary(fms[[2]])$coefficients[, "Pr(>|t|)"] pvals[[3]] <- pvals[[2]] beta_during <- fms[[2]]$coefficients["pilotTRUE:duringTRUE", 1]
beta_post <- fms[[2]]$coefficients["pilotTRUE:postTRUE", 1] pvals[[3]]["pilotTRUE:duringTRUE"] <- mean(abs(rand_results$pilotTRUE:duringTRUE) > abs(beta_during))
pvals[[3]]["pilotTRUE:postTRUE"] <-
mean(abs(rand_results$pilotTRUE:postTRUE) > abs(beta_post)) We report regression results for regressions with controls and firm fixed effects. The first column uses two-way cluster-robust standard errors, the second column uses standard errors clustered by firm, and the third column uses standard errors obtained from randomization inference for coefficients on variables involving $$\mathit{PILOT}$$. stargazer(fms, se = ses, p = pvals, type = sg_format, header = FALSE, title = "Results with randomization inference", label = "tab-rand-inf", omit = "^(during|post|pilot)TRUE$",
keep.stat = c("n", "rsq"))
 Dependent variable: model (1) (2) (3) log(at) 0.001 0.001 0.001 (0.001) (0.001) (0.001) mtob -0.001 -0.001** -0.001** (0.0004) (0.0003) (0.0003) roa -0.009 -0.009 -0.009 (0.016) (0.012) (0.012) leverage -0.016** -0.016*** -0.016*** (0.006) (0.004) (0.004) pilotTRUE:during -0.010** -0.010** -0.010** (0.004) (0.005) (0.005) pilotTRUE:post 0.008** 0.008 0.008 (0.004) (0.005) (0.006) Observations 19,626 19,626 19,626 R2 0.002 0.002 0.002 Note: p<0.1; p<0.05; p<0.01

21.7.1 Exercises

1. In the function get_coef_rand(), we first created the data set treatment, then merged this with reg_data_alt. Why did we do it this way rather than simply applying the line mutate(pilot = sample(pilot, size = length(pilot), replace = FALSE)) directly to reg_data_alt?

2. Using randomization inference, calculate a $$p$$-value for a one-sided alternative hypothesis that $$H_1: \beta < 0$$ where $$\beta$$ is the coefficient on $$\mathit{PILOT} \times \mathit{DURING}$$. (Hint: You should not need to run the randomization again; modifying the calculation of p_value should suffice.)

3. What is the empirical standard error implied by the distribution of coefficients in rand_results? Is it closer to the standard errors obtained in estimating with cl_2 = TRUE or those with cl_2 = FALSE? Why might it be preferable to calculate $$p$$-values under randomization inference as we have done above instead of calculating $$p$$-values from $$t$$-statistics based on a the estimated coefficient and the empirical standard error? Would we get different $$p$$-values using this latter approach?

4. Why did we not use the empirical standard error implied by the distribution of coefficients in rand_results to calculate standard errors for the control variables (e.g., log(at))?

21.8 Causal diagrams

It is important to note that we observe total accruals, not discretionary accruals. Instead we need to construct measures of discretionary accruals. The Jones (1991) model of discretionary accruals “controls for” sales growth and PP&E and the Kothari et al. (2005) model additionally “controls for” performance.

Assuming that the causal diagram below is correct, we get unbiased estimates of causal effects whether we “control for” pre-treatment outcome values (e.g., using DiD) or not (e.g., using POST), and it is not clear that we need to control for other factors that drive total accruals. If being a Reg SHO pilot firm leads to a reduction in earnings management, we should observe lower total accruals, even if we posit that the effect is through discretionary accruals, which we do not observe directly. If we accept this causal diagram, then the decision as to which factors to control for is—like the choice between DiD, POST, and ANCOVA–a question of statistical efficiency rather than bias.

In this context, it is perhaps useful to consider causal diagrams to sharpen our understanding of the issues, which we explore in the discussion questions below, as matters can be more complicated if the causal diagram below is incomplete.

21.8.1 Discussion questions

1. What features of the causal diagram above imply that we do not need to control for performance, sales, and PP&E in estimating the causal effect of Reg SHO on accruals? What is the basis for assuming these features in the causal diagram?

2. Black et al. (2022) report that “over 60 papers in accounting, finance, and economics report that suspension of the price tests had wide-ranging indirect effects on pilot firms, including on earnings management, investments, leverage, acquisitions, management compensation, workplace safety, and more (see Internet Appendix, Table IA-1 for a summary).” In light of the Internet Appendix of Black et al. (2022), is there any evidence that Reg SHO might plausibly have an effect on performance, sales growth, or PP&E? If so, how would the causal diagram above need to be modified to account for these consequences? What would be the implications of these changes on the appropriate tests for estimating the causal effects of Reg SHO on accruals?

3. Produce a regression table and a plot like the ones in the FHK replication above, but using discretionary accruals without performance matching instead of performance-matched discretionary accruals. How do you interpret these results?

4. Produce a regression table and a plot like the ones in the FHK replication above, but using total accruals instead of discretionary accruals and excluding controls (so the coefficients will be simple conditional sample means). How do you interpret these results?

5. Suppose you had been brought in by the SEC to design a study examining the research question examined by FHK in the form of a registered report. What analyses would you conduct to try to understand the best research design? For example, how would you choose between DiD, POST, ANCOVA and other empirical approaches? What controls would you include? How would you decide how to include controls? (For example, one could control for performance by including performance as a regressor in the model of earnings management, by matching on performance, or by including performance in the main regression specification.) How would you calculate standard errors? Discuss how your proposed empirical test differs from that of FHK. Would you have reported similar results to what FHK reported?

6. Suppose that FHK’s empirical analysis had produced a positive effect of Reg SHO on earnings management? Would this imply a lack of support for their hypotheses? Do you believe that publication in the Journal of Finance depended on finding a negative effect?

7. What implications would there have been for publication of FHK in the Journal of Finance if they had failed to find an effect of Reg SHO on earnings management?

21.9 Causal mechanisms

Black et al. (2022, p. 4) suggest a number of possible “causal channels … through which the Reg SHO experiment could have affected the behavior of firms or third parties”, including short interest, returns, price efficient, and “manager fear”. On the last of these, Black et al. (2022, p. 4) suggest that “even if the Reg SHO experiment did not actually affect short interest or returns, pilot firm managers could have feared being targeted by short sellers and taken pre-emptive actions.”

Black et al. (2022, p. 4) argue that “if firm managers were fearful that relaxing the price tests would affect them, one might expect them to voice concerns in various ways: speaking with business news reporters; writing to the SEC when it sought public comments; seeking meetings with SEC officials to express opposition. … We found no evidence of manager opposition when the rule was proposed in 2003, when it was announced in 2004, or when the SEC proposed to abolish the short-sale rule in 2006.”

Black et al. (2022, p. 5) suggest that “FHK rely on the manager fear channel. They conjecture that in response to a greater threat of short selling, pilot firms’ managers reduced earnings management to preemptively deter short sellers.”

21.9.1 Discussion questions

1. Do you agree with the assertion of Black et al. (2022) that “FHK rely on the manager fear channel”? What causal mechanisms are suggested in Fang et al. (2016)? What evidence do Fang et al. (2016) offer in support of these mechanisms?

2. Evaluate the response of Fang et al. (2019) to Black et al. (2022) as it relates to causal mechanisms?

3. Do you think evidence of causal mechanisms is more or less important when using a natural experiment (i.e., an experiment outside the control of the researcher that is typically analysed after it has been run) than when conducting a randomized experiment? Explain your reasoning given the various issues raised in this chapter.

21.10 Two-step regressions

Chen et al. (2018) examine the question of statistical inference when residuals from one regression are used as a dependent variable in a subsequent regression, which they refer to as “the two-step procedure”. For example, discretionary accruals measured using the Jones (1991) model are residuals from a regression of total accruals on changes in sales and PP&E. As we saw in Dechow et al. (1995), which we covered in Chapter 18, many papers examine how Jones (1991) model discretionary accruals relate to various posited incentives for earnings management.

Chen et al. (2018, p. 755) show that “the two-step procedure is likely to generate biased coefficients and t-statistics in many studies” and, drawing on the Frisch-Waugh-Lovell Theorem (see section 4.3) propose using a single regression in place of the two-step procedure. In the case of the Jones (1991) model, this would entail including the regressors from the first step in same regression as the second step and using total accruals in place of discretionary accruals as the dependent variable.

21.10.1 Discussion questions

1. What challenges would exist in implementing the single-regression recommendation of Chen et al. (2018) for a researcher using Kothari et al. (2005) performance-matched discretionary accruals?

2. Do you believe the issues raised by Chen et al. (2018) with regard to two-step procedures also apply if using randomization inference? Why or why not?