# 23 Panel data

Most data sets used in accounting research are panel data sets, which means repeated observations on multiple units. Typically units are companies and observations are available for several points of time or period. As we saw in Chapter 4, in certain conditions, researchers can exploit panel data sets to gain insights into causal effects.

In this chapter we explore two approaches commonly used with panel data sets. The first approach uses difference in differences to measure causal effects. The second approach uses multi-way fixed effects to account both for what is commonly referred to as “unobserved time-invariant heterogeneity” between units (i.e., unit-specific effects) and period-specific effects. In practice, these approaches are often combined. We discussed difference in differences in Chapter 21 and recommend that you read that chapter before this one.

The code in this chapter uses the packages listed in the code below. For instructions on how to set up your computer to use the code found in this book, see Section 1.2.1 (note that Step 4 is not required as we do not use WRDS data in this chapter).

library(dplyr, warn.conflicts = FALSE)
library(farr)

library(stargazer)

We use the stargazer package to produce neat output from regressions. For the HTML version of this book, we set sg_format to "html", but "text" would be a better option if looking at the results interactively.

sg_format <- "html"

## 23.1 Analysis of simulated data

In Chapter 4, we explored the test_scores data set. There we used a number of approaches, including one where we estimated the effect size using a difference-in-difference estimator with grade and individual fixed effects as 15.736.

We can now reveal that the test_scores data set is simulated using the function get_test_scores(), with the default values for each of its arguments: effect_size = 15, n_students = 1000, and n_grades = 4.

The data-generating process embedded in get_test_scores() produces scores using the following equation

$y_{ig} = \alpha_i + \beta x_{ig} + \gamma_g + \epsilon_{ig}$ where $$i$$ and $$g$$ denote individuals and grades, respectively. Denoting $$G$$ as the grade after application of treatment (i.e., 7 in the test_scores data), $$x_{ig}$$ is the treatment indicator:

$x_{ig} = \begin{cases} 1 \text{ if } i \text{ is treated and } g \geq G \\ 0 \text{ otherwise} \end{cases}$

Thus, as we will see, this is precisely the setting where multi-way fixed effects are appropriate. The individual fixed effect picks up $$\alpha_i$$ and the grade fixed effect picks up $$\gamma_g$$. While the estimated effect size of 15.736 seems close to the true value $$\beta = 15$$, but is it close enough? To better examine this issue, we generate a version of the data where the treatment effect is zero (i.e., $$\beta = 0$$).

set.seed(2021)
test_scores_alt <- get_test_scores(effect_size = 0)
fms <- list()
fms[[1]] <- felm(score ~ I(treat * post) | grade  + id,
data = test_scores_alt)
fms[[2]] <- felm(score ~ I(treat * post) | grade  + id | 0 | grade + id,
data = test_scores_alt)

Because we set the seed to the same value (2021) as used to generate the original data, we get the same random draws and our estimated treatment effect is exactly as before (0.736), but reduced by $$15$$ and the estimated standard error (without clustering) is exactly as we saw earlier (0.319). However, we see evidence that the coefficient is not “close enough” to the true value, as we would reject the true null at the 5% level ($$p$$-value of 0.0211).

We use the stargazer package for regression output and set sg_format to "html" here (use sg_format <- "text" if viewing the output below on screen).

sg_format <- "html"
stargazer(fms, type = sg_format,
omit.stat = c("ser", "f"), header = FALSE)
 Dependent variable: score (1) (2) I(treat * post) 0.736** 0.736 (0.319) (0.990) Observations 4,000 4,000 R2 0.923 0.923 Adjusted R2 0.897 0.897 Note: p<0.1; p<0.05; p<0.01

What has “gone wrong” here? Some possible explanations are:

1. Our standard errors are too low.
2. This is simply bad luck (i.e., we would not expect these results for a different seed value).
3. Our estimator is biased due to small-sample properties of the fixed-effects estimator.
4. Our estimator is biased due to endogenous selection.

Let’s take these explanations one at a time. For the first, the fact that cluster-robust standard errors are significantly higher seems consistent with this explanation. But recall that the standard error is essentially an estimate of the standard deviation of the relevant coefficient over independent random draws of the data. So a more rigorous way to test this explanation is to simulate the data and examine whether the OLS standard errors tend to underestimate the variability of the coefficients.

The following function generates a data set using the get_test_scores() function 1,000 times.

sim_run <- function(i, ...) {
df <- get_test_scores(effect_size = 0, ...)
fm1 <- felm(score ~ I(treat * post) | grade  + id, data = df)
fm2 <- felm(score ~ I(treat * post) | grade  + id | 0 | grade + id, data = df)

df1 <- c(model = "OLS", as_tibble(summary(fm1)$coefficients)) df2 <- c(model = "CL-2", as_tibble(summary(fm2)$coefficients))

names(df1) <- names(df2) <- c("model", "estimate", "se", "t", "p-value")

bind_rows(df1, df2)
}

sim_results <- bind_rows(lapply(1:1000L, sim_run))

Given these data, we want to compare the standard deviation of the estimate coefficient with the mean of its estimated standard error for the two approachs (OLS and CL-2):

sim_results %>%
group_by(model) %>%
summarize(se_obs = sd(estimate),
se_est = mean(se))
## # A tibble: 2 × 3
##   model se_obs se_est
##   <chr>  <dbl>  <dbl>
## 1 CL-2   0.335  0.980
## 2 OLS    0.335  0.315

We can see that the two-way cluster-robust standard errors are actually far too high, while the OLS standard errors are pretty close to the true standard deviation of the coefficients. This seems to rule out the first explanation.

To evaluate the second explanation (“bad luck”), we can use the same data. Let’s consider critical $$p$$-values of 1% and 5% and count how many of the OLS test statistics would lead to rejection of the null hypothesis ($$\beta = 0$$) at each size. (We focus on the OLS $$t$$-statistics, as we have established that these produce the better standard error estimates.)

rejection_stats <-
sim_results %>%
filter(model == "OLS") %>%
summarize(prop_01 = mean(p-value <= 0.01),
prop_05 = mean(p-value <= 0.05))

rejection_stats
## # A tibble: 1 × 2
##   prop_01 prop_05
##     <dbl>   <dbl>
## 1   0.863   0.951

So, we reject a true null hypothesis $$86\%$$ of the time at the $$1\%$$-level and $$95\%$$ of the time at the $$5\%$$-level, even with what appear to be good standard errors. This would be an extraordinary degree of bad luck, suggesting that there is bias in our coefficient estimates.

The properties of fixed-effect estimators are derived asymptotically, that is as the number of observations and estimated fixed effects approaches infinity. However, in a typical panel data set, we have a fairly small number of time periods. In our analysis above, we have just four grades of data. But, because these are simulated data and not real people, we can easily expand the number of grades of data that we consider. Let’s consider n_grades = 12, the maximum handled by the get_test_scores function.168

sim_results_wide <- bind_rows(lapply(1:1000L, sim_run, n_grades = 12))
rejection_stats <-
sim_results_wide %>%
filter(model == "OLS") %>%
summarize(prop_01 = mean(p-value <= 0.01),
prop_05 = mean(p-value <= 0.05))

rejection_stats
## # A tibble: 1 × 2
##   prop_01 prop_05
##     <dbl>   <dbl>
## 1   0.329   0.574

Here we see evidence of a reduction of the bias, but the bias is not eliminated. We still reject a true null hypothesis $$33\%$$ of the time at the $$1\%$$-level and $$57\%$$ of the time at the $$5\%$$-level.

So let’s consider the final explanation of the bias, which is due to non-random selection. It turns out that assignment to treatment in get_test_scores is a function of test scores in the year prior to the camp, but conditional on these scores, is completely random.169 This is termed selection on observables, which is often suggested as a basis on which causal inference can be justified when using approaches such as propensity-score matching.

So what happens if we use completely random assignment to treatment? We can request such assignment by setting the random_assignment to TRUE in the get_test_scores() function.

sim_results_rand <- bind_rows(lapply(1:1000L, sim_run,
random_assignment = TRUE))
rejection_stats <-
sim_results_rand %>%
filter(model == "OLS") %>%
summarize(prop_01 = mean(p-value <= 0.01),
prop_05 = mean(p-value <= 0.05))

rejection_stats
## # A tibble: 1 × 2
##   prop_01 prop_05
##     <dbl>   <dbl>
## 1   0.009   0.046

Here we see evidence of elimination of the bias. We reject a true null hypothesis $$0.9\%$$ of the time at the $$1\%$$-level and $$4.6\%$$ of the time at the $$5\%$$-level.

Thus it appears that random assignment is critical in this setting for achieving valid causal inferences. This is somewhat concerning as, beyond non-random treatment assignment, the basic assumptions underlying causal inference appear to be satisfied, as we have a plausible basis for the parallel trends assumption. First, we have grade effects that are the same for both treated and untreated observations. Second, we have individual effects that, while different between the treated and untreated observations, remain constant over the sample period and therefore do not undermine the parallel-trends assumption.170 What this suggests is that subtle biases can enter difference-in-difference analyses even in settings that are unrealistically simple. It seems reasonable to expect that biases exist—and are plausibly worse—in the more complex settings of actual research.

## 23.2 Voluntary disclosure

One paper that uses a multi-way fixed-effect structure like that we analysed above is Li et al. (2018), who “seek to provide causal evidence on the proprietary cost hypothesis. While “proprietary costs” are commonly assumed to be those caused by use of information by competitors, Verrecchia (1983) uses the term more broadly:

It seems apparent that there would be a proprietary cost associated with releasing information which is unfavorable to a firm (e.g., a bank would be tempted to ask for repayment of its loan). However, the release of a variety of accounting statistics about a firm (e.g., sales, net income, costs of operation, etc.) may be useful to competitors, shareholders, or employees in a way which is harmful to a firm’s prospects even if (or perhaps because) the information is favorable. One recent example of this is the response of the UAW (United Auto Workers) for fewer labor concessions in the face of an announcement by Chrysler Corporation’s chairman that that firm’s fortunes had improved.

While Verrecchia (1983) considers the “reluctance of managers in certain highly competitive industries … to disclose favorable accounting data”, he assumes that the cost of disclosure is constant and independent of the disclosed value.

Verrecchia (1983) can be viewed as providing a theoretical foundation for the proprietary cost hypothesis. The model in Verrecchia (1983) posits a capital market–driven incentive for disclosure of favourable information; absent such an incentive, there would be no disclosure in the Verrecchia (1983) setting. Another feature of the Verrecchia (1983) setting is that firms with favourable news are the ones disclosing, as the capital-market benefit of disclosure is greater for these firms.

Li et al. (2018) exploit the staggered implementation of the inevitable disclosure doctrine, which was adopted by state courts as part of the common law of their respective states at different times (some state courts later rejected the doctrine after adopting it). The inevitable disclosure doctrine (IDD) provides an employer with injunctive relief to prevent a current or former employee from working for another company if doing so will lead to the inevitable disclosure of trade secret information. Li et al. (2018) argue that IDD increases the marginal benefits of non-disclosure, which they interpret as an increase in the cost of disclosure.

Li et al. (2018) focus their analysis of disclosure choice on the disclosure of customer identities in 10-Ks and assume that the state of a firm’s headquarters governs the applicability of IDD to the firm. We will conduct an approximate replication of certain analyses of Li et al. (2018) to understand the empirical approaches of this chapter. This requires combining data on customer disclosures with data on the dates of adoption of IDD by states with data in the states of companies’ headquarters.

### 23.2.1 Customer disclosures

The first data set is comp_segments_hist_daily.seg_customer, which contains data derived from companies’ disclosures regarding significant customers.

pg <- dbConnect(RPostgres::Postgres(), bigint = "integer")
seg_customer <- tbl(pg, sql("SELECT * FROM comp_segments_hist_daily.seg_customer"))
names_seg <- tbl(pg, sql("SELECT * FROM comp_segments_hist_daily.names_seg"))
funda <- tbl(pg, sql("SELECT * FROM comp.funda"))
seg_customer %>%
count(ctype) %>%
arrange(desc(n)) %>%
collect()
## # A tibble: 7 × 2
##   ctype         n
##   <chr>     <int>
## 1 COMPANY  351140
## 2 GEOREG   137916
## 3 MARKET   100675
## 4 GOVDOM    38788
## 5 GOVFRN     5000
## 6 GOVSTATE   1391
## 7 GOVLOC      568
seg_customer %>%
filter(ctype == "COMPANY") %>%
count(cnms) %>%
arrange(desc(n)) %>%
collect()
## # A tibble: 43,354 × 2
##    cnms             n
##    <chr>        <int>
##  1 Not Reported 63889
##  2 NOT REPORTED 31162
##  3 2 Customers   5663
##  4 3 Customers   4978
##  5 5 Customers   4431
##  6 9 Customers   3969
##  7 4 Customers   3902
##  8 10 Customers  3640
##  9 Not reported  2894
## 10 2 CUSTOMERS   2053
## # … with 43,344 more rows

To better understand this setting, let’s examine one example. One observation chosen somewhat at random from the last year of the sample period is Advanced Micro Devices, Inc. (AMD), a global semiconductor company. In its 10-K for the year ended 31 December 2010, AMD disclosed the following:

In 2010, Hewlett-Packard Company accounted for more than 10% of our consolidated net revenues. Sales to Hewlett-Packard consisted primarily of products from our Computing Solutions segment. Five customers, including Hewlett-Packard, accounted for approximately 55% of the net revenue attributable to our Computing Solutions segment. In addition, five customers accounted for approximately 46% of the net revenue attributable to our Graphics segment. A loss of any of these customers could have a material adverse effect on our business.

Elsewhere in its 10-K, AMD discloses:

In 2010, the Company had one customer that accounted for more than 10% of the Company’s consolidated net revenues. Net sales to this customer were approximately $1.4 billion, or 22% of consolidated net revenues, and were primarily attributable to the Computing Solutions segment. Note that the Graphics and Computing Solutions segments had sales of$1,663 million and $4,817 million, respectively, and therefore made up 99.8% of AMD’s total sales of$6,494 million.

Thus 46% of Graphics represents $765 million, while 55% of Computing Solutions is$2,650 million. After subtracting $1,400 million of these sales to Hewlett-Packard (HP), Compustat appears to ascribe the remaining$1,250 to “4 Customers”. The basis for assigning $1,400 to “Not Reported” is unclear (while the sentence above merely says “one customer”, it is quite clear that this is HP and Compustat elsewhere assumes as much). (In fact, it seems almost certain that there is double-counting in this case, making addition of the numbers in salecs problematic.) seg_customer %>% filter(gvkey == "001161", datadate == "2010-12-31") %>% select(cnms, ctype, salecs, stype) %>% collect() ## # A tibble: 8 × 4 ## cnms ctype salecs stype ## <chr> <chr> <dbl> <chr> ## 1 HEWLETT-PACKARD CO COMPANY 1400 BUSSEG ## 2 4 Customers COMPANY 1250. BUSSEG ## 3 Not Reported COMPANY 1400 BUSSEG ## 4 5 Customers COMPANY 765 BUSSEG ## 5 Not Reported COMPANY NA BUSSEG ## 6 Not Reported COMPANY NA BUSSEG ## 7 Not Reported COMPANY NA BUSSEG ## 8 International GEOREG 5715. BUSSEG Focusing on this case perhaps help us to understand the disclosure decision faced by AMD. Even if AMD had not disclosed the identity of its largest customer, it perhaps would have been easy to infer that “Customer #1” was indeed HP, as HP was the largest personal computer manufacturer at the time.171 The second dataset we will use is undisclosed_names, which is part of the farr package. This dataset was constructed somewhat manually and represents the values of cnms that should be deemed non-disclosures.172 undisclosed_names ## # A tibble: 432 × 2 ## cnms disclosed ## <chr> <lgl> ## 1 Not Reported FALSE ## 2 NOT REPORTED FALSE ## 3 2 CUSTOMERS FALSE ## 4 4 Customers FALSE ## 5 6 Customers FALSE ## 6 3 Customers FALSE ## 7 8 Customers FALSE ## 8 9 CUSTOMERS FALSE ## 9 5 CUSTOMERS FALSE ## 10 3 CUSTOMERS FALSE ## # … with 422 more rows We can then perform a left_join of the customer-related data in seg_customer to identify disclosures (and non-disclosures) of customer identities. Following Li et al. (2018), we focus on the period from 1994 to 2010. Here we show some of the result data items, focused on disclosed customers and larger customers. customers <- seg_customer %>% filter(ctype == "COMPANY") %>% collect() disclosure_raw <- customers %>% filter(between(datadate, as.Date("1994-01-01"), as.Date("2010-12-31"))) %>% left_join(undisclosed_names, by = "cnms") %>% mutate(disclosed = coalesce(disclosed, TRUE)) %>% select(gvkey, datadate, cnms, salecs, disclosed) ## Warning in left_join(., undisclosed_names, by = "cnms"): Detected an unexpected many-to-many relationship between x and y. ## ℹ Row 3937 of x matches multiple rows in y. ## ℹ Row 1 of y matches multiple rows in x. ## ℹ If a many-to-many relationship is expected, set relationship = "many-to-many" to silence this warning. disclosure_raw %>% filter(disclosed) %>% arrange(desc(salecs)) ## # A tibble: 86,609 × 5 ## gvkey datadate cnms salecs disclosed ## <chr> <date> <chr> <dbl> <lgl> ## 1 002751 2010-06-30 CVS Health Corp 23641. TRUE ## 2 002751 2009-06-30 WALGREEN CO 22888. TRUE ## 3 118122 1999-12-31 General Motors Corp 22302 TRUE ## 4 002751 2010-06-30 WALGREEN CO 21671. TRUE ## 5 002751 2009-06-30 CVS Caremark Corp 20898. TRUE ## 6 118122 2000-12-31 General Motors Corp 20665 TRUE ## 7 002751 2008-06-30 CVS Caremark Corp 20040. TRUE ## 8 002751 2007-06-30 CVS Caremark Corp 18239. TRUE ## 9 012136 2005-06-30 GENERAL ELECTRIC CO 17712 TRUE ## 10 002751 2008-06-30 Walgreen Co 17307. TRUE ## # … with 86,599 more rows As a shortcut way of imposing the sample requirements used in Li et al. (2018) (e.g., not financial services), we restrict our analysis to firms (GVKEYs) in the sample of Li et al. (2018). The associated gvkeys are found in the llz_2018 data set from the farr package.173 llz_2018 ## # A tibble: 5,830 × 1 ## gvkey ## <chr> ## 1 001004 ## 2 001013 ## 3 001021 ## 4 001037 ## 5 001048 ## 6 001050 ## 7 001056 ## 8 001073 ## 9 001082 ## 10 001086 ## # … with 5,820 more rows To get a feel for the data, we can look at the proportion of sales attributed to the largest customer. Note that there are some cases where the largest customer appears to account for more than 100% of sales; many of these appear to be data-entry errors. But there are hundreds of firm-years with revenue from just one customer. There is also an interesting jump at 10% of sales. While this is likely driven by the 10% cut-off driving mandatory disclosure, it is interesting to note that many firms disclose significant customer information even when the largest customer is below 10% of sales. sales <- funda %>% filter(indfmt == "INDL", datafmt == "STD", consol == "C", popsrc == "D") %>% select(gvkey, datadate, sale) %>% collect() biggest_customers <- disclosure_raw %>% inner_join(sales, by = c("gvkey", "datadate")) %>% group_by(gvkey, datadate) %>% mutate(max_customer = max(salecs)/sale) biggest_customers %>% filter(between(max_customer, 0, 1)) %>% ggplot(aes(max_customer)) + geom_histogram(binwidth = 0.02) + scale_x_continuous(breaks = seq(0, 1, by = 0.1)) Interestingly, Li et al. (2018) only include a firm-year observation where at least one customer is disclosed (whether identified or not) as having at least 10% of total sales. We construct a data set (prin_cust_df) to identify these firms. prin_cust_df <- disclosure_raw %>% inner_join(sales, by = c("gvkey", "datadate")) %>% group_by(gvkey, datadate) %>% filter(!is.na(salecs), sale > 0) %>% summarize(prin_cust = max(salecs/sale, na.rm = TRUE), .groups = 'drop') %>% mutate(has_prin_cust = prin_cust >= 0.1) We next restrict the data to the llz_2018 sample and calculate the two measures of disclosure choice used in Li et al. (2018), one measuring the proportion of significant customers whose identities are not disclosed and another weighting that proportion by sales. disclosure <- disclosure_raw %>% inner_join(sales, by = c("gvkey", "datadate")) %>% semi_join(llz_2018, by = "gvkey") %>% group_by(gvkey, datadate) %>% summarize(ratio = mean(!disclosed), ratio_sale = sum((!disclosed) * salecs)/sum(salecs), .groups = "drop") %>% mutate(year = year(datadate)) disclosure %>% summarize(across(c(ratio, ratio_sale), mean, na.rm = TRUE)) ## Warning: There was 1 warning in summarize(). ## ℹ In argument: across(c(ratio, ratio_sale), mean, na.rm = TRUE). ## Caused by warning: ## ! The ... argument of across() is deprecated as of dplyr 1.1.0. ## Supply arguments directly to .fns through an anonymous function instead. ## ## # Previously ## across(a:b, mean, na.rm = TRUE) ## ## # Now ## across(a:b, $$x) mean(x, na.rm = TRUE)) ## # A tibble: 1 × 2 ## ratio ratio_sale ## <dbl> <dbl> ## 1 0.459 0.461 ### 23.2.2 Data on adoption of IDD The next data set we use is idd_periods which is returned by the farr function get_idd_periods. This table is based on idd_dates, which is derived from data reported in Klasa et al. (2018) and reproduced in Li et al. (2018). idd_dates ## # A tibble: 24 × 3 ## state idd_date idd_type ## <chr> <date> <chr> ## 1 AR 1997-03-18 Adopt ## 2 CT 1996-02-28 Adopt ## 3 DE 1964-05-05 Adopt ## 4 FL 1960-07-11 Adopt ## 5 FL 2001-05-21 Reject ## 6 GA 1998-06-29 Adopt ## 7 IL 1989-02-09 Adopt ## 8 IN 1995-07-12 Adopt ## 9 IA 1997-03-18 Adopt ## 10 KS 2006-02-02 Adopt ## # … with 14 more rows While idd_dates represents dates of either adoption or rejection of the inevitable disclosure doctrine, the idd_periods table takes a sample period (defined using min_date and max_date) and breaks that sample period into three kinds by state: pre- and post-adoption and post-rejection. idd_periods <- get_idd_periods(min_date = "1994-01-01", max_date = "2010-12-31") idd_periods ## # A tibble: 65 × 4 ## state period_type start_date end_date ## <chr> <chr> <date> <date> ## 1 AK Pre-adoption 1994-01-01 2010-12-31 ## 2 AL Pre-adoption 1994-01-01 2010-12-31 ## 3 AR Pre-adoption 1994-01-01 1997-03-18 ## 4 AR Post-adoption 1997-03-18 2010-12-31 ## 5 AZ Pre-adoption 1994-01-01 2010-12-31 ## 6 CA Pre-adoption 1994-01-01 2010-12-31 ## 7 CO Pre-adoption 1994-01-01 2010-12-31 ## 8 CT Pre-adoption 1994-01-01 1996-02-28 ## 9 CT Post-adoption 1996-02-28 2010-12-31 ## 10 DC Pre-adoption 1994-01-01 2010-12-31 ## # … with 55 more rows ### 23.2.3 Data on state headquarters The final piece of the puzzle in terms of data is state_hq, which is found in the farr package and derived from data provided by Bill McDonald (see here), which are the data on headquarters location used by Li et al. (2018). The state_hq table reports the range of dates for SEC filings for each cik-ba_state combination.174 state_hq ## # A tibble: 53,133 × 4 ## cik ba_state min_date max_date ## <chr> <chr> <date> <date> ## 1 0000066382 MI 1994-01-04 2018-10-10 ## 2 0000070415 NY 1994-01-04 2007-03-14 ## 3 0000084129 PA 1994-01-05 2018-10-04 ## 4 0000832922 OH 1994-01-05 2001-01-09 ## 5 0000909832 CA 1994-01-05 1996-12-20 ## 6 0000004911 PA 1994-01-06 1996-01-05 ## 7 0000038777 CA 1994-01-06 2018-11-09 ## 8 0000726977 IL 1994-01-06 1997-11-14 ## 9 0000756764 NY 1994-01-06 1999-07-08 ## 10 0000764478 MN 1994-01-06 2018-12-07 ## # … with 53,123 more rows To use these data, we need to link CIKs with GVKEYs, which we do using the table comp_segments_hist_daily.names_seg provided alongside Compustat’s segment data.175 ciks <- names_seg %>% filter(!is.na(cik)) %>% select(gvkey, cik) %>% collect() state_hq_linked <- state_hq %>% inner_join(ciks, by = "cik") %>% inner_join(disclosure, by = "gvkey", relationship = "many-to-many") %>% filter(datadate >= min_date, datadate <= max_date) %>% select(gvkey, datadate, ba_state) %>% rename(state = ba_state) ### 23.2.4 Regression analysis Finally, we pull all these pieces together. Like Li et al. (2018), we delete “post-rejection” observations and log-transform the dependent variable. reg_data <- disclosure %>% inner_join(prin_cust_df, by = c("gvkey", "datadate")) %>% filter(has_prin_cust) %>% inner_join(state_hq_linked, by = c("gvkey", "datadate")) %>% inner_join(idd_periods, join_by(state, datadate >= start_date, datadate <= end_date)) %>% filter(period_type != "Post-rejection") %>% mutate(post = period_type == "Post-adoption", ln_ratio = log(1 + ratio), ln_ratio_sale = log(1 + ratio_sale)) %>% select(-start_date, -end_date) reg_data %>% count(period_type) ## # A tibble: 2 × 2 ## period_type n ## <chr> <int> ## 1 Post-adoption 15839 ## 2 Pre-adoption 12943 fms <- list() fms[[1]] <- felm(ln_ratio ~ post | gvkey + year , data = reg_data) fms[[2]] <- felm(ln_ratio_sale ~ post | gvkey + year, data = reg_data) To keep the analysis simple, we do not include controls (apart from firm and year fixed effects). In a typical regression analysis with firm and year fixed effects, omission of controls will often not have a material impact on coefficient estimates. We run the analysis again, but focused on a subset of firms. switchers <- reg_data %>% distinct(gvkey, post) %>% group_by(gvkey) %>% filter(n() > 1) %>% select(gvkey) %>% ungroup() reg_data_switchers <- reg_data %>% semi_join(switchers, by = "gvkey") fms[[3]] <- felm(ln_ratio ~ post | gvkey + year, data = reg_data_switchers) fms[[4]] <- felm(ln_ratio_sale ~ post | gvkey + year, data = reg_data_switchers) stargazer(fms, type = sg_format, omit.stat = c("ser", "f"), header = FALSE, title = "Effect of IDD adoption")  Dependent variable: ln_ratio ln_ratio_sale ln_ratio ln_ratio_sale (1) (2) (3) (4) post 0.016** 0.024*** 0.014 0.019** (0.008) (0.008) (0.009) (0.010) Observations 28,782 24,600 3,617 3,164 R2 0.672 0.712 0.624 0.661 Adjusted R2 0.599 0.637 0.568 0.602 Note: p<0.1; p<0.05; p<0.01 The analysis above is similar to Table 2 of Li et al. (2018). Below we conduct analysis similar to that reported in Table 3 of Li et al. (2018), which examines the effect of IDD adoption on disclosure by year related to adoption. We first make a little function that takes a vector of numbers and returns them as a factor that has nicely formatted strings order by the original numbers.176 This will be useful for presenting “nice” regression results. factor_t <- function(t) { t <- relevel(as.factor(t), ref = "-1") levels(t) <- gsub("^([0-9]+)", "+\\1", levels(t)) levels(t) <- gsub("([0-9]+)", " \\1", levels(t)) levels(t) <- gsub("^", " ", levels(t)) t } We next make a function that turns a vector of year-difference values into a factor—with values outside \((-5, +5)$$ collapsed into $$-5$$ and $$+5$$ as appropriate—and labels the output in a regression output-friendly way. Note that if a firm is never-treated, it’s value of $$t$$ will be missing and we set the value for these years to -Inf (R’s way of saying $$-\infty$$). year_diff <- function(t) { t <- case_when(t < -5 ~ -5, t > 5 ~ 5, is.na(t) ~ -Inf, TRUE ~ t) factor_t(t) } switch_years <- reg_data %>% group_by(gvkey) %>% arrange(datadate) %>% filter(period_type == "Post-adoption" & lag(period_type) == "Pre-adoption") %>% group_by(gvkey) %>% summarize(adoption_year = min(year), .groups = "drop") reg_data_t <- reg_data %>% left_join(switch_years, by = "gvkey") %>% mutate(t = year_diff(year - adoption_year)) fms <- list() fms[[1]] <- felm(ln_ratio ~ t | gvkey + year, data = reg_data_t) fms[[2]] <- felm(ln_ratio_sale ~ t | gvkey + year, data = reg_data_t) fms[[3]] <- felm(ln_ratio ~ t | gvkey + year, data = reg_data_t, subset = !grepl("Inf", t)) fms[[4]] <- felm(ln_ratio_sale ~ t | gvkey + year, data = reg_data_t, subset = !grepl("Inf", t)) stargazer(fms, type = sg_format, omit.stat = c("ser", "f"), header = FALSE, title = "Effect of IDD adoption by year", omit = "t -Inf")  Dependent variable: ln_ratio ln_ratio_sale ln_ratio ln_ratio_sale (1) (2) (3) (4) t - 5 -0.004 0.001 -0.017 -0.009 (0.019) (0.021) (0.027) (0.029) t - 4 -0.017 -0.030 -0.026 -0.038 (0.022) (0.023) (0.025) (0.026) t - 3 -0.030 -0.029 -0.031 -0.031 (0.020) (0.020) (0.021) (0.022) t - 2 -0.008 -0.003 -0.014 -0.010 (0.018) (0.018) (0.019) (0.019) t + 0 -0.001 0.017 0.007 0.020 (0.015) (0.015) (0.016) (0.017) t + 1 0.009 0.025 0.021 0.032 (0.016) (0.017) (0.018) (0.020) t + 2 -0.007 0.010 0.010 0.018 (0.017) (0.018) (0.021) (0.023) t + 3 0.023 0.021 0.039 0.026 (0.019) (0.020) (0.025) (0.027) t + 4 0.029 0.039* 0.042 0.042 (0.019) (0.021) (0.027) (0.030) t + 5 0.044*** 0.048*** 0.047 0.039 (0.015) (0.016) (0.032) (0.036) Observations 28,782 24,600 3,110 2,740 R2 0.673 0.712 0.625 0.668 Adjusted R2 0.600 0.637 0.568 0.611 Note: p<0.1; p<0.05; p<0.01 Next we make a small function to plot the coefficients over time. The function first arranges the coefficients for each period in event-time as a data frame and turns the coefficient labels (e.g., $$t + 1$$) into values (i.e., $$1$$). In making the plot, we add in the reference period ($$t-1$$), which has a coefficient value of $$0$$ by definition, and add 95% confidence intervals for the coefficients. plot_coefs <- function(fm) { coefs <- summary(fm)$coefficients %>%
as_tibble(rownames = "t") %>%
mutate(t = as.integer(gsub("[t ]+", "", t))) %>%
filter(t != -Inf)

ci <- 0.95
mult <- qnorm(1 - (1 - ci)/2)

plot <-
tibble(t = -1, Estimate = 0, Std. Error = 0) %>%
bind_rows(coefs) %>%
mutate(mean = Estimate,
top = Estimate + mult * Std. Error,
bot = Estimate - mult * Std. Error) %>%
ggplot(aes(x = t, y = mean)) +
geom_errorbar(aes(ymin = bot, ymax = top), width = .1) +
geom_line() +
geom_point() +
scale_x_continuous(breaks = seq(from = min(coefs$t), to = max(coefs$t)))

plot
}

And we use this function to plot the coefficients for the first model.

plot_coefs(fms[[1]])

### 23.2.5 Discussion questions

1. The proprietary cost hypothesis tested by Li et al. (2018) posits that increased cost of disclosure will lead to reduced supply of disclosure? What rival theories exist that would make alternative predictions? Are there other elements of disclosure theory that might be tested? Why do you think Li et al. (2018) focused on this specific element of the proprietary cost hypothesis?
2. In the analysis above, we do not include the control variables considered by Li et al. (2018). For example, Li et al. (2018) “include R&D expenditures to sales, advertisement [sic] expenditure to sales, and intangible assets scaled by total assets to control for a firm’s proprietary costs of disclosure.” Using the approach outlined in Chapter 5, in what circumstances would it be necessary to control for “a firm’s proprietary costs of disclosure” in this way? Do these circumstances seem applicable in this setting, where the treatment is a (presumably exogenous) shock to disclosure costs and the outcome is disclosure choices?
3. What differs between the regressions reported in columns (1) and (3), and (2) and (4) of the table titled “Effect of IDD adoption”, respectively? What does this tell you about what drives the regression results in this setting? What happens if you omit the year fixed effects from both sets of regressions? What does this tell you about the role of the “non-switchers” (i.e., firms not in the switchers data frame) in the regression?
4. Would you expect the inclusion of controls (see the question above) to have a significant impact on the regression results? Why or why not?
5. What differs between the regressions reported in columns (1) and (3), and (2) and (4) of the table titled “Effect of IDD adoption by year”, respectively? What happens if you omit the year fixed effects from both sets of regressions? What does this tell you about what drives the regression results in this setting?
6. What patterns do you observe in the coefficients reported in the table titled “Effect of IDD adoption by year”? Do these conform to what you would expect from Li et al. (2018)? (It may be easiest to focus on these in groups, e.g., those in $$t-5$$ to $$t-2$$, those in $$t + 0$$ to $$t+3$$ and those for $$t+4$$ and $$t+5$$.)
7. How do the variables in the table titled “Effect of IDD adoption by year” above differ from those used in Table 3 of Li et al. (2018)? Modify the code above (e.g., the year_diff function) to produce analysis closer to that reported in Table 3 of Li et al. (2018).
8. How helpful do you find the plot of the coefficients by year?
9. The year_diff function collapses years after $$t + 5$$ and before $$t - 5$$ into years $$t + 5$$ and $$t - 5$$, respectively. When would this approach make sense? What would be one alternative approach to handling these years? Does your suggested approach make a difference in this setting?
10. Describe the data set created from the following code. What proportion of the firms in the data set have same_state equal to TRUE? For the purposes of empirical analysis, do the firms with same_state equal to FALSE enhance, or detract from, our ability to draw causal inferences about the effect of adoption of IDD?
1. What issues might be implied by the following data? How might you address these?
## # A tibble: 12 × 2
##    t           n
##    <fct>   <int>
##  1 " - 1"    322
##  2 " -Inf" 25672
##  3 " - 5"    232
##  4 " - 4"    117
##  5 " - 3"    154
##  6 " - 2"    219
##  7 " + 0"    385
##  8 " + 1"    306
##  9 " + 2"    233
## 10 " + 3"    192
## 11 " + 4"    168
## 12 " + 5"    782