21  Panel data

The typical data set used in accounting research comprises panel data, namely means repeated observations on multiple units. Typically units are companies and observations are available for several points of time or periods. As we saw in Chapter 3, in certain conditions, researchers can exploit panel data sets to gain insights into causal effects.

In this chapter we explore two approaches commonly used with panel data sets. The first approach uses difference in differences to measure causal effects. The second approach uses multi-way fixed effects to account both for what is commonly referred to as “unobserved time-invariant heterogeneity” between units (i.e., unit-specific effects) and period-specific effects. In practice, these approaches are often combined. We discussed difference in differences in Chapter 19 and recommend that you read that chapter before this one.

Tip

The code in this chapter uses the following packages. Rather than invoking several Tidyverse packages separately, we load the tidyverse package. For instructions on how to set up your computer to use the code found in this book, see Section 1.2. Quarto templates for the exercises below are available at https://github.com/iangow/far_templates.

21.1 Analysis of simulated data

In Chapter 3, we explored the test_scores data set. There we used a number of approaches, including one where we estimated the effect size using a difference-in-difference estimator with grade and individual fixed effects as 15.736.

We can now reveal that the test_scores and camp_attendance data sets were simulated using the function get_test_scores(), with the default values for each of its arguments: effect_size = 15, n_students = 1000, and n_grades = 4.1

The data-generating process embedded in get_test_scores() produces scores using the following equation

\[ y_{ig} = \alpha_i + \beta x_{ig} + \gamma_g + \epsilon_{ig} \] where \(i\) and \(g\) denote individuals and grades, respectively. Denoting the grade after application of treatment (i.e., 7 in the test_scores data) as \(G\), the treatment indicator \(x_{ig}\) is as follows:

\[ x_{ig} = \begin{cases} 1 \text{ if } i \text{ is treated and } g \geq G \\ 0 \text{ otherwise} \end{cases} \]

Thus, as we will see, this is precisely the setting where multi-way fixed effects are appropriate. The individual fixed effect picks up \(\alpha_i\) and the grade fixed effect picks up \(\gamma_g\). While the estimated effect size of 15.736 seems close to the true value \(\beta = 15\), but is it close enough? To better examine this issue, we generate a version of the data where the treatment effect is zero (i.e., \(\beta = 0\)).

set.seed(2021)
test_scores_alt <- get_test_scores(effect_size = 0)

We then estimate regressions with grade- and individual-level fixed effects. One regression uses “OLS” standard errors, the other uses standard errors clustered by grade and id (see Section 5.6 for discussion of alternative approaches to calculating standard errors).

fms <- list(feols(score ~ I(treat * post) | grade  + id, 
                  vcov = "iid", data = test_scores_alt),
            feols(score ~ I(treat * post) | grade  + id, 
                  vcov = ~ grade + id, data = test_scores_alt))

Because we set the seed to the same value (2021) as used to generate the original data, we get the same random draws and our estimated treatment effect is exactly as before (0.736), but reduced by \(15\) and the estimated standard error (without clustering) is exactly as we saw earlier (0.319). However, in Table 21.1 we see evidence that the coefficient is not “close enough” to the true value, as we would reject the true null at the 5% level (\(p\)-value of 0.0211).

modelsummary(fms,
             estimate = "{estimate}{stars}",
             gof_map = c("nobs", "r.squared"),
             stars = c('*' = .1, '**' = 0.05, '***' = .01))
Table 21.1: Regression analysis: Simulated data
 (1)   (2)
I(treat * post) 0.736** 0.736
(0.319) (0.995)
Num.Obs. 4000 4000
R2 0.923 0.923

What has “gone wrong” in this case? Some possible explanations are:

  1. Our standard errors are too low.
  2. This is simply bad luck (i.e., we would not expect these results for a different seed value).
  3. Our estimator is biased due to small-sample properties of the fixed-effects estimator.
  4. Our estimator is biased due to endogenous selection.

Let’s take these explanations one at a time. For the first, the fact that cluster-robust standard errors are significantly higher than OLS standard errors seems consistent with this explanation. But recall that the standard error is essentially an estimate of the standard deviation of the relevant coefficient over independent random draws of the data. So a more rigorous way to test this explanation is to simulate the data and examine whether the OLS standard errors tend to underestimate the variability of the coefficients.

extract_stats <- function(fm, model_name) {
  tibble(model = model_name, 
         estimate = coef(fm),
         se = se(fm),
         t = coef(fm) / se(fm), 
         `p-value` = pvalue(fm)) 
}

The following function generates a data set using the get_test_scores() function, then runs two regressions—one using “OLS” standard errors and one using two-way cluster-robust standard errors (“CL-2”)—then returns statistics from the fitted models.

sim_run <- function(i, ...) {
  df <- get_test_scores(effect_size = 0, ...)
  
  fm1 <- feols(score ~ I(treat * post) | grade  + id, 
              vcov = "iid", data = df)
  fm2 <- feols(score ~ I(treat * post) | grade  + id, 
               vcov = ~ grade + id, data = df)
  
  bind_rows(extract_stats(fm1, model_name = "OLS"),
            extract_stats(fm2, model_name = "CL-2"))
}

The following code runs sim_run() 1,000 times and stores the results in sim_results.

num_sims <- 1000L

sim_results <- list_rbind(map(1:num_sims, sim_run))

Given these data, we want to compare the standard deviation of the estimate coefficient with the mean of the estimated standard error for the two approaches ("OLS" and "CL-2"):

sim_results |> 
  group_by(model) |> 
  summarize(se_obs = sd(estimate), 
            se_est = mean(se))
# A tibble: 2 × 3
  model se_obs se_est
  <chr>  <dbl>  <dbl>
1 CL-2   0.335  0.984
2 OLS    0.335  0.315

We can see that the two-way cluster-robust standard errors are actually far too high, while the OLS standard errors are pretty close to the true standard deviation of the coefficients. This seems to rule out the first explanation.

To evaluate the second explanation (“bad luck”), we can use the same data. Let’s consider critical \(p\)-values of 1% and 5% and count how many of the OLS test statistics would lead to rejection of the null hypothesis (\(\beta = 0\)) at each size. (We focus on the OLS \(t\)-statistics, as we have established that these produce the better standard error estimates.)

rejection_stats <-
  sim_results |> 
  filter(model == "OLS") |> 
  summarize(prop_01 = mean(`p-value` <= 0.01), 
            prop_05 = mean(`p-value` <= 0.05))

rejection_stats
# A tibble: 1 × 2
  prop_01 prop_05
    <dbl>   <dbl>
1   0.863   0.951

So, we reject a true null hypothesis \(86\%\) of the time at the \(1\%\)-level and \(95\%\) of the time at the \(5\%\)-level, even with what appear to be good standard errors. This would be an extraordinary degree of bad luck, suggesting that there is bias in our coefficient estimates.

The properties of fixed-effect estimators are derived asymptotically, that is as the number of observations and estimated fixed effects approaches infinity. However, in a typical panel data set, we have a fairly small number of time periods. In our analysis above, we have just four grades of data. But, because these are simulated data and not real people, we can easily expand the number of grades of data that we consider. Let’s consider n_grades = 12, the maximum handled by get_test_scores().2

sim_results_wide <- list_rbind(map(1:num_sims, sim_run, n_grades = 12))

We collect data on rejection rates for both \(\alpha = 0.01\) and \(\alpha = 0.05\). Based on our analysis above, we focus on the "OLS" model.

rejection_stats <-
  sim_results_wide |> 
  filter(model == "OLS") |> 
  summarize(prop_01 = mean(`p-value` <= 0.01), 
            prop_05 = mean(`p-value` <= 0.05))

rejection_stats
# A tibble: 1 × 2
  prop_01 prop_05
    <dbl>   <dbl>
1   0.329   0.574

Here we see evidence of a reduction of the bias, but the bias is not eliminated. We still reject a true null hypothesis \(33\%\) of the time at the \(1\%\)-level and \(57\%\) of the time at the \(5\%\)-level.

So let’s consider the final explanation of the bias, which is due to non-random selection. It turns out that assignment to treatment in get_test_scores() depends on test scores in the year prior to the camp, but conditional on these scores, is completely random.3 This is termed selection on observables, which is often suggested as a basis on which causal inference can be justified when using approaches such as propensity-score matching.

So what happens if we use completely random assignment to treatment? We can request such assignment by setting the random_assignment = TRUE in get_test_scores().

sim_results_rand <- list_rbind(map(1:num_sims, sim_run, 
                                   random_assignment = TRUE))
rejection_stats <-
  sim_results_rand |> 
  filter(model == "OLS") |> 
  summarize(prop_01 = mean(`p-value` <= 0.01), 
            prop_05 = mean(`p-value` <= 0.05))

rejection_stats
# A tibble: 1 × 2
  prop_01 prop_05
    <dbl>   <dbl>
1   0.009   0.046

Here we see evidence of elimination of the bias. We reject a true null hypothesis \(0.9\%\) of the time at the \(1\%\)-level and \(4.6\%\) of the time at the \(5\%\)-level.

Thus it appears that random assignment is critical in this setting for achieving valid causal inferences. This is somewhat concerning as, beyond non-random treatment assignment, the basic assumptions underlying causal inference appear to be satisfied, as we have a plausible basis for the parallel trends assumption. First, we have grade effects that are the same for both treated and untreated observations. Second, we have individual effects that, while different between the treated and untreated observations, remain constant over the sample period and therefore do not undermine the parallel-trends assumption.4 What this suggests is that subtle biases can enter difference-in-difference analyses even in settings that are unrealistically simple. It seems reasonable to expect that biases exist—and are plausibly worse—in the more complex settings of actual research.

21.2 Voluntary disclosure

One paper that uses a multi-way fixed-effect structure like that we analysed above is Li et al. (2018), who “seek to provide causal evidence on the proprietary cost hypothesis(2018, p. 266). While “proprietary costs” are commonly assumed to be those caused by use of information by competitors, Verrecchia (1983, p. 181) uses the term more broadly.

Many researchers focus on settings where firms want to disclose favourable information to investors. For example, Verrecchia (1983, p. 181) writes of the “proprietary cost associated with releasing information which is unfavorable to a firm (e.g., a bank would be tempted to ask for repayment of its loan).” However, Verrecchia (1983, p. 182) points out that in some situations firms may prefer to disclose bad news: “One recent example of this is the response of the UAW (United Auto Workers) for fewer labor concessions in the face of an announcement by Chrysler Corporation’s chairman that that firm’s fortunes had improved.”

While Verrecchia (1983) considers the “reluctance of managers in certain highly competitive industries … to disclose favorable accounting data”, in his model he assumes that the cost of disclosure is constant and independent of the disclosed value.

Verrecchia (1983) can be viewed as providing a theoretical foundation for the proprietary cost hypothesis. The model in Verrecchia (1983) posits a capital market–driven incentive for disclosure of favourable information; absent such an incentive, there would be no disclosure in the Verrecchia (1983) setting. Another feature of the Verrecchia (1983) setting is that firms with favourable news are the ones disclosing, as the capital-market benefit of disclosure is greater for these firms.

Li et al. (2018) exploit the staggered implementation of the inevitable disclosure doctrine (IDD), which was adopted by state courts as part of the common law of their respective states at different times (some state courts later rejected the doctrine after adopting it). IDD provides an employer with injunctive relief to prevent a current or former employee from working for another company if doing so will lead to the inevitable disclosure of trade secret information. Li et al. (2018) argue that IDD increases the marginal benefits of non-disclosure, which they interpret as an increase in the cost of disclosure.

Li et al. (2018) focus their analysis of disclosure choice on the disclosure of customer identities in 10-Ks and assume that the state of a firm’s headquarters governs the applicability of IDD to the firm. We will conduct an approximate replication of certain analyses of Li et al. (2018) to understand the empirical approaches of this chapter. This requires combining data on customer disclosures with data on the dates of adoption of IDD by states with data in the states of companies’ headquarters.

21.2.1 Customer disclosures

We will use three data sets in the replication analysis. The first data set is compseg.seg_customer, which contains data derived from companies’ disclosures regarding significant customers. We use the second (compseg.names_seg) to link companies on Compustat with CIKs, which we use to link with data on headquarters locations. Finally, we use comp.funda for data on total sales.

db <- dbConnect(RPostgres::Postgres(), bigint = "integer")

seg_customer <- tbl(db, Id(schema = "compseg", table = "seg_customer"))
names_seg <- tbl(db, Id(schema = "compseg", table = "names_seg"))
funda <- tbl(db, Id(schema = "comp", table = "funda"))
db <- dbConnect(duckdb::duckdb())

seg_customer <- load_parquet(db, schema = "compseg", table = "seg_customer")
names_seg <- load_parquet(db, schema = "compseg", table = "names_seg")
funda <- load_parquet(db, schema = "comp", table = "funda")

It turns out that compseg.seg_customer contains data on a number of different types of segment, including geographic regions and markets.

seg_customer |>
  count(ctype) |>
  arrange(desc(n)) |>
  collect()
# A tibble: 7 × 2
  ctype         n
  <chr>     <dbl>
1 COMPANY  364802
2 GEOREG   150601
3 MARKET   108389
4 GOVDOM    39524
5 GOVFRN     5101
6 GOVSTATE   1456
7 GOVLOC      574

We are interested in data in "COMPANY" segments, which may or may not name individual customers.

seg_customer |>
  filter(ctype == "COMPANY") |>
  count(cnms) |>
  arrange(desc(n)) |>
  collect()
# A tibble: 44,045 × 2
   cnms             n
   <chr>        <dbl>
 1 Not Reported 68829
 2 NOT REPORTED 31162
 3 2 Customers   5981
 4 3 Customers   5263
 5 5 Customers   4676
 6 9 Customers   4199
 7 4 Customers   4130
 8 10 Customers  3854
 9 Not reported  2928
10 2 CUSTOMERS   2053
# ℹ 44,035 more rows

To better understand this setting, we examine one example chosen somewhat at random from the last year of the sample period. In its 10-K for the year ended 31 December 2010, Advanced Micro Devices, Inc. (AMD), a global semiconductor company, disclosed the following:

In 2010, Hewlett-Packard Company accounted for more than 10% of our consolidated net revenues. Sales to Hewlett-Packard consisted primarily of products from our Computing Solutions segment. Five customers, including Hewlett-Packard, accounted for approximately 55% of the net revenue attributable to our Computing Solutions segment. In addition, five customers accounted for approximately 46% of the net revenue attributable to our Graphics segment. A loss of any of these customers could have a material adverse effect on our business.

Elsewhere in its 10-K, AMD discloses the following:

In 2010, the Company had one customer that accounted for more than 10% of the Company’s consolidated net revenues. Net sales to this customer were approximately $1.4 billion, or 22% of consolidated net revenues, and were primarily attributable to the Computing Solutions segment.

Note that the Graphics and Computing Solutions segments had sales of $1,663 million and $4,817 million, respectively, and therefore made up 99.8% of AMD’s total sales of $6,494 million.

Thus 46% of Graphics represents $765 million, while 55% of Computing Solutions is $2,650 million. After subtracting $1,400 million of these sales to Hewlett-Packard (HP), Compustat appears to ascribe the remaining $1,250 to “4 Customers”. The basis for assigning $1,400 to “Not Reported” is unclear (while the sentence above merely says “one customer”, it is quite clear that this is HP and Compustat elsewhere assumes as much). (In fact, it seems almost certain that there is double-counting in this case, making addition of the numbers in salecs problematic.)

seg_customer |> 
  filter(gvkey == "001161", datadate == "2010-12-31") |>
  select(cnms, ctype, salecs, stype) |>
  collect()
# A tibble: 8 × 4
  cnms               ctype   salecs stype 
  <chr>              <chr>    <dbl> <chr> 
1 HEWLETT-PACKARD CO COMPANY  1400  BUSSEG
2 4 Customers        COMPANY  1250. BUSSEG
3 Not Reported       COMPANY  1400  BUSSEG
4 5 Customers        COMPANY   765  BUSSEG
5 Not Reported       COMPANY    NA  BUSSEG
6 Not Reported       COMPANY    NA  BUSSEG
7 Not Reported       COMPANY    NA  BUSSEG
8 International      GEOREG   5715. BUSSEG

Focusing on this case perhaps help us to understand the disclosure decision faced by AMD. Even if AMD had not disclosed the identity of its largest customer, it perhaps would have been easy to infer that “Customer #1” was indeed HP, as HP was the largest personal computer manufacturer at the time.5

The second dataset we use is undisclosed_names, which contains the values of cnms considered to be non-disclosures and is part of the farr package.6

undisclosed_names
# A tibble: 460 × 2
   cnms         disclosed
   <chr>        <lgl>    
 1 Not Reported FALSE    
 2 NOT REPORTED FALSE    
 3 2 CUSTOMERS  FALSE    
 4 6 Customers  FALSE    
 5 4 Customers  FALSE    
 6 3 Customers  FALSE    
 7 8 Customers  FALSE    
 8 9 CUSTOMERS  FALSE    
 9 5 CUSTOMERS  FALSE    
10 3 CUSTOMERS  FALSE    
# ℹ 450 more rows

We can then perform a left_join() of the customer-related data in seg_customer to identify disclosures (and non-disclosures) of customer identities.

Following Li et al. (2018), we focus on the period from 1994 to 2010. Here we show some of the result data items, focused on disclosed customers and larger customers.

customers <-
  seg_customer |>
  filter(ctype == "COMPANY") |>
  collect()

sample_start <- as.Date("1994-01-01")
sample_end <- as.Date("2010-12-31")

disclosure_raw <-
  customers |>
  filter(between(datadate, sample_start, sample_end)) |>
  left_join(undisclosed_names, by = "cnms") |>
  mutate(disclosed = coalesce(disclosed, TRUE)) |>
  select(gvkey, datadate, cnms, salecs, disclosed)

Table 21.2 lists the largest disclosed customers in the data. Many of these observations relate to Walgreens and CVS, two large US pharmacy chains.

disclosure_raw |>  
  filter(disclosed) |> 
  arrange(desc(salecs)) |>
  select(-disclosed) |>
  head(n = 10)
Table 21.2: Largest disclosed customers
gvkey datadate cnms salecs
002751 2010-06-30 CVS Health Corp 23,641
002751 2009-06-30 WALGREEN CO 22,888
118122 1999-12-31 General Motors Corp 22,302
002751 2010-06-30 WALGREEN CO 21,671
002751 2009-06-30 CVS Caremark Corp 20,898
118122 2000-12-31 General Motors Corp 20,665
002751 2008-06-30 CVS Caremark Corp 20,040
002751 2007-06-30 CVS Caremark Corp 18,239
012136 2005-06-30 GENERAL ELECTRIC CO 17,712
002751 2008-06-30 Walgreen Co 17,307

As a shortcut way of imposing the sample requirements used in Li et al. (2018) (e.g., not financial services), we restrict our analysis to firms (GVKEYs) in the sample of Li et al. (2018). The llz_2018 data set from the farr package contains these GVKEYs.7

To get a feel for the data, we calculate the proportion of total sales attributable to the largest disclosed customer.

sales <-
  funda |>
  filter(indfmt == "INDL", datafmt == "STD",
         consol == "C", popsrc == "D") |>
  select(gvkey, datadate, sale) |>
  collect()

biggest_customers <- 
  disclosure_raw |> 
  filter(disclosed) |>
  inner_join(sales, by = c("gvkey", "datadate")) |>
  group_by(gvkey, datadate) |> 
  mutate(max_customer = max(salecs) / sale)

Figure 21.1 provides a look at the proportion of sales attributed to the largest customer. Note that there are some cases where the largest customer appears to account for more than 100% of sales; many of these appear to be data-entry errors. But there are hundreds of firm-years with revenue from just one customer.

There is also an interesting jump at 10% of sales. While this is likely driven by the 10% cut-off driving mandatory disclosure, it is interesting to note that many firms disclose significant customer information even when the largest customer is below 10% of sales, as seen in Figure 21.1.

biggest_customers |> 
  filter(between(max_customer, 0, 1)) |>
  ggplot(aes(max_customer)) +
  geom_histogram(binwidth = 0.02) + 
  scale_x_continuous(breaks = seq(0, 1, by = 0.1))
Figure 21.1: Histogram of proportion of sales from largest disclosed customer

Interestingly, Li et al. (2018) only include firm-year observations where at least one customer is disclosed (whether identified or not) as having at least 10% of total sales. We construct a data set (prin_custs) to identify these firms.

prin_custs <-
  disclosure_raw |> 
  inner_join(sales, by = c("gvkey", "datadate")) |>
  group_by(gvkey, datadate) |> 
  filter(!is.na(salecs), sale > 0) |>
  summarize(prin_cust = max(salecs / sale, na.rm = TRUE), 
            .groups = "drop") |>
  mutate(has_prin_cust = prin_cust >= 0.1)

We next restrict the data to the llz_2018 sample and calculate the two measures of disclosure choice used in Li et al. (2018), one measuring the proportion of significant customers whose identities are not disclosed and another weighting that proportion by sales.

disclosure <-
    disclosure_raw |> 
    inner_join(sales, by = c("gvkey", "datadate")) |>
    semi_join(llz_2018, by = "gvkey") |>
    group_by(gvkey, datadate) |>
    summarize(ratio = mean(!disclosed),
              ratio_sale = sum((!disclosed) * salecs) / sum(salecs),
              .groups = "drop") |>
    mutate(year = year(datadate)) 

Table 21.3 provides summary statistics for the two measures of disclosure choice. These are fairly similar to those disclosed in Li et al. (2018) (e.g., Panel C of Table 1 reports a mean of 0.447 for Ratio 1, the measure we call ratio).

disclosure |> 
  pivot_longer(cols = c("ratio", "ratio_sale"), 
               names_to = "measure") |>
  group_by(measure) |>
  summarize(mean = mean(value, na.rm = TRUE),
            median = median(value, na.rm = TRUE),
            sd = sd(value, na.rm = TRUE))
Table 21.3: Means of key disclosure measures
measure mean median sd
ratio 0.459 0.333 0.449
ratio_sale 0.461 0.316 0.465

21.2.2 Data on adoption of IDD

The next data set we use is idd_periods which is returned by get_idd_periods() from the farr package. This function relies on idd_dates, a data set in the farr package derived from data reported in Klasa et al. (2018) and reproduced in Li et al. (2018).

idd_dates
# A tibble: 24 × 3
   state idd_date   idd_type
   <chr> <date>     <chr>   
 1 AR    1997-03-18 Adopt   
 2 CT    1996-02-28 Adopt   
 3 DE    1964-05-05 Adopt   
 4 FL    1960-07-11 Adopt   
 5 FL    2001-05-21 Reject  
 6 GA    1998-06-29 Adopt   
 7 IL    1989-02-09 Adopt   
 8 IN    1995-07-12 Adopt   
 9 IA    1997-03-18 Adopt   
10 KS    2006-02-02 Adopt   
# ℹ 14 more rows

While idd_dates represents dates of either adoption or rejection of the inevitable disclosure doctrine, the idd_periods table takes a sample period (defined using min_date and max_date) and breaks that sample period into three kinds by state: pre- and post-adoption and post-rejection.

idd_periods <- get_idd_periods(min_date = sample_start, 
                               max_date = sample_end)
idd_periods
# A tibble: 65 × 4
   state period_type   start_date end_date  
   <chr> <chr>         <date>     <date>    
 1 AK    Pre-adoption  1994-01-01 2010-12-31
 2 AL    Pre-adoption  1994-01-01 2010-12-31
 3 AR    Pre-adoption  1994-01-01 1997-03-18
 4 AR    Post-adoption 1997-03-18 2010-12-31
 5 AZ    Pre-adoption  1994-01-01 2010-12-31
 6 CA    Pre-adoption  1994-01-01 2010-12-31
 7 CO    Pre-adoption  1994-01-01 2010-12-31
 8 CT    Pre-adoption  1994-01-01 1996-02-28
 9 CT    Post-adoption 1996-02-28 2010-12-31
10 DC    Pre-adoption  1994-01-01 2010-12-31
# ℹ 55 more rows

21.2.3 Data on state headquarters

The final piece of the puzzle in terms of data is state_hq, which contains the headquarters location used by Li et al. (2018). This data frame is provided by the farr package and is derived from data provided by Bill McDonald.8 The state_hq table reports the range of dates for SEC filings for which each combination of CIK (cik) and state of headquarters (ba_state) applies.9

state_hq
# A tibble: 53,133 × 4
   cik        ba_state min_date   max_date  
   <chr>      <chr>    <date>     <date>    
 1 0000066382 MI       1994-01-04 2018-10-10
 2 0000070415 NY       1994-01-04 2007-03-14
 3 0000084129 PA       1994-01-05 2018-10-04
 4 0000832922 OH       1994-01-05 2001-01-09
 5 0000909832 CA       1994-01-05 1996-12-20
 6 0000004911 PA       1994-01-06 1996-01-05
 7 0000038777 CA       1994-01-06 2018-11-09
 8 0000726977 IL       1994-01-06 1997-11-14
 9 0000756764 NY       1994-01-06 1999-07-08
10 0000764478 MN       1994-01-06 2018-12-07
# ℹ 53,123 more rows

To use these data, we need to link CIKs with GVKEYs, which we do using the table compseg.names_seg provided alongside Compustat’s segment data.10

ciks <-
  names_seg |>
  filter(!is.na(cik)) |>
  select(gvkey, cik) |>
  collect()

The data frame state_hq_linked provides the state of headquarters applicable to each firm-year (i.e., combination of gvkey and datadate).

state_hq_linked <- 
  state_hq |>
  inner_join(ciks, by = "cik") |>
  inner_join(disclosure, 
             join_by(gvkey, between(y$datadate, x$min_date, x$max_date))) |>
  select(gvkey, datadate, ba_state) |>
  rename(state = ba_state)

21.2.4 Regression analysis

Finally, we pull all these pieces together. Like Li et al. (2018), we delete “post-rejection” observations and log-transform the dependent variable.

reg_data <-
  disclosure |>
  inner_join(prin_custs, by = c("gvkey", "datadate")) |>
  filter(has_prin_cust) |>
  inner_join(state_hq_linked, by = c("gvkey", "datadate")) |>
  inner_join(idd_periods, 
             join_by(state,
                     datadate >= start_date, datadate <= end_date)) |>
  filter(period_type != "Post-rejection") |>
  mutate(post = period_type == "Post-adoption",
         ln_ratio = log(1 + ratio),
         ln_ratio_sale = log(1 + ratio_sale)) |>
  select(-start_date, -end_date)

reg_data |> count(period_type)
# A tibble: 2 × 2
  period_type       n
  <chr>         <int>
1 Post-adoption 15836
2 Pre-adoption  12921

To keep the analysis simple, we do not include controls in the regressions below (apart from firm and year fixed effects). In a typical regression analysis with firm and year fixed effects, omission of controls will often not have a material impact on coefficient estimates.

fms <- list(feols(ln_ratio ~ post | gvkey + year, 
                  vcov = "iid", data = reg_data),
            feols(ln_ratio_sale ~ post | gvkey + year, 
                  vcov = "iid", data = reg_data))

We then run the analysis again, but focused on a subset of firms. We explore this analysis more deeply in the discussion questions.

switchers <- 
  reg_data |> 
  distinct(gvkey, post) |> 
  group_by(gvkey) |> 
  filter(n() > 1) |>
  select(gvkey) |>
  distinct() |>
  ungroup()

reg_data_switchers <-
  reg_data |>
  semi_join(switchers, by = "gvkey") 

fms[[3]] <- feols(ln_ratio ~ post | gvkey + year, 
                  vcov = "iid", data = reg_data_switchers)
fms[[4]] <- feols(ln_ratio_sale ~ post | gvkey + year, 
                  vcov = "iid", data = reg_data_switchers)

The regression results reported in Table 21.4 are similar to those in Table 2 of Li et al. (2018).

modelsummary(fms, 
             estimate = "{estimate}{stars}",
             gof_map = c("nobs", "r.squared"),
             stars = c('*' = .1, '**' = 0.05, '***' = .01))
Table 21.4: Effect of IDD adoption
 (1)   (2)   (3)   (4)
postTRUE 0.016** 0.024*** 0.014 0.019**
(0.008) (0.008) (0.009) (0.010)
Num.Obs. 28757 24581 3617 3164
R2 0.672 0.711 0.624 0.661

We next conduct analysis similar to that reported in Table 3 of Li et al. (2018), which examines the effect of IDD adoption on disclosure by year related to adoption.

We first make factor_t(), a small function that takes a vector of numbers and returns them as a factor that has nicely formatted strings ordered by the original numbers.11 This will be useful for presenting “nice” regression results.

factor_t <- function(t) {
  t <- relevel(as.factor(t), ref = "-1")
  levels(t) <- str_replace(levels(t), "^([0-9]+)", "+\\1")
  levels(t) <- str_replace_all(levels(t), "([0-9]+)", " \\1")
  levels(t) <- str_replace(levels(t), "^", " ")
  t
}

We next make a function that turns a vector of year-difference values into a factor—with values outside \((-5, +5)\) collapsed into \(-5\) and \(+5\) as appropriate—and labels the result in a way that facilitates regression output. Note that if a firm is never-treated, it’s value of \(t\) will be missing and we set the value for these years to -Inf (R’s way of saying \(-\infty\)).

year_diff <- function(t) {
  t <- case_when(t < -5 ~ -5,
                 t > 5 ~ 5,
                 is.na(t) ~ -Inf,
                 .default = t) 
  factor_t(t)
}

We construct a data set (switch_years) that contains data on the first year of adoption of IDD for each GVKEY.

switch_years <-
  reg_data |>
  group_by(gvkey) |> 
  arrange(datadate) |> 
  filter(period_type == "Post-adoption",
         lag(period_type) == "Pre-adoption") |>
  group_by(gvkey) |>
  summarize(adoption_year = min(year), .groups = "drop")

We next construct our data set for this set of regressions (reg_data_t) by merging in data on adoption years and then calculating t using the year_diff() function we created above.

reg_data_t <-
  reg_data |>
  left_join(switch_years, by = "gvkey") |>
  mutate(t = year_diff(year - adoption_year))

Again we conduct two regressions for each of the two disclosure measures and explore relations between these two regressions in the discussion questions. Again we explore the relation between these regressions more deeply in the discussion questions.

fms <- list(feols(ln_ratio ~ t | gvkey + year, vcov = "iid", 
                  data = reg_data_t),
            feols(ln_ratio_sale ~ t | gvkey + year, vcov = "iid",
                  data = reg_data_t),
            feols(ln_ratio ~ t | gvkey + year, vcov = "iid",
                  data = filter(reg_data_t, !str_detect(t, "Inf"))),
            feols(ln_ratio_sale ~ t | gvkey + year, vcov = "iid",
                  data = filter(reg_data_t, !str_detect(t, "Inf"))))

Results from these regressions are provided in Table 21.5.

modelsummary(fms, 
             estimate = "{estimate}{stars}",
             gof_map = c("nobs", "r.squared"),
             stars = c('*' = .1, '**' = 0.05, '***' = .01))
Table 21.5: Effect of IDD adoption by year
 (1)   (2)   (3)   (4)
t - 5 −0.004 0.001 −0.017 −0.009
(0.019) (0.021) (0.027) (0.029)
t - 4 −0.017 −0.030 −0.026 −0.038
(0.022) (0.023) (0.025) (0.026)
t - 3 −0.030 −0.029 −0.031 −0.031
(0.020) (0.020) (0.021) (0.022)
t - 2 −0.008 −0.003 −0.014 −0.010
(0.018) (0.018) (0.019) (0.019)
t + 0 −0.001 0.016 0.007 0.020
(0.015) (0.015) (0.016) (0.017)
t + 1 0.009 0.025 0.021 0.031
(0.016) (0.017) (0.018) (0.020)
t + 2 −0.008 0.010 0.009 0.017
(0.017) (0.018) (0.021) (0.023)
t + 3 0.023 0.021 0.038 0.026
(0.019) (0.020) (0.025) (0.027)
t + 4 0.028 0.039* 0.041 0.042
(0.019) (0.021) (0.027) (0.030)
t + 5 0.044*** 0.048*** 0.046 0.039
(0.015) (0.016) (0.032) (0.036)
Num.Obs. 28757 24581 3110 2740
R2 0.673 0.712 0.625 0.668

Next we make a small function to plot the estimated coefficients over time. The function first arranges the coefficients for each period in event-time as a data frame and turns the coefficient labels (e.g., \(t + 1\)) into values (i.e., \(1\)). In making the plot, we add in the reference period (\(t-1\)), which has a coefficient value of \(0\) by definition, and add 95% confidence intervals for the coefficients.

plot_coefs <- function(fm) {
  
  coefs <- 
    tibble(t = names(coef(fm)),
           value = coef(fm),
           se = se(fm)) |>
    mutate(t = as.integer(str_replace_all(t, "[t ]+", ""))) |>
    filter(t != -Inf)

  ci <- 0.95
  mult <- qnorm(1 - (1 - ci) / 2)

  tibble(t = -1, value = 0, se = 0) |>
    bind_rows(coefs) |>
    mutate(mean = value,
           top = value + mult * se,
           bot = value - mult * se) |>
    ggplot(aes(x = t, y = mean)) + 
    geom_errorbar(aes(ymin = bot, ymax = top), width = .1) +
    geom_line() +
    geom_point() +
    scale_x_continuous(breaks = seq(from = min(coefs$t), 
                                    to = max(coefs$t)))
}

And we use this function to plot the coefficients for the first model in Figure 21.2.

plot_coefs(fms[[1]])
Figure 21.2: Plot of coefficients by year

21.2.5 Discussion questions

  1. The proprietary cost hypothesis tested by Li et al. (2018) posits that increased cost of disclosure will lead to reduced supply of disclosure? What rival theories exist that would make alternative predictions? Are there other elements of disclosure theory that might be tested? Why do you think Li et al. (2018) focused on this specific element of the proprietary cost hypothesis?
  2. In the analysis above, we do not include the control variables considered by Li et al. (2018). For example, Li et al. (2018) “include R&D expenditures to sales, advertisement [sic] expenditure to sales, and intangible assets scaled by total assets to control for a firm’s proprietary costs of disclosure.” Using the approach outlined in Chapter 4, in what circumstances would it be necessary to control for “a firm’s proprietary costs of disclosure” in this way? Do these circumstances seem applicable in this setting, where the treatment is a (presumably exogenous) shock to disclosure costs and the outcome is disclosure choices?
  3. What differs between the regressions reported in columns (1) and (3), and (2) and (4) of Table 21.4, respectively? What does this tell you about what drives the regression results in this setting? What happens if you omit the year fixed effects from both sets of regressions? What does this tell you about the role of the “non-switchers” (i.e., firms not in the switchers data frame) in the regression?
  4. Would you expect the inclusion of controls (see the question above) to have a significant impact on the regression results? Why or why not?
  5. What differs between the regressions reported in columns (1) and (3), and (2) and (4) of Table 21.5, respectively? What happens if you omit the year fixed effects from both sets of regressions? What does this tell you about what drives the regression results in this setting?
  6. What patterns do you observe in the coefficients reported in the table titled “Effect of IDD adoption by year”? Do these conform to what you would expect from Li et al. (2018)? (It may be easiest to focus on these in groups, e.g., those in \(t-5\) to \(t-2\), those in \(t + 0\) to \(t+3\) and those for \(t+4\) and \(t+5\).)
  7. How do the variables in Table 21.5 differ from those used in Table 3 of Li et al. (2018)? Modify the code above (e.g., year_diff()) to produce analysis closer to that reported in Table 3 of Li et al. (2018).
  8. How helpful do you find Figure 21.2 (plot of the coefficients by year)?
  9. The year_diff() function collapses years after \(t + 5\) and before \(t - 5\) into years \(t + 5\) and \(t - 5\), respectively. When would this approach make sense? What would be one alternative approach to handling these years? Does your suggested approach make a difference in this setting?
  10. Describe the data set created from the following code. What proportion of the firms in the data set have same_state equal to TRUE? For the purposes of empirical analysis, do the firms with same_state equal to FALSE enhance, or detract from, our ability to draw causal inferences about the effect of adoption of IDD?
    switch_years <-
      reg_data_switchers |>
      group_by(gvkey) |> 
      arrange(datadate) |> 
      mutate(same_state = state == lag(state), 
             adoption_year = period_type == "Post-adoption" & 
               lag(period_type) == "Pre-adoption") |> 
      filter(adoption_year) |> 
      ungroup() 
  1. What issues might be implied by the following data? How might you address these?
    reg_data_t |> count(t)
    # A tibble: 12 × 2
       t           n
       <fct>   <int>
     1 " - 1"    322
     2 " -Inf" 25647
     3 " - 5"    232
     4 " - 4"    117
     5 " - 3"    154
     6 " - 2"    219
     7 " + 0"    385
     8 " + 1"    306
     9 " + 2"    233
    10 " + 3"    192
    11 " + 4"    168
    12 " + 5"    782

21.3 Further reading

Chapter 5 of Angrist and Pischke (2008) provides a good introduction to the topic of fixed effects and panel data. Chapters 8 and 9 of Cunningham (2021) go beyond the topics here and provide an excellent pathway to recent developments in the fixed-effects literature. Chapters 16 and 18 of Huntington-Klein (2021) also provide excellent coverage, including discussion of some issues not covered in this book.


  1. Note that the function get_test_scores() returns a version of the data that merges test_scores and camp_attendance and calculates the post and treat variables we added in Chapter 3.↩︎

  2. This maximum exists, not only because it seems inhumane to require even simulated students to attend school for longer, but also because the grade effects are hard-coded and only for 12 years.↩︎

  3. This can be seen from examination of the source code for the function, but there is no need to do so.↩︎

  4. The parallel trends assumption is likely subtly violated because of regression-to-the mean effects in test scores, but not in a way likely to be detected using the usual “parallel trends plots” used in the literature. A researcher seeing the underlying data-generating process is likely to find it very easy the imagine the “can-opener” of parallel trends we discussed in Chapter 19. To paraphrase Macbeth: “Is this a [can-opener] which I see before me, The handle toward my hand? Come, let me clutch thee. I have thee not, and yet I see thee still. Art thou not, fatal vision, sensible To feeling as to sight? … Thou marshal’st me the way that I was going [i.e., to use DiD], And such an instrument I was to use.”↩︎

  5. See Gartner (2011).↩︎

  6. See the source code for the farr package for details on the somewhat manual construction of this data set. This likely mirrors a process used for Li et al. (2018) itself.↩︎

  7. See the source code for the farr package for the code used in creating this data set.↩︎

  8. See Bill McDonald’s website for the original data.↩︎

  9. Any gaps are filled in by extending the min_date back to the date after the preceding max_date.↩︎

  10. Note that names_seg provides just one CIK for each GVKEY, though firms can change CIKs over time, which means that we will lose observations related to firms that previously used a different CIK from that found on names_seg.↩︎

  11. Examine the output of as.character(factor_t(c(-2, -1, 0, 1, 2))) to see what factor_t() is doing.↩︎