6 Statistical inference

This chapter is under construction.

In this chapter, we study statistical inference in the context of null hypothesis significance testing (NHST), which is the predominant paradigm in empirical accounting research. We examine NHST in fairly procedural terms without spending too much time justifying it. One reason for doing so is that even if we simply accept NHST as a sensible approach to science, in practice the way is NHST is used is profoundly different from the assumptions supporting its use.45 So real problems with NHST exist well before reaching its epistemological foundations.

Another issue with NHST is that statistical inference in most settings is based on limit theorems—results from mathematical statistics that require fairly advanced mathematics to prove—such as laws of large numbers and central limit theorems.46 To minimize the mathematical requirements here, we use simulations to illustrate ideas rather then trying to provide mathematically precise results. For readers looking to go deeper on either dimension, we provide a guide to further reading at the end of the chapter.

The code in this chapter uses the following packages. For instructions on how to set up your computer to use the code found in this book, see Chapter 1.2.1.

library(dplyr, warn.conflicts = FALSE)
library(stargazer)
library(ggplot2)
library(purrr)      # For map()
library(tidyr)      # For nest()
library(sandwich)   # For NeweyWest()
library(plm)
library(lfe)
library(farr)

We use the stargazer package to produce neat output from regressions. For the HTML version of this book, we set sg_format to "html", but "text" would be a better option if looking at the results interactively, and you would probably use "latex" if compiling a PDF.47

sg_format <- "html"

6.1 Some observations

Students of econometrics in PhD programs often spend much of their time learning about the asymptotic properties of estimators, which can be understood as how estimators behave as sample sizes tend toward infinity. However, there are two gaps in this standard pedagogical approach. First, it is often not made clear why we care about asymptotic properties of estimators. A practical student might question the relevance of asymptotic properties when sample sizes in practice can at times be quite large, but are rarely infinite.

Second, in practice, researchers soon forget that the estimators they use are justified by their asymptotic properties and rarely exploit what they learned in econometrics classes to study new estimators. For example, in this chapter we will devote some time to the study of standard errors. Almost all estimators of standard errors are motivated by asymptotic properties and thus can perform poorly in actual research settings. Yet Gow et al. (2010) show that estimators can gain currency among researchers without any formal analysis of their properties, asymptotic or otherwise.

One reason for studying the asymptotic properties of estimators is that these are mathematically tractable for a well-trained econometrician. For a practical researcher, showing that an estimator has good asymptotic properties is more or less necessary. An estimator that is not consistent is apt to produce significantly biased estimated in finite samples (one can think of consistency as the asymptotic equivalent of unbiasedness).48 And good asymptotic properties might be practically sufficient for the estimator to behave well in finite samples. An estimator that is asymptotically efficient and asymptotically normal has a good chance of being an efficient estimator in finite samples and having a distribution that is sufficiently close to the normal distribution for practical statistical inference.

In this chapter, we will not derive asymptotic properties of estimators. Instead, we will direct readers to econometrics texts for these details. Additionally, while we will provide an intuitive introduction to laws of large numbers and central limit theorems, we will not even state these formally, let alone derive them. Instead, in keeping with a theme of this book, we will use simulations to give the reader an intuitive feel for where these results from mathematical statistics come from and what they mean.

6.2 Data-generating processes

Like causal inference, thinking about statistical inference generally involves quite abstract reasoning. Before getting into the statistics, it is helpful to imagine that the data we observe are generated by an underlying process (hence a data-generating process or DGP). A DGP will often look like a model, but in principle these are distinct ideas. The DGP is a feature of reality that we generally do not observe directly, while a model is a guess or useful simplification of a conjectured DGP that humans use to understand the DGP.

In some settings we may believe that our model captures all the essential details of the DGP. While the results of a coin toss will depend on the rotational momentum imparted to the coin, the vertical velocity of the coin, imperfections on the coin’s surface, prevailing humidity, temperature, and wind velocity and so on, we may be happy in simply using a model in which coin tosses are independent, and equally likely to be one outcome as the other.

Part of the potential for confusion between these two ideas is that inevitably when discussing a DGP, we use a model, as we do here. Suppose that the height of adult individuals is drawn from a normal distribution with mean of \(\mu = 165\) centimetres and standard deviation of \(\sigma = 8\) centimetres.

hgt_mean <- 165
hgt_sd <- 8

This statement could be interpreted as the process for generating the heights of the billions of adult people who happen to inhabit Earth at this time. In a sense, the people who actually exist are a non-deterministic sample of the potential people who could exist.49

Alternatively, we might take the current inhabitants of the world as a given and posit that the empirical distribution of heights is actually a normal distribution with parameters \(\mu\) and \(\sigma\) and focus instead on taking random sample from that population. Looking ahead, we might use a random sample to make inferences about the empirical distribution of heights of adult humans who currently inhabit the world or to make inferences about the data-generating process for adult heights. In the context of causal inference, it is often the latter kind of inference that we want to make, but for present purposes we focus on the former kind.

There is no doubt that a two-parameter normal distribution is an oversimplification of the actual DGP, hence more of a model than an actual DGP. For example, we make no allowance for heights varying around the world, for the tendency of heights to increase over time (hence the reference to “current” above), or the tendency for men to be taller than women, each of which may undermine the applicability of a simple assumed DGP. But for current purposes, we ignore such quibbles and simply assume that heights come from a normal distribution \((\mu, \sigma)\).

We can simulate the taking of a random sample of size \(n = 100\) by using the rnorm function we saw in Chapter 3.

hgt_sample <- rnorm(100, mean = hgt_mean, sd = hgt_sd)

Given that we know that \(\mu = 165\) and that \(\sigma = 8\), we can calculate expected values for the sample mean and sample standard deviation of the heights in our sample. Here the sample mean is calculated as the sum of the observed heights (\(\{x_i: i \in (1, \dots, n)\}\)) divided by \(n\), where \(n = 100\).

\[ \overline{X}_{n} = \frac{\sum_{i = 1}^{n} x_i}{n} \]

\[ \begin{aligned} \mathbb{E}[\overline{X}_{n}] &= \mathbb{E}\left[\frac{\sum_{i = 1}^{n} x_i}{n}\right] \\ &= \frac{\sum_{i = 1}^{n}}{n} \mathbb{E}\left[ x_i\right] \\ &= \frac{\sum_{i = 1}^{n} \mathbb{E}\left[ x_i\right]}{n} \\ &= \frac{n \mu}{n} \\ &= \mu \end{aligned} \] To calculate the variance of \(\overline{X}_{n}\), we note that the variance of \(ax\) where \(a\) is a constant and \(x\) is a random variable is \(\mathrm{Var}[ax] = a^2 \mathrm{Var}[x]\) and the variance of the sum of independent and identically distributed variables \(x_i\), \(i \in (1, \dots, n)\), each with variance \(\sigma^2\) is simply the sum of their variances:

\[ \begin{aligned} \mathrm{Var} \left[\sum_{i = 1}^{n} x_i \right] &= \sum_{i = 1}^{n} \mathrm{Var} \left[x_i \right] \\ &=\frac{n}{\sigma^2} \end{aligned} \]

\[ \begin{aligned} \mathrm{Var}[\overline{X}_{n}] &= \mathrm{Var} \left[\frac{\sum_{i = 1}^{n} x_i}{n}\right] \\ &= \frac{1}{n^2} \mathrm{Var} \left[\sum_{i = 1}^{n} x_i \right] \\ &= \frac{1}{n} \sigma^2 \end{aligned} \] The standard deviation of the sample mean is the square root of the variance: \(\frac{\sigma}{\sqrt{n}}\).

Because in this section, we are in an idealized world in which we know the underlying DGP, we can test our calculations by generating data. We can incorporate the code we used to make a sample above in a function get_hgt_sample that takes a single argument (n) for the sample size.

get_hgt_sample <- function(n) {
  rnorm(n, mean = hgt_mean, sd = hgt_sd)
}

Calling get_hgt_sample with n = 10 we get a sample of 10 random numbers.

get_hgt_sample(10)
##  [1] 161.0218 162.9585 158.2503 188.7250 173.4118 160.1665 168.9922 175.2958
##  [9] 172.4536 150.6659

But we can easily generate ten-thousand samples. In many programming languages, we would use a for loop to do this as in the following code. In R, it is natural to use a list to store results from multiple runs of a function, so we intiate hgt_samples as an empty list and then use the for command to repeat the get_hgt_sample(100) call 10,000 times, storing each sample in a slot in hgt_samples.

n_samples <- 10000
set.seed(2023)

hgt_samples <- list()
for (i in 1:n_samples) {
  hgt_samples[[i]] <- get_hgt_sample(100)
}

We can take a peek at the first ten observations in the 463rd sample like this:

hgt_samples[[463]][1:10]
##  [1] 168.4070 164.0360 163.2143 155.2887 163.0420 170.6284 146.7339 168.8120
##  [9] 172.7608 156.1855

We can calculate the mean of the 463rd sample like this:

mean(hgt_samples[[463]])
## [1] 164.4264

We could use a for loop to calculate the mean for each of the samples in one go:

hgt_sample_means <- list()
for (i in 1:length(hgt_samples)) {
  hgt_sample_means[[i]] <- mean(hgt_samples[[i]])
}

Here hgt_sample_means is a list and we can peek at the first 5 values like this:

hgt_sample_means[1:5]
## [[1]]
## [1] 165.5401
## 
## [[2]]
## [1] 166.6662
## 
## [[3]]
## [1] 163.7082
## 
## [[4]]
## [1] 165.0866
## 
## [[5]]
## [1] 164.1906

For many purposes, it will be easier to work with hgt_sample_means if it is a vector rather than a list and we can use the unlist function to convert it to a vector:

unlist(hgt_sample_means[1:5])
## [1] 165.5401 166.6662 163.7082 165.0866 164.1906

Additionally, we could use the function lapply rather than a for loop. The function lapply applies a function to a list and returns a list of the results. Combining lapply and unlist, we could use the following code instead:

hgt_sample_means <- unlist(lapply(hgt_samples, mean))

We can see we have a nice vector of means from a single line of code.

hgt_sample_means[1:10]
##  [1] 165.5401 166.6662 163.7082 165.0866 164.1906 164.9117 164.3328 164.5542
##  [9] 164.1957 164.8957

In fact, we could use lapply rather than a for-loop to create hgt_samples in the first place. Note that to get n_samples samples, we want to lapply the function get_hgt_sample to the vector 1:n_samples. But we have a small problem: lapply will provide each value of that vector to get_hgt_sample and get_hgt_sample simply does not expect that value. One approach to addressing this is to use an anonymous function as we do below.

set.seed(2023)
hgt_samples <- lapply(1:n_samples, function(x) get_hgt_sample(100))

We call function(x) get_hgt_sample(100) an anonymous function because we do not give it a name (by storing it in a variable). We could do that, but as can be seen below it means storing a function in a new variable (get_hgt_sample_alt) without producing code that is easier to read.

set.seed(2023)
get_hgt_sample_alt <- function(x) get_hgt_sample(100)
hgt_samples <- lapply(1:n_samples, get_hgt_sample_alt)

Another approach would be to make a version of get_hgt_sample that does expect an argument related to the index of the sample being generated. In fact, in some settings it might be useful to retain that information and in the function below, we have get_hgt_sample return a data frame (tibble) containing the sample index in the first column (i) and the actual sample in the second column (data).

get_hgt_sample_df <- function(i, n) {
  tibble(i, data = list(rnorm(n, mean = hgt_mean, sd = hgt_sd)))
}

To see what get_hgt_sample_df is doing, we can call it once to get a single (small) sample.

one_sample <- get_hgt_sample_df(i = 1, n = 10)
one_sample
## # A tibble: 1 × 2
##       i data      
##   <dbl> <list>    
## 1     1 <dbl [10]>
one_sample$data[[1]]
##  [1] 159.2986 172.3488 170.3377 172.4086 161.2063 171.0710 164.7035 183.3683
##  [9] 152.9049 171.6578
set.seed(2023)
hgt_samples <- lapply(1:n_samples, get_hgt_sample_df, n = 100)

We can access the underlying data, but at this stage it may seem that we are adding complexity for no particular reason, as might be suggested by comparing the code below to the equivalent code in our first attempt above.

unlist(hgt_samples[[463]]$data)[1:10]
##  [1] 168.4070 164.0360 163.2143 155.2887 163.0420 170.6284 146.7339 168.8120
##  [9] 172.7608 156.1855

One reason for this complexity is that currently hgt_samples is a list of tibbles. We can use the function bind_rows to bind them together into a single data frame.

hgt_samples_df <- bind_rows(hgt_samples)

Note that hgt_samples_df is starting to get fairly large:

print(object.size(hgt_samples_df), units = "auto")
## 8.2 Mb

While the size of hgt_samples_df does not present a problem for a modern computer, more complicated analyses might produce objects sufficiently large that we would want to pay attention to their sizes.

We can now use map functions from the purrr library to extract statistics from hgt_samples_df. Because we want to the result of each calculation here to a vector, we use map_vec. To keep our results small, we drop the data column from our table of statistics (hgt_samples_stats). (We could always use the column i to join hgt_samples_stats with hgt_samples_df if we wanted to restore it.)

hgt_samples_stats <- 
  hgt_samples_df %>%
  mutate(n = map_vec(data, length),
         mean = map_vec(data, mean),
         se = map_vec(data, sd)/sqrt(n)) %>%
  select(-data)

We can now assess how well our formulas worked out in our simulations. The mean of the estimated means is indeed pretty close to the \(\mu\) parameter of 165 and the standard deviation of those estimated means is also very close to the value \(\frac{\sigma}{\sqrt{n}}\) (0.8).

hgt_stats_summ <-
  hgt_samples_stats %>%
  summarize(n = mean(n),
            mean_mean = mean(mean),
            sd_mean = sd(mean),
            mean_se = mean(se),
            sd_se = sd(se))

hgt_stats_summ
## # A tibble: 1 × 5
##       n mean_mean sd_mean mean_se  sd_se
##   <dbl>     <dbl>   <dbl>   <dbl>  <dbl>
## 1   100      165.   0.810   0.799 0.0569
hgt_stats_summ %>%
  transmute(mean_error = mean_mean - hgt_mean,
            se_error = sd_mean - hgt_sd/sqrt(n),
            se_est_error = mean_se - sd_mean)
## # A tibble: 1 × 3
##   mean_error se_error se_est_error
##        <dbl>    <dbl>        <dbl>
## 1     0.0110   0.0100      -0.0112

6.3 Hypothesis testing

6.3.1 Normal distribution, known standard deviation

Of course, in reality we generally do not observe parameters like hgt_mean and hgt_sd and our task as statisticians is form and test hypotheses about those unobserved parameters. With null hypothesis significance testing, the researcher starts with a null hypothesis and asks whether the data are “sufficiently unusual” to justify rejection of that hypothesis.

As alluded to above, there are many deep issues lurking in this approach that we largely ignore. One such issue that we ignore is the source for null hypotheses. It is often implied that null hypotheses somehow represent the prior or default beliefs of “science”. For example, in the context of pharmaceuticals, the null hypothesis is usually that a potential drug has zero (or negative) effect on the target outcome (say, curing cancer) and this null hypothesis has to be rejected for a drug to receive approval from regulators.50

Suppose we have a null hypothesis that the mean height of adult individuals is \(\mu_{0} = 170\) and that we know that heights are drawn from a normal distribution with standard deviation of \(\sigma = 8\) centimetres.

hgt_mean_null <- 170

Now we can get a sample of data to test our hypothesis:

set.seed(2023)
sample_size <- 100
test_sample <- get_hgt_sample(sample_size)
mean_est <- mean(test_sample)

Here we have a mean of 166. How unusual is a value of 166 under the null hypothesis?

Given we know that the sample mean will be normally distributed with standard deviation equal to \(\sigma/\sqrt{n}\), we can calculate the probability of observing a mean this far from the null hypothesis assuming the null hypothesis is true using the pnorm function:

pnorm(mean_est, mean = hgt_mean_null, sd = hgt_sd/sqrt(sample_size))
## [1] 1.238951e-08

In practice is it more common to transform the mean estimate into a z-statistic which has a standard normal distribution (denoted \(N(0,1)\)), with mean zero and standard deviation of one.

\[ z = \frac{\overline{x} - \mu_0}{\sigma/\sqrt{n}} \]

z_stat <- (mean_est - hgt_mean_null)/(hgt_sd/sqrt(sample_size))

We can see that using the z-statistic (here \(-5.57\)) yields exactly the same p-value.

pnorm(z_stat)
## [1] 1.238951e-08

It is generally posited that the criterion for rejection of the null hypothesis is established prior to analysing the data. The set of possible outcomes is partitioned into two disjoint sets that together comprise the event space. The critical value represents the threshold between these two sets expressed in terms of the test statistic.

The critical value is generally calculated based on probability threshold (also known as the significance level or size of the test) below which results are deemed statistically significant. A widely used significance level is 5%. Given that our null hypothesis is \(H_{0}: \mu = 170\), we could reject this null hypothesis because the test statistic is either too low or too high, so we need to accout for both possibilities when determining the critical value.

As such we set the critical value for “surprisingly low” test statistics at the point on the distribution where 2.5% (\(5\% \div 2\)) of values are more extremely negative given the null hypothesis. And we set set the critical value for “surprisingly high” test statistics at the point on the distribution where 2.5% (\(5\% \div 2\)) of values are more extremely positive given the null hypothesis.

qnorm(0.025)
## [1] -1.959964
qnorm(1 - 0.025)
## [1] 1.959964

We set that these critical values are identical in magnitude and only differ in sign. We can therefore express our critical value \(c\) as \(c = 1.96\) and identify the rejection region as \(\{z: |z| > c \}\). As the absolute value of our observed z-statistic (\(5.57\))) easily exceeds the critical value, we can reject the null hypothesis.

We can also calculate the p-value of our observed result. Again we need to recognize that values as extreme as those observed could be positive or negative, so the probability of finding data at least as surprising as that we have is actually twice this probability calculated above \(p = 2.4779025\times 10^{-8}\). (That the observed p-value is below 5% is equivalent to \(|z| > c\).)

6.3.2 Normal distribution, unknown standard deviation

In reality a researcher does not simply know the standard deviation. Fortunately, we can use our data to estimate it.

\[ \hat{\sigma}^2 = \frac{1}{n-1}\sum_{i = 1}^{n} (x_i - \overline{x})^2 \] We can use our estimate \(\hat{\sigma}\) to calculate a t-statistic, which is a close analogue of the z-statistic calculated above.

\[ t = \frac{\overline{x} - \mu_0}{\hat{\sigma}/\sqrt{n}} \] It turns out that with normally distributed and independent variables, this t-statistic has a known distribution that is close to the normal distribution and that gets closer as \(n\) increase.

sd_est <- sd(test_sample)
t_stat <- (mean_est - hgt_mean_null)/(sd_est/sqrt(sample_size))
t_stat
## [1] -5.614045

We can calculate the critical values for a test based on the t-statistics as follows:

qt(0.025, df = sample_size - 1)
## [1] -1.984217
qt(1 - 0.025, df = sample_size - 1)
## [1] 1.984217

The t-distribution is also symmetric, so we can express \(c\) as \(c = 1.98\) and identify the rejection region as \(\{t: |t| > c \}\).

pt(t_stat, df = sample_size - 1) * 2
## [1] 1.808489e-07

Discuss blurring of t-statistics and z-statistics.

Using some relations between means and regression coefficients explored in Chapter 4, we can use the lm function to produce this t-statistic and p-value:

df <- tibble(x = test_sample)
summary(lm((x - hgt_mean_null) ~ 1, data = df))
## 
## Call:
## lm(formula = (x - hgt_mean_null) ~ 1, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.064  -4.853  -1.043   5.036  21.342 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -4.4599     0.7944  -5.614 1.81e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.944 on 99 degrees of freedom

6.3.3 Unknown distribution: Asymptotic methods

But what if we don’t know that the underlying distribution for heights is normal? Perhaps we even have good reason to believe that it isn’t normal. In this kind of situation, which is actually quite common, there are two basic approaches.

The first approach invokes a central limit theorem and thereby claim that the distribution of means will be asymptotically normal and thus approximately normal in finite samples.

To make things interesting, we replace our underlying normal distribution of heights with a new bimodal distribution created by drawing heights from one of two normal distribution:

bimodal_dgf <- function(n) {
  
  short <- sample(c(TRUE, FALSE), 
                  size = n, 
                  prob = c(0.3, 0.7), 
                  replace = TRUE)
  
  short_sample <- rnorm(n, mean = 155, sd = 3)
  tall_sample <- rnorm(n, mean = 171, sd = 3)
  if_else(short, short_sample, tall_sample)
}

We can see that a sample drawn from this (unrealistic) distribution of heights has two modes and is clearly not normal.

set.seed(2023)

bimodal_sample <- bimodal_dgf(1000)

tibble(height = bimodal_sample) %>%
  ggplot(aes(x = height)) + 
  geom_histogram(binwidth = 1)

Now we make a function to draw multiple (n_samples) samples (each of sample_size size) from a given distribution (supplied using the dgf argument) and calculate the mean for each that we can examine the distribution of these means:

get_means <- function(sample_size, n_samples, dgf) {
  tibble(i = 1:n_samples,
         sample = map(i, function(x) dgf(sample_size)),
         mean = map_vec(sample, mean),
         se = map_vec(sample, function(x) sd(x)/sqrt(length(x) - 1))) %>%
    select(-sample) %>%
    mutate(z = mean/se)
}

Here we draw 10,000 samples of 1000 from bimodal_dgf. (Here you may be pleasantly surprised to learn that we can just pass functions like bimodal_dgf into other functions!)

set.seed(2023)
bimodal_sample_stats <- get_means(sample_size = 1000,
                                  n_samples = 10000,
                                  dgf = bimodal_dgf)

Now we make a small function to

make_clt_plot <- function(df) {
  df %>%
    ggplot(aes(x = mean)) +
    geom_histogram(aes(y = after_stat(density)), fill = "green", 
                   binwidth = sd(df$mean)/10) +
    geom_vline(aes(xintercept = mean(mean), color = "red")) +
    stat_function(fun = dnorm, 
                  args = list(mean = mean(df$mean), 
                              sd = sd(df$mean)),
                  color = "blue") +
    theme(legend.position = "none") 
}
make_clt_plot(bimodal_sample_stats)             

We can combine the get_means and make_clt_plot functions into a single function clt_demo that allows us to make a plot of the distribution of means with an overlaid normal distribution so that we can see how well the latter approximates the former for a given sample size (sample_size) and DGF (dgf).

clt_demo <- function(sample_size, n_samples, dgf) {
  get_means(sample_size, n_samples, dgf) %>%
    make_clt_plot()
}

We can now call this function to reproduce the plot above, but for sample_size = 100.

clt_demo(sample_size = 100, n_samples = 10000, bimodal_dgf)

6.3.4 Unknown distribution: Bootstrapping

Above we considered the possibility that we had a normally distributed variable, but didn’t know the standard deviation. We saw that we can easily use the sample we have to estimate that standard deviation. In fact, we can go further and use the sample to generate information about the distribution of the underlying variable.

The bootstrapping approach in this setting proceeds as follows:

  1. Take a with-replacement sample of size \(n\) from our sample.
  2. Calculate the mean for that sample.
  3. Repeat steps 1 and 2 many times.
  4. Use the distribution above to calculate estimates of relevant parameters (e.g., the standard deviation).

We can use the sample we created above (bimodal_sample) for this purpose. Here we calculate 10,000 samples. Note that get_bs_mean performs steps 1 and 2, and we use map_vec to perform step 3.

get_bs_mean <- function(data) {
  bs_sample <- sample(data, size = length(data), replace = TRUE)
  mean(bs_sample)
}

set.seed(2021)
bs_stats <- 
  tibble(i = 1:n_samples) %>%
  mutate(mean = map_vec(i, function(x) get_bs_mean(data = bimodal_sample)))

We might simply use the results of our bootstrapping procedure to estimate the standard error of our mean estimates.

se_est <- sd(bs_stats$mean)
se_est
## [1] 0.2489343

But we could also use these results to evaluate the appropriateness of the normal approximation.

make_clt_plot(bs_stats)             

Here we have barely scratched the surface of what can be done with bootstrapping analysis. In some settings we can use bootstrapping to read off critical values without needing to rely on the distribution approximating a normal distribution. We will mention bootstrapping approaches below and discuss them in more detail in subsequent chapters.

6.4 Differences in means

As we saw in Chapter 4, we are often interested in the differences between two groups of observations. Suppose that the first group has \(n_0\) observations and a mean of \(\overline{x}_0\) with standard deviation \(\hat{\sigma}_0\) and the second group has \(n_1\) observations and a mean of \(\overline{x}_1\) with standard deviation \(\hat{\sigma}_1\).

Given the maintained assumption of independence of the observations here, the two sample means are independently distributed. The difference between the means of the two groups \(\overline{x}_1 - \overline{x}_0\) has variance equal to the sum of the variances of each mean.

\[ \mathrm{var}(\overline{x}_1 - \overline{x}_0) = \frac{\hat{\sigma}^2_1}{n_1} + \frac{\hat{\sigma}^2_0}{n_0} \] Taking the square root of this gives the standard error of the difference in means.

\[ \mathrm{se}(\overline{x}_1 - \overline{x}_0) = \sqrt{\frac{\hat{\sigma}^2_1}{n_1} + \frac{\hat{\sigma}^2_0}{n_0}} \] Again, we can often rely on the difference in means being approximately normally distributed.

One important point to note is that if the null hypothesis is \(\mu_0 = \mu_1 = 0\), then it may be tempting to think that if the null hypothesis \(\mu_1 \leq 0\) is rejected (in favour of \(H_1: \mu_1 > 0\)) while the null hypothesis \(\mu_0 \leq 0\) is not, that we can conclude that the null hypothesis \(\mu_1 \leq \mu_0\) can thereby be rejected (in favour of \(H_1: \mu_1 > \mu_0\)). This reasoning is erroneous, as we will demonstrate.

Imagine that our sampling procedure will draw 1,000 observations from each of two groups and that we will compare the means of these two sub-samples. We will construct 10,000 such samples; we do this by constructing them in turn using the get_means function with dgf = rnorm.

We merge these two samples by the sample identifier (i) and for each sample, we calculate the difference in means and the standard error of the difference using our formula above. From these values, we can calculate the z-statistic for each difference.

set.seed(2023)
sample_0_stats <- get_means(sample_size = 1000,
                            n_samples = 10000,
                            dgf = rnorm)

sample_1_stats <- get_means(sample_size = 1000,
                            n_samples = 10000,
                            dgf = rnorm)

merged_stats <- 
  sample_0_stats %>%
  inner_join(sample_1_stats, by = "i", suffix = c("_0", "_1")) %>%
  mutate(mean_diff = mean_1 - mean_0,
         se_diff = sqrt(se_0^2 + se_1^2),
         z_diff = mean_diff/se_diff)

Let’s assume that the size of our test is \(\alpha = 0.05\). Because we are using two-tailed tests, we set the critical value to the absolute value of \(\Phi^{-1}(\alpha/2)\).

crit_value <- abs(qnorm(0.025))

The correct way to evaluate difference in means is to consider the difference significant if (1) \(\left|z_{\text{diff}}\right| > c\). However, some researchers will consider the difference significant if either (2) \(|z_1| > c\) while \(|z_0| \leq c\) or (3) \(|z_0| > c\) while \(|z_1| \leq c\). And there is a risk that researchers will take the opportunity to declare the difference “statistically significant” if any of the conditions (1), (2), or (3) is met.

Below we calculate the proportion of samples for which the null hypothesis is rejected based on condition (1) (prop_sig_diff), based on conditions (2) or (3) (prop_sig_diff_alt), or based on conditions (1), (2), or (3) (prop_sig_diff_choice).

merged_stats %>%
  mutate(sig_diff = abs(z_diff) > crit_value,
         sig_1 = abs(z_1) > crit_value,
         sig_0 = abs(z_0) > crit_value) %>%
  summarize(prop_sig_diff = mean(sig_diff),
            prop_sig_diff_alt = mean((sig_1 & !sig_0) | (sig_0 & !sig_1)),
            prop_sig_diff_choice = mean(sig_diff | (sig_1 & !sig_0) |
                                          (sig_0 & !sig_1)))
## # A tibble: 1 × 3
##   prop_sig_diff prop_sig_diff_alt prop_sig_diff_choice
##           <dbl>             <dbl>                <dbl>
## 1        0.0508            0.0967                0.116

6.5 Inference with regression

When the OLS estimator is unbiased, the expected value of \(\hat{\beta}\) will be \(\beta\), but \(\hat{\beta}\) is just an estimate and will typically differ from the true value, as we can see by substituting the value of \(y\) into the equation for \(\hat{\beta}\) and doing some algebra. Note that below we use the facts that \(\mathbb{E} \left[\hat{\beta} - \beta \right]\) is zero (implied by the expression below) and that the variance of a zero-mean vector \(Z\) equals \(\mathbb{E} \left[ZZ' \right]\).

\[ \begin{aligned} \hat{\beta} &= (X^{\mathsf{T}}X)^{-1}X^{\mathsf{T}}y \\ &= (X^{\mathsf{T}}X)^{-1}X^{\mathsf{T}}(X\beta + \epsilon) \\ &= (X^{\mathsf{T}}X)^{-1}X^{\mathsf{T}}X\beta + (X^{\mathsf{T}}X)^{-1}\epsilon \\ & = \beta + (X^{\mathsf{T}}X)^{-1}X^{\mathsf{T}}\epsilon \\ \hat{\beta} - \beta &= (X^{\mathsf{T}}X)^{-1}X^{\mathsf{T}}\epsilon \\ \mathrm{var} (\hat{\beta}|X) &= \mathbb{E} \left[(X^{\mathsf{T}}X)^{-1}X^{\mathsf{T}} \epsilon \epsilon' X (X^{\mathsf{T}}X)^{-1} | X\right] \\ &= (X^{\mathsf{T}}X)^{-1}X^{\mathsf{T}} \mathbb{E} \left[\epsilon \epsilon' | X\right] X (X^{\mathsf{T}}X)^{-1} \end{aligned} \]

Now if \(\mathbb{E} \left[\epsilon_i \epsilon_j \right] = 0, \forall i \neq j\), then \(\mathbb{E} \left[\epsilon \epsilon' | X\right]\) will be a diagonal matrix with each element on the main diagonal equal to \(\sigma^2_i\), the variance of \(\epsilon\) for observation \(i\) (i.e., \(\mathrm{var} (\epsilon_i)\)). And if \(\mathbb{E} [\sigma^2_i |X ] = \sigma^2\) for all \(i\), then we have:

\[ \begin{aligned} \Sigma &:= \mathrm{var} (\hat{\beta}|X) \\ &= \sigma^2 (X^{\mathsf{T}}X)^{-1}X^{\mathsf{T}}X (X^{\mathsf{T}}X)^{-1} \\ &= \sigma^2 (X^{\mathsf{T}}X)^{-1} \\ \end{aligned} \]

This is the familiar result under the “standard” OLS assumptions, including the assumption that \(\mathbb{E} [\sigma^2_i |X ] = \sigma^2\) for all \(i\), which is known as homoskedasticity.

One issue is that we do not observe \(\sigma^2\), but we can estimate it using the standard formula

\[ \hat{\sigma}^2 = \frac{1}{N-k} \sum_{i=1}^N \hat{\epsilon}_i^2 \] where \(\hat{\epsilon}_i\) is the regression residual for observation \(i\) and \(N - k\) incorporates a degrees-of-freedom correction. (Note that \(k = 2\) when we have one regressor in addition to a constant term.) Putting the above together, we have

\[ \hat{\Sigma} = \hat{\sigma}^2 (X^{\mathsf{T}}X)^{-1} \] This is also the standard approach to estimating the variance of coefficients in statistical software, including the lm function in R. To confirm this, let’s create a test data set, calculate regression statistics “by hand” and then compare results with those provided by the lm function.

To make this more concrete, let’s consider some actual (though not “real”) data. The following code uses R functions to generate random data. Specifically, we generate 1000 observations with \(\beta = 1\) and \(\sigma = 0.2\).

set.seed(2021)
N <- 1000
x <- rnorm(N)
e <- rnorm(N, sd = 0.2)
y <- x * 1 + e

We next construct \(X\) as a matrix comprising a column of ones (to estimate the constant term) and a column containing \(x\).

X <- matrix(c(rep(1, N), x), ncol = 2)
b <- solve(t(X) %*% X) %*% t(X) %*% y
resid <- y - X %*% b

We now calculate \(\hat{\sigma}^2\) and then \(\hat{\Sigma}\). This allows us to calculate the estimated standard errors of the estimated coefficients as the square root of the diagonal elements of \(\hat{\Sigma}\) and then the \(t\)-statistics for each coefficient by dividing each element of \(\hat{\beta}\) by its estimated standard error.

resvar <- sum(resid^2)/(N-2)
Sigma <- resvar * solve(t(X) %*% X)
se <- sqrt(diag(Sigma))
tibble(se, t = b/se) %>% knitr::kable()
se t
0.0064399 1.170815
0.0063210 157.829403

Now, let’s compare with what we get from the summary function applied to the results of estimating a model using lm.

fm <- lm(y ~ x)
summary(fm)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.72772 -0.12642 -0.00101  0.12774  0.65945 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.007540   0.006440   1.171    0.242    
## x           0.997647   0.006321 157.829   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2036 on 998 degrees of freedom
## Multiple R-squared:  0.9615, Adjusted R-squared:  0.9614 
## F-statistic: 2.491e+04 on 1 and 998 DF,  p-value: < 2.2e-16

You may have noticed that these approaches are analogous to using t-statistics to test hypotheses about means that we examined above. Similarly, we might use asymptotic methods or bootstrapping to test hypotheses about regression coefficients. In fact, the assumptions required to yield regression coefficients that follow the \(t\) distribution—such as normally distributed error terms—are generally implausible in most research setting and in practice most inferences are justified based on asymptotic results and assumptions of approximately normal distributions in finite samples of actual research. In fact, all the approaches discussed in the next section rely on asymptotic justifications.

In the remainder of this section, we discuss how bootstrapping can be applied to regression analysis. The bootstrapping approach in this setting proceeds as follows:

  1. Take a with-replacement sample of size \(n\) from our sample of data.
  2. Calculate the regression coefficients for that sample.
  3. Repeat steps 1 and 2 many times.
  4. Use the distribution above to calculate estimates of relevant parameters (e.g., the standard error of regression coefficients).

To demonstrate this, we put our y and x variables above into a data frame and create a function (get_bs_coefs) that performs steps 1 and 2.

df <- tibble(y, x)

get_bs_coefs <- function(data) {
  rows <- 1:nrow(data)
  bs_rows <- sample(rows, size = length(rows), replace = TRUE)
  
  bs_sample <- data[bs_rows, ]
  fm <- lm(y ~ x, data = bs_sample)
  fm$coefficients
}

We then use map_df to repeat these steps 10,000 times.

set.seed(2021)
bs_coefs <- 
  tibble(i = 1:n_samples) %>%
  mutate(map_df(i, function(x) get_bs_coefs(data = df)))

We can then calculate the standard deviation of the estimated regression coefficients as an estimate of their standard errors.

bs_coefs %>%
  select(-i) %>%
  summarize(across(everything(), sd))
## # A tibble: 1 × 2
##   `(Intercept)`       x
##           <dbl>   <dbl>
## 1       0.00645 0.00650

6.6 Dependence

In generating data in Chapter 4, we assumed that the errors and regressors were independent. However, most of the empirical accounting literature uses panel data sets, typically repeated observations on a set of firms over time. In these studies, it is reasonable to assume that regressors (\(X\)) and errors (\(\epsilon\)) are correlated both across firms (i.e., cross-sectional dependence) and over time (i.e., time-series dependence).

While White (1980) standard errors are consistent in the presence of heteroskedasticity, it is well known that both OLS and White produce misspecified test statistics when either form of dependence is present.

Gow et al. (2010) demonstrate that accounting researchers use a variety of approaches to address the issues of cross-sectional and time-series dependence widely understood to be present in common research settings. The literature survey in Gow et al. (2010) identified a number of common approaches: Fama-MacBeth, Newey-West, the “\(Z2\)” statistic, and standard errors clustered by firm, industry, or time.

6.6.1 Newey-West

Newey and West (1987) generalized the White (1980a) approach to yield a covariance matrix estimator that is robust to both heteroskedasticity and serial correlation. While the Newey-West procedure was developed in the context of a single time series, Gow et al. (2010) show it is frequently applied in panel data settings, and in such settings assumes cross-sectional independence. Gow et al. (2010) assess the performance of Newey-West in the presence of both cross-sectional and time-series dependence, and find that it produces misspecified test statistics with even moderate levels of cross-sectional dependence.

6.6.2 Fama-MacBeth

The Fama-MacBeth approach (Fama and MacBeth, 1973) is designed to address concerns about cross-sectional correlation. The original Fama-MacBeth approach (which Gow et al., 2010 label “FM‑t”) involves estimating T cross-sectional regressions (one for each period) and basing inferences on a t-statistic calculated as

\[ t = \frac{\overline{\beta}}{\mathit{se}(\beta)}\text{, where } \overline{\beta} = \frac{1}{T} \sum_{t=1}^T \hat{\beta}_t\]

and \(\mathit{se}(\beta)\) is the standard error of the coefficients based on their empirical distribution. When there is no cross-regression (time-series) dependence, this approach yields consistent estimates of the standard error of the coefficients as T goes to infinity. However, cross-regression dependence in errors and regressors causes Fama-MacBeth standard errors to be understated (Cochrane, 2009; Schipper and Thompson, 1983).

To illustrate the Fama-MacBeth approach, we will generate a data set using the simulation approach of Gow et al. (2010), which is available using the get_got_data function found in the farr package. The parameters Xvol and Evol relate to the cross-sectional dependence of \(X\) and \(\epsilon\) repsectively. The parameters rho_X and rho_E relate to the time-series dependence of \(X\) and \(\epsilon\) respectively. We get 10 years of simulated data on 500 firms and store the data in the data frame named test.

set.seed(2021)
test <- get_got_data(N = 500, T = 10, Xvol = 0.75,
                       Evol = 0.75, rho_X = 0.5, rho_E = 0.5)

We first estimate use the nest() function from tidyr to organize the relevant data by year. We then use the map function from the purrr package to estimate the regression model by year and place the fitted models in a column named fm, then again to extract the coefficients from the fm column and store them in the coefs column.

Finally, we combine the year column and the data in the coefs column into a new data frame that we can tabulate. Note that we use the shortcut provided by purrr that allows us to write function(z) lm(y ~ x, data = z) as simply ~ lm(y ~ x, data = .).

results <-
  test %>% 
  group_by(year) %>% 
  nest() %>%
  mutate(fm = map(data, ~ lm(y ~ x, data = .)),
         coefs = map(fm, coefficients))

coefs <- 
  bind_cols(year = results$year, bind_rows(results$coefs)) %>%
  rename(Constant = `(Intercept)`)

coefs %>% knitr::kable()
year Constant x
1 1.6355567 0.9513677
2 2.7017022 0.9898722
3 -1.1093092 0.9078208
4 -0.3491182 1.0951516
5 -0.8628827 1.0034783
6 -0.4451103 0.9104420
7 -0.2011427 1.0506848
8 1.5738367 0.8011475
9 -2.7726904 0.9517446
10 0.8071852 0.9843101

The second step of the Fama-MacBeth approach involves calculating the mean and standard error of the estimated coefficients. It can be shown that regression of the time-series of coefficients on a constant produces the required means and standard errors.

fms <- list(lm(Constant ~ 1, data = coefs),
            lm(x ~ 1, data = coefs))

Two common variants of the Fama-MacBeth approach appear in the accounting literature. The first variant, which Gow et al. (2010) label “FM‑i”, involves estimating firm- or portfolio-specific time-series regressions with inferences based on the cross-sectional distribution of coefficients, and had been used extensively in the accounting literature prior to Gow et al. (2010). This modification of the Fama-MacBeth approach is appropriate if there is time-series dependence but not cross-sectional dependence. However, it is difficult to identify such circumstances in real research settings and we do not discuss FM-i further. 51

The second common variant of the FM‑t approach, which Gow et al. (2010) label FM‑NW, is intended to correct for serial correlation in addition to cross-sectional correlation. FM‑NW modifies FM‑t by applying a Newey-West adjustment in an attempt to correct for serial correlation. While a number of studies claimed that FM‑NW produces a conservative estimate of statistical significance, they did so without any formal evaluation of FM‑NW.

Gow et al. (2010) found that FM-NW simply does not address serial correlation and offer two explanations for this failure. First, FM‑NW generally involves applying Newey-West to a limited number of observations, a setting in which Newey-West is known to perform poorly. Second, FM‑NW applies Newey-West to a time-series of coefficients, whereas the dependence is in the underlying data. For example, without using information in the underlying data, it seems difficult to determine whether a sequence such as {2.1, 1.8, 1.9, 2.2, 2.3} represents a highly autocorrelated draw from a mean-zero distribution or evidence of a population coefficient of around 2.

Nonetheless, implementing FM-NW is straightforward using the sandwich package. We simply apply the NeweyWest function to the fitted FM-t model to recover FM-NW standard errors. The following function does this:52

get_nw_ses <- function(fm, lag = 1) {
  vcov <- NeweyWest(fm, lag = lag, prewhite = FALSE, adjust = TRUE)
  sqrt(diag(vcov))
}

We tabulate each of the Fama-MacBeth coefficients twice: once with the default standard errors (by passing NULL to the se argument of stargazer) and once with Newey-West standard errors from the get_nw_ses() function defined above).

sg_format <- "html"
stargazer(list(fms, fms),
          type = sg_format,
          se = list(NULL, NULL,
                    get_nw_ses(fms[[1]]),
                    get_nw_ses(fms[[2]])),
          column.labels = c("FM-t", "FM-t",
                            "FM-NW", "FM-NW"),
          dep.var.caption = "Covariate",
          covariate.labels = "Coefficient",
          omit.stat=c("ser", "f", "adj.rsq"),
          report = "vcs",
          model.names = FALSE,
          dep.var.labels.include = TRUE,
          header = FALSE)
Covariate
Constant x Constant x
FM-t FM-t FM-NW FM-NW
(1) (2) (3) (4)
Coefficient 0.098 0.965 0.098 0.965
(0.506) (0.026) (0.457) (0.020)
Observations 10 10 10 10
R2 0.000 0.000 0.000 0.000
Note: p<0.1; p<0.05; p<0.01

6.6.3 Z2 statistic

The Z2 statistic first appeared in 1994 and appeared in a number of subsequent studies in the accounting literature. The Z2‑t (Z2-i)statistic is calculated using t-statistics from separate cross-sectional (time-series) regressions for each time period (cross-sectional unit) and is given by the expression:

\[ \textit{Z2} = \frac{\overline{t}}{\mathit{se}(t)}\text{, where } \overline{t} = \frac{1}{T} \sum_{t=1}^T \hat{t}_t\]

\(\mathit{se}(t)\) is the standard error of the t-statistics based on their empirical distribution, and T is the number of time periods (cross-sectional units) in the sample. Several studies claimed that \(Z2\) adjusts for cross-sectional and serial correlation, but offered no basis for this claim. Gow et al. (2010) found no prior formal analysis of the properties of the \(Z2\) statistic and found that \(Z2\) suffers from cross-regression dependence in the same way as the Fama-MacBeth approach does.

6.6.4 One-way cluster-robust standard errors

A number of studies in our survey use cluster-robust standard errors, with clustering either along a cross-sectional dimension (e.g., analyst, firm, industry, or country) or along a time-series dimension (e.g., year); we refer to the former as CL‑i and the latter as CL‑t. Cluster-robust standard errors (also referred to as Huber-White or Rogers standard errors) were proposed by White (1984) as a generalization of the heteroskedasticity-robust standard errors of White (1980). With observations grouped into \(G\) clusters of \(N_g\) observations, for \(g\) in \({1, \dots, G}\), the covariance matrix is estimated using the following expression,

\[ \hat{V}(\hat{\beta}) = (X'X)^{-1} \hat{B} (X'X)^{-1} \text{, where } \hat{B} = \sum_{g=1}^G X'_g u_g u'_g X_g \]

where \(X_g\) is the \(N_g \times K\) matrix of regressors, and \(u_g\) is the \(N_g\)-vector of residuals for cluster \(g\). If each cluster contains a single observation, equation (3) yields the White (1980) heteroskedasticity-consistent estimator.

While one-way cluster-robust standard errors allow for correlation of unknown form within cluster, it is assumed that errors are uncorrelated across clusters. For example, clustering by time (firm) allows observations to be cross-sectionally (serially) correlated, but assumes independence over time (across firms). As we demonstrate, \(t\)-statistics for CL‑t are inflated in the presence of time-series dependence and \(t\)-statistics for CL‑i are inflated in the presence of cross-sectional dependence. Thus, when both forms of dependence are present, both CL‑t and CL‑i produce overstated \(t\)-statistics.

It is also important to note that clustering by, say, industry-year does not produce standard errors robust to within-industry and within-year dependence. Instead, like all cluster-robust methods, this approach assumes independence across clusters. As such, clustering by industry-year assumes that each industry-year cluster is independent; that is, neither time-series nor cross-industry correlation is an issue (i.e., observations for industry \(j\) in year \(t\) are uncorrelated with those in industry \(j\) in year \(t+1\) and those in industry \(k\) in year \(t\)). To see this point, note that clustering by firm-year when observations are firm-years produces White standard errors.

6.6.5 Two-way cluster-robust standard errors

Thompson (2011) and Cameron et al. (2011) developed an extension of cluster-robust standard errors that allows for clustering along more than one dimension.

In contrast to one-way clustering, two-way clustering (CL‑2) allows for both time-series and cross-sectional dependence. For example, two-way clustering by firm and year allows for within-firm (time-series) dependence and within-year (cross-sectional) dependence (e.g., the observation for firm \(j\) in year \(t\) can be correlated with that for firm \(j\) in year \(t+1\) and that for firm \(k\) in year \(t\)). To estimate two-way cluster-robust standard errors, the expression in (3) is evaluated using clusters along each dimension (for example, clustered by industry and clustered by year) to yield \(V_1\) and \(V_2\). Then the same expression is calculated using the “intersection” clusters (in our example, observations within an industry-year) to yield \(V_I\). The two-way cluster-robust estimator \(V\) is calculated as \(V = V_1 + V_2 - V_I\). Packages available for R make it relatively straightforward to implement two-way cluster-robust standard errors and we demonstrate their use below.

6.6.6 Calculating standard errors

In the following code, we use functions from three libraries: sandwich, plm, and lfe. The lfe library provides the felm function as an extension to lm that allows for fixed effects, instrumental variables, and clustered standard errors. We will discuss the use of felm with fixed effects and instrumental variables in later chapters. Here we just use the functionality related to clustered standard errors. The interface to felm uses a four-part formula specification with each part separated by |. The first part consists of ordinary covariates, the second part consists of “factors to be projected out” (think of these as fixed effects), the third part provides instrumental variables, and the fourth part provides the specification for clustered standard errors. We use 0 for the second and third parts below.

fms <- list(OLS = lm(y ~ x, data = test),
            White = lm(y ~ x, data = test),
            NW = plm(y ~ x, test, index = c("firm", "year"),
                     model = "pooling"),
            `FM-t` = pmg(y ~ x, test, index = "year"),
            `CL-i` = felm(y ~ x | 0 | 0 | firm, data = test),
            `CL-t` = felm(y ~ x | 0 | 0 | year, data = test),
            `CL-2` = felm(y ~ x | 0 | 0 | year + firm, data = test))

se <- vector("list", length(fms))
names(se) <- names(fms)
se[["White"]] <- sqrt(diag(vcovHC(fms[["White"]], type="HC1")))
se[["NW"]] <- sqrt(diag(vcovNW(fms[["NW"]])))
stargazer(fms, type = sg_format, digits = 3,
          se = se,
          dep.var.caption = "",
          omit.stat=c("ser", "f", "adj.rsq"),
          column.labels = names(fms),
          report = "vcs",
          model.names = FALSE,
          dep.var.labels.include = FALSE,
          header = FALSE)
OLS White NW FM-t CL-i CL-t CL-2
(1) (2) (3) (4) (5) (6) (7)
x 1.266 1.266 1.266 0.965 1.266 1.266 1.266
(0.027) (0.021) (0.021) (0.026) (0.018) (0.172) (0.172)
Constant 0.021 0.021 0.021 0.098 0.021 0.021 0.021
(0.026) (0.026) (0.027) (0.506) (0.023) (0.492) (0.492)
Observations 5,000 5,000 5,000 5,000 5,000 5,000 5,000
R2 0.305 0.305 0.305 0.786 0.305 0.305 0.305
Note: p<0.1; p<0.05; p<0.01

One concern with cluster-robust methods is their finite-sample properties. Cameron et al. (2008) document that cluster-robust methods will over-reject a true null when the number of clusters is small and attribute this issue to the use of limiting distributions (typically the normal distribution) in making inferences in small samples (e.g., using the standard 1.64, 1.96, and 2.58 critical values). However, it is important to note that most methods used in accounting research (including White, N-W, FM‑i, FM‑t, Z2) have asymptotic foundations and thus are affected by this issue (e.g., FM‑t with 10 years of data). Accordingly, researchers should exercise caution when applying any asymptotic methods in small-sample settings (e.g., FM-t, CL-t, or CL-2 when the number of time periods is small). This issue notwithstanding, we continue to find CL‑2 produces unequivocally better inferences than any of these methods even with as few as 10 clusters. Furthermore, as the econometrics literature has identified approaches to address this concern (e.g., Cameron et al., 2008 identify corrections based on bootstrapping methods), having few clusters does not warrant relying on approaches that are not robust to cross-sectional and time-series dependence.

6.6.7 Exercises

  1. Confirm either by direct computation using the data in coefs or using a little algebra that regressing the first-stage coefficients on a constant yields the desired second-stage results for FM-t.
  2. Assume that the null hypothesis is \(H_0: \beta = 1\). Using the reported coefficients and standard errors for each of the methods listed in the table (i.e., OLS, White, NW, FM-t, CL-i, CL-t, CL-2), for which methods is the null hypothesis rejected?
  3. Based on the analysis in Gow et al. (2010), when should you use the FM-NW or Z2 approaches? What factors likely led to the use of these approaches and claims for them (i.e., robustness to cross-sectional and time-series dependence) that were unsubstantiated (and ultimately false)?
  4. If FM-i is, as Gow et al. (2010) show, so inappropriate for the situations in which it was used, why do you think it was used in those situations?
  5. Gow et al. (2010) refer to “the standard 1.64, 1.96, and 2.58 critical values.” Using pnorm function, determine the \(p\)-value associated with each of these critical values. Are these one-tailed or two-tailed \(p\)-values? The values of 1.64, 1.96, and 2.58 are approximations. Which function in R allows you to recover the precise critical values associated with a chosen \(p\)-value? (Hint: Read the help provided by ? pnorm in R.) Provide the critical values to four decimal places.