6 Statistical inference
This chapter is under construction.
In this chapter, we study statistical inference in the context of null hypothesis significance testing (NHST), which is the predominant paradigm in empirical accounting research. We examine NHST in fairly procedural terms without spending too much time justifying it. One reason for doing so is that even if we simply accept NHST as a sensible approach to science, in practice the way is NHST is used is profoundly different from the assumptions supporting its use.^{45} So real problems with NHST exist well before reaching its epistemological foundations.
Another issue with NHST is that statistical inference in most settings is based on limit theorems—results from mathematical statistics that require fairly advanced mathematics to prove—such as laws of large numbers and central limit theorems.^{46} To minimize the mathematical requirements here, we use simulations to illustrate ideas rather then trying to provide mathematically precise results. For readers looking to go deeper on either dimension, we provide a guide to further reading at the end of the chapter.
The code in this chapter uses the following packages. For instructions on how to set up your computer to use the code found in this book, see Chapter 1.2.1.
library(dplyr, warn.conflicts = FALSE)
library(stargazer)
library(ggplot2)
library(purrr) # For map()
library(tidyr) # For nest()
library(sandwich) # For NeweyWest()
library(plm)
library(lfe)
library(farr)
We use the stargazer
package to produce neat output from regressions.
For the HTML version of this book, we set sg_format
to "html"
, but "text"
would be a better option if looking at the results interactively, and you would probably use "latex"
if compiling a PDF.^{47}
sg_format < "html"
6.1 Some observations
Students of econometrics in PhD programs often spend much of their time learning about the asymptotic properties of estimators, which can be understood as how estimators behave as sample sizes tend toward infinity. However, there are two gaps in this standard pedagogical approach. First, it is often not made clear why we care about asymptotic properties of estimators. A practical student might question the relevance of asymptotic properties when sample sizes in practice can at times be quite large, but are rarely infinite.
Second, in practice, researchers soon forget that the estimators they use are justified by their asymptotic properties and rarely exploit what they learned in econometrics classes to study new estimators. For example, in this chapter we will devote some time to the study of standard errors. Almost all estimators of standard errors are motivated by asymptotic properties and thus can perform poorly in actual research settings. Yet Gow et al. (2010) show that estimators can gain currency among researchers without any formal analysis of their properties, asymptotic or otherwise.
One reason for studying the asymptotic properties of estimators is that these are mathematically tractable for a welltrained econometrician. For a practical researcher, showing that an estimator has good asymptotic properties is more or less necessary. An estimator that is not consistent is apt to produce significantly biased estimated in finite samples (one can think of consistency as the asymptotic equivalent of unbiasedness).^{48} And good asymptotic properties might be practically sufficient for the estimator to behave well in finite samples. An estimator that is asymptotically efficient and asymptotically normal has a good chance of being an efficient estimator in finite samples and having a distribution that is sufficiently close to the normal distribution for practical statistical inference.
In this chapter, we will not derive asymptotic properties of estimators. Instead, we will direct readers to econometrics texts for these details. Additionally, while we will provide an intuitive introduction to laws of large numbers and central limit theorems, we will not even state these formally, let alone derive them. Instead, in keeping with a theme of this book, we will use simulations to give the reader an intuitive feel for where these results from mathematical statistics come from and what they mean.
6.2 Datagenerating processes
Like causal inference, thinking about statistical inference generally involves quite abstract reasoning. Before getting into the statistics, it is helpful to imagine that the data we observe are generated by an underlying process (hence a datagenerating process or DGP). A DGP will often look like a model, but in principle these are distinct ideas. The DGP is a feature of reality that we generally do not observe directly, while a model is a guess or useful simplification of a conjectured DGP that humans use to understand the DGP.
In some settings we may believe that our model captures all the essential details of the DGP. While the results of a coin toss will depend on the rotational momentum imparted to the coin, the vertical velocity of the coin, imperfections on the coin’s surface, prevailing humidity, temperature, and wind velocity and so on, we may be happy in simply using a model in which coin tosses are independent, and equally likely to be one outcome as the other.
Part of the potential for confusion between these two ideas is that inevitably when discussing a DGP, we use a model, as we do here. Suppose that the height of adult individuals is drawn from a normal distribution with mean of \(\mu = 165\) centimetres and standard deviation of \(\sigma = 8\) centimetres.
hgt_mean < 165
hgt_sd < 8
This statement could be interpreted as the process for generating the heights of the billions of adult people who happen to inhabit Earth at this time. In a sense, the people who actually exist are a nondeterministic sample of the potential people who could exist.^{49}
Alternatively, we might take the current inhabitants of the world as a given and posit that the empirical distribution of heights is actually a normal distribution with parameters \(\mu\) and \(\sigma\) and focus instead on taking random sample from that population. Looking ahead, we might use a random sample to make inferences about the empirical distribution of heights of adult humans who currently inhabit the world or to make inferences about the datagenerating process for adult heights. In the context of causal inference, it is often the latter kind of inference that we want to make, but for present purposes we focus on the former kind.
There is no doubt that a twoparameter normal distribution is an oversimplification of the actual DGP, hence more of a model than an actual DGP. For example, we make no allowance for heights varying around the world, for the tendency of heights to increase over time (hence the reference to “current” above), or the tendency for men to be taller than women, each of which may undermine the applicability of a simple assumed DGP. But for current purposes, we ignore such quibbles and simply assume that heights come from a normal distribution \((\mu, \sigma)\).
We can simulate the taking of a random sample of size \(n = 100\) by using the rnorm
function we saw in Chapter 3.
hgt_sample < rnorm(100, mean = hgt_mean, sd = hgt_sd)
Given that we know that \(\mu = 165\) and that \(\sigma = 8\), we can calculate expected values for the sample mean and sample standard deviation of the heights in our sample. Here the sample mean is calculated as the sum of the observed heights (\(\{x_i: i \in (1, \dots, n)\}\)) divided by \(n\), where \(n = 100\).
\[ \overline{X}_{n} = \frac{\sum_{i = 1}^{n} x_i}{n} \]
\[ \begin{aligned} \mathbb{E}[\overline{X}_{n}] &= \mathbb{E}\left[\frac{\sum_{i = 1}^{n} x_i}{n}\right] \\ &= \frac{\sum_{i = 1}^{n}}{n} \mathbb{E}\left[ x_i\right] \\ &= \frac{\sum_{i = 1}^{n} \mathbb{E}\left[ x_i\right]}{n} \\ &= \frac{n \mu}{n} \\ &= \mu \end{aligned} \] To calculate the variance of \(\overline{X}_{n}\), we note that the variance of \(ax\) where \(a\) is a constant and \(x\) is a random variable is \(\mathrm{Var}[ax] = a^2 \mathrm{Var}[x]\) and the variance of the sum of independent and identically distributed variables \(x_i\), \(i \in (1, \dots, n)\), each with variance \(\sigma^2\) is simply the sum of their variances:
\[ \begin{aligned} \mathrm{Var} \left[\sum_{i = 1}^{n} x_i \right] &= \sum_{i = 1}^{n} \mathrm{Var} \left[x_i \right] \\ &=\frac{n}{\sigma^2} \end{aligned} \]
\[ \begin{aligned} \mathrm{Var}[\overline{X}_{n}] &= \mathrm{Var} \left[\frac{\sum_{i = 1}^{n} x_i}{n}\right] \\ &= \frac{1}{n^2} \mathrm{Var} \left[\sum_{i = 1}^{n} x_i \right] \\ &= \frac{1}{n} \sigma^2 \end{aligned} \] The standard deviation of the sample mean is the square root of the variance: \(\frac{\sigma}{\sqrt{n}}\).
Because in this section, we are in an idealized world in which we know the underlying DGP, we can test our calculations by generating data.
We can incorporate the code we used to make a sample above in a function get_hgt_sample
that takes a single argument (n
) for the sample size.
get_hgt_sample < function(n) {
rnorm(n, mean = hgt_mean, sd = hgt_sd)
}
Calling get_hgt_sample
with n = 10
we get a sample of 10 random numbers.
get_hgt_sample(10)
## [1] 161.0218 162.9585 158.2503 188.7250 173.4118 160.1665 168.9922 175.2958
## [9] 172.4536 150.6659
But we can easily generate tenthousand samples.
In many programming languages, we would use a for
loop to do this as in the following code.
In R, it is natural to use a list
to store results from multiple runs of a function, so we intiate hgt_samples
as an empty list and then use the for
command to repeat the get_hgt_sample(100)
call 10,000 times, storing each sample in a slot in hgt_samples
.
n_samples < 10000
set.seed(2023)
hgt_samples < list()
for (i in 1:n_samples) {
hgt_samples[[i]] < get_hgt_sample(100)
}
We can take a peek at the first ten observations in the 463rd sample like this:
hgt_samples[[463]][1:10]
## [1] 168.4070 164.0360 163.2143 155.2887 163.0420 170.6284 146.7339 168.8120
## [9] 172.7608 156.1855
We can calculate the mean of the 463rd sample like this:
mean(hgt_samples[[463]])
## [1] 164.4264
We could use a for
loop to calculate the mean for each of the samples in one go:
hgt_sample_means < list()
for (i in 1:length(hgt_samples)) {
hgt_sample_means[[i]] < mean(hgt_samples[[i]])
}
Here hgt_sample_means
is a list and we can peek at the first 5 values like this:
hgt_sample_means[1:5]
## [[1]]
## [1] 165.5401
##
## [[2]]
## [1] 166.6662
##
## [[3]]
## [1] 163.7082
##
## [[4]]
## [1] 165.0866
##
## [[5]]
## [1] 164.1906
For many purposes, it will be easier to work with hgt_sample_means
if it is a vector rather than a list and we can use the unlist
function to convert it to a vector:
unlist(hgt_sample_means[1:5])
## [1] 165.5401 166.6662 163.7082 165.0866 164.1906
Additionally, we could use the function lapply
rather than a for
loop.
The function lapply
applies a function to a list and returns a list of the results.
Combining lapply
and unlist
, we could use the following code instead:
We can see we have a nice vector of means from a single line of code.
hgt_sample_means[1:10]
## [1] 165.5401 166.6662 163.7082 165.0866 164.1906 164.9117 164.3328 164.5542
## [9] 164.1957 164.8957
In fact, we could use lapply
rather than a for
loop to create hgt_samples
in the first place.
Note that to get n_samples
samples, we want to lapply
the function get_hgt_sample
to the vector 1:n_samples
.
But we have a small problem: lapply
will provide each value of that vector to get_hgt_sample
and get_hgt_sample
simply does not expect that value.
One approach to addressing this is to use an anonymous function as we do below.
We call function(x) get_hgt_sample(100)
an anonymous function because we do not give it a name (by storing it in a variable).
We could do that, but as can be seen below it means storing a function in a new variable (get_hgt_sample_alt
) without producing code that is easier to read.
set.seed(2023)
get_hgt_sample_alt < function(x) get_hgt_sample(100)
hgt_samples < lapply(1:n_samples, get_hgt_sample_alt)
Another approach would be to make a version of get_hgt_sample
that does expect an argument related to the index of the sample being generated.
In fact, in some settings it might be useful to retain that information and in the function below, we have get_hgt_sample
return a data frame (tibble
) containing the sample index in the first column (i
) and the actual sample in the second column (data
).
get_hgt_sample_df < function(i, n) {
tibble(i, data = list(rnorm(n, mean = hgt_mean, sd = hgt_sd)))
}
To see what get_hgt_sample_df
is doing, we can call it once to get a single (small) sample.
one_sample < get_hgt_sample_df(i = 1, n = 10)
one_sample
## # A tibble: 1 × 2
## i data
## <dbl> <list>
## 1 1 <dbl [10]>
one_sample$data[[1]]
## [1] 159.2986 172.3488 170.3377 172.4086 161.2063 171.0710 164.7035 183.3683
## [9] 152.9049 171.6578
We can access the underlying data, but at this stage it may seem that we are adding complexity for no particular reason, as might be suggested by comparing the code below to the equivalent code in our first attempt above.
unlist(hgt_samples[[463]]$data)[1:10]
## [1] 168.4070 164.0360 163.2143 155.2887 163.0420 170.6284 146.7339 168.8120
## [9] 172.7608 156.1855
One reason for this complexity is that currently hgt_samples
is a list of tibbles.
We can use the function bind_rows
to bind them together into a single data frame.
hgt_samples_df < bind_rows(hgt_samples)
Note that hgt_samples_df
is starting to get fairly large:
print(object.size(hgt_samples_df), units = "auto")
## 8.2 Mb
While the size of hgt_samples_df
does not present a problem for a modern computer, more complicated analyses might produce objects sufficiently large that we would want to pay attention to their sizes.
We can now use map
functions from the purrr
library to extract statistics from hgt_samples_df
.
Because we want to the result of each calculation here to a vector, we use map_vec
.
To keep our results small, we drop the data
column from our table of statistics (hgt_samples_stats
).
(We could always use the column i
to join hgt_samples_stats
with hgt_samples_df
if we wanted to restore it.)
hgt_samples_stats <
hgt_samples_df %>%
mutate(n = map_vec(data, length),
mean = map_vec(data, mean),
se = map_vec(data, sd)/sqrt(n)) %>%
select(data)
We can now assess how well our formulas worked out in our simulations. The mean of the estimated means is indeed pretty close to the \(\mu\) parameter of 165 and the standard deviation of those estimated means is also very close to the value \(\frac{\sigma}{\sqrt{n}}\) (0.8).
hgt_stats_summ <
hgt_samples_stats %>%
summarize(n = mean(n),
mean_mean = mean(mean),
sd_mean = sd(mean),
mean_se = mean(se),
sd_se = sd(se))
hgt_stats_summ
## # A tibble: 1 × 5
## n mean_mean sd_mean mean_se sd_se
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 100 165. 0.810 0.799 0.0569
hgt_stats_summ %>%
transmute(mean_error = mean_mean  hgt_mean,
se_error = sd_mean  hgt_sd/sqrt(n),
se_est_error = mean_se  sd_mean)
## # A tibble: 1 × 3
## mean_error se_error se_est_error
## <dbl> <dbl> <dbl>
## 1 0.0110 0.0100 0.0112
6.3 Hypothesis testing
6.3.1 Normal distribution, known standard deviation
Of course, in reality we generally do not observe parameters like hgt_mean
and hgt_sd
and our task as statisticians is form and test hypotheses about those unobserved parameters.
With null hypothesis significance testing, the researcher starts with a null hypothesis and asks whether the data are “sufficiently unusual” to justify rejection of that hypothesis.
As alluded to above, there are many deep issues lurking in this approach that we largely ignore. One such issue that we ignore is the source for null hypotheses. It is often implied that null hypotheses somehow represent the prior or default beliefs of “science”. For example, in the context of pharmaceuticals, the null hypothesis is usually that a potential drug has zero (or negative) effect on the target outcome (say, curing cancer) and this null hypothesis has to be rejected for a drug to receive approval from regulators.^{50}
Suppose we have a null hypothesis that the mean height of adult individuals is \(\mu_{0} = 170\) and that we know that heights are drawn from a normal distribution with standard deviation of \(\sigma = 8\) centimetres.
hgt_mean_null < 170
Now we can get a sample of data to test our hypothesis:
set.seed(2023)
sample_size < 100
test_sample < get_hgt_sample(sample_size)
mean_est < mean(test_sample)
Here we have a mean of 166. How unusual is a value of 166 under the null hypothesis?
Given we know that the sample mean will be normally distributed with standard deviation equal to \(\sigma/\sqrt{n}\), we can calculate the probability of observing a mean this far from the null hypothesis assuming the null hypothesis is true using the pnorm
function:
## [1] 1.238951e08
In practice is it more common to transform the mean estimate into a zstatistic which has a standard normal distribution (denoted \(N(0,1)\)), with mean zero and standard deviation of one.
\[ z = \frac{\overline{x}  \mu_0}{\sigma/\sqrt{n}} \]
z_stat < (mean_est  hgt_mean_null)/(hgt_sd/sqrt(sample_size))
We can see that using the zstatistic (here \(5.57\)) yields exactly the same pvalue.
pnorm(z_stat)
## [1] 1.238951e08
It is generally posited that the criterion for rejection of the null hypothesis is established prior to analysing the data. The set of possible outcomes is partitioned into two disjoint sets that together comprise the event space. The critical value represents the threshold between these two sets expressed in terms of the test statistic.
The critical value is generally calculated based on probability threshold (also known as the significance level or size of the test) below which results are deemed statistically significant. A widely used significance level is 5%. Given that our null hypothesis is \(H_{0}: \mu = 170\), we could reject this null hypothesis because the test statistic is either too low or too high, so we need to accout for both possibilities when determining the critical value.
As such we set the critical value for “surprisingly low” test statistics at the point on the distribution where 2.5% (\(5\% \div 2\)) of values are more extremely negative given the null hypothesis. And we set set the critical value for “surprisingly high” test statistics at the point on the distribution where 2.5% (\(5\% \div 2\)) of values are more extremely positive given the null hypothesis.
qnorm(0.025)
## [1] 1.959964
qnorm(1  0.025)
## [1] 1.959964
We set that these critical values are identical in magnitude and only differ in sign. We can therefore express our critical value \(c\) as \(c = 1.96\) and identify the rejection region as \(\{z: z > c \}\). As the absolute value of our observed zstatistic (\(5.57\))) easily exceeds the critical value, we can reject the null hypothesis.
We can also calculate the pvalue of our observed result. Again we need to recognize that values as extreme as those observed could be positive or negative, so the probability of finding data at least as surprising as that we have is actually twice this probability calculated above \(p = 2.4779025\times 10^{8}\). (That the observed pvalue is below 5% is equivalent to \(z > c\).)
6.3.2 Normal distribution, unknown standard deviation
In reality a researcher does not simply know the standard deviation. Fortunately, we can use our data to estimate it.
\[ \hat{\sigma}^2 = \frac{1}{n1}\sum_{i = 1}^{n} (x_i  \overline{x})^2 \] We can use our estimate \(\hat{\sigma}\) to calculate a tstatistic, which is a close analogue of the zstatistic calculated above.
\[ t = \frac{\overline{x}  \mu_0}{\hat{\sigma}/\sqrt{n}} \] It turns out that with normally distributed and independent variables, this tstatistic has a known distribution that is close to the normal distribution and that gets closer as \(n\) increase.
## [1] 5.614045
We can calculate the critical values for a test based on the tstatistics as follows:
qt(0.025, df = sample_size  1)
## [1] 1.984217
qt(1  0.025, df = sample_size  1)
## [1] 1.984217
The tdistribution is also symmetric, so we can express \(c\) as \(c = 1.98\) and identify the rejection region as \(\{t: t > c \}\).
pt(t_stat, df = sample_size  1) * 2
## [1] 1.808489e07
Discuss blurring of tstatistics and zstatistics.
Using some relations between means and regression coefficients explored in Chapter 4, we can use the lm
function to produce this tstatistic and pvalue:
##
## Call:
## lm(formula = (x  hgt_mean_null) ~ 1, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## 17.064 4.853 1.043 5.036 21.342
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 4.4599 0.7944 5.614 1.81e07 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.944 on 99 degrees of freedom
6.3.3 Unknown distribution: Asymptotic methods
But what if we don’t know that the underlying distribution for heights is normal? Perhaps we even have good reason to believe that it isn’t normal. In this kind of situation, which is actually quite common, there are two basic approaches.
The first approach invokes a central limit theorem and thereby claim that the distribution of means will be asymptotically normal and thus approximately normal in finite samples.
To make things interesting, we replace our underlying normal distribution of heights with a new bimodal distribution created by drawing heights from one of two normal distribution:
bimodal_dgf < function(n) {
short < sample(c(TRUE, FALSE),
size = n,
prob = c(0.3, 0.7),
replace = TRUE)
short_sample < rnorm(n, mean = 155, sd = 3)
tall_sample < rnorm(n, mean = 171, sd = 3)
if_else(short, short_sample, tall_sample)
}
We can see that a sample drawn from this (unrealistic) distribution of heights has two modes and is clearly not normal.
set.seed(2023)
bimodal_sample < bimodal_dgf(1000)
tibble(height = bimodal_sample) %>%
ggplot(aes(x = height)) +
geom_histogram(binwidth = 1)
Now we make a function to draw multiple (n_samples
) samples (each of sample_size
size) from a given distribution (supplied using the dgf
argument) and calculate the mean for each that we can examine the distribution of these means:
get_means < function(sample_size, n_samples, dgf) {
tibble(i = 1:n_samples,
sample = map(i, function(x) dgf(sample_size)),
mean = map_vec(sample, mean),
se = map_vec(sample, function(x) sd(x)/sqrt(length(x)  1))) %>%
select(sample) %>%
mutate(z = mean/se)
}
Here we draw 10,000 samples of 1000 from bimodal_dgf
.
(Here you may be pleasantly surprised to learn that we can just pass functions like bimodal_dgf
into other functions!)
set.seed(2023)
bimodal_sample_stats < get_means(sample_size = 1000,
n_samples = 10000,
dgf = bimodal_dgf)
Now we make a small function to
make_clt_plot < function(df) {
df %>%
ggplot(aes(x = mean)) +
geom_histogram(aes(y = after_stat(density)), fill = "green",
binwidth = sd(df$mean)/10) +
geom_vline(aes(xintercept = mean(mean), color = "red")) +
stat_function(fun = dnorm,
args = list(mean = mean(df$mean),
sd = sd(df$mean)),
color = "blue") +
theme(legend.position = "none")
}
make_clt_plot(bimodal_sample_stats)
We can combine the get_means
and make_clt_plot
functions into a single function clt_demo
that allows us to make a plot of the distribution of means with an overlaid normal distribution so that we can see how well the latter approximates the former for a given sample size (sample_size
) and DGF (dgf
).
clt_demo < function(sample_size, n_samples, dgf) {
get_means(sample_size, n_samples, dgf) %>%
make_clt_plot()
}
We can now call this function to reproduce the plot above, but for sample_size = 100
.
clt_demo(sample_size = 100, n_samples = 10000, bimodal_dgf)
6.3.4 Unknown distribution: Bootstrapping
Above we considered the possibility that we had a normally distributed variable, but didn’t know the standard deviation. We saw that we can easily use the sample we have to estimate that standard deviation. In fact, we can go further and use the sample to generate information about the distribution of the underlying variable.
The bootstrapping approach in this setting proceeds as follows:
 Take a withreplacement sample of size \(n\) from our sample.
 Calculate the mean for that sample.
 Repeat steps 1 and 2 many times.
 Use the distribution above to calculate estimates of relevant parameters (e.g., the standard deviation).
We can use the sample we created above (bimodal_sample
) for this purpose.
Here we calculate 10,000 samples.
Note that get_bs_mean
performs steps 1 and 2, and we use map_vec
to perform step 3.
get_bs_mean < function(data) {
bs_sample < sample(data, size = length(data), replace = TRUE)
mean(bs_sample)
}
set.seed(2021)
bs_stats <
tibble(i = 1:n_samples) %>%
mutate(mean = map_vec(i, function(x) get_bs_mean(data = bimodal_sample)))
We might simply use the results of our bootstrapping procedure to estimate the standard error of our mean estimates.
se_est < sd(bs_stats$mean)
se_est
## [1] 0.2489343
But we could also use these results to evaluate the appropriateness of the normal approximation.
make_clt_plot(bs_stats)
Here we have barely scratched the surface of what can be done with bootstrapping analysis. In some settings we can use bootstrapping to read off critical values without needing to rely on the distribution approximating a normal distribution. We will mention bootstrapping approaches below and discuss them in more detail in subsequent chapters.
6.4 Differences in means
As we saw in Chapter 4, we are often interested in the differences between two groups of observations. Suppose that the first group has \(n_0\) observations and a mean of \(\overline{x}_0\) with standard deviation \(\hat{\sigma}_0\) and the second group has \(n_1\) observations and a mean of \(\overline{x}_1\) with standard deviation \(\hat{\sigma}_1\).
Given the maintained assumption of independence of the observations here, the two sample means are independently distributed. The difference between the means of the two groups \(\overline{x}_1  \overline{x}_0\) has variance equal to the sum of the variances of each mean.
\[ \mathrm{var}(\overline{x}_1  \overline{x}_0) = \frac{\hat{\sigma}^2_1}{n_1} + \frac{\hat{\sigma}^2_0}{n_0} \] Taking the square root of this gives the standard error of the difference in means.
\[ \mathrm{se}(\overline{x}_1  \overline{x}_0) = \sqrt{\frac{\hat{\sigma}^2_1}{n_1} + \frac{\hat{\sigma}^2_0}{n_0}} \] Again, we can often rely on the difference in means being approximately normally distributed.
One important point to note is that if the null hypothesis is \(\mu_0 = \mu_1 = 0\), then it may be tempting to think that if the null hypothesis \(\mu_1 \leq 0\) is rejected (in favour of \(H_1: \mu_1 > 0\)) while the null hypothesis \(\mu_0 \leq 0\) is not, that we can conclude that the null hypothesis \(\mu_1 \leq \mu_0\) can thereby be rejected (in favour of \(H_1: \mu_1 > \mu_0\)). This reasoning is erroneous, as we will demonstrate.
Imagine that our sampling procedure will draw 1,000 observations from each of two groups and that we will compare the means of these two subsamples.
We will construct 10,000 such samples; we do this by constructing them in turn using the get_means
function with dgf = rnorm
.
We merge these two samples by the sample identifier (i
) and for each sample, we calculate the difference in means and the standard error of the difference using our formula above.
From these values, we can calculate the zstatistic for each difference.
set.seed(2023)
sample_0_stats < get_means(sample_size = 1000,
n_samples = 10000,
dgf = rnorm)
sample_1_stats < get_means(sample_size = 1000,
n_samples = 10000,
dgf = rnorm)
merged_stats <
sample_0_stats %>%
inner_join(sample_1_stats, by = "i", suffix = c("_0", "_1")) %>%
mutate(mean_diff = mean_1  mean_0,
se_diff = sqrt(se_0^2 + se_1^2),
z_diff = mean_diff/se_diff)
Let’s assume that the size of our test is \(\alpha = 0.05\). Because we are using twotailed tests, we set the critical value to the absolute value of \(\Phi^{1}(\alpha/2)\).
The correct way to evaluate difference in means is to consider the difference significant if (1) \(\leftz_{\text{diff}}\right > c\). However, some researchers will consider the difference significant if either (2) \(z_1 > c\) while \(z_0 \leq c\) or (3) \(z_0 > c\) while \(z_1 \leq c\). And there is a risk that researchers will take the opportunity to declare the difference “statistically significant” if any of the conditions (1), (2), or (3) is met.
Below we calculate the proportion of samples for which the null hypothesis is rejected based on condition (1) (prop_sig_diff
), based on conditions (2) or (3)
(prop_sig_diff_alt
), or based on conditions (1), (2), or (3) (prop_sig_diff_choice
).
merged_stats %>%
mutate(sig_diff = abs(z_diff) > crit_value,
sig_1 = abs(z_1) > crit_value,
sig_0 = abs(z_0) > crit_value) %>%
summarize(prop_sig_diff = mean(sig_diff),
prop_sig_diff_alt = mean((sig_1 & !sig_0)  (sig_0 & !sig_1)),
prop_sig_diff_choice = mean(sig_diff  (sig_1 & !sig_0) 
(sig_0 & !sig_1)))
## # A tibble: 1 × 3
## prop_sig_diff prop_sig_diff_alt prop_sig_diff_choice
## <dbl> <dbl> <dbl>
## 1 0.0508 0.0967 0.116
6.5 Inference with regression
When the OLS estimator is unbiased, the expected value of \(\hat{\beta}\) will be \(\beta\), but \(\hat{\beta}\) is just an estimate and will typically differ from the true value, as we can see by substituting the value of \(y\) into the equation for \(\hat{\beta}\) and doing some algebra. Note that below we use the facts that \(\mathbb{E} \left[\hat{\beta}  \beta \right]\) is zero (implied by the expression below) and that the variance of a zeromean vector \(Z\) equals \(\mathbb{E} \left[ZZ' \right]\).
\[ \begin{aligned} \hat{\beta} &= (X^{\mathsf{T}}X)^{1}X^{\mathsf{T}}y \\ &= (X^{\mathsf{T}}X)^{1}X^{\mathsf{T}}(X\beta + \epsilon) \\ &= (X^{\mathsf{T}}X)^{1}X^{\mathsf{T}}X\beta + (X^{\mathsf{T}}X)^{1}\epsilon \\ & = \beta + (X^{\mathsf{T}}X)^{1}X^{\mathsf{T}}\epsilon \\ \hat{\beta}  \beta &= (X^{\mathsf{T}}X)^{1}X^{\mathsf{T}}\epsilon \\ \mathrm{var} (\hat{\beta}X) &= \mathbb{E} \left[(X^{\mathsf{T}}X)^{1}X^{\mathsf{T}} \epsilon \epsilon' X (X^{\mathsf{T}}X)^{1}  X\right] \\ &= (X^{\mathsf{T}}X)^{1}X^{\mathsf{T}} \mathbb{E} \left[\epsilon \epsilon'  X\right] X (X^{\mathsf{T}}X)^{1} \end{aligned} \]
Now if \(\mathbb{E} \left[\epsilon_i \epsilon_j \right] = 0, \forall i \neq j\), then \(\mathbb{E} \left[\epsilon \epsilon'  X\right]\) will be a diagonal matrix with each element on the main diagonal equal to \(\sigma^2_i\), the variance of \(\epsilon\) for observation \(i\) (i.e., \(\mathrm{var} (\epsilon_i)\)). And if \(\mathbb{E} [\sigma^2_i X ] = \sigma^2\) for all \(i\), then we have:
\[ \begin{aligned} \Sigma &:= \mathrm{var} (\hat{\beta}X) \\ &= \sigma^2 (X^{\mathsf{T}}X)^{1}X^{\mathsf{T}}X (X^{\mathsf{T}}X)^{1} \\ &= \sigma^2 (X^{\mathsf{T}}X)^{1} \\ \end{aligned} \]
This is the familiar result under the “standard” OLS assumptions, including the assumption that \(\mathbb{E} [\sigma^2_i X ] = \sigma^2\) for all \(i\), which is known as homoskedasticity.
One issue is that we do not observe \(\sigma^2\), but we can estimate it using the standard formula
\[ \hat{\sigma}^2 = \frac{1}{Nk} \sum_{i=1}^N \hat{\epsilon}_i^2 \] where \(\hat{\epsilon}_i\) is the regression residual for observation \(i\) and \(N  k\) incorporates a degreesoffreedom correction. (Note that \(k = 2\) when we have one regressor in addition to a constant term.) Putting the above together, we have
\[
\hat{\Sigma} = \hat{\sigma}^2 (X^{\mathsf{T}}X)^{1}
\]
This is also the standard approach to estimating the variance of coefficients in statistical software, including the lm
function in R.
To confirm this, let’s create a test data set, calculate regression statistics “by hand” and then compare results with those provided by the lm
function.
To make this more concrete, let’s consider some actual (though not “real”) data. The following code uses R functions to generate random data. Specifically, we generate 1000 observations with \(\beta = 1\) and \(\sigma = 0.2\).
We next construct \(X\) as a matrix comprising a column of ones (to estimate the constant term) and a column containing \(x\).
We now calculate \(\hat{\sigma}^2\) and then \(\hat{\Sigma}\). This allows us to calculate the estimated standard errors of the estimated coefficients as the square root of the diagonal elements of \(\hat{\Sigma}\) and then the \(t\)statistics for each coefficient by dividing each element of \(\hat{\beta}\) by its estimated standard error.
resvar < sum(resid^2)/(N2)
Sigma < resvar * solve(t(X) %*% X)
se < sqrt(diag(Sigma))
tibble(se, t = b/se) %>% knitr::kable()
se  t 

0.0064399  1.170815 
0.0063210  157.829403 
Now, let’s compare with what we get from the summary
function applied to the results of estimating a model using lm
.
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## 0.72772 0.12642 0.00101 0.12774 0.65945
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 0.007540 0.006440 1.171 0.242
## x 0.997647 0.006321 157.829 <2e16 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2036 on 998 degrees of freedom
## Multiple Rsquared: 0.9615, Adjusted Rsquared: 0.9614
## Fstatistic: 2.491e+04 on 1 and 998 DF, pvalue: < 2.2e16
You may have noticed that these approaches are analogous to using tstatistics to test hypotheses about means that we examined above. Similarly, we might use asymptotic methods or bootstrapping to test hypotheses about regression coefficients. In fact, the assumptions required to yield regression coefficients that follow the \(t\) distribution—such as normally distributed error terms—are generally implausible in most research setting and in practice most inferences are justified based on asymptotic results and assumptions of approximately normal distributions in finite samples of actual research. In fact, all the approaches discussed in the next section rely on asymptotic justifications.
In the remainder of this section, we discuss how bootstrapping can be applied to regression analysis. The bootstrapping approach in this setting proceeds as follows:
 Take a withreplacement sample of size \(n\) from our sample of data.
 Calculate the regression coefficients for that sample.
 Repeat steps 1 and 2 many times.
 Use the distribution above to calculate estimates of relevant parameters (e.g., the standard error of regression coefficients).
To demonstrate this, we put our y
and x
variables above into a data frame and create a function (get_bs_coefs
) that performs steps 1 and 2.
df < tibble(y, x)
get_bs_coefs < function(data) {
rows < 1:nrow(data)
bs_rows < sample(rows, size = length(rows), replace = TRUE)
bs_sample < data[bs_rows, ]
fm < lm(y ~ x, data = bs_sample)
fm$coefficients
}
We then use map_df
to repeat these steps 10,000 times.
set.seed(2021)
bs_coefs <
tibble(i = 1:n_samples) %>%
mutate(map_df(i, function(x) get_bs_coefs(data = df)))
We can then calculate the standard deviation of the estimated regression coefficients as an estimate of their standard errors.
## # A tibble: 1 × 2
## `(Intercept)` x
## <dbl> <dbl>
## 1 0.00645 0.00650
6.6 Dependence
In generating data in Chapter 4, we assumed that the errors and regressors were independent. However, most of the empirical accounting literature uses panel data sets, typically repeated observations on a set of firms over time. In these studies, it is reasonable to assume that regressors (\(X\)) and errors (\(\epsilon\)) are correlated both across firms (i.e., crosssectional dependence) and over time (i.e., timeseries dependence).
While White (1980) standard errors are consistent in the presence of heteroskedasticity, it is well known that both OLS and White produce misspecified test statistics when either form of dependence is present.
Gow et al. (2010) demonstrate that accounting researchers use a variety of approaches to address the issues of crosssectional and timeseries dependence widely understood to be present in common research settings. The literature survey in Gow et al. (2010) identified a number of common approaches: FamaMacBeth, NeweyWest, the “\(Z2\)” statistic, and standard errors clustered by firm, industry, or time.
6.6.1 NeweyWest
Newey and West (1987) generalized the White (1980a) approach to yield a covariance matrix estimator that is robust to both heteroskedasticity and serial correlation. While the NeweyWest procedure was developed in the context of a single time series, Gow et al. (2010) show it is frequently applied in panel data settings, and in such settings assumes crosssectional independence. Gow et al. (2010) assess the performance of NeweyWest in the presence of both crosssectional and timeseries dependence, and find that it produces misspecified test statistics with even moderate levels of crosssectional dependence.
6.6.2 FamaMacBeth
The FamaMacBeth approach (Fama and MacBeth, 1973) is designed to address concerns about crosssectional correlation. The original FamaMacBeth approach (which Gow et al., 2010 label “FM‑t”) involves estimating T crosssectional regressions (one for each period) and basing inferences on a tstatistic calculated as
\[ t = \frac{\overline{\beta}}{\mathit{se}(\beta)}\text{, where } \overline{\beta} = \frac{1}{T} \sum_{t=1}^T \hat{\beta}_t\]
and \(\mathit{se}(\beta)\) is the standard error of the coefficients based on their empirical distribution. When there is no crossregression (timeseries) dependence, this approach yields consistent estimates of the standard error of the coefficients as T goes to infinity. However, crossregression dependence in errors and regressors causes FamaMacBeth standard errors to be understated (Cochrane, 2009; Schipper and Thompson, 1983).
To illustrate the FamaMacBeth approach, we will generate a data set using the simulation approach of Gow et al. (2010), which is available using the get_got_data
function found in the farr
package.
The parameters Xvol
and Evol
relate to the crosssectional dependence of \(X\) and \(\epsilon\) repsectively.
The parameters rho_X
and rho_E
relate to the timeseries dependence of \(X\) and \(\epsilon\) respectively.
We get 10 years of simulated data on 500 firms and store the data in the data frame named test
.
set.seed(2021)
test < get_got_data(N = 500, T = 10, Xvol = 0.75,
Evol = 0.75, rho_X = 0.5, rho_E = 0.5)
We first estimate use the nest()
function from tidyr
to organize the relevant data by year
.
We then use the map
function from the purrr
package to estimate the regression model by year and place the fitted models in a column named fm
, then again to extract the coefficients from the fm
column and store them in the coefs
column.
Finally, we combine the year
column and the data in the coefs
column into a new data frame that we can tabulate.
Note that we use the shortcut provided by purrr
that allows us to write function(z) lm(y ~ x, data = z)
as simply ~ lm(y ~ x, data = .)
.
results <
test %>%
group_by(year) %>%
nest() %>%
mutate(fm = map(data, ~ lm(y ~ x, data = .)),
coefs = map(fm, coefficients))
coefs <
bind_cols(year = results$year, bind_rows(results$coefs)) %>%
rename(Constant = `(Intercept)`)
coefs %>% knitr::kable()
year  Constant  x 

1  1.6355567  0.9513677 
2  2.7017022  0.9898722 
3  1.1093092  0.9078208 
4  0.3491182  1.0951516 
5  0.8628827  1.0034783 
6  0.4451103  0.9104420 
7  0.2011427  1.0506848 
8  1.5738367  0.8011475 
9  2.7726904  0.9517446 
10  0.8071852  0.9843101 
The second step of the FamaMacBeth approach involves calculating the mean and standard error of the estimated coefficients. It can be shown that regression of the timeseries of coefficients on a constant produces the required means and standard errors.
Two common variants of the FamaMacBeth approach appear in the accounting literature. The first variant, which Gow et al. (2010) label “FM‑i”, involves estimating firm or portfoliospecific timeseries regressions with inferences based on the crosssectional distribution of coefficients, and had been used extensively in the accounting literature prior to Gow et al. (2010). This modification of the FamaMacBeth approach is appropriate if there is timeseries dependence but not crosssectional dependence. However, it is difficult to identify such circumstances in real research settings and we do not discuss FMi further. ^{51}
The second common variant of the FM‑t approach, which Gow et al. (2010) label FM‑NW, is intended to correct for serial correlation in addition to crosssectional correlation. FM‑NW modifies FM‑t by applying a NeweyWest adjustment in an attempt to correct for serial correlation. While a number of studies claimed that FM‑NW produces a conservative estimate of statistical significance, they did so without any formal evaluation of FM‑NW.
Gow et al. (2010) found that FMNW simply does not address serial correlation and offer two explanations for this failure.
First, FM‑NW generally involves applying NeweyWest to a limited number of observations, a setting in which NeweyWest is known to perform poorly.
Second, FM‑NW applies NeweyWest to a timeseries of coefficients, whereas the dependence is in the underlying data.
For example, without using information in the underlying data, it seems difficult to determine whether a sequence such
as {2.1, 1.8, 1.9, 2.2, 2.3}
represents a highly autocorrelated draw from a meanzero distribution or evidence of a population coefficient of around 2.
Nonetheless, implementing FMNW is straightforward using the sandwich
package.
We simply apply the NeweyWest
function to the fitted FMt model to recover FMNW standard errors.
The following function does this:^{52}
get_nw_ses < function(fm, lag = 1) {
vcov < NeweyWest(fm, lag = lag, prewhite = FALSE, adjust = TRUE)
sqrt(diag(vcov))
}
We tabulate each of the FamaMacBeth coefficients twice: once with the default standard errors (by passing NULL
to the se
argument of stargazer
) and once with NeweyWest standard errors from the get_nw_ses()
function defined above).
sg_format < "html"
stargazer(list(fms, fms),
type = sg_format,
se = list(NULL, NULL,
get_nw_ses(fms[[1]]),
get_nw_ses(fms[[2]])),
column.labels = c("FMt", "FMt",
"FMNW", "FMNW"),
dep.var.caption = "Covariate",
covariate.labels = "Coefficient",
omit.stat=c("ser", "f", "adj.rsq"),
report = "vcs",
model.names = FALSE,
dep.var.labels.include = TRUE,
header = FALSE)
Covariate  
Constant  x  Constant  x  
FMt  FMt  FMNW  FMNW  
(1)  (2)  (3)  (4)  
Coefficient  0.098  0.965  0.098  0.965 
(0.506)  (0.026)  (0.457)  (0.020)  
Observations  10  10  10  10 
R^{2}  0.000  0.000  0.000  0.000 
Note:  ^{}p<0.1; ^{}p<0.05; ^{}p<0.01 
6.6.3 Z2 statistic
The Z2 statistic first appeared in 1994 and appeared in a number of subsequent studies in the accounting literature. The Z2‑t (Z2i)statistic is calculated using tstatistics from separate crosssectional (timeseries) regressions for each time period (crosssectional unit) and is given by the expression:
\[ \textit{Z2} = \frac{\overline{t}}{\mathit{se}(t)}\text{, where } \overline{t} = \frac{1}{T} \sum_{t=1}^T \hat{t}_t\]
\(\mathit{se}(t)\) is the standard error of the tstatistics based on their empirical distribution, and T is the number of time periods (crosssectional units) in the sample. Several studies claimed that \(Z2\) adjusts for crosssectional and serial correlation, but offered no basis for this claim. Gow et al. (2010) found no prior formal analysis of the properties of the \(Z2\) statistic and found that \(Z2\) suffers from crossregression dependence in the same way as the FamaMacBeth approach does.
6.6.4 Oneway clusterrobust standard errors
A number of studies in our survey use clusterrobust standard errors, with clustering either along a crosssectional dimension (e.g., analyst, firm, industry, or country) or along a timeseries dimension (e.g., year); we refer to the former as CL‑i and the latter as CL‑t. Clusterrobust standard errors (also referred to as HuberWhite or Rogers standard errors) were proposed by White (1984) as a generalization of the heteroskedasticityrobust standard errors of White (1980). With observations grouped into \(G\) clusters of \(N_g\) observations, for \(g\) in \({1, \dots, G}\), the covariance matrix is estimated using the following expression,
\[ \hat{V}(\hat{\beta}) = (X'X)^{1} \hat{B} (X'X)^{1} \text{, where } \hat{B} = \sum_{g=1}^G X'_g u_g u'_g X_g \]
where \(X_g\) is the \(N_g \times K\) matrix of regressors, and \(u_g\) is the \(N_g\)vector of residuals for cluster \(g\). If each cluster contains a single observation, equation (3) yields the White (1980) heteroskedasticityconsistent estimator.
While oneway clusterrobust standard errors allow for correlation of unknown form within cluster, it is assumed that errors are uncorrelated across clusters. For example, clustering by time (firm) allows observations to be crosssectionally (serially) correlated, but assumes independence over time (across firms). As we demonstrate, \(t\)statistics for CL‑t are inflated in the presence of timeseries dependence and \(t\)statistics for CL‑i are inflated in the presence of crosssectional dependence. Thus, when both forms of dependence are present, both CL‑t and CL‑i produce overstated \(t\)statistics.
It is also important to note that clustering by, say, industryyear does not produce standard errors robust to withinindustry and withinyear dependence. Instead, like all clusterrobust methods, this approach assumes independence across clusters. As such, clustering by industryyear assumes that each industryyear cluster is independent; that is, neither timeseries nor crossindustry correlation is an issue (i.e., observations for industry \(j\) in year \(t\) are uncorrelated with those in industry \(j\) in year \(t+1\) and those in industry \(k\) in year \(t\)). To see this point, note that clustering by firmyear when observations are firmyears produces White standard errors.
6.6.5 Twoway clusterrobust standard errors
Thompson (2011) and Cameron et al. (2011) developed an extension of clusterrobust standard errors that allows for clustering along more than one dimension.
In contrast to oneway clustering, twoway clustering (CL‑2) allows for both timeseries and crosssectional dependence. For example, twoway clustering by firm and year allows for withinfirm (timeseries) dependence and withinyear (crosssectional) dependence (e.g., the observation for firm \(j\) in year \(t\) can be correlated with that for firm \(j\) in year \(t+1\) and that for firm \(k\) in year \(t\)). To estimate twoway clusterrobust standard errors, the expression in (3) is evaluated using clusters along each dimension (for example, clustered by industry and clustered by year) to yield \(V_1\) and \(V_2\). Then the same expression is calculated using the “intersection” clusters (in our example, observations within an industryyear) to yield \(V_I\). The twoway clusterrobust estimator \(V\) is calculated as \(V = V_1 + V_2  V_I\). Packages available for R make it relatively straightforward to implement twoway clusterrobust standard errors and we demonstrate their use below.
6.6.6 Calculating standard errors
In the following code, we use functions from three libraries: sandwich
, plm
, and lfe
.
The lfe
library provides the felm
function as an extension to lm
that allows for fixed effects, instrumental variables, and clustered standard errors.
We will discuss the use of felm
with fixed effects and instrumental variables in later chapters.
Here we just use the functionality related to clustered standard errors.
The interface to felm
uses a fourpart formula specification with each part separated by 
.
The first part consists of ordinary covariates, the second part consists of “factors to be projected out” (think of these as fixed effects), the third part provides instrumental variables, and the fourth part provides the specification for clustered standard errors.
We use 0
for the second and third parts below.
fms < list(OLS = lm(y ~ x, data = test),
White = lm(y ~ x, data = test),
NW = plm(y ~ x, test, index = c("firm", "year"),
model = "pooling"),
`FMt` = pmg(y ~ x, test, index = "year"),
`CLi` = felm(y ~ x  0  0  firm, data = test),
`CLt` = felm(y ~ x  0  0  year, data = test),
`CL2` = felm(y ~ x  0  0  year + firm, data = test))
se < vector("list", length(fms))
names(se) < names(fms)
se[["White"]] < sqrt(diag(vcovHC(fms[["White"]], type="HC1")))
se[["NW"]] < sqrt(diag(vcovNW(fms[["NW"]])))
stargazer(fms, type = sg_format, digits = 3,
se = se,
dep.var.caption = "",
omit.stat=c("ser", "f", "adj.rsq"),
column.labels = names(fms),
report = "vcs",
model.names = FALSE,
dep.var.labels.include = FALSE,
header = FALSE)
OLS  White  NW  FMt  CLi  CLt  CL2  
(1)  (2)  (3)  (4)  (5)  (6)  (7)  
x  1.266  1.266  1.266  0.965  1.266  1.266  1.266 
(0.027)  (0.021)  (0.021)  (0.026)  (0.018)  (0.172)  (0.172)  
Constant  0.021  0.021  0.021  0.098  0.021  0.021  0.021 
(0.026)  (0.026)  (0.027)  (0.506)  (0.023)  (0.492)  (0.492)  
Observations  5,000  5,000  5,000  5,000  5,000  5,000  5,000 
R^{2}  0.305  0.305  0.305  0.786  0.305  0.305  0.305 
Note:  ^{}p<0.1; ^{}p<0.05; ^{}p<0.01 
One concern with clusterrobust methods is their finitesample properties. Cameron et al. (2008) document that clusterrobust methods will overreject a true null when the number of clusters is small and attribute this issue to the use of limiting distributions (typically the normal distribution) in making inferences in small samples (e.g., using the standard 1.64, 1.96, and 2.58 critical values). However, it is important to note that most methods used in accounting research (including White, NW, FM‑i, FM‑t, Z2) have asymptotic foundations and thus are affected by this issue (e.g., FM‑t with 10 years of data). Accordingly, researchers should exercise caution when applying any asymptotic methods in smallsample settings (e.g., FMt, CLt, or CL2 when the number of time periods is small). This issue notwithstanding, we continue to find CL‑2 produces unequivocally better inferences than any of these methods even with as few as 10 clusters. Furthermore, as the econometrics literature has identified approaches to address this concern (e.g., Cameron et al., 2008 identify corrections based on bootstrapping methods), having few clusters does not warrant relying on approaches that are not robust to crosssectional and timeseries dependence.
6.6.7 Exercises
 Confirm either by direct computation using the data in
coefs
or using a little algebra that regressing the firststage coefficients on a constant yields the desired secondstage results for FMt.  Assume that the null hypothesis is \(H_0: \beta = 1\). Using the reported coefficients and standard errors for each of the methods listed in the table (i.e., OLS, White, NW, FMt, CLi, CLt, CL2), for which methods is the null hypothesis rejected?
 Based on the analysis in Gow et al. (2010), when should you use the FMNW or Z2 approaches? What factors likely led to the use of these approaches and claims for them (i.e., robustness to crosssectional and timeseries dependence) that were unsubstantiated (and ultimately false)?
 If FMi is, as Gow et al. (2010) show, so inappropriate for the situations in which it was used, why do you think it was used in those situations?

Gow et al. (2010) refer to “the standard 1.64, 1.96, and 2.58 critical values.”
Using
pnorm
function, determine the \(p\)value associated with each of these critical values. Are these onetailed or twotailed \(p\)values? The values of 1.64, 1.96, and 2.58 are approximations. Which function in R allows you to recover the precise critical values associated with a chosen \(p\)value? (Hint: Read the help provided by? pnorm
in R.) Provide the critical values to four decimal places.