4 Regression fundamentals
In this chapter, we provide a short introduction to the fundamentals of regression analysis with focus on ordinary least-squares (OLS) regression. Our emphasis here is on helping the reader to build intuition for the mechanics of regression. We demonstrate different ways of achieving various regression results to strengthen the reader’s intuition for what regression is doing. In this spirit, we close the chapter with a brief discussion of the Frisch-Waugh-Lovell theorem which provides a way of representing multivariate regression coefficients as the result of a single-variable regression.
While we motivate some of our regressions with a data set that prompts a number of causal questions, we largely sidestep the issue of when OLS regression does or does not produce valid estimates of causal effects. Thus while we loosely hint at possible causal interpretations of results in this chapter, the reader should be cautious about these interpretations. We begin our formal analysis of causal inference with Chapter 5.
Additionally, while we point out early that OLS will provide noisy estimates of estimands, we do not address issues regarding how precise these estimates are or how to assess the statistical significance of results. A few \(p\)-values and \(t\)-statistics will crop up in regression analysis shown in this chapter, but we ignore those details for now. We begin our study of statistical inference in Chapter 6.
The code in this chapter uses the packages listed in the code below. For instructions on how to set up your computer to use the code found in this book, see Section 1.2.1 (note that Step 4 is not required as we do not use WRDS data in this chapter).
library(dplyr, warn.conflicts = FALSE)
library(ggplot2)
library(farr)
library(lfe) # For felm()
library(tidyr) # For pivot_wider()
library(knitr) # For kable()
library(stargazer)
We use the stargazer
package to render regression output from this chapter.
For the HTML version of this book, we set sg_format <- "html"
, but sg_format <- "text"
is a better option if looking at the results interactively.
sg_format <- "html"
4.1 Introduction
Suppose we have data on variables \(y\), \(x_1\) and \(x_2\) for \(n\) units and that we conjecture that there is a linear relationship between these variables of the following form
\[ y_i = \beta_0 + \beta_1 \times x_{i1} + \beta_2 \times x_{i2} + \epsilon_i \] where \(i \in {1, \dots, n}\) denotes the data for a particular unit.
We can write that in matrix form as follows:
\[ \begin{bmatrix} y_1 \\ y_2 \\ \dots \\ y_{n-1} \\ y_{n} \\ \end{bmatrix} = \begin{bmatrix} 1 & x_{11} & x_{12} \\ 1 & x_{21} & x_{22} \\ \dots & \dots & \dots \\ 1 & x_{n-1,1} & x_{n-1, 2} \\ 1 & x_{n, 1} & x_{n, 2} \end{bmatrix} \times \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \dots \\ \epsilon_{n-1} \\ \epsilon_{n} \\ \end{bmatrix} \]
And this can be written even more compactly as
\[ y = X \beta + \epsilon \] where \(X\) is an \(n \times 3\) matrix and \(y\) is an \(n\)-element vector. It is conventional to denote the number of columns in the \(X\) matrix using \(k\), where \(k = 3\) in this case.^{25}
In a regression context, we call \(X\) the regressors, \(y\) the regressand, and \(\epsilon\) the error term. We assume that we observe \(X\) and \(y\), but not \(\epsilon\). If we did observe \(\epsilon\), then we could probably solve for the exact value of the coefficients \(\beta\) with just a few observations. Lacking such information, we can produce an estimate of our estimand, \(\beta\). Our estimate (\(\hat{\beta}\)) is likely to differ from \(\beta\) due to noise arising from the randomness of the unobserved \(\epsilon\) and also possibly bias. There will usually be a number of estimators that we might consider for a particular problem. We will focus on the ordinary least-squares regression (or OLS) estimator as the source for our estimates in this chapter.
OLS is a mainstay of empirical research in the social sciences in general and in financial accoiunting in particular. In matrix notation, the OLS estimator is given by
\[ \hat{\beta} = (X^{\mathsf{T}}X)^{-1}X^{\mathsf{T}}y \]
Let’s break this down. First, \(X^{\mathsf{T}}\) is the transpose of \(X\), meanting the \(k \times n\) matrix formed by making the rows of \(X\) into columns. Second, \(X^{\mathsf{T}} X\) is the product of the \(k \times n\) matrix \(X^{\mathsf{T}}\) and the \(n \times k\) matrix \(X\), which results in a \(k \times k\) matrix. Third, the \(-1\) exponent indicates the inverse matrix. For a real number \(x\), \(x^{-1}\) denotes the number that when multiplied by \(x\) gives \(1\) (i.e., \(x \times x^{-1} = 1\)). For a square matrix \(Z\) (here “square” means the number of rows equals the number of columns), \(Z^{-1}\) denotes the square matrix that when multiplied by \(Z\) gives the identity matrix, \(\mathbf{I}\) (i.e., \(Z \times Z^{-1} = \mathbf{I}\)).
The \(3 \times 3\) identity matrix looks like this
\[ \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} \]
Note that, just as there is no meaningful way to calculate the inverse of \(0\), it’s not always possible to take the inverse of a matrix. But, so long as no column of \(X\) is a linear combination of other columns and \(n > k\), then we can calculate \((X^{\mathsf{T}}X)^{-1}\) (and there are standard algorithms for doing so). Now, \(X^{\mathsf{T}}y\) is the product of a \(k \times n\) matrix (\(X^{\mathsf{T}}\)) and a vector with \(n\) elements (this can be thought of as an \(n \times 1\) matrix), so the result will be a \(k \times 1\) matrix. Thus the product of \((X^{\mathsf{T}}X)^{-1}\) and \(X^{\mathsf{T}}y\) will be a vector with \(k\) elements, which we can denote as follows
\[ \begin{bmatrix} \hat{\beta}_1 \\ \hat{\beta}_2 \\ \hat{\beta}_3 \\ \end{bmatrix} \]
So \(X^{\mathsf{T}}X\) is a \(k \times k\) matrix (as is its inverse), and \(X^{\mathsf{T}}y\) is the product of a \(k \times N\) matrix times an \(N \times 1\) matrix (i.e., a vector). So \(\hat{\beta}\) is a \(k\)-element vector. If we have a single regressor \(x\), then \(X\) will typically include the constant term, so \(k = 2\).
For this chapter, we will assume that the model \(y = X \beta + \epsilon\) is a structural (causal) model. What this means is that if we could somehow increase the value of \(x_{i1}\) by 1 unit without changing any other part of the system, we would see an increase in the value of \(y_{i}\) equal to \(\beta_1\). This model is causal in the sense that a unit change in \(x_{i1}\) can be said to cause a \(\beta_1\) change in \(y_i\).
To make this more concrete, let’s consider some actual (though not “real”) data. The following code uses R functions to generate random data. Specifically, we generate 1000 observations with \(\beta = 1\) and \(\sigma = 0.2\).
We next construct \(X\) as a matrix comprising a column of ones (to estimate the constant term) and a column containing \(x\).
## [,1] [,2]
## [1,] 1 -0.1224600
## [2,] 1 0.5524566
## [3,] 1 0.3486495
## [4,] 1 0.3596322
## [5,] 1 0.8980537
## [6,] 1 -1.9225695
Naturally, R has built-in matrix operations.
To get \(X^{\mathsf{T}}\), the transpose of the matrix \(X\), we use t(X)
.
To multiply two matrices, we use the matrix multiplication operator %*%
.
And to invert \((X^{\mathsf{T}}X)\) to get \((X^{\mathsf{T}}X)^{-1}\), we use the solve()
function.
Thus the following calculates the OLS estimator \(\hat{\beta} = (X^{\mathsf{T}}X)^{-1} X^{\mathsf{T}}y\).
x |
---|
0.00754 |
0.99765 |
4.2 Running regressions in R
According to the documentation:
The basic function for fitting ordinary multiple models is
lm()
, and a streamlined version of the call is as follows:
fitted.model <- lm(formula, data = data.frame)
For example,
fm1 <- lm(y ~ x1 + x2, data = production)
would fit a multiple regression model of y
on x1
and x2
using data from the data frame production
.
We use the lm
function—which is part of base R—to estimate regressions in this chapter.
Note that R was developed by statisticians and thus works in a way consistent with history.
For example, if we run the following code, we actually estimate coefficients on the constant term (the intercept),
x1
, x2
, and the product of x1
and x2
.
fm2 <- lm(y ~ x1 * x2, data = production)
In contrast to other statistical packages, there’s no need to calculate the product of x1
and x2
and store it as a separate variable and there’s no need to explicitly specify the “main effect” terms (i.e., x1
and x2
) in the regression equation.
R (and the lm
function) takes care of these details for us.
The third argument to the lm
function is subset
, which allows us to specify as condition that each observation needs to satisfy to be included in the regression.
Here the first argument to lm
(formula
) is “an object of class formula … a symbolic description of the model to be fitted.”
As we are regressing \(y\) on \(x\), we use the formula y ~ x
here; we will soon see more complicated formula expressions.
The value to the data
argument is normally a data frame, so we put x
and y
into a tibble.
We call the lm
function, store the returned value in the variable fm
, then show some of the contents of fm
:
##
## Call:
## lm(formula = y ~ x, data = df)
##
## Coefficients:
## (Intercept) x
## 0.00754 0.99765
From this out put we see that we get the same results using lm
as we do using matrix algebra.
The data set we will focus on next is test_scores
, which is part of the farr
package.
This data frame contains data on test scores for 1000 students over four years (grades 5 through 8).
test_scores
## # A tibble: 4,000 × 6
## id grade post treat score grade_effect
## <int> <dbl> <lgl> <lgl> <dbl> <dbl>
## 1 1 5 FALSE TRUE 498. 80
## 2 1 6 FALSE TRUE 513. 98
## 3 1 7 TRUE TRUE 521. 103
## 4 1 8 TRUE TRUE 552. 119
## 5 2 5 FALSE TRUE 480. 80
## 6 2 6 FALSE TRUE 515. 98
## 7 2 7 TRUE TRUE 535. 103
## 8 2 8 TRUE TRUE 551. 119
## 9 3 5 FALSE FALSE 504. 80
## 10 3 6 FALSE FALSE 522. 98
## # … with 3,990 more rows
The variable treat
indicates whether a student attended a science camp during the summer after sixth grade.
The following code counts how many different values of treat
there are for each student (id
).
In Chapter 3, we used the table
function and here we use the count
function, which performs a similar function, but works better with the Tidyverse work flow.
From the output we see that each student is associated with one value of treat
, so students who went to the science camp have the value TRUE
in treat
, even in the rows of data that relate to years before they went to the camp.
test_scores %>%
select(id, treat) %>%
distinct() %>%
group_by(id) %>%
summarize(n_treat_vals = n()) %>%
ungroup() %>%
count(n_treat_vals)
## # A tibble: 1 × 2
## n_treat_vals n
## <int> <int>
## 1 1 1000
We can also see that exactly half the students in the sample attended the science camp.
## # A tibble: 1 × 1
## prop_treat
## <dbl>
## 1 0.5
From inspection of the data, we can infer that post
is TRUE
when the student is in grades 7 or 8 (i.e., after the summer when the science camp was held), and FALSE
before then.
Note that this variable has the same relationship with grade
whether the student attended the science camp or not.
## # A tibble: 4 × 3
## grade post n
## <dbl> <lgl> <int>
## 1 5 FALSE 1000
## 2 6 FALSE 1000
## 3 7 TRUE 1000
## 4 8 TRUE 1000
The question we might be interested in is whether attending the science camp improves test performance. The natural first thing to do would be to look at a plot of the data:
test_scores %>%
group_by(grade, treat) %>%
summarize(score = mean(score),
.groups = "drop") %>%
ggplot(aes(x = grade, y = score, colour = treat)) +
geom_line()
What we see is that the students who went to the camp had lower scores in grades 5 and 6 (i.e., before the camp), but stronger performance in grades 7 and 8. This provides prima facie evidence of an positive effect of the science camp on scores.
While the ideal approach might be to randomly assign students to the camp and then compare scores after the camp, it appears that in this case students with lower scores went to camp. There is not obvious story that results in what we see in the plot above that is more plausible than (or at least as simple as) one that attributes a positive effect on test scores of going to the science camp.
A question that we might now have is: What is the best way to test for an effect of the science camp on test scores? One reasonable hypothesis is that the summer camp has a biggest effect on seventh-grade scores and that we might compare seventh-grade scores with sixth-grade scores to get the best estimate of the effect of camp on scores.
In the code below, we use filter()
to limit our analysis to data related to sixth and seventh grades.
We then calculate mean scores for each value of (post, treat)
.
We then use pivot_wider
to put score
values for different levels of post
in the same row so that we can calculate change
.
test_summ <-
test_scores %>%
filter(grade %in% 6:7L) %>%
group_by(post, treat) %>%
summarize(score = mean(score), .groups = "drop") %>%
pivot_wider(names_from = post, values_from = score) %>%
rename(post = `TRUE`, pre = `FALSE`) %>%
mutate(change = post - pre)
You may find it helpful in understanding code examples such as this one to look at the intermediate output along the line of pipes.
For example, if you highlight the text from test_scores %>%
through to the end of the summarize
line just before the pipe at the end of the last line, and click “Run Selected Lines” in the “Code” menu of RStudio (or hit CTRL
+ Enter
on Windows or Linux or ⌘
+ Enter
on MacOS) you will see what is coming out at that point of the pipe (it should look like the following):
test_scores %>%
filter(grade %in% 6:7L) %>%
group_by(post, treat) %>%
summarize(score = mean(score), .groups = "drop")
## # A tibble: 4 × 3
## post treat score
## <lgl> <lgl> <dbl>
## 1 FALSE FALSE 516.
## 2 FALSE TRUE 511.
## 3 TRUE FALSE 520.
## 4 TRUE TRUE 532.
We strongly recommend that you use this approach liberally when working through this book (and when debugging your own Tidyverse pipelines).
We stored the results of our analysis in test_summ
and can use the kable()
function from the knitr
package to produce a simple table:
treat | pre | post | change |
---|---|---|---|
FALSE | 515.5390 | 519.6573 | 4.1183 |
TRUE | 510.8101 | 531.9380 | 21.1280 |
Here we see that the scores of the treated students (i.e., those who went to the summer camp) increased by 21.1280, while the scores of the control students increased by 4.1183. One approach is to view the outcome of the control students as the “but-for-treatment” outcome that the treated students would have seen had they not gone to summer camp. With this view, the effect of going to camp is the difference in the difference in scores, or 17.0097. This is the difference-in-differences estimator of the causal effect.
Note that we can also recover this estimator using the lm
function.
##
## Call:
## lm(formula = score ~ treat * post, data = test_scores, subset = grade %in%
## 6:7L)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.4986 -4.7772 0.1442 4.6307 24.1969
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 515.5390 0.3135 1644.345 <2e-16 ***
## treatTRUE -4.7290 0.4434 -10.666 <2e-16 ***
## postTRUE 4.1183 0.4434 9.288 <2e-16 ***
## treatTRUE:postTRUE 17.0097 0.6270 27.127 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.011 on 1996 degrees of freedom
## Multiple R-squared: 0.5562, Adjusted R-squared: 0.5556
## F-statistic: 834 on 3 and 1996 DF, p-value: < 2.2e-16
Note that we did not need to specify the inclusion of the main effects of treat
and post
; R automatically added those when we requested their interaction (treat * post
).
Also note that we did not need to convert the logical variables treat
and post
so that TRUE
is 1
and FALSE
is 0
; in effect, R also did this for us.
A natural question might be whether we could do better using all four years of data. There is no simple answer to this question unless we have a stronger view of the underlying causal model. One causal model might have it that students have variation in underlying talent, but that there is also variation in industriousness that affects how students improve over time. From the perspective of evaluating the effect of the camp, variation in industriousness is going to add noise to estimation that increases if we are comparing performance in fifth grade with that in eighth grade.
Another issue is that the effects of the summer camp might fade over time. As such, we might get a larger estimated effect if we focus on seventh-grade scores than if we focus on (or also include) eighth-grade scores. But from a policy perspective, we might care more about sustained performance improvement and actually prefer eighth-grade scores.
However, if we were willing to assume that, in fact, scores are a combination of underlying, time-invariant individual talent, the persistent effects (if any) of summer camp, and random noise, then we’d actually do better to include all observations.^{26}
##
## Call:
## lm(formula = score ~ treat * post, data = test_scores)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.260 -8.409 0.045 8.665 30.528
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 505.8878 0.3504 1443.841 < 2e-16 ***
## treatTRUE -3.5364 0.4955 -7.137 1.13e-12 ***
## postTRUE 21.7646 0.4955 43.924 < 2e-16 ***
## treatTRUE:postTRUE 15.7359 0.7008 22.456 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.08 on 3996 degrees of freedom
## Multiple R-squared: 0.6615, Adjusted R-squared: 0.6613
## F-statistic: 2603 on 3 and 3996 DF, p-value: < 2.2e-16
Another possibility is that scores are a combination of underlying, time-invariant individual talent, the persistent effects (if any) of summer camp, and random noise and the grade in which the test is taken. For example, perhaps the test taken in seventh grade is similar to that taken in sixth grade but, with an extra year of schooling, students might be expected to do better in the higher grade assuming the scores are not scaled in any way within grades. (Recall that we just have this data set without details that would allow us to rule out such ideas, so the safest thing to do is to examine the data.) The easiest way to include grade is as a linear trend, which means that grade is viewed as a number (e.g., 5 or 8):
##
## Call:
## lm(formula = score ~ treat * post + grade, data = test_scores)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.7556 -4.8276 0.1544 4.7289 25.7705
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 412.3357 1.2547 328.64 <2e-16 ***
## treatTRUE -3.5364 0.3174 -11.14 <2e-16 ***
## postTRUE -12.2543 0.5498 -22.29 <2e-16 ***
## grade 17.0095 0.2244 75.79 <2e-16 ***
## treatTRUE:postTRUE 15.7359 0.4489 35.06 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.097 on 3995 degrees of freedom
## Multiple R-squared: 0.8611, Adjusted R-squared: 0.861
## F-statistic: 6194 on 4 and 3995 DF, p-value: < 2.2e-16
Note that we have exactly the same coefficients on treat
and treat * post
as we had before (-3.5364 and 15.7359, respectively).
The easiest way to understand the estimated coefficients is to plug in some candidate \(X\) values to get the fitted values.
Suppose we have a student who went to summer camp. In grade 6, this student’s predicted score would be
\[ 412.3357 + -3.5364 + 6 \times 17.0095 = 510.8561 \]
In grade 7, this student’s predicted score would be
\[ 412.3357 + -3.5364 + 7 \times 17.0095 + 15.7359 = 543.6015 \]
An alternative approach that allows for grade
to affect scores would be to estimate a separate intercept for each level that grade
takes on.
That is, we’d have a different intercept for grade==6
, a different intercept for grade==7
, and so on.
While we could achieve this outcome by creating variables using an approach such as mutate(grade7 = grade == 7)
, it is easier to use the factor
functionality within R.
As discussed in Chapter 3, factors are a type that is useful for representing categorical variables, which often have no meaningful numerical representation (e.g., “red” or “blue”, or “Australia” or “New Zealand”) or where we want to move away from a simple numerical representation (e.g., grade 7 may not be simply 7/6 times grade 6).^{27}
Before adding a factor
version of grade
to the model above, let’s run a simpler regression.
##
## Call:
## lm(formula = score ~ factor(grade), data = test_scores)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.8631 -5.3754 -0.0097 5.7598 30.3373
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 495.0645 0.2650 1867.91 <2e-16 ***
## factor(grade)6 18.1100 0.3748 48.32 <2e-16 ***
## factor(grade)7 30.7331 0.3748 82.00 <2e-16 ***
## factor(grade)8 46.6421 0.3748 124.44 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.381 on 3996 degrees of freedom
## Multiple R-squared: 0.8063, Adjusted R-squared: 0.8062
## F-statistic: 5546 on 3 and 3996 DF, p-value: < 2.2e-16
This model estimates fixed effects for each grade
without other covariates.
Let’s look at the mean scores by grade for comparison.
mean_by_grade <-
test_scores %>%
group_by(grade) %>%
summarize(score = mean(score),
.groups = "drop")
mean_by_grade %>% kable(digits = 3)
grade | score |
---|---|
5 | 495.065 |
6 | 513.175 |
7 | 525.798 |
8 | 541.707 |
The idea of fixed effects is that there are time-invariant factors that have a constant effect on the outcome (hence fixed effects). In some settings, we would posit fixed effects at the level of the individual. Here we are positing fixed effects at the grade level. Working through the exercises should provide additional insights into what we are doing here. For more on fixed effects, see Cunningham (2021, pp. 391–392) and also Chapter 23 below.
Now, let’s estimate fixed effects for both grade
and student (id
).
This will yield more fixed effects than we have students (we have 1000 students), so we suppress the coefficients for the fixed effect in the regression output.
fm_id <- lm(score ~ treat * post + factor(grade) + factor(id),
data = test_scores, x = TRUE)
stargazer(fm_id, type = sg_format, omit = "^factor",
header = FALSE, omit.stat=c("ser", "f"))
Dependent variable: | |
score | |
treat | -2.235 |
(3.570) | |
post | 38.774^{***} |
(0.276) | |
treatTRUE:post | 15.736^{***} |
(0.319) | |
Constant | 495.370^{***} |
(2.527) | |
Observations | 4,000 |
R^{2} | 0.947 |
Adjusted R^{2} | 0.930 |
Note: | ^{}p<0.1; ^{}p<0.05; ^{}p<0.01 |
Note that we specified x = TRUE
so that the \(X\) matrix used in estimation was returned.
The size of this matrix is given by the dim
function:
dim(fm_id$x)
## [1] 4000 1006
That is we have 1006 columns, because we have added so many fixed effects. This means that \((X^{\mathsf{T}}X)^{-1}\) is a 1006 \(\times\) 1006 matrix. As we add more years and students, this matrix could quickly become quite large and inverting it would be computationally expensive (even more so for some other operations that would need even larger matrices). To get a hint as to a less computationally taxing approach, let’s see what happens when we “demean” the variables in a particular way.
demean <- function(x) x - mean(x)
test_scores_demean <-
test_scores %>%
group_by(id) %>%
mutate(score = demean(score)) %>%
group_by(grade) %>%
mutate(score = demean(score))
fm_demean <- lm(score ~ treat * post,
data = test_scores_demean, x = TRUE)
stargazer(fm_demean, type = sg_format,
header = FALSE, omit.stat=c("ser", "f"))
Dependent variable: | |
score | |
treat | -7.868^{***} |
(0.195) | |
post | -7.868^{***} |
(0.195) | |
treatTRUE:post | 15.736^{***} |
(0.276) | |
Constant | 3.934^{***} |
(0.138) | |
Observations | 4,000 |
R^{2} | 0.448 |
Adjusted R^{2} | 0.448 |
Note: | ^{}p<0.1; ^{}p<0.05; ^{}p<0.01 |
The size of the \(X\) matrix is now 4000 \(\times\) 4. This means that \((X^{\mathsf{T}}X)^{-1}\) is now a much more manageable 4 \(\times\) 4 matrix. While we had to demean the data, this is a relatively fast operation.
4.2.1 Exercises
In using
pivot_wider
in Chapter 3, we supplied a value to theid_cols
argument, but we omitted that in creatingtest_summ
. If we wanted to be explicit, what value would we need to provide for that argument in the code creatingtest_summ
?What is the relation between the means in the table produced from
mean_by_grade
and the regression coefficients infm_grade
?Why is there no estimated coefficient for
factor(grade)5
infm_grade
?Now let’s return to our earlier regression specification, except this time we include fixed effects for
grade
(see code and output below).
With this approach (output below), we have two fixed effects omitted: factor(grade)5
(not even shown) and factor(grade)8
, which is shown, but with NA
estimates.
Why are we losing two fixed effects, while above we lost just one?
(Hint: Which variables can be expressed as linear combinations of the grade
indicators?)
summary(fm_dd_fe)
##
## Call:
## lm(formula = score ~ treat * post + factor(grade), data = test_scores)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.2053 -4.8029 0.1635 4.6886 26.3207
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 496.8328 0.2741 1812.68 <2e-16 ***
## treatTRUE -3.5364 0.3165 -11.17 <2e-16 ***
## postTRUE 38.7741 0.3876 100.03 <2e-16 ***
## factor(grade)6 18.1100 0.3165 57.22 <2e-16 ***
## factor(grade)7 -15.9089 0.3165 -50.27 <2e-16 ***
## factor(grade)8 NA NA NA NA
## treatTRUE:postTRUE 15.7359 0.4476 35.16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.077 on 3994 degrees of freedom
## Multiple R-squared: 0.862, Adjusted R-squared: 0.8618
## F-statistic: 4989 on 5 and 3994 DF, p-value: < 2.2e-16
In words, what are we doing to create
test_scores_demean
? Intuitively, why might this affect the need to use fixed effects?What’s going on in the regression stored in
fm_demean
above? Can you relate the coefficients to the numbers in the following table? Which of these estimated coefficients is meaningful? All of them? Some of them? None of them?
test_scores_demean %>%
group_by(grade, treat) %>%
summarize(score = mean(score), .groups = "drop") %>%
kable(digits = 3)
grade | treat | score |
---|---|---|
5 | FALSE | 3.338 |
5 | TRUE | -3.338 |
6 | FALSE | 4.530 |
6 | TRUE | -4.530 |
7 | FALSE | -3.975 |
7 | TRUE | 3.975 |
8 | FALSE | -3.893 |
8 | TRUE | 3.893 |
- The
felm
function from thelfe
package offers a succinct syntax for adding fixed effects and uses computationally efficient algorithms (much like our demeaning approach above) to estimating these. What is the same in the results below and the two specifications we estimated above? What is different? Why might these differences exist? What is theI()
function doing here? What happens if we omit it (i.e., just includepost * treat
)?
fefm <- felm(score ~ I(post * treat) | grade + id, data = test_scores)
stargazer(fefm, type = sg_format,
header = FALSE, omit.stat=c("ser", "f"))
Dependent variable: | |
score | |
I(post * treat) | 15.736^{***} |
(0.319) | |
Observations | 4,000 |
R^{2} | 0.947 |
Adjusted R^{2} | 0.930 |
Note: | ^{}p<0.1; ^{}p<0.05; ^{}p<0.01 |
4.3 Frisch-Waugh-Lovell theorem
The Frisch-Waugh-Lovell theorem states that the following two regressions yield identical regression results in terms of both the estimate \(\hat{\beta_2}\) and residuals.
\[ y = X_1 \beta_1 + X_2 \beta_2 + \epsilon \] and
\[ M_{X_1} y = M_{X_1} X_2 \beta_2 + \eta \] where \(M_{X_1}\) is the “residual maker” for \(X_1\) or \(I - P_{X_1} = I - X_1({X_1^{\mathsf{T}}X_1})^{-1}X_1^{\mathsf{T}}\), and \(y\) is a \(n \times 1\) vector, and \(X_1\) and \(X_2\) are \((n \times k_1)\) and \((n \times k_2)\) matrices.
In other words, we have two procedures that we can use to estimate \(\hat{\beta}_2\) in the regression equation above. First, we could simply regress \(y\) on \(X_1\) and \(X_2\) to obtain estimate \(\hat{\beta}_2\). Second, we could take the following more elaborate approach:
- Regress \(X_2\) on \(X_1\) (and a constant term) and store the residuals (\(\epsilon_{X_1}\)).
- Regress \(y\) on \(X_1\) (and a constant term) and store the residuals (\(\epsilon_{y}\)).
- Regress \(\epsilon_{y}\) on \(\epsilon_{X_1}\) (and a constant term) to obtain estimate \(\hat{\beta}_2\).
The Frisch-Waugh-Lovell theorem tells us that the only portion of \(X_2\) that affects the estimate \(\beta_2\) is the portion that is orthogonal to \(X_1\). Note that the partition of \(X\) into \([X_1 X_2]\) is quite arbitrary, which means that we also get the same estimate \(\hat{\beta_1}\) from the first regression equation above and from estimating
\[ M_{X_2} y = M_{X_2} X_1 \beta_1 + \upsilon \]
To verify the Frisch-Waugh-Lovell theorem using some actual data, we draw on the data set comp
from the farr
package and a regression specification we will see in Chapter 25.
As our baseline, we run the following linear regression and store it in fm
.
fm <- lm(ta ~ big_n + cfo + size + lev + mtb +
factor(fyear) * (inv_at + I(d_sale - d_ar) + ppe),
data = comp, na.action = na.exclude)
Here the dependent variable is ta
(total accruals), big_n
is an indicator variable for having a Big \(N\) auditor and the other variables are various controls (use help(comp)
or ? comp
for descriptions of these variables).
We again use the I()
function we saw above and interact factor(fyear)
with three different variables.
We then run two auxiliary regressions: one of ta
on all regressors except size
(we store this in fm_aux_size
) and one of size
on all regressors except size
(we store this in fm_aux_ta
).
We then take the residuals from each of these regressions and put them in a data frame under the names of the original variables (ta
and size
respectively).
Finally, using the data in aux_data
, we regress ta
on size
.
fm_aux_size <- lm(size ~ big_n + cfo + lev + mtb +
factor(fyear) * (inv_at + I(d_sale - d_ar) + ppe),
data = comp, na.action = na.exclude)
fm_aux_ta <- lm(ta ~ big_n + cfo + lev + mtb +
factor(fyear) * (inv_at + I(d_sale - d_ar) + ppe),
data = comp, na.action = na.exclude)
aux_data <- tibble(size = resid(fm_aux_size),
ta = resid(fm_aux_ta))
fm_aux <- lm(ta ~ size, data = aux_data)
The Frisch-Waugh-Lovell theorem tells us that the regression in fm_aux
will produce exactly the same coefficient on size
and the same residuals (and standard errors) as the regression in fm
.
That this is the case can be seen in the following results.
stargazer(list(fm, fm_aux), type = sg_format,
omit = "(fyear|ppe|inv_at|d_sale)", report = ('vc*s'),
header = FALSE, omit.stat = c("f", "ser", "rsq"))
Dependent variable: | ||
ta | ||
(1) | (2) | |
big_n | 0.022^{*} | |
(0.011) | ||
cfo | 0.141^{***} | |
(0.008) | ||
size | -0.0004 | -0.0004 |
(0.002) | (0.002) | |
lev | -0.066^{***} | |
(0.013) | ||
mtb | -0.0001^{***} | |
(0.00002) | ||
Constant | -0.017 | 0.000 |
(0.028) | (0.004) | |
Observations | 8,850 | 8,850 |
Adjusted R^{2} | 0.872 | -0.0001 |
Note: | ^{}p<0.1; ^{}p<0.05; ^{}p<0.01 |
The Frisch-Waugh-Lovell theorem is an important result for applied researchers to understand, as it provides insights into how multivariate regression works.
A side-benefit of the result is that it allows us to reduce the relation between two variables in a multivariate regression to a bivariate regression without altering that relation.
For example, to understand the relation between size
and ta
embedded in the estimated model in fm
, we can plot the data.
Below we produce two plots. The first plot includes all data, along with a line of best fit and a smoothed curve of best fit. However, the first plot reveals extreme observations of the kind that we will study more closely in Chapter 25 (abnormal accruals more than 500 or less than 500 times lagged total assets!).
So we truncate the value of ta
at \(-1\) and \(+1\) and produce a second plot.
In the second plot, there is no visually discernible relation between size
and ta
and the line of best fit is radically different from the curve.
If nothing else, hopefully this kind of plot raises questions about the merits of blindly accepting regression results with the messy data that we often encounter in practice.
aux_data %>%
filter(!is.na(size), !is.na(ta)) %>%
ggplot(aes(x = size, y = ta)) +
geom_point() +
geom_abline(color = "red") +
geom_smooth(method = 'gam', formula = y ~ s(x, bs = "cs"))
aux_data %>%
filter(!is.na(size), !is.na(ta)) %>%
filter(abs(ta) < 1) %>%
ggplot(aes(x = size, y = ta)) +
geom_point() +
geom_abline(color = "red") +
geom_smooth(method = 'gam', formula = y ~ s(x, bs = "cs"))
4.3.1 Exercises
Verify the Frisch-Waugh-Lovell theorem using
big_n
andcfo
in place ofsize
. For each variable, also produce plots like those above. Why are the plots withbig_n
as the independent variable less helpful?In words, what effect does converting
fyear
into a factor and interacting it withinv_at
,I(d_sale - d_ar)
andppe
have? (Hint: It may be helpful to visual inspect the more complete regression output produced withsg_format <- "text"
and withoutomit = "(fyear|ppe|inv_at|d_sale)"
.)