A Linear algebra
Matrices come to us from linear algebra, a branch of mathematics. A full course in linear algebra would provide a lot more detail than we can cover here. We mostly focus on some basic details about matrices, including what they are, some of the operations we can perform with them and some useful results.
The canonical case in econometrics is the one where we have \(n\) observations on \(k + 1\) variables. Each observation might be a person or a firm, or even a firm at a particular point in time. In earlier chapters of the book, we considered the possibility that the variables for observation \(i\) are related in the following way
\[ y_i = x_{1i} \beta_1 + + x_{2i} \beta_2 + \dots + x_{ki} \beta_{k} + \epsilon_i \] For example, \(y_i\) might represent the profitability of a firm in a given year and the various \(x\) variables are factors assumed to affect that profitability, such as capital stock and market concentration. The error term (\(\epsilon_i\)) allows the equation to hold when the variables \(x_{1i}\) through \(x_{ki}\) do not provide the exact value of in \(y_i\). Given we have \(n\) observations, we actually have \(n\) equations.
\[ \begin{aligned} y_1 &= x_{11} \beta_1 + x_{21} \beta_2 + \dots + x_{k1} \beta_{k} + \epsilon_1 \\ y_2 &= x_{12} \beta_1 + x_{22} \beta_2 + \dots + x_{k2} \beta_{k} + \epsilon_2 \\ \vdots &= \qquad \vdots \qquad \vdots \qquad \vdots \qquad \vdots \qquad \vdots \\ y_n &= x_{ni} \beta_1 + x_{ni} \beta_2 + \dots + x_{ni} \beta_{k} + \epsilon_n \end{aligned} \] As we shall see, matrices will allow us to write this system of equations in a succinct fashion that allows manipulations to be represents concisely.
A.1 Vectors
For an observation, we might have data on sales, profit, R&D spending, and fixed assets. We can arrange these data as a vector: \(y = ( \textrm{sales}, \textrm{profit}, \textrm{R\&D}, \textrm{fixed assets})\). This \(y_i\) is an \(n\)-tuple (here \(n = 4\)), which is a finite ordered list of elements. A more generic representation of a \(y\) would be \(y = (y_1, y_2, \dots, y_n)\).
A.1.1 Operations on vectors
Suppose we have two vectors \(x = (x_1, x_2, \dots, x_n)\) and \(y = (y_1, y_2, \dots, y_n)\).
Vectors of equal length can be added (\(x + y = (x_1 + y_1, x_2 + y_2, \dots, x_n + y_n)\)) and subtracted (\(x - y = (x_1 - y_1, x_2 - y_2, \dots, x_n - y_n)\)). Vectors can also be multiplied by real number: \(\lambda y = (\lambda y_1, \lambda y_2, \dots, \lambda y_n)\)).
Definition A.1 For the dot product of two \(n\)-vectors \(x\) and \(y\) is denoted as \(x \cdot y\) and is defined as \[ x \cdot y = x_1 y_1 + x_2 y_2 + \dots + x_n y_n = \sum_{i=1}^n x_i y_i. \]
A.2 Matrices
A matrix is a rectangular array of real numbers.
\[ A = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1k} \\ a_{21} & a_{22} & \cdots & a_{2k} \\ \cdot & \cdot & \cdots & \cdot \\ a_{m1} & a_{m2} & \cdots & a_{mk} \end{bmatrix} \]
Matrices are typically denoted with capital letters (e.g., \(A\)) and the generic element of a matrix is denoted as \(a_{ij}\). We can also express a matrix in terms of its generic element and its dimensions as \(\left[ a_{ij} \right]_{m \times k}\).
Two important matrices are the null matrix, 0, which contains only zeros, and the identity matrix of size \(n\), I which has diagonal elements equal to one (\(i_{kk} = 1\)) and all other elements equal to zero (\(i_{jk} = 0\) for \(\forall j \neq k\)).
Each row or column of a matrix can be considered as a vector, so that a matrix can be viewed as \(m\) \(k\)-vectors (the rows) or \(k\) \(m\)-vectors (the columns).
A.2.1 Operations on matrices
Suppose we have two matrices \(A = \left[ a_{ij} \right]_{m \times k}\) and \(B = \left[ b_{ij} \right]_{m \times k}\), then we can add these matrices
\[ A + B = \left[ a_{ij} + b_{ij} \right]_{m \times k} \] We can multiply a matrix by a real number \(\lambda\)
\[ \lambda A = \left[ \lambda a_{ij} \right]_{m \times k} \]
Matrix multiplication is defined for two matrices if the number of columns for the first is equal to the number of rows of the second.
Suppose we have two matrices \(A = \left[ a_{ij} \right]_{m \times l}\) and \(B = \left[ b_{jk} \right]_{l \times n}\), then the matrix \(AB\) will have order \(m \times n\) and typical element \(c_{ik}\) defined as
\[ AB = \left[ {c}_{ik} := \sum_{j=1}^l a_{ij} b_{jk} \right]_{m \times n}\] Alternatively \(c_{ik} = a_i \cdot b_k\), where \(a_i\) is the \(i\)-th row of \(A\) and \(b_k\) is the \(k\)-th column of \(B\).
Definition A.2 The matrix \(B = \left[ b_{ij} \right]_{n \times m}\) is called the transpose of a matrix \(A = \left[ a_{ij} \right]_{m \times n}\) (and denoted \(A^{\mathsf{T}}\)) if \(b_{ij} = a_{ji}\) for all \(i \in \{1, 2, \dots, m\}\) and all \(j \in \{1, 2, \dots, n\}\).
First, that for a square, invertible matrix \(A\), we have \((AB)^{\mathsf{T}} = B^{\mathsf{T}} A^{\mathsf{T}}\).227 Second, that for a square, invertible matrix \(A\), we have \(\left(A^{\mathsf{T}}\right)^{-1} = \left(A^{-1}\right)^{\mathsf{T}}\):
\[ \begin{aligned} A A^{-1} &= I \\ \left(A^{-1}\right)^{\mathsf{T}} A^{\mathsf{T}} &= I \\ \left(A^{-1}\right)^{\mathsf{T}} A^{\mathsf{T}} \left(A^{\mathsf{T}}\right)^{-1}&= \left(A^{\mathsf{T}}\right)^{-1} \\ \left(A^{-1}\right)^{\mathsf{T}} &= \left(A^{\mathsf{T}}\right)^{-1} \end{aligned} \]
A.2.3 Matrix inverses
Definition A.3 A matrix \(A\) is idempotent if it has the property that \(A A = A\).
A.2.4 The projection matrix
Definition A.4 Given a matrix \(X\), the projection matrix for \(X\) denote \(P_X\) is defined as
\[ P_X = X(X^{\mathsf{T}} X)^{-1}X^{\mathsf{T}} \]
The following shows that \(P_X\) is an idempotent matrix:
\[ P_X P_X = X(X^{\mathsf{T}} X)^{-1}X^{\mathsf{T}} X(X^{\mathsf{T}} X)^{-1}X^{\mathsf{T}} = X(X^{\mathsf{T}} X)^{-1}X^{\mathsf{T}} = P_X \]
Note also that \(P_X\) is symmetric, which means that it and its transpose are equal, as the following demonstrates:
\[ \begin{aligned} P_X^{\mathsf{T}} &= \left(X(X^{\mathsf{T}} X)^{-1}X^{\mathsf{T}}\right)^{\mathsf{T}} \\ &= \left(X(X^{\mathsf{T}} X)^{-1}X^{\mathsf{T}}\right)^{\mathsf{T}} \\ &= X \left(X^{\mathsf{T}} X)^{-1}\right)^{\mathsf{T}} X^{\mathsf{T}} \\ &= X \left((X^{\mathsf{T}} X)^{\mathsf{T}}\right)^{-1} X^{\mathsf{T}} \\ &= X \left(X^{\mathsf{T}} X\right)^{-1} X^{\mathsf{T}} \\ &= P_X \end{aligned} \] Two results from above are used in this analysis. First, that for a square, invertible matrix \(A\), \((AB)^{\mathsf{T}} = B^{\mathsf{T}} A^{\mathsf{T}}\). Second, that for a square, invertible matrix \(A\), \(\left(A^{\mathsf{T}}\right)^{-1} = \left(A^{-1}\right)^{\mathsf{T}}\)
A.3 The OLS estimator
The classical linear regression model assumes that the data-generating process has \(y = X \beta + \epsilon\) where \(\epsilon \sim IID(0, \sigma^2 I)\), where \(y\) and \(\epsilon\) are \(n\)-vectors, \(X\) is an \(n \times k\) matrix, is a \(k\)-vector and \(I\) is the \(n \times n\) identity matrix.
The ordinary least-squares (OLS) estimator is given by
\[ \hat{\beta} = \left(X^{\mathsf{T}} X\right)^{-1} X^{\mathsf{T}} y \] If we assume that \(\mathbb{E}[\epsilon | X] = 0\), then we can derived the following result.
\[ \begin{aligned} \mathbb{E}\left[\hat{\beta} \right] &= \mathbb{E}\left[\mathbb{E}\left[\hat{\beta} | X \right] \right] \\ \mathbb{E}\left[\hat{\beta} | X \right] &= \mathbb{E}\left[\left(X^{\mathsf{T}} X\right)^{-1} X^{\mathsf{T}} y | X \right] \\ &= \mathbb{E}\left[\left(X^{\mathsf{T}} X\right)^{-1} X^{\mathsf{T}} (X\beta + \epsilon) | X \right] \\ &= \mathbb{E}\left[\left(X^{\mathsf{T}} X\right)^{-1} X^{\mathsf{T}} X\beta | X \right] + \mathbb{E}\left[\left(X^{\mathsf{T}} X\right)^{-1} X^{\mathsf{T}} \epsilon | X \right] \\ &= \beta + \left(X^{\mathsf{T}} X\right)^{-1} X^{\mathsf{T}} \mathbb{E}\left[ \epsilon | X \right] \\ &= \beta \end{aligned} \] This demonstrates that \(\hat{\beta}\) is unbiased given these assumptions. But note that the assumption that \(\mathbb{E}[\epsilon | X] = 0\) can be a strong one in some situations. For example, Davidson and MacKinnon point out that “in the context of time-series data, [this] assumption is a very strong one that we may often not feel comfortable making.” As such, many textbook treatments replace \(\mathbb{E}[\epsilon | X] = 0\) with weaker assumptions and focus on the asymptotic property of consistency instead of unbiasedness.