8 Financial statements: A second look
In this chapter, we will dive into financial statements a bit more deeply than we did in Chapter 7. The focus of this chapter is an exploration of core attributes of financial accounting data (e.g., that balance sheets balance). After exploring how financial statement data are represented in Compustat, we dive into our first detailed exploration of an accounting research paper, Koh and Reeb (2015), which explores issues regarding how R&D spending is shown in Compustat.
This chapters is optional in the sense that subsequent chapters do not depend on an understanding of the content of this chapter. However, the material of this chapter provides a good opportunity for readers to better understand the messiness of representing seemingly tidy relationships in accounting data in a database. Some of the exercises help readers to understand better where data in Compustat come from. Working through this chapter also serves to dispel any notion that commercial databases are error-free, as we find discrepancies that are apparent even from just the information in Compustat.
8.1 Core attributes of financial statements
Three core attributes of financial statements that should hold are the following:
- Balance-sheet balance. Balance sheets should balance (i.e., for total assets should equal the sum of total liabilities and shareholders’ equity). This is a sine qua non of double-entry bookkeeping.
- Within-statement articulation. The various sub-totals within a financial statement make sense with regard to the items they comprise. For example, total assets should equal the sum of its component assets.
- Across-statement articulation. Financial statements should articulate. For example, the amounts reported for cash on the beginning and ending balance sheets should be explained by the statement of cash flows.
Below we explore how these attributes are reflected in Compustat.
As in Chapter 7, we will focus on two tables from Compustat’s North American database: comp.funda
and comp.company
and we will construct a table funda_mod
that contains the “standard” set of observations for Compustat.
We begin by setting up objects for the remote data frames we used in Chapter 7.
8.2 Balance sheets
An essential feature of statements of financial position is that they balance, hence the more traditional term “balance sheets”.
Let’s check that this holds on Compustat.
Total assets is at
and the sum of liabilities and shareholders’ equity is lse
.
## # Source: SQL [0 x 4]
## # Database: postgres [iangow@/tmp:5432/iangow]
## # … with 4 variables: gvkey <chr>, datadate <date>, at <dbl>, lse <dbl>
So everything balances at this level.
Note that if either at
or lse
is NA
, then the filter will not apply.
For completeness, let’s look into missingness of at
and lse
:
funda_mod %>%
mutate(missing_at = is.na(at), missing_lse = is.na(lse)) %>%
count(missing_at, missing_lse) %>%
ungroup() %>%
collect()
## # A tibble: 2 × 3
## missing_at missing_lse n
## <lgl> <lgl> <int>
## 1 FALSE FALSE 471127
## 2 TRUE TRUE 83492
In short, if at
is missing, so is lse
; if at
is present, so is lse
.
8.3 Within-statement articulation
Having established that balance sheets always balance on Compustat, the next question we study is whether the various balancing models for funda
hold.
Compustat provides balancing models that explain how various items on financial statements in its database relate to each other.
WRDS provides these balancing models as Excel files here (requires WRDS access).
One example of a balancing model relates to the decomposition of total assets (at
) as follows:
\[ \textit{ACT} + \textit{PPENT} + \textit{IVAEQ} + \textit{IVAO} + \textit{INTAN} + \textit{AO} = \textit{AT} \] where the description of each item in the equation is as follows:
Item | Description |
---|---|
ACT | Current Assets - Total |
PPENT | Property Plant and Equipment - Total (Net) |
IVAEQ | Investment and Advances - Equity |
IVAO | Investment and Advances - Other |
INTAN | Intangible Assets - Total |
AO | Assets - Other - Total |
DC | Deferred Charges (component of AO) |
AOX | Assets - Other (Sundry) (component of AO) |
The first issue we need to think about is the presence of NA
values in the components of at
even when at
itself is not NA
.
Let’s look at data for one company where this creates issues.
na_sample_firm_years <-
funda_mod %>%
filter(gvkey == "008902", datadate >= "2000-01-01") %>%
select(gvkey, datadate)
funda_mod %>%
semi_join(na_sample_firm_years, by = c("gvkey", "datadate")) %>%
mutate(at_calc = act + ppent + ivaeq + ivao + intan + ao) %>%
select(gvkey, datadate, at, at_calc, act, ppent, ivaeq, ivao, intan, ao) %>%
collect()
## # A tibble: 23 × 10
## gvkey datadate at at_calc act ppent ivaeq ivao intan ao
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 008902 2000-05-31 2099. NA 785. 366. NA 0 916. 32.1
## 2 008902 2001-05-31 2078. NA 819. 362. NA 0 872. 25.4
## 3 008902 2002-05-31 2036. NA 801. 356. NA 0 857. 22.4
## 4 008902 2003-05-31 2247. NA 928. 371. NA 0 914. 34.1
## 5 008902 2004-05-31 2353. NA 995. 381. NA 0 931. 46.8
## 6 008902 2005-05-31 2656. NA 1271. 390. NA 0 939. 55.8
## 7 008902 2006-05-31 2980. NA 1369. 445. NA 0 1073. 93.7
## 8 008902 2007-05-31 3333. NA 1570. 473. NA 0 1182. 108.
## 9 008902 2008-05-31 3764. NA 1784. 498. NA NA 1293. 189.
## 10 008902 2009-05-31 3410. NA 1553. 470. NA NA 1214. 173.
## # … with 13 more rows
Because ivaeq
is NA
for these years, the sum of the components of at
is also NA
, even though the value of at
is not NA
.
What’s going on?
The answer is that ivaeq
is NA
because this firm doesn’t report amounts for “Investment and Advances - Equity”, because they’re either zero or not material.
Thus it’s reasonable to assume that these should zero.
To convert NA
amounts to zero, we can use the coalesce
function, which takes two arguments and returns the first argument if it is not NA
and returns the second argument otherwise.64
So, coalesce(ivaeq, 0)
equals ivaeq
when ivaeq
is not NA
, and 0
otherwise.
Let’s see if applying the coalesce
function to treat NA
values as zeros helps:
tol <- 1e-3
funda_na_fixed <-
funda_mod %>%
select(gvkey, datadate, act, ppent, ivaeq, ivao, intan, ao, at,
dc, aox) %>%
mutate(at_calc = coalesce(act, 0) + coalesce(ppent, 0) + coalesce(ivaeq, 0) +
coalesce(ivao, 0) + coalesce(intan, 0) + coalesce(ao, 0),
at_diff = at_calc - at,
balance = abs(at_diff) < tol)
funda_na_fixed %>%
semi_join(na_sample_firm_years,
by = c("gvkey", "datadate")) %>%
select(gvkey, datadate, at, at_calc, balance) %>%
collect()
## # A tibble: 23 × 5
## gvkey datadate at at_calc balance
## <chr> <date> <dbl> <dbl> <lgl>
## 1 008902 2000-05-31 2099. 2099. TRUE
## 2 008902 2001-05-31 2078. 2078. TRUE
## 3 008902 2002-05-31 2036. 2036. TRUE
## 4 008902 2003-05-31 2247. 2247. TRUE
## 5 008902 2004-05-31 2353. 2353. TRUE
## 6 008902 2005-05-31 2656. 2656. TRUE
## 7 008902 2006-05-31 2980. 2980. TRUE
## 8 008902 2007-05-31 3333. 3333. TRUE
## 9 008902 2008-05-31 3764. 3764. TRUE
## 10 008902 2009-05-31 3410. 3410. TRUE
## # … with 13 more rows
Yes, we see that, in this case, this approach allows us to tie the value for at
with its components (act
, ppent
, ivaeq
, ivao
, intan
, and ao
).
Note that we specified a “tolerance” (tol
) equal to 0.001 because testing for equality of floating-point calculations is not always exact.
But, does this approach resolve all issues with articulation of at
with its components?
## # A tibble: 3 × 2
## balance n
## <lgl> <int>
## 1 FALSE 77644
## 2 TRUE 393483
## 3 NA 83492
Alas the answer is “no”. We use the exercises to look more closely at a few of the problem cases.
8.3.1 Exercises
- What is the value in
funda_na_fixed
ofat
whenbalance
isNA
? Is this surprising? - Write code to calculate
decade
, the decade in whichdatadate
for each observation falls. (Hint: The functionsfloor
andyear
and the number10
may be helpful.) - Are there any cases in
funda_na_fixed
whereat_calc
is greater than zero andat
isNA
? Which decades are these mostly found in? If you were doing research with these data, how might you handle these cases? - Consider the firm with
gvkey
equal to016476
. Write code to obtain the company name and CIK from thecompany
table defined above. Using that CIK, find the 10-K filing for the year ending2001-12-31
on the SEC website. (Note: The 10-K will be released some time after2001-12-31
.) - Looking at the 10-K you just found, what seems to going on with this observation?
funda_na_fixed %>%
filter(gvkey=="016476", datadate=="2003-11-30") %>%
select(datadate, at, at_diff, act, ppent, intan, dc, aox) %>%
collect()
## # A tibble: 1 × 8
## datadate at at_diff act ppent intan dc aox
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2003-11-30 2984. 54.8 1675. 487. 245. 54.8 577.
- Using the approach above, we can find the relevant 10-K for the following observation here.
What’s going on with this case?
What’s the most significant difference between this case and the one above?
(Hint: The following additional Compustat balancing model may help:
act = ppent + intan + ivao + ao
.)
funda_na_fixed %>%
filter(gvkey=="145003", datadate=="2001-12-31") %>%
select(datadate, at, at_calc, at_diff, act, ppent, intan, ivao, ao) %>%
collect()
## # A tibble: 1 × 9
## datadate at at_calc at_diff act ppent intan ivao ao
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2001-12-31 6.64 7.25 0.616 4.45 1.06 0.761 0 0.978
8.4 Across-statement articulation
Under IAS 1 Presentation of Financial Statements, a complete set of financial statements will include the following four statements:
- A statement of financial position (i.e., a balance sheet)
- A statement of profit or loss and other comprehensive income (i.e., an income statement)
- A statement of changes in equity
- A statement of cash flows
Of these four statements, the balance sheet represents a statement of stocks, or balances at particular points of time, while the other three represent statements of flows, or changes in balances over time. Of the three flow statements, only the last two explicitly reconcile beginning and ending balance sheet items.65 The statement of changes in equity “provides a reconciliation of the opening and closing amounts of each component of equity for the period” and the statement of cash flows performs an analogous function with respect to the opening and closing amounts of cash and cash equivalents.
Compustat’s balancing models only address the balance sheet, income statement, and statement of cash flows. There is no balancing model for the statement of changes in equity on Compustat. So we only really have the statement of cash flows available to test the across-statement articulation within Compustat and therefore focus on that statement in the discussion below.
Before moving on to the cash flow statement, we examine the balancing model for cash on the balance sheet:
\[ \textit{CH} + \textit{IVST} = \textit{CHE} \] where the description of each item in the equation is as follows
Item | Description |
---|---|
CH | Cash |
IVST | Short-Term Investments |
CHE | Cash and Short-Term Investments- Total |
A wrinkle here is that there is small number of cases where we are missing the total (che
), but we have values for one of its components (ivst
or ch
).
funda_mod %>%
select(gvkey, datadate, che, ch, ivst) %>%
count(missing_che = is.na(che),
missing_ch = is.na(ch),
missing_ivst = is.na(ivst)) %>%
collect()
## # A tibble: 7 × 4
## missing_che missing_ch missing_ivst n
## <lgl> <lgl> <lgl> <int>
## 1 FALSE FALSE FALSE 402548
## 2 FALSE FALSE TRUE 1210
## 3 FALSE TRUE FALSE 4
## 4 FALSE TRUE TRUE 62164
## 5 TRUE FALSE TRUE 75
## 6 TRUE TRUE FALSE 59
## 7 TRUE TRUE TRUE 88559
While more digging might be appropriate to work out how best to handle these cases, below we take the simple expedient of using ch
or ivst
as the value for che
when che
is missing, but ch
is not.
The following suggests that this modified balancing model holds in almost every case where at least one of components is not missing (though there is a non-trivial number of cases where che
, ch
and ivst
are all missing).
funda_mod %>%
select(gvkey, datadate, che, ch, ivst) %>%
mutate(missing_che = is.na(che),
missing_che_comps = is.na(ch) & is.na(ivst),
che = case_when(!is.na(che) ~ che,
!is.na(ivst) ~ ivst,
!is.na(ch) ~ ch,
FALSE ~ NA),
che_calc = coalesce(ch, 0) + coalesce(ivst, 0),
che_diff = che_calc - che,
che_balance = abs(che_diff) < tol) %>%
count(che_balance, missing_che, missing_che_comps) %>%
collect() %>%
arrange(desc(n))
## # A tibble: 6 × 4
## che_balance missing_che missing_che_comps n
## <lgl> <lgl> <lgl> <int>
## 1 TRUE FALSE FALSE 403761
## 2 NA TRUE TRUE 88559
## 3 FALSE FALSE TRUE 61907
## 4 TRUE FALSE TRUE 257
## 5 TRUE TRUE FALSE 134
## 6 FALSE FALSE FALSE 1
Having explored the Compustat balancing model for cash on the balance sheet, we now consider the Compustat balancing model for the annual cash flow statement, which is expressed as follows:
\[ \textit{OANCF} + \textit{IVNCF} + \textit{FINCF} + \textit{EXRE} = \textit{CHECH} \] where the description of each item in the equation is as follows
Item | Description |
---|---|
OANCF | Operating Activities - Net Cash Flow |
IVNCF | Investing Activities - Net Cash Flow |
FINCF | Financing Activities - Net Cash Flow |
EXRE | Exchange Rate Effect |
CHECH | Cash and Cash Equivalents - Increase (Decrease) |
Let’s collect some data to check whether the cash flow statement balancing model holds.
The first step is to calculate its left-hand side, chech_calc
, converting missing values for each of the four elements to zero. (Here we use the across
function, which is explained here.)
We then compare chech_calc
and chech
and, if these two value are essentially equal, we set chech_balance
to TRUE
.
funda_mod %>%
select(gvkey, datadate, oancf, ivncf, fincf, exre, chech) %>%
mutate(across(oancf:exre, ~ coalesce(., 0)),
chech_calc = oancf + ivncf + fincf + exre,
chech_diff = chech_calc - chech,
chech_balance = abs(chech_diff) < tol) %>%
filter(!is.na(chech), chech_calc != 0) %>%
count(chech_balance) %>%
collect()
## # A tibble: 2 × 2
## chech_balance n
## <lgl> <int>
## 1 FALSE 484
## 2 TRUE 295671
So there are relatively few cases where the first equation doesn’t hold.
Note that chech
is likely collected from the statement of cash flows itself, so this is really a within-statement relationship.
This leaves the final question: Do the amounts provided in Compustat items related to the balance sheet tie to amounts provided in items related to the statement of cash flows?
Here things get a little more complicated because we need to compare changes in cash implied by two balance sheets with chech
.
In practice, it turns out that, while in many cases ch
(“cash”) equals che
(“cash and cash equivalents”), in cases where ch
does not equal che
, there are firms whose statements of cash flows reconcile with changes in ch
and firms whose statements of cash flows reconcile with changes in ch
.
tol <- 0.1
funda_cf_balance <-
funda_mod %>%
select(gvkey, datadate, oancf, ivncf, fincf, exre, chech, che, ch, ivst) %>%
mutate(across(oancf:exre, ~ coalesce(., 0)),
chech_calc = oancf + ivncf + fincf + exre,
chech_balance = abs(chech_calc - chech) < tol) %>%
filter(chech_calc != 0, chech_balance) %>%
group_by(gvkey) %>%
window_order(datadate) %>%
mutate(lag_datadate = lag(datadate),
d_che = che - lag(che),
d_ch = ch - lag(ch)) %>%
ungroup() %>%
filter(!is.na(d_che) | !is.na(d_ch)) %>%
mutate(artic_desc =
case_when(abs(d_ch - d_che) < tol & abs(chech - d_che) < tol
~ "Articulates (CHE == CH)",
abs(chech - d_che) < tol ~ "Articulates using CHE",
abs(chech - d_ch) < tol ~ "Articulates using CH",
abs(chech - d_che) >= tol ~ "Does not articulate",
abs(chech - d_ch) >= tol ~ "Does not articulate",
TRUE ~ "Other"))
funda_cf_balance %>%
count(artic_desc) %>%
collect()
## # A tibble: 4 × 2
## artic_desc n
## <chr> <int>
## 1 Does not articulate 39284
## 2 Articulates using CHE 15003
## 3 Articulates using CH 78201
## 4 Articulates (CHE == CH) 134687
Here we see that there is a non-trivial number of cases where the cash flow statement appears not to articulate with changes in cash balances on the balance sheet. We explore some cases of this failure to articulate in the exercises.
8.4.1 Exercises
In checking cash flow statement articulation, we used the
across
function (see details here). Rewrite the code to createfunda_na_fixed
to use a similar approach. Check that you get the same results with respect tocount(balance)
as you get using the original code.Consider the case of American Airlines (GVKEY:
001045
) for the year ended 31 December 2020:
funda_cf_balance %>%
filter(datadate == "2020-12-31", gvkey == "001045") %>%
select(gvkey, datadate, chech, d_che, d_ch)
## # Source: SQL [1 x 5]
## # Database: postgres [iangow@/tmp:5432/iangow]
## # Ordered by: datadate
## gvkey datadate chech d_che d_ch
## <chr> <date> <dbl> <dbl> <dbl>
## 1 001045 2020-12-31 109 3489 -35
Look up the relevant American Airlines 10-K on sec.gov
.
(Hint: You can get the CIK from comp.company
.)
What explains the gaps above?
8.5 Missing R & D
In this section, we discuss Koh and Reeb (2015), who “investigate whether missing R&D expenditures in financial statements indicates a lack of innovation activity. Patent records reveal that 10.5% of missing R&D firms file and receive patents, which is 14 times greater than zero R&D firms.”
8.5.1 Discussion questions
- What is the following code doing?
Koh and Reeb (2015) use a similar
filter
. How do they explain their rationale for this?
included_firms <-
company %>%
mutate(sic = as.integer(sic)) %>%
filter(!between(sic, 4900, 4999), !between(sic, 6000, 6999)) %>%
select(gvkey)
rd_data <-
funda_mod %>%
semi_join(included_firms, by = "gvkey") %>%
mutate(missing_rd = is.na(xrd),
zero_rd = xrd == 0,
year = year(datadate)) %>%
select(gvkey, datadate, year, xrd, missing_rd, zero_rd, at) %>%
collect()
Is there a balancing model that includes R&D spending?
Koh and Reeb (2015) find that many “studies in The Accounting Review use R&D in their analysis and code the missing values as zero, implicitly assuming that blank R&D is equal to zero R&D, [while many] articles in the Journal of Finance [code] … the blank values as zero and including a dummy variable to indicate blank R&D firms. … In contrast, 42% of the studies in the Strategic Management Journal use R&D and they take a very different approach, often replacing the missing R&D values with either the industry average R&D, or a historical value from prior years.” Why do scholars from different fields make such different choices? In light of Koh and Reeb (2015), do you think that one approach is more correct than the others?
Focusing on years 1980-2019, what have been the trends in the proportion of firms not reporting R&D and the “average” amount of R&D (scaled by assets)?66 (In calculating the “average” does it make sense to use the
mean
ormedian
?) Speculate as to what might explain these trends.Consider the 10-K filing made by IHS Markit Ltd on 2021-01-22. Based on the information in the filing, do you think that IHS Markit engages in research and development activity? Does IHS Markit generate patents? Are there other forms of intellectual property protection that IHS Markit relies on? Does IHS Markit report an amount for research and development expenditure? Who is the IHS Markit’s external auditor? Do you think that IHS Markit is in violation of GAAP? Or is its reporting choice with respect to R&D a within-GAAP use of reporting discretion?
Koh and Reeb (2015) state that “our first set of tests compare patent activity between non-reporting R&D firms and firms that report zero R&D. … We use both full sample and propensity score matched samples. … While full sample tests allow for greater external validity, the matched sample tests potentially improve the local treatment effect. … Based on the propensity score matched sample, our multivariate tests indicate that, on average, non-reporting R&D firms file about 14 times more patent applications than the matched zero R&D firms.” In speaking of “treatment effects” Koh and Reeb (2015) implicitly view disclosure of R&D as a treatment variable and patent applications as an outcome. Does this make sense to you? Do you think that this is really what Koh and Reeb (2015) want to do? If not, what is the inference they are trying to draw?
What exactly is the take-away from the analysis in section 4.4 which uses “the rapid demise of Arthur Andersen (AA) as a quasi-natural experiment”? For example, what is the treatment? What assumptions are needed to generalize from the specific treatment to a treatment of wider applicability? What additional analyses can you suggest that might provide additional assurance that the results reflect an causal effect of interest?
What are the suggested implications of the simulation analysis of section 4.5? What are the inherent limitations in a simulation analysis like this one?