8  Financial statements: A second look

In this chapter, we will dive a bit more deeply into financial statements than we did in Chapter 6. The focus of this chapter is an exploration of core attributes of financial accounting data (e.g., that balance sheets balance). After exploring how financial statement data are represented in Compustat, we dive into our first detailed exploration of an accounting research paper, Koh and Reeb (2015), which explores issues regarding how R&D spending is reported by firms and coded in Compustat.

This chapters is optional in the sense that subsequent chapters do not depend on an understanding of the content of this chapter. However, the material of this chapter provides a good opportunity for readers to better understand the messiness of representing seemingly tidy relationships in accounting data in a database. Some of the exercises help readers to understand better where data in Compustat come from. Working through this chapter also serves to dispel any notion that commercial databases are error-free, as we find discrepancies that are apparent even from just the information in Compustat.


In this chapter, we will use the R libraries below. For instructions on how to set up your computer to use the code found in this book, see Section 1.2. Quarto templates for the exercises below are available on GitHub.

8.1 Core attributes of financial statements

Three core attributes of financial statements that should hold are the following:

  1. Balance-sheet balance. Balance sheets should balance (i.e., for total assets should equal the sum of total liabilities and shareholders’ equity). This is a sine qua non of double-entry bookkeeping.
  2. Within-statement articulation. The various sub-totals within a financial statement make sense with regard to the items they comprise. For example, the amount in total assets should equal the sum of the component assets.
  3. Across-statement articulation. Financial statements should articulate. For example, the amounts reported for cash on the beginning and ending balance sheets should be explained by the statement of cash flows.

Below we explore how these attributes are reflected in Compustat. As in Chapter 6, we will focus on two tables from Compustat’s North American database: comp.funda and comp.company and we will construct a table funda_mod that contains the “standard” set of observations for Compustat.

We begin by setting up objects for the remote data frames we used in Chapter 6.

db <- dbConnect(RPostgres::Postgres(), bigint = "integer")

funda <- tbl(db, Id(schema = "comp", table = "funda"))
company <- tbl(db, Id(schema = "comp", table = "company"))
db <- dbConnect(duckdb::duckdb())

funda <- load_parquet(db, schema = "comp", table = "funda")
company <- load_parquet(db, schema = "comp", table = "company")
funda_mod <-
  funda |>
  filter(indfmt == "INDL", datafmt == "STD",
         consol == "C", popsrc == "D")

8.2 Balance sheets

An essential feature of statements of financial position is that they balance, hence the more traditional term “balance sheets”. Let’s check that this holds on Compustat. Total assets is at and the sum of liabilities and shareholders’ equity is lse.

funda_mod |>
  filter(at != lse) |>
  select(gvkey, datadate, at, lse) |>
  collect() |>
[1] 0

So everything balances at this level. Note that if either at or lse is NA, then the filter will not apply. For completeness, let’s look into missingness of at and lse:

funda_mod |>
  mutate(missing_at = is.na(at), missing_lse = is.na(lse)) |>
  count(missing_at, missing_lse) |>
  ungroup() |>
Table 8.1: Missingness in at and lse.
missing_at missing_lse n
TRUE TRUE 88,439

In short, if at is missing, so is lse; if at is present, so is lse.

8.3 Within-statement articulation

Having established that balance sheets always balance on Compustat, the next question we study is whether the various balancing models for funda hold. Compustat provides balancing models that explain how various items on financial statements in its database relate to each other. WRDS provides these balancing models as Excel files on its website (requires WRDS access).

One example of a balancing model relates to the decomposition of total assets: at = act + ppent + ivaeq + ivao + intan + ao, where the description of each item in the equation is provided in Table 8.2.

Table 8.2: Components of total assets
Item Description
act Current Assets - Total
ppent Property Plant and Equipment - Total (Net)
ivaeq Investment and Advances - Equity
ivao Investment and Advances - Other
intan Intangible Assets - Total
ao Assets - Other - Total
dc Deferred Charges (component of AO)
aox Assets - Other (Sundry) (component of AO)

The first issue we need to think about is the presence of NA values in the components of at even when at itself is not NA. Let’s look at data for one company where this creates issues.

na_sample_firm_years <-
  funda_mod |>
  filter(gvkey == "008902", datadate >= "2000-01-01") |>
  select(gvkey, datadate)

funda_mod |>
  semi_join(na_sample_firm_years, by = c("gvkey", "datadate")) |>
  mutate(at_calc = act + ppent + ivaeq + ivao + intan + ao) |>
  select(gvkey, datadate, at, at_calc, act,
         ppent, ivaeq, ivao, intan, ao) |>
# A tibble: 24 × 10
   gvkey  datadate      at at_calc   act ppent ivaeq  ivao intan    ao
   <chr>  <date>     <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 008902 2010-05-31 3004.      NA 1449.  383.    NA  22.2 1071.  79.2
 2 008902 2011-05-31 3515.      NA 1868.  390.    NA  35.8 1144.  76.9
 3 008902 2002-05-31 2036.      NA  801.  356.    NA   0    857.  22.4
 4 008902 2003-05-31 2247.      NA  928.  371.    NA   0    914.  34.1
 5 008902 2017-05-31 5090.      NA 2397.  743.    NA  75   1717. 158. 
 6 008902 2000-05-31 2099.      NA  785.  366.    NA   0    916.  32.1
 7 008902 2013-05-31 4116.      NA 1886.  492.    NA  64   1573.  99.4
 8 008902 2015-05-31 4694.      NA 2100.  590.    NA  85.5 1820.  99.4
 9 008902 2016-05-31 4776.      NA 2138.  629.    NA  72.8 1795. 140. 
10 008902 2018-05-31 5272.      NA 2471.  780.    NA  70.7 1776. 173. 
# ℹ 14 more rows

Because ivaeq is NA for these years, the sum of the components of at is also NA, even though the value of at is not NA. What’s going on? The answer is that ivaeq is NA because this firm doesn’t report amounts for “Investment and Advances - Equity”, because they’re either zero or not material. Thus it seems reasonable to assume that these should be zero. To convert NA amounts to zero, we can use the coalesce() function, which takes two arguments and returns the first argument if it is not NA and returns the second argument otherwise.1 So, coalesce(ivaeq, 0) equals ivaeq when ivaeq is not NA, and 0 otherwise.

tol <- 1e-3

funda_na_fixed <-
  funda_mod |>
  select(gvkey, datadate, act, ppent, ivaeq, ivao, intan, ao, at,
         dc, aox) |>
  mutate(at_calc = coalesce(act, 0) + coalesce(ppent, 0) +
           coalesce(ivaeq, 0) + coalesce(ivao, 0) + 
           coalesce(intan, 0) + coalesce(ao, 0),
         at_diff = at_calc - at,
         balance = abs(at_diff) < tol)

Let’s see if applying coalesce(x, 0) helps. In Table 8.3, we see that, in this case, this approach allows us to tie the value for at with its components (act, ppent, ivaeq, ivao, intan, and ao). Note that we specified a “tolerance” (tol) equal to 0.001 because testing for equality of floating-point calculations is not always exact.

funda_na_fixed |>
            by = c("gvkey", "datadate")) |>
  select(gvkey, datadate, at, at_calc, balance) |>
  collect(n = 10)
Table 8.3: Balancing equation for at with coalesce(x, 0)
gvkey datadate at at_calc balance
008902 2010-05-31 3004.024 3004.024 TRUE
008902 2011-05-31 3515.029 3515.029 TRUE
008902 2002-05-31 2036.403 2036.403 TRUE
008902 2003-05-31 2247.211 2247.211 TRUE
008902 2017-05-31 5090.449 5090.449 TRUE
008902 2000-05-31 2099.203 2099.203 TRUE
008902 2013-05-31 4115.526 4115.526 TRUE
008902 2015-05-31 4694.240 4694.240 TRUE
008902 2016-05-31 4776.041 4776.041 TRUE
008902 2018-05-31 5271.822 5271.822 TRUE

But, does this approach resolve all issues with articulation of at with its components? Alas the answer seen in Table 8.4 is “no”. We use the exercises to look more closely at a few of the problem cases.

funda_na_fixed |>
Table 8.4: Approximate articulation of at
balance n
FALSE 81,047
NA 88,439
TRUE 401,834

8.3.1 Exercises

  1. What is the value in funda_na_fixed of at when balance is NA? Is this surprising?
  2. Write code to calculate decade, the decade in which datadate for each observation falls. (Hint: The functions floor() and year() and the number 10 may be helpful.)
  3. Are there any cases in funda_na_fixed where at_calc is greater than zero and at is NA? Which decades are these mostly found in? If you were doing research with these data, how might you handle these cases?
  4. Consider the firm with gvkey equal to 016476. Write code to obtain the company name and CIK from the company table defined above. Using that CIK, find the 10-K filing for the year ending November 2003 on the SEC website.2 (Note: The 10-K will be released some time after 2003-11-30.)
  5. Looking at the 10-K you just found, what seems to be going on with this observation?
funda_na_fixed |>
  filter(gvkey=="016476", datadate=="2003-11-30") |>
  select(datadate, at, at_diff, act, ppent, intan, dc, aox)
datadate at at_diff act ppent intan dc aox
2003-11-30 2983.762 54.8 1675.117 486.714 244.627 54.8 577.304
  1. Using the approach above, we can find the relevant 10-K for the following observation on the SEC’s EDGAR site.3 What’s going on with this case? What’s the most significant difference between this case and the one above? (Hint: The following additional Compustat balancing model may help: act = ppent + intan + ivao + ao.)
funda_na_fixed |>
  filter(gvkey == "145003", datadate == "2001-12-31") |>
  select(datadate, at, at_calc, at_diff, act, 
         ppent, intan, ivao, ao)
datadate at at_calc at_diff act ppent intan ivao ao
2001-12-31 6.636 7.252 0.616 4.451 1.062 0.761 0 0.978

8.4 Across-statement articulation

Under IAS 1 Presentation of Financial Statements, a complete set of financial statements will include the following four statements:

  • A statement of financial position (i.e., a balance sheet)
  • A statement of profit or loss and other comprehensive income (i.e., an income statement)
  • A statement of changes in equity
  • A statement of cash flows

Of these four statements, the balance sheet represents a statement of stocks, or balances at particular points of time, while the other three represent statements of flows, or changes in balances over time. Of the three flow statements, only the last two explicitly reconcile beginning and ending balance sheet items.4 The statement of changes in equity “provides a reconciliation of the opening and closing amounts of each component of equity for the period” and the statement of cash flows performs an analogous function with respect to the opening and closing amounts of cash and cash equivalents.

Compustat’s balancing models only address the balance sheet, income statement, and statement of cash flows. There is no balancing model for the statement of changes in equity on Compustat. So we only really have the statement of cash flows available to test the across-statement articulation within Compustat and therefore focus on that statement in the discussion below.

Before moving on to the cash flow statement, we examine the balancing model for cash on the balance sheet: ch + ivst = che, where the description of each item in the equation is given in Table 8.5.

Table 8.5: Components of cash balance
Item Description
ch Cash
ivst Short-Term Investments
che Cash and Short-Term Investments- Total

A wrinkle here is that there is small number of cases where we are missing the total (che), but we have values for one of its components (ivst or ch).

funda_mod |>
  select(gvkey, datadate, che, ch, ivst) |>
  count(missing_che = is.na(che),
        missing_ch = is.na(ch), 
        missing_ivst = is.na(ivst)) |>
  arrange(missing_che, missing_ch, missing_ivst)
Table 8.6: Missingness in components of cash balance
missing_che missing_ch missing_ivst n

While more digging might be appropriate to work out how best to handle these cases, below we take the simple expedient of using ch or ivst as the value for che when che is missing, but ch is not. The following suggests that this modified balancing model holds in almost every case where at least one of components is not missing (though there is a non-trivial number of cases where che, ch and ivst are all missing).

funda_mod |>
  select(gvkey, datadate, che, ch, ivst) |>
  mutate(missing_che = is.na(che),
         missing_che_comps = is.na(ch) & is.na(ivst),
         che = case_when(!is.na(che) ~ che,
                         !is.na(ivst) ~ ivst,
                         !is.na(ch) ~ ch),
         che_calc = coalesce(ch, 0) + coalesce(ivst, 0),
         che_diff = che_calc - che,
         che_balance = abs(che_diff) < tol) |>
  count(che_balance, missing_che, missing_che_comps) |>
  collect() |>
Table 8.7: Missingness in che components with coalesce(x, 0)
che_balance missing_che missing_che_comps n

Having explored the Compustat balancing model for cash on the balance sheet, we now consider the Compustat balancing model for the annual cash flow statement, which is expressed as follows: oancf + ivncf + fincf + exre = chech, where the description of each item in the equation is given in Table 8.8.

Table 8.8: Components of cash flow statement
Item Description
oancf Operating Activities - Net Cash Flow
ivncf Investing Activities - Net Cash Flow
fincf Financing Activities - Net Cash Flow
exre Exchange Rate Effect
chech Cash and Cash Equivalents - Increase (Decrease)

Let’s collect some data to check whether the cash flow statement balancing model holds. The first step is to calculate its left-hand side, chech_calc, converting missing values for each of the four elements to zero. Here we use across() to modify multiple columns in one step. (The across() function is covered in some detail in Chapter 26 of R for Data Science.) We then compare chech_calc and chech and, if these two values are essentially equal, we set chech_balance to TRUE.

funda_mod |>
  select(gvkey, datadate, oancf, ivncf, fincf, exre, chech) |>
  mutate(across(oancf:exre, \(x) coalesce(x, 0)),
         chech_calc = oancf + ivncf + fincf + exre,
         chech_diff = chech_calc - chech,
         chech_balance = abs(chech_diff) < tol) |>
  filter(!is.na(chech), chech_calc != 0) |>
  count(chech_balance) |>
# A tibble: 2 × 2
  chech_balance      n
  <lgl>          <dbl>
1 FALSE            521
2 TRUE          305749

So there are relatively few cases where the first equation doesn’t hold.

Note that chech is likely collected from the statement of cash flows itself, so this is really a within-statement relationship. This leaves the final question: Do the amounts provided in Compustat items related to the balance sheet tie to amounts provided in corresponding items related to the statement of cash flows? Here things get a little more complicated because we need to compare changes in cash implied by two balance sheets with chech. In practice, it turns out that, while in many cases ch (“cash”) equals che (“cash and cash equivalents”), in cases where ch does not equal che, there are firms whose statements of cash flows reconcile with changes in ch and firms whose statements of cash flows reconcile with changes in ch. Note that we loosen the tolerance a bit (tol <- 0.1) in this test to allow for rounding errors in the presentation of multiple financial statements.

Note that we use the window function lag() in this query. With a window function, the data are organized into partitions and the function operates on each partition independently and possibly uses data from multiple rows of data within that window. Here we construct partitions based on gvkey using group_by(gvkey) because we only want to consider values for a given firm. The data within each partition are ordered by datadate using window_order(). As its name suggests, the lag() function for each value simply returns the value for the previous element in the window.5 We will see several window functions in this book, including lead() (the opposite of lag()), row_number() (the position of the row in the window), and cumsum() (the cumulative sum of the window’s values). A short discussion of window functions is found in Chapter 21 of R for Data Science.

tol <- 0.1

funda_cf_balance <-
  funda_mod |>
  select(gvkey, datadate, oancf, ivncf, fincf, 
         exre, chech, che, ch, ivst) |>
  mutate(across(oancf:exre, \(x) coalesce(x, 0)),
         chech_calc = oancf + ivncf + fincf + exre,
         chech_balance = abs(chech_calc - chech) < tol) |>
  filter(chech_calc != 0, chech_balance) |>
  group_by(gvkey) |> 
  window_order(datadate) |> 
  mutate(lag_datadate = lag(datadate),
         d_che = che - lag(che), 
         d_ch = ch - lag(ch)) |>
  ungroup() |>
  filter(!is.na(d_che) | !is.na(d_ch)) |>
  mutate(artic_desc = 
           case_when(abs(d_ch - d_che) < tol & abs(chech - d_che) < tol
                        ~ "Articulates (CHE == CH)",
                     abs(chech - d_che) < tol  ~ "Articulates using CHE",
                     abs(chech - d_ch) < tol   ~ "Articulates using CH",
                     abs(chech - d_che) >= tol ~ "Does not articulate",
                     abs(chech - d_ch) >= tol  ~ "Does not articulate",
                     .default = "Other"))
funda_cf_balance |>
  count(artic_desc) |>
  collect() |>
  rename(`Description` = artic_desc,
         `Number of cases` = n)
Table 8.9: Data on cash flow statement articulation
Description Number of cases
Articulates using CHE 16,682
Articulates (CHE == CH) 137,638
Articulates using CH 80,462
Does not articulate 42,078

In Table 8.9 we see that there is a non-trivial number of cases where the cash flow statement appears not to articulate with changes in cash balances on the balance sheet. We explore some cases of this failure to articulate in the exercises.

8.4.1 Exercises

  1. In checking cash flow statement articulation, we used the across() function (the documentation ? dplyr::across provides more detail). Rewrite the code to create funda_na_fixed above to use a similar approach. Check that you get the same results with respect to count(balance) as you get using the original code.

  2. Consider the case of American Airlines (GVKEY: 001045) for the year ended 31 December 2020:

funda_cf_balance |> 
  filter(datadate == "2020-12-31", gvkey == "001045") |>
  select(gvkey, datadate, chech, d_che, d_ch)
gvkey datadate chech d_che d_ch
001045 2020-12-31 109 3489 -35

Look up the relevant American Airlines 10-K on sec.gov. (Hint: You can get the CIK from comp.company.) What explains the gaps above?

8.5 Missing R & D

In this section, we discuss Koh and Reeb (2015), who “investigate whether missing R&D expenditures in financial statements indicates a lack of innovation activity. Patent records reveal that 10.5% of missing R&D firms file and receive patents, which is 14 times greater than zero R&D firms.”

8.5.1 Discussion questions

  1. What is the following code doing? Koh and Reeb (2015) use a similar filter. What is their rationale for this?
included_firms <-
  company |>
  mutate(sic = as.integer(sic)) |>
  filter(!between(sic, 4900, 4999), !between(sic, 6000, 6999)) |>

rd_data <-
  funda_mod |>
  semi_join(included_firms, by = "gvkey") |>
  mutate(missing_rd = is.na(xrd),
         zero_rd = xrd == 0,
         year = year(datadate)) |>
  select(gvkey, datadate, year, xrd, missing_rd, zero_rd, at) |>
  1. Is there a balancing model that includes R&D spending?

  2. Focusing on years 1980-2019, what have been the trends in the proportion of firms not reporting R&D and the “average” amount of R&D (scaled by assets)?6 (In calculating the “average” does it make sense to use the mean() or median()?) Speculate as to what might explain these trends.

  3. Koh and Reeb (2015) find that many “studies in The Accounting Review use R&D in their analysis and code the missing values as zero, implicitly assuming that blank R&D is equal to zero R&D, [while many] articles in the Journal of Finance [code] … the blank values as zero and including a dummy variable to indicate blank R&D firms. … In contrast, 42% of the studies in the Strategic Management Journal use R&D and they take a very different approach, often replacing the missing R&D values with either the industry average R&D, or a historical value from prior years.” Why do scholars from different fields make such different choices? In light of Koh and Reeb (2015), do you think that one approach is more correct than the others?

  4. Consider the 10-K filing made by IHS Markit Ltd on 2021-01-22. Based on the information in the filing, do you think that IHS Markit engages in research and development activity? Does IHS Markit generate patents? Are there other forms of intellectual property protection that IHS Markit relies on? Does IHS Markit report an amount for research and development expenditure? Who is IHS Markit’s external auditor? Do you think that IHS Markit is in violation of GAAP? Or is its reporting choice with respect to R&D a within-GAAP use of reporting discretion?

  5. Koh and Reeb (2015) state that “our first set of tests compare patent activity between non-reporting R&D firms and firms that report zero R&D. … We use both full sample and propensity score matched samples. … While full sample tests allow for greater external validity, the matched sample tests potentially improve the local treatment effect. … Based on the propensity score matched sample, our multivariate tests indicate that, on average, non-reporting R&D firms file about 14 times more patent applications than the matched zero R&D firms.” In speaking of “treatment effects” Koh and Reeb (2015) implicitly view disclosure of R&D as a treatment variable and patent applications as an outcome. Does this make sense to you? Do you think that this is really what Koh and Reeb (2015) want to do? If not, what is the inference they are trying to draw?

  6. What exactly is the take-away from the analysis in section 4.4 which uses “the rapid demise of Arthur Andersen (AA) as a quasi-natural experiment”? For example, what is the treatment? What assumptions are needed to generalize from the specific treatment to a treatment of wider applicability? What additional analyses can you suggest that might provide additional assurance that the results reflect a causal effect of interest?

  7. What are the suggested implications of the simulation analysis of Section 4.5 of Koh and Reeb (2015)? What are the inherent limitations in a simulation analysis like this one?

  1. As discussed in its documentation, the coalesce() function from dplyr is “inspired by the SQL COALESCE function which does the same thing for NULLs”.↩︎

  2. See https://www.sec.gov/edgar/searchedgar/companysearch.↩︎

  3. See https://www.sec.gov/Archives/edgar/data/1074874/000093041302001117/c23823_10ksb.txt.↩︎

  4. While the income statement needs to articulate with retained earnings, it does not provide a complete account of changes in that balance sheet account.↩︎

  5. A careful reader might have noticed that we actually used the window function fill() in Chapter 2. However, there we used arrange() instead of window_order(). The window_order() function is only available for remote data frames because it relies on the SQL backed to provide functionality not available with local data frames. Readers coming from an SQL background might observe that dplyr’s group_by() is “overloaded” in the sense that it does the work of both the GROUP BY statement and the PARTITION BY clause in SQL.↩︎

  6. At the time of initial writing, 2020 was too incomplete to be meaningful.↩︎