8 Financial statements: A second look

In this chapter, we will dive into financial statements a bit more deeply than we did in Chapter 7. The focus of this chapter is an exploration of core attributes of financial accounting data (e.g., that balance sheets balance). After exploring how financial statement data are represented in Compustat, we dive into our first detailed exploration of an accounting research paper, Koh and Reeb (2015), which explores issues regarding how R&D spending is shown in Compustat.

This chapters is optional in the sense that subsequent chapters do not depend on an understanding of the content of this chapter. However, the material of this chapter provides a good opportunity for readers to better understand the messiness of representing seemingly tidy relationships in accounting data in a database. Some of the exercises help readers to understand better where data in Compustat come from. Working through this chapter also serves to dispel any notion that commercial databases are error-free, as we find discrepancies that are apparent even from just the information in Compustat.

8.1 Core attributes of financial statements

Three core attributes of financial statements that should hold are the following:

  1. Balance-sheet balance. Balance sheets should balance (i.e., for total assets should equal the sum of total liabilities and shareholders’ equity). This is a sine qua non of double-entry bookkeeping.
  2. Within-statement articulation. The various sub-totals within a financial statement make sense with regard to the items they comprise. For example, total assets should equal the sum of its component assets.
  3. Across-statement articulation. Financial statements should articulate. For example, the amounts reported for cash on the beginning and ending balance sheets should be explained by the statement of cash flows.

Below we explore how these attributes are reflected in Compustat. As in Chapter 7, we will focus on two tables from Compustat’s North American database: comp.funda and comp.company and we will construct a table funda_mod that contains the “standard” set of observations for Compustat.

We begin by setting up objects for the remote data frames we used in Chapter 7.

library(dplyr, warn.conflicts = FALSE)
library(DBI)
library(dbplyr)       # For window_order()
library(ggplot2)
pg <- dbConnect(RPostgres::Postgres(), 
                bigint = "integer", 
                check_interrupts = TRUE)

funda <- tbl(pg, sql("SELECT * FROM comp.funda"))
company <- tbl(pg, sql("SELECT * FROM comp.company"))

funda_mod <-
  funda %>%
  filter(indfmt == "INDL", datafmt == "STD",
         consol == "C", popsrc == "D")

8.2 Balance sheets

An essential feature of statements of financial position is that they balance, hence the more traditional term “balance sheets”. Let’s check that this holds on Compustat. Total assets is at and the sum of liabilities and shareholders’ equity is lse.

funda_mod %>%
  filter(at != lse) %>%
  select(gvkey, datadate, at, lse)
## # Source:   SQL [0 x 4]
## # Database: postgres  [iangow@/tmp:5432/iangow]
## # … with 4 variables: gvkey <chr>, datadate <date>, at <dbl>, lse <dbl>

So everything balances at this level. Note that if either at or lse is NA, then the filter will not apply. For completeness, let’s look into missingness of at and lse:

funda_mod %>%
  mutate(missing_at = is.na(at), missing_lse = is.na(lse)) %>%
  count(missing_at, missing_lse) %>%
  ungroup() %>%
  collect()
## # A tibble: 2 × 3
##   missing_at missing_lse      n
##   <lgl>      <lgl>        <int>
## 1 FALSE      FALSE       471127
## 2 TRUE       TRUE         83492

In short, if at is missing, so is lse; if at is present, so is lse.

8.3 Within-statement articulation

Having established that balance sheets always balance on Compustat, the next question we study is whether the various balancing models for funda hold. Compustat provides balancing models that explain how various items on financial statements in its database relate to each other. WRDS provides these balancing models as Excel files here (requires WRDS access).

One example of a balancing model relates to the decomposition of total assets (at) as follows:

\[ \textit{ACT} + \textit{PPENT} + \textit{IVAEQ} + \textit{IVAO} + \textit{INTAN} + \textit{AO} = \textit{AT} \] where the description of each item in the equation is as follows:

Item Description
ACT Current Assets - Total
PPENT Property Plant and Equipment - Total (Net)
IVAEQ Investment and Advances - Equity
IVAO Investment and Advances - Other
INTAN Intangible Assets - Total
AO Assets - Other - Total
DC Deferred Charges (component of AO)
AOX Assets - Other (Sundry) (component of AO)

The first issue we need to think about is the presence of NA values in the components of at even when at itself is not NA. Let’s look at data for one company where this creates issues.

na_sample_firm_years <-
  funda_mod %>%
  filter(gvkey == "008902", datadate >= "2000-01-01") %>%
  select(gvkey, datadate)

funda_mod %>%
  semi_join(na_sample_firm_years, by = c("gvkey", "datadate")) %>%
  mutate(at_calc = act + ppent + ivaeq + ivao + intan + ao) %>%
  select(gvkey, datadate, at, at_calc, act, ppent, ivaeq, ivao, intan, ao) %>%
  collect()
## # A tibble: 23 × 10
##    gvkey  datadate      at at_calc   act ppent ivaeq  ivao intan    ao
##    <chr>  <date>     <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 008902 2000-05-31 2099.      NA  785.  366.    NA     0  916.  32.1
##  2 008902 2001-05-31 2078.      NA  819.  362.    NA     0  872.  25.4
##  3 008902 2002-05-31 2036.      NA  801.  356.    NA     0  857.  22.4
##  4 008902 2003-05-31 2247.      NA  928.  371.    NA     0  914.  34.1
##  5 008902 2004-05-31 2353.      NA  995.  381.    NA     0  931.  46.8
##  6 008902 2005-05-31 2656.      NA 1271.  390.    NA     0  939.  55.8
##  7 008902 2006-05-31 2980.      NA 1369.  445.    NA     0 1073.  93.7
##  8 008902 2007-05-31 3333.      NA 1570.  473.    NA     0 1182. 108. 
##  9 008902 2008-05-31 3764.      NA 1784.  498.    NA    NA 1293. 189. 
## 10 008902 2009-05-31 3410.      NA 1553.  470.    NA    NA 1214. 173. 
## # … with 13 more rows

Because ivaeq is NA for these years, the sum of the components of at is also NA, even though the value of at is not NA. What’s going on? The answer is that ivaeq is NA because this firm doesn’t report amounts for “Investment and Advances - Equity”, because they’re either zero or not material. Thus it’s reasonable to assume that these should zero. To convert NA amounts to zero, we can use the coalesce function, which takes two arguments and returns the first argument if it is not NA and returns the second argument otherwise.64 So, coalesce(ivaeq, 0) equals ivaeq when ivaeq is not NA, and 0 otherwise.

Let’s see if applying the coalesce function to treat NA values as zeros helps:

tol <- 1e-3

funda_na_fixed <-
  funda_mod %>%
  select(gvkey, datadate, act, ppent, ivaeq, ivao, intan, ao, at,
         dc, aox) %>%
  mutate(at_calc = coalesce(act, 0) + coalesce(ppent, 0) + coalesce(ivaeq, 0) + 
           coalesce(ivao, 0) + coalesce(intan, 0) + coalesce(ao, 0),
         at_diff = at_calc - at,
         balance = abs(at_diff) < tol)
  
funda_na_fixed %>%
  semi_join(na_sample_firm_years,
            by = c("gvkey", "datadate")) %>%
  select(gvkey, datadate, at, at_calc, balance) %>%
  collect()
## # A tibble: 23 × 5
##    gvkey  datadate      at at_calc balance
##    <chr>  <date>     <dbl>   <dbl> <lgl>  
##  1 008902 2000-05-31 2099.   2099. TRUE   
##  2 008902 2001-05-31 2078.   2078. TRUE   
##  3 008902 2002-05-31 2036.   2036. TRUE   
##  4 008902 2003-05-31 2247.   2247. TRUE   
##  5 008902 2004-05-31 2353.   2353. TRUE   
##  6 008902 2005-05-31 2656.   2656. TRUE   
##  7 008902 2006-05-31 2980.   2980. TRUE   
##  8 008902 2007-05-31 3333.   3333. TRUE   
##  9 008902 2008-05-31 3764.   3764. TRUE   
## 10 008902 2009-05-31 3410.   3410. TRUE   
## # … with 13 more rows

Yes, we see that, in this case, this approach allows us to tie the value for at with its components (act, ppent, ivaeq, ivao, intan, and ao). Note that we specified a “tolerance” (tol) equal to 0.001 because testing for equality of floating-point calculations is not always exact.

But, does this approach resolve all issues with articulation of at with its components?

funda_na_fixed %>%
  count(balance) %>%
  collect()
## # A tibble: 3 × 2
##   balance      n
##   <lgl>    <int>
## 1 FALSE    77644
## 2 TRUE    393483
## 3 NA       83492

Alas the answer is “no”. We use the exercises to look more closely at a few of the problem cases.

8.3.1 Exercises

  1. What is the value in funda_na_fixed of at when balance is NA? Is this surprising?
  2. Write code to calculate decade, the decade in which datadate for each observation falls. (Hint: The functions floor and year and the number 10 may be helpful.)
  3. Are there any cases in funda_na_fixed where at_calc is greater than zero and at is NA? Which decades are these mostly found in? If you were doing research with these data, how might you handle these cases?
  4. Consider the firm with gvkey equal to 016476. Write code to obtain the company name and CIK from the company table defined above. Using that CIK, find the 10-K filing for the year ending 2001-12-31 on the SEC website. (Note: The 10-K will be released some time after 2001-12-31.)
  5. Looking at the 10-K you just found, what seems to going on with this observation?
funda_na_fixed %>%
  filter(gvkey=="016476", datadate=="2003-11-30") %>%
  select(datadate, at, at_diff, act, ppent, intan, dc, aox) %>%
  collect()
## # A tibble: 1 × 8
##   datadate      at at_diff   act ppent intan    dc   aox
##   <date>     <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2003-11-30 2984.    54.8 1675.  487.  245.  54.8  577.
  1. Using the approach above, we can find the relevant 10-K for the following observation here. What’s going on with this case? What’s the most significant difference between this case and the one above? (Hint: The following additional Compustat balancing model may help: act = ppent + intan + ivao + ao.)
funda_na_fixed %>%
  filter(gvkey=="145003", datadate=="2001-12-31") %>%
  select(datadate, at, at_calc, at_diff, act, ppent, intan, ivao, ao) %>%
  collect()
## # A tibble: 1 × 9
##   datadate      at at_calc at_diff   act ppent intan  ivao    ao
##   <date>     <dbl>   <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2001-12-31  6.64    7.25   0.616  4.45  1.06 0.761     0 0.978

8.4 Across-statement articulation

Under IAS 1 Presentation of Financial Statements, a complete set of financial statements will include the following four statements:

  • A statement of financial position (i.e., a balance sheet)
  • A statement of profit or loss and other comprehensive income (i.e., an income statement)
  • A statement of changes in equity
  • A statement of cash flows

Of these four statements, the balance sheet represents a statement of stocks, or balances at particular points of time, while the other three represent statements of flows, or changes in balances over time. Of the three flow statements, only the last two explicitly reconcile beginning and ending balance sheet items.65 The statement of changes in equity “provides a reconciliation of the opening and closing amounts of each component of equity for the period” and the statement of cash flows performs an analogous function with respect to the opening and closing amounts of cash and cash equivalents.

Compustat’s balancing models only address the balance sheet, income statement, and statement of cash flows. There is no balancing model for the statement of changes in equity on Compustat. So we only really have the statement of cash flows available to test the across-statement articulation within Compustat and therefore focus on that statement in the discussion below.

Before moving on to the cash flow statement, we examine the balancing model for cash on the balance sheet:

\[ \textit{CH} + \textit{IVST} = \textit{CHE} \] where the description of each item in the equation is as follows

Item Description
CH Cash
IVST Short-Term Investments
CHE Cash and Short-Term Investments- Total

A wrinkle here is that there is small number of cases where we are missing the total (che), but we have values for one of its components (ivst or ch).

funda_mod %>%
  select(gvkey, datadate, che, ch, ivst) %>%
  count(missing_che = is.na(che),
        missing_ch = is.na(ch), 
        missing_ivst = is.na(ivst)) %>%
  collect()
## # A tibble: 7 × 4
##   missing_che missing_ch missing_ivst      n
##   <lgl>       <lgl>      <lgl>         <int>
## 1 FALSE       FALSE      FALSE        402548
## 2 FALSE       FALSE      TRUE           1210
## 3 FALSE       TRUE       FALSE             4
## 4 FALSE       TRUE       TRUE          62164
## 5 TRUE        FALSE      TRUE             75
## 6 TRUE        TRUE       FALSE            59
## 7 TRUE        TRUE       TRUE          88559

While more digging might be appropriate to work out how best to handle these cases, below we take the simple expedient of using ch or ivst as the value for che when che is missing, but ch is not. The following suggests that this modified balancing model holds in almost every case where at least one of components is not missing (though there is a non-trivial number of cases where che, ch and ivst are all missing).

funda_mod %>%
  select(gvkey, datadate, che, ch, ivst) %>%
  mutate(missing_che = is.na(che),
         missing_che_comps = is.na(ch) & is.na(ivst),
         che = case_when(!is.na(che) ~ che,
                         !is.na(ivst) ~ ivst,
                         !is.na(ch) ~ ch,
                         FALSE ~ NA),
         che_calc = coalesce(ch, 0) + coalesce(ivst, 0),
         che_diff = che_calc - che,
         che_balance = abs(che_diff) < tol) %>%
  count(che_balance, missing_che, missing_che_comps) %>%
  collect() %>%
  arrange(desc(n))
## # A tibble: 6 × 4
##   che_balance missing_che missing_che_comps      n
##   <lgl>       <lgl>       <lgl>              <int>
## 1 TRUE        FALSE       FALSE             403761
## 2 NA          TRUE        TRUE               88559
## 3 FALSE       FALSE       TRUE               61907
## 4 TRUE        FALSE       TRUE                 257
## 5 TRUE        TRUE        FALSE                134
## 6 FALSE       FALSE       FALSE                  1

Having explored the Compustat balancing model for cash on the balance sheet, we now consider the Compustat balancing model for the annual cash flow statement, which is expressed as follows:

\[ \textit{OANCF} + \textit{IVNCF} + \textit{FINCF} + \textit{EXRE} = \textit{CHECH} \] where the description of each item in the equation is as follows

Item Description
OANCF Operating Activities - Net Cash Flow
IVNCF Investing Activities - Net Cash Flow
FINCF Financing Activities - Net Cash Flow
EXRE Exchange Rate Effect
CHECH Cash and Cash Equivalents - Increase (Decrease)

Let’s collect some data to check whether the cash flow statement balancing model holds. The first step is to calculate its left-hand side, chech_calc, converting missing values for each of the four elements to zero. (Here we use the across function, which is explained here.) We then compare chech_calc and chech and, if these two value are essentially equal, we set chech_balance to TRUE.

funda_mod %>%
  select(gvkey, datadate, oancf, ivncf, fincf, exre, chech) %>%
  mutate(across(oancf:exre, ~ coalesce(., 0)),
         chech_calc = oancf + ivncf + fincf + exre,
         chech_diff = chech_calc - chech,
         chech_balance = abs(chech_diff) < tol) %>%
  filter(!is.na(chech), chech_calc != 0) %>%
  count(chech_balance) %>%
  collect()
## # A tibble: 2 × 2
##   chech_balance      n
##   <lgl>          <int>
## 1 FALSE            484
## 2 TRUE          295671

So there are relatively few cases where the first equation doesn’t hold.

Note that chech is likely collected from the statement of cash flows itself, so this is really a within-statement relationship. This leaves the final question: Do the amounts provided in Compustat items related to the balance sheet tie to amounts provided in items related to the statement of cash flows? Here things get a little more complicated because we need to compare changes in cash implied by two balance sheets with chech. In practice, it turns out that, while in many cases ch (“cash”) equals che (“cash and cash equivalents”), in cases where ch does not equal che, there are firms whose statements of cash flows reconcile with changes in ch and firms whose statements of cash flows reconcile with changes in ch.

tol <- 0.1

funda_cf_balance <-
  funda_mod %>%
  select(gvkey, datadate, oancf, ivncf, fincf, exre, chech, che, ch, ivst) %>%
  mutate(across(oancf:exre, ~ coalesce(., 0)),
         chech_calc = oancf + ivncf + fincf + exre,
         chech_balance = abs(chech_calc - chech) < tol) %>%
  filter(chech_calc != 0, chech_balance) %>%
  group_by(gvkey) %>% 
  window_order(datadate) %>% 
  mutate(lag_datadate = lag(datadate),
         d_che = che - lag(che), 
         d_ch = ch - lag(ch)) %>%
  ungroup() %>%
  filter(!is.na(d_che) | !is.na(d_ch)) %>%
  mutate(artic_desc = 
           case_when(abs(d_ch - d_che) < tol & abs(chech - d_che) < tol
                        ~ "Articulates (CHE == CH)",
                     abs(chech - d_che) < tol   ~ "Articulates using CHE",
                     abs(chech - d_ch) < tol    ~ "Articulates using CH",
                     abs(chech - d_che) >= tol  ~ "Does not articulate",
                     abs(chech - d_ch) >= tol   ~ "Does not articulate",
                     TRUE ~ "Other"))

funda_cf_balance %>%
  count(artic_desc) %>%
  collect()
## # A tibble: 4 × 2
##   artic_desc                   n
##   <chr>                    <int>
## 1 Does not articulate      39284
## 2 Articulates using CHE    15003
## 3 Articulates using CH     78201
## 4 Articulates (CHE == CH) 134687

Here we see that there is a non-trivial number of cases where the cash flow statement appears not to articulate with changes in cash balances on the balance sheet. We explore some cases of this failure to articulate in the exercises.

8.4.1 Exercises

  1. In checking cash flow statement articulation, we used the across function (see details here). Rewrite the code to create funda_na_fixed to use a similar approach. Check that you get the same results with respect to count(balance) as you get using the original code.

  2. Consider the case of American Airlines (GVKEY: 001045) for the year ended 31 December 2020:

funda_cf_balance %>% 
  filter(datadate == "2020-12-31", gvkey == "001045") %>%
  select(gvkey, datadate, chech, d_che, d_ch)
## # Source:     SQL [1 x 5]
## # Database:   postgres  [iangow@/tmp:5432/iangow]
## # Ordered by: datadate
##   gvkey  datadate   chech d_che  d_ch
##   <chr>  <date>     <dbl> <dbl> <dbl>
## 1 001045 2020-12-31   109  3489   -35

Look up the relevant American Airlines 10-K on sec.gov. (Hint: You can get the CIK from comp.company.) What explains the gaps above?

8.5 Missing R & D

In this section, we discuss Koh and Reeb (2015), who “investigate whether missing R&D expenditures in financial statements indicates a lack of innovation activity. Patent records reveal that 10.5% of missing R&D firms file and receive patents, which is 14 times greater than zero R&D firms.”

8.5.1 Discussion questions

  1. What is the following code doing? Koh and Reeb (2015) use a similar filter. How do they explain their rationale for this?
included_firms <-
  company %>%
  mutate(sic = as.integer(sic)) %>%
  filter(!between(sic, 4900, 4999), !between(sic, 6000, 6999)) %>%
  select(gvkey)

rd_data <-
  funda_mod %>%
  semi_join(included_firms, by = "gvkey") %>%
  mutate(missing_rd = is.na(xrd),
         zero_rd = xrd == 0,
         year = year(datadate)) %>%
  select(gvkey, datadate, year, xrd, missing_rd, zero_rd, at) %>%
  collect()
  1. Is there a balancing model that includes R&D spending?

  2. Koh and Reeb (2015) find that many “studies in The Accounting Review use R&D in their analysis and code the missing values as zero, implicitly assuming that blank R&D is equal to zero R&D, [while many] articles in the Journal of Finance [code] … the blank values as zero and including a dummy variable to indicate blank R&D firms. … In contrast, 42% of the studies in the Strategic Management Journal use R&D and they take a very different approach, often replacing the missing R&D values with either the industry average R&D, or a historical value from prior years.” Why do scholars from different fields make such different choices? In light of Koh and Reeb (2015), do you think that one approach is more correct than the others?

  3. Focusing on years 1980-2019, what have been the trends in the proportion of firms not reporting R&D and the “average” amount of R&D (scaled by assets)?66 (In calculating the “average” does it make sense to use the mean or median?) Speculate as to what might explain these trends.

  4. Consider the 10-K filing made by IHS Markit Ltd on 2021-01-22. Based on the information in the filing, do you think that IHS Markit engages in research and development activity? Does IHS Markit generate patents? Are there other forms of intellectual property protection that IHS Markit relies on? Does IHS Markit report an amount for research and development expenditure? Who is the IHS Markit’s external auditor? Do you think that IHS Markit is in violation of GAAP? Or is its reporting choice with respect to R&D a within-GAAP use of reporting discretion?

  5. Koh and Reeb (2015) state that “our first set of tests compare patent activity between non-reporting R&D firms and firms that report zero R&D. … We use both full sample and propensity score matched samples. … While full sample tests allow for greater external validity, the matched sample tests potentially improve the local treatment effect. … Based on the propensity score matched sample, our multivariate tests indicate that, on average, non-reporting R&D firms file about 14 times more patent applications than the matched zero R&D firms.” In speaking of “treatment effects” Koh and Reeb (2015) implicitly view disclosure of R&D as a treatment variable and patent applications as an outcome. Does this make sense to you? Do you think that this is really what Koh and Reeb (2015) want to do? If not, what is the inference they are trying to draw?

  6. What exactly is the take-away from the analysis in section 4.4 which uses “the rapid demise of Arthur Andersen (AA) as a quasi-natural experiment”? For example, what is the treatment? What assumptions are needed to generalize from the specific treatment to a treatment of wider applicability? What additional analyses can you suggest that might provide additional assurance that the results reflect an causal effect of interest?

  7. What are the suggested implications of the simulation analysis of section 4.5? What are the inherent limitations in a simulation analysis like this one?