7 Financial statements: A first look

In this chapter, we provide an introduction to data on financial statements, which are the primary focus of financial reporting. While a lot of contemporary accounting research has only a tangential connection to financial reporting, it’s not clear there would be a field of accounting research without financial reporting and most accounting research paper use financial reporting data to some degree (e.g., as controls). In this chapter, we will focus on annual financial statement data of North American firms, as provided by Compustat.

7.1 Setting up WRDS

Academic researchers generally get Compustat data through Wharton Research Data Services, more commonly referred to as WRDS (pronounced like “words”). In the following, we assume that you have set up your computer (as described in section 1.2.1), that you have a WRDS account, and that you are connected to the internet.

In this chapter, we will use the R libraries specified in the following code. We have seen dplyr and ggplot2 in earlier chapters, but the DBI library is new. The DBI library provides a standardized interface to relational database systems, including PostgreSQL (the system used by WRDS), MySQL, and SQLite.

First, let’s load the required packages

library(dplyr, warn.conflicts = FALSE)
library(DBI)
library(ggplot2)

Second, let’s connect to the WRDS database. To actually use the code below, you should replace your_WRDS_ID with your actual WRDS ID. You may also need to add a line Sys.setenv(PGPASSWORD="your_password") as the third line of code.53

Sys.setenv(PGHOST = "wrds-pgdata.wharton.upenn.edu",
           PGPORT = 9737L,
           PGUSER = "your_WRDS_ID", 
           PGDATABASE = "wrds")

pg <- dbConnect(RPostgres::Postgres(), bigint = "integer")

The first line of code above sets up environment variables containing information that R can use as defaults when connecting to a PostgreSQL database. The second line of code creates the actual connection to the database and assigns it to the variable pg.

An alternative to the above code would be to pass connection information to the dbConnect() function directly, as shown in the code below.54

pg <- dbConnect(RPostgres::Postgres(), 
                host = "wrds-pgdata.wharton.upenn.edu", 
                port = 9737L, 
                # user = "your_WRDS_ID",
                # password = "a_really_good_password",
                dbname ="wrds",
                bigint = "integer")

However, for reasons we discuss below, we recommend using the environment variable-based approach above.55

Now that we have established a connection to the WRDS database, we can get some data. Data sources on WRDS are organized into libraries according to their source (in PostgreSQL, these libraries are called schemas, but we will stick with the WRDS terminology for now).

In this chapter we will focus on Compustat data, which is found in the comp library. While there are other sources of financial statement data, in accounting research the pre-eminent data source is Compustat, which is typically obtained via WRDS.

We first look at the table, comp.company, which contains information about the companies in Compustat’s North American database (the equivalent table in Compustat’s “Global” database is comp.g_company). By running the following code, we create R objects that point to the data on the WRDS database and behave a lot like the data frames you have seen in earlier chapters.

company <- tbl(pg, sql("SELECT * FROM comp.company"))

What have we done here? In the first line, we have created an object company that “points to” the data returned by running the SQL query SELECT * FROM comp.company on the data source pg (which is the WRDS database). From the help for the tbl function, we learn that in tbl(src, ...), src refers to the data source. Here pg is an object representing the connection to the WRDS PostgreSQL database. The ... allows us to pass other arguments to the function. In this case, we provide sql("SELECT * FROM comp.company"), which tells the tbl function that we’re passing SQL SELECT * FROM comp.company. This code selecting all variables (*) and all rows from the table comp.company.

But what exactly is company? One way to get information about company is to use the class function:

class(company)
## [1] "tbl_PqConnection" "tbl_dbi"          "tbl_sql"          "tbl_lazy"        
## [5] "tbl"

The output here is a little complicated for a new user, but the basic idea is that company is a tbl_PqConnection object, which is a kind of tbl_sql, which is a kind of tbl_dbi, which is a kind of tbl_lazy, which is a kind of tbl, which means it a tibble of some kind. The critical idea here is that funda is a tbl_sql object, which means it behaves like a tibble, but is actually effectively a reference to a table or query in a relational database. For want of a better term, company is a remote data frame.

While there are some technical details here that are unimportant for our discussion, the important points are that, as a tbl_sql, funda has three important properties:

  1. It’s not a data frame in memory on our computer. This means that we don’t need to download the all of the data and load it into memory.
  2. It’s not a data frame in memory on the remote server (in this case, the WRDS PostgreSQL server). This means that we didn’t load the full data set into memory even onto the remote computer.
  3. Notwithstanding the previous two points, company behaves in many important ways just like a data frame.

We can use filter to get only rows meeting certain conditions and we can use select to indicate which columns we want to retrieve:

company %>% 
  filter(gsubind == "45202030", 
         substr(conm, 1, 1) == "A") %>%
  select(gvkey, conm, state)
## # Source:   SQL [3 x 3]
## # Database: postgres  [iangow@/tmp:5432/iangow]
##   gvkey  conm                state
##   <chr>  <chr>               <chr>
## 1 001690 APPLE INC           CA   
## 2 001820 ASTRONOVA INC       RI   
## 3 027925 AVID TECHNOLOGY INC MA

When we are dealing with a remote data frame, the dplyr code that we write is automatically translated into SQL, a specialized language from manipulating tabular data “used by pretty much every database in existence” (Appendix C has more on SQL and dplyr).

We can inspect the translated SQL using the show_query() function:

company %>% 
  filter(gsubind == "45202030", 
         substr(conm, 1, 1) == "A") %>%
  select(gvkey, conm, state) %>%
  show_query()
## <SQL>
## SELECT "gvkey", "conm", "state"
## FROM (SELECT * FROM comp.company) "q01"
## WHERE ("gsubind" = '45202030') AND (SUBSTR("conm", 1, 1) = 'A')

Here we see that the filter verb is translated into equivalent SQL using WHERE.

While remote data frames behave a lot like local data frames, they are not quite the same thing and, for many purposes, we need to use local data frames. For example, the lm() function for fitting regressions assumes that the object provided as its data argument is a data frame and it will not work if we give it a remote data frame instead. To convert the table from a remote data frame to a local one is achieved using the collect verb.

df_company <-
  company %>% 
  filter(gsubind == "45202030", 
         substr(conm, 1, 1) == "A") %>%
  select(gvkey, conm, state) %>%
  collect()

class(df_company)
## [1] "tbl_df"     "tbl"        "data.frame"
df_company
## # A tibble: 3 × 3
##   gvkey  conm                state
##   <chr>  <chr>               <chr>
## 1 001690 APPLE INC           CA   
## 2 001820 ASTRONOVA INC       RI   
## 3 027925 AVID TECHNOLOGY INC MA

We use collect and store the result in df_company. Looking at class(df_company), we see that it is not only a “tibble” (tbl_df), but also data.frame, meaning that we can pass it to functions (such as lm) that expect data.frames objects.

7.2 Financial statement data

The two core tables for North American data are comp.funda, which contains annual financial statement data, and comp.fundq, which contains quarterly data.

fundq <- tbl(pg, sql("SELECT * FROM comp.fundq"))
funda <- tbl(pg, sql("SELECT * FROM comp.funda"))

We can examine this object in R by typing it’s name in the R console:

funda
## # Source:   SQL [?? x 948]
## # Database: postgres  [iangow@/tmp:5432/iangow]
##    gvkey datadate   fyear indfmt consol popsrc datafmt tic   cusip conm  acctchg
##    <chr> <date>     <int> <chr>  <chr>  <chr>  <chr>   <chr> <chr> <chr> <chr>  
##  1 0012… 2019-12-31  2019 INDL   C      D      STD     ALX   0147… ALEX… ASU16-…
##  2 0012… 2019-12-31  2019 INDL   C      D      SUMM_S… ALX   0147… ALEX… <NA>   
##  3 0012… 2020-12-31  2020 FS     C      D      STD     ALX   0147… ALEX… <NA>   
##  4 0012… 1983-05-31  1982 INDL   C      D      STD     CGYNQ 1391… CAPC… <NA>   
##  5 0017… 1959-06-30  1959 INDL   C      D      STD     WM1   9393… WASH… <NA>   
##  6 0012… 2020-12-31  2020 INDL   C      D      STD     ALX   0147… ALEX… <NA>   
##  7 0012… 2020-12-31  2020 INDL   C      D      SUMM_S… ALX   0147… ALEX… <NA>   
##  8 0012… 2021-12-31  2021 FS     C      D      STD     ALX   0147… ALEX… <NA>   
##  9 0012… 1984-05-31  1983 INDL   C      D      STD     CGYNQ 1391… CAPC… <NA>   
## 10 0012… 2021-12-31  2021 INDL   C      D      STD     ALX   0147… ALEX… <NA>   
## # … with more rows, and 937 more variables: acctstd <chr>, acqmeth <chr>,
## #   adrr <dbl>, ajex <dbl>, ajp <dbl>, bspr <chr>, compst <chr>, curcd <chr>,
## #   curncd <chr>, currtr <dbl>, curuscn <dbl>, final <chr>, fyr <int>,
## #   ismod <int>, ltcm <chr>, ogm <chr>, pddur <int>, scf <int>, src <int>,
## #   stalt <chr>, udpl <chr>, upd <int>, apdedate <date>, fdate <date>,
## #   pdate <date>, acchg <dbl>, acco <dbl>, accrt <dbl>, acdo <dbl>, aco <dbl>,
## #   acodo <dbl>, acominc <dbl>, acox <dbl>, acoxar <dbl>, acqao <dbl>, …

From this code we can see that comp.funda is a very wide table, with 948 columns. With that many columns, it’s a moderately large table (at the time of writing, 1212 MB in PostgreSQL), but by focusing on certain columns, the size can be dramatically reduced. Let’s learn more about the funda table.

First, how many rows does it have?

funda %>%
  count() %>% 
  pull()
## [1] 869854

From this snippet we can see that comp.funda has 869854 rows. As covered in Chapter 5 of R4DS, funda %>% count() is equivalent to count(funda). When given an ungrouped data frame as the first argument, the count function simply counts the number of rows in the data frame.56 In this case funda %>% count() returns a server-side data frame (tbl_df) with a single column (n). The command pull extracts that single column as a vector in R.57

While not evident from casual inspection of the table, the primary key of comp.funda is (gvkey, datadate, indfmt, consol, popsrc, datafmt). One requirement for a valid primary key is that each value of the primary key is associated with only one row of the data set.58 That this is a valid key can be seen from the following code, which counts the number of rows associated with each set of values for (gvkey, datadate, indfmt, consol, popsrc, datafmt) and stores that number in num_rows and then displays the various values of num_rows.

funda %>%
  group_by(gvkey, datadate, indfmt, consol, popsrc, datafmt) %>%
  summarize(num_rows = n(), .groups = "drop") %>%
  count(num_rows) 
## # Source:   SQL [1 x 2]
## # Database: postgres  [iangow@/tmp:5432/iangow]
##   num_rows      n
##      <int>  <int>
## 1        1 869854

Because num_rows is equal to 1 in 869854 cases and this is the number of rows in the dataset, we have (gvkey, datadate, indfmt, consol, popsrc, datafmt) as a valid key.

Another requirement of a primary key is that none of the columns contains null values (in SQL, NULL; in R, NA). The code below checks this.59

funda %>%
  filter(is.na(gvkey), is.na(datadate), is.na(indfmt),
         is.na(consol), is.na(popsrc), is.na(datafmt)) %>%
  count() %>%
  pull()
## [1] 0

Teasing apart the primary key, we have a firm identifier (gvkey) and financial period identifier (datadate), along with four variables that are more technical in nature: indfmt, consol, popsrc, and datafmt.

funda %>%
  count(indfmt, consol, popsrc, datafmt) %>%
  arrange(desc(n)) %>%
  print(n = Inf)
## # Source:     SQL [12 x 5]
## # Database:   postgres  [iangow@/tmp:5432/iangow]
## # Ordered by: desc(n)
##    indfmt consol popsrc datafmt          n
##    <chr>  <chr>  <chr>  <chr>        <int>
##  1 INDL   C      D      STD         554619
##  2 INDL   C      D      SUMM_STD    267949
##  3 FS     C      D      STD          45183
##  4 INDL   P      D      STD           1167
##  5 INDL   R      D      STD            480
##  6 INDL   P      D      SUMM_STD       298
##  7 FS     R      D      STD             51
##  8 INDL   R      D      SUMM_STD        28
##  9 INDL   D      D      SUMM_STD        25
## 10 INDL   D      D      STD             25
## 11 INDL   C      D      PRE_AMENDS      19
## 12 INDL   C      D      PRE_AMENDSS     10

As discussed on the CRSP website, the first set of values, which covers 554619 observations, represents the standard (STD) secondary keyset, as is what is used by most researchers when using Compustat. We can create a version of funda that limits data to this STD secondary keyset as follows:

funda_mod <-
  funda %>%
  filter(indfmt == "INDL", datafmt == "STD",
         consol == "C", popsrc == "D")

For funda_mod, (gvkey, datadate) should represent a primary key. Let’s check this using code like that we used above:

funda_mod %>%
  group_by(gvkey, datadate) %>%
  summarize(num_rows = n(), .groups = "drop") %>%
  count(num_rows) 
## # Source:   SQL [1 x 2]
## # Database: postgres  [iangow@/tmp:5432/iangow]
##   num_rows      n
##      <int>  <int>
## 1        1 554619

Now, let’s look at the table comp.company, which the following code confirms has 39 columns and for which gvkey is a primary key.

company
## # Source:   SQL [?? x 39]
## # Database: postgres  [iangow@/tmp:5432/iangow]
##    conm    gvkey add1  add2  add3  add4  addzip busdesc cik   city  conml costat
##    <chr>   <chr> <chr> <chr> <chr> <chr> <chr>  <chr>   <chr> <chr> <chr> <chr> 
##  1 A & E … 0010… <NA>  <NA>  <NA>  <NA>  <NA>   <NA>    <NA>  <NA>  A & … I     
##  2 A & M … 0010… 1924… <NA>  <NA>  <NA>  94104  <NA>    0000… Tulsa A & … I     
##  3 AAI CO… 0010… 124 … <NA>  <NA>  <NA>  21030… Textro… 0001… Hunt… AAI … I     
##  4 A.A. I… 0010… 7700… <NA>  <NA>  <NA>  63125  A.A. I… 0000… St. … A.A.… I     
##  5 AAR CO… 0010… One … <NA>  <NA>  <NA>  60191  AAR Co… 0000… Wood… AAR … A     
##  6 A.B.A.… 0010… 1026… <NA>  <NA>  <NA>  33782  A.B.A.… <NA>  Pine… A.B.… I     
##  7 ABC IN… 0010… 301 … <NA>  <NA>  <NA>  46590  ABC In… <NA>  Wino… ABC … I     
##  8 ABKCO … 0010… 1700… <NA>  <NA>  <NA>  10019  ABKCO … 0000… New … ABKC… I     
##  9 ABM CO… 0010… 3 Wh… <NA>  <NA>  <NA>  92714  <NA>    <NA>  Irvi… ABM … I     
## 10 ABS IN… 0010… Inte… <NA>  <NA>  <NA>  44904  Makes … 0000… Will… ABS … I     
## # … with more rows, and 27 more variables: county <chr>, dlrsn <chr>,
## #   ein <chr>, fax <chr>, fic <chr>, fyrc <int>, ggroup <chr>, gind <chr>,
## #   gsector <chr>, gsubind <chr>, idbflag <chr>, incorp <chr>, loc <chr>,
## #   naics <chr>, phone <chr>, prican <chr>, prirow <chr>, priusa <chr>,
## #   sic <chr>, spcindcd <int>, spcseccd <int>, spcsrc <chr>, state <chr>,
## #   stko <int>, weburl <chr>, dldte <date>, ipodate <date>
company %>%
  group_by(gvkey) %>%
  summarize(num_rows = n(), .groups = "drop") %>%
  count(num_rows) 
## # Source:   SQL [1 x 2]
## # Database: postgres  [iangow@/tmp:5432/iangow]
##   num_rows     n
##      <int> <int>
## 1        1 52448
company %>%
  filter(is.na(gvkey)) %>%
  count()
## # Source:   SQL [1 x 1]
## # Database: postgres  [iangow@/tmp:5432/iangow]
##       n
##   <int>
## 1     0

7.2.1 Illustration: Microsoft versus Apple

Suppose that we were interested in comparing the profitability of Apple and Microsoft over time. To measure performance, we will calculate a measure of return on assets, here measured as the value of “Income Before Extraordinary Items” scaled by “Total Assets” for Microsoft (GVKEY: 012141) and Apple (GVKEY: 001690) over time.60

sample <-
  company %>%
  filter(gvkey %in% c("001690", "012141")) %>%
  select(conm, gvkey)

sample
## # Source:   SQL [2 x 2]
## # Database: postgres  [iangow@/tmp:5432/iangow]
##   conm           gvkey 
##   <chr>          <chr> 
## 1 APPLE INC      001690
## 2 MICROSOFT CORP 012141

Here we can see that sample is a server-side data frame with the company name (conm) and identifier (gvkey) for Apple and Microsoft. This might be a useful juncture at which to explain what the server-side data frame represents. To this end, we can use the show_query function:

sample %>%
  show_query()
## <SQL>
## SELECT "conm", "gvkey"
## FROM (SELECT * FROM comp.company) "q01"
## WHERE ("gvkey" IN ('001690', '012141'))

What this shows is that sample is actually a shorthand for an SQL query that is equivalent to the following SQL.61

SELECT conm, gvkey
FROM comp.company
WHERE "gvkey" IN ('001690', '012141');

Here filter is translated into a WHERE condition, %in% is translated to IN, and c("001690", "012141") is translated to ('001690', '012141').

It turns out that “Income Before Extraordinary Items” is represented by ib and “Total Assets” is at (see here for details). We can further restrict funda beyond funda_mod so that it contains just the key variables (gvkey, datadate) and the values for ib and at as follows. (In the following code, we will gradually refine the data set assigned to ib_at.)

ib_at <-
  funda_mod %>%
  select(gvkey, datadate, ib, at) 

ib_at
## # Source:   SQL [?? x 4]
## # Database: postgres  [iangow@/tmp:5432/iangow]
##    gvkey  datadate        ib       at
##    <chr>  <date>       <dbl>    <dbl>
##  1 001257 2019-12-31  60.1   1266.   
##  2 001258 1983-05-31  -0.365    1.57 
##  3 001738 1959-06-30   4.85   147.   
##  4 001257 2020-12-31  41.9   1404.   
##  5 001258 1984-05-31  -0.693    0.908
##  6 001257 2021-12-31 131.    1392.   
##  7 001257 2022-12-31  57.6   1398.   
##  8 001258 1985-05-31  -0.345    0.589
##  9 001258 1986-05-31  -0.294    0.291
## 10 001258 1987-05-31  -0.086    0.171
## # … with more rows

and then restrict this by joining it with sample as follows:

ib_at <-
  funda_mod %>%
  select(gvkey, datadate, ib, at) %>%
  inner_join(sample)
## Joining with `by = join_by(gvkey)`
ib_at
## # Source:   SQL [?? x 5]
## # Database: postgres  [iangow@/tmp:5432/iangow]
##    gvkey  datadate      ib     at conm     
##    <chr>  <date>     <dbl>  <dbl> <chr>    
##  1 001690 1981-09-30  39.4  255.  APPLE INC
##  2 001690 1982-09-30  61.3  358.  APPLE INC
##  3 001690 1983-09-30  76.7  557.  APPLE INC
##  4 001690 1980-09-30  11.7   65.4 APPLE INC
##  5 001690 1984-09-30  64.1  789.  APPLE INC
##  6 001690 1985-09-30  61.2  936.  APPLE INC
##  7 001690 1986-09-30 154.  1160.  APPLE INC
##  8 001690 1987-09-30 217.  1478.  APPLE INC
##  9 001690 1988-09-30 400.  2082.  APPLE INC
## 10 001690 1989-09-30 454.  2744.  APPLE INC
## # … with more rows

In the code above, we used a natural join, which joins by variables found in both data sets (in this case, gvkey). In some cases, we want to be explicit about the join variables, which we can do using the by argument to the join function:

ib_at <-
  funda_mod %>%
  select(gvkey, datadate, ib, at) %>%
  inner_join(sample, by = "gvkey")

ib_at
## # Source:   SQL [?? x 5]
## # Database: postgres  [iangow@/tmp:5432/iangow]
##    gvkey  datadate      ib     at conm     
##    <chr>  <date>     <dbl>  <dbl> <chr>    
##  1 001690 1981-09-30  39.4  255.  APPLE INC
##  2 001690 1982-09-30  61.3  358.  APPLE INC
##  3 001690 1983-09-30  76.7  557.  APPLE INC
##  4 001690 1980-09-30  11.7   65.4 APPLE INC
##  5 001690 1984-09-30  64.1  789.  APPLE INC
##  6 001690 1985-09-30  61.2  936.  APPLE INC
##  7 001690 1986-09-30 154.  1160.  APPLE INC
##  8 001690 1987-09-30 217.  1478.  APPLE INC
##  9 001690 1988-09-30 400.  2082.  APPLE INC
## 10 001690 1989-09-30 454.  2744.  APPLE INC
## # … with more rows

We next calculate a value for return on assets (using income before extraordinary items and the ending balance of total assets for simplicity).

ib_at <-
  funda_mod %>%
  select(gvkey, datadate, ib, at) %>%
  inner_join(sample, by = "gvkey") %>%
  mutate(roa = ib/at)

The final step is to bring these data into R. At this stage, ib_at is still a server-side data frame (tbl_sql), which we can see using the class function.62

"tbl_sql" %in% class(ib_at)
## [1] TRUE

To bring the data into R, we can use the collect function, as follows:

ib_at <-
  funda_mod %>%
  select(gvkey, datadate, ib, at) %>%
  inner_join(sample, by = "gvkey") %>%
  mutate(roa = ib/at) %>%
  collect()

Now we have a local data frame that is no longer an instance of the tbl_sql class:

class(ib_at)
## [1] "tbl_df"     "tbl"        "data.frame"

Note that, because we have used select to get to just five fields and filter to focus on two firms, the tibble ib_at is just 5.8 kB in size. Note the placement of collect at the end of this pipeline means that the amount of data we need to retrieve from the database and load into R is quite small. If we had placed collect() immediately after funda_mod, we would have been retrieving data on thousands of observations and 948 variables. If we had placed collect() immediately after the select statement, we would have been retrieving data on thousands of observations, but just 4 variables. But by placing collect() after the inner_join, we are only retrieving data for firm-years related to Microsoft and Apple. Retrieving data into our local R instance requires both reading it off disk and, if our PostgreSQL server is remote, transferring it over the internet. Each of these operations is slow and thus should be avoided where possible.

Having collected our data, we can make a quick plot.63

ib_at %>%
  ggplot(aes(x = datadate, y = roa, color = conm)) +
  geom_line()

7.3 Exercises

  1. Suppose we didn’t have access to Compustat (or an equivalent database) for the analysis above, describe a process you would use to get the data required to make the plot above comparing performance of Microsoft and Apple.

  2. In the following code, how do funda_mod and funda_mod_alt differ? (For example, where are the data for each table?) What does the statement collect(n = 100) at the end of this code do?

funda <- tbl(pg, sql("SELECT * FROM comp.funda"))

funda_mod <-
  funda %>%
  filter(indfmt == "INDL", datafmt == "STD",
         consol == "C", popsrc == "D") %>%
  filter(fyear >= 1980)

funda_mod_alt <-
  funda_mod %>%
  collect(n = 100)
  1. The table comp.company has data on SIC (Standard Industrial Classification) codes in the field sic. In words, what is the case_when function doing in the following code? Why do we end up with just two rows?
company <- tbl(pg, sql("SELECT * FROM comp.company"))

sample <-
  company %>%
  select(gvkey, sic) %>%
  mutate(co_name = case_when(gvkey == "001690" ~ "Apple",
                             gvkey == "012141" ~ "Microsoft",
         TRUE ~ NA_character_)) %>%
  filter(!is.na(co_name))
  1. What does the data frame another_sample represent? What happens if we change the inner_join statement below to simply inner_join(sample). What happens if we change it to inner_join(sample, by = "sic") (i.e., omit the suffix = c("", "_other") portion)? Why do you think we want filter(gvkey != gvkey_other)?
another_sample <-
  company %>%
  select(gvkey, sic) %>%
  inner_join(sample, by = "sic", suffix = c("", "_other")) %>%
  filter(gvkey != gvkey_other) %>%
  mutate(group = paste(co_name, "peer")) %>%
  select(gvkey, group)
  1. What is the following code doing?
total_sample <-
  sample %>%
  rename(group = co_name) %>%
  select(gvkey, group) %>%
  union_all(another_sample)
  1. Suppose that we are interested in how firms’ level of R&D activity. One measure of R&D activity is R&D Intensity, which can be defined as “R&D expenses” (Compustat item xrd) scaled by “Total Assets” (Compustat item at). In xrd_at, what’s the difference between rd_intensity and rd_intensity_alt? Does filter(at > 0) seem like a reasonable filter? What happens if we omit it?
xrd_at <-
  funda_mod %>%
  select(gvkey, datadate, fyear, conm, xrd, at) %>%
  filter(at > 0) %>%
  mutate(rd_intensity = xrd/at,
         xrd_alt = coalesce(xrd, 0),
         rd_intensity_alt = xrd_alt/at) %>%
  inner_join(total_sample, by = "gvkey") 
  1. Looking at a sample of rows from xrd_at_sum, it appears that the three R&D intensity measures are always identical for Apple and Microsoft, but generally different for their peer groups. What explains these differences? Can you say that one measure is “correct”? Or would you say “it depends”?
xrd_at_sum <-
  xrd_at %>%
  group_by(group, fyear) %>%
  summarize(total_at = sum(at, na.rm = TRUE),
            total_rd = sum(xrd, na.rm = TRUE),
            rd_intensity1 = mean(xrd/at, na.rm = TRUE),
            rd_intensity2 = mean(xrd_alt/at, na.rm = TRUE),
            .groups = "drop") %>%
  mutate(rd_intensity3 = if_else(total_at > 0, total_rd/total_at, NA_real_)) 
  
xrd_at_sum %>%
  select(-total_at, -total_rd) %>%
  arrange(desc(fyear), group) %>%
  collect(n = 8)
## # A tibble: 8 × 5
##   group          fyear rd_intensity1 rd_intensity2 rd_intensity3
##   <chr>          <int>         <dbl>         <dbl>         <dbl>
## 1 Apple           2022        0.0744        0.0744        0.0744
## 2 Apple peer      2022        0.138         0.138         0.111 
## 3 Microsoft       2022        0.0672        0.0672        0.0672
## 4 Microsoft peer  2022        0.135         0.116         0.0835
## 5 Apple           2021        0.0624        0.0624        0.0624
## 6 Apple peer      2021        0.127         0.122         0.0947
## 7 Microsoft       2021        0.0621        0.0621        0.0621
## 8 Microsoft peer  2021        0.235         0.194         0.0709
  1. Write code to produce the following plot. Also produce plots that use rd_intensity1 and rd_intensity2 as measures of R&D intensity. Do the plots help you think about which of the three measures makes most sense?