6  Financial statements: A first look

In this chapter, we provide an introduction to data on financial statements, which are the primary focus of financial reporting. While a lot of contemporary accounting research has only a tangential connection to financial reporting, it’s not clear there would be a field of accounting research without financial reporting and most accounting research paper use financial reporting data to some degree (e.g., as controls). In this chapter, we will focus on annual financial statement data of North American firms, as provided by Compustat.

Tip

In this chapter, we will use four R libraries. We have seen dplyr, ggplot2, and farr in earlier chapters, but the DBI library is new. The DBI library provides a standardized interface to relational database systems, including PostgreSQL (the system used by WRDS), MySQL, and SQLite. For instructions on how to set up your computer to use the code found in this book, see Section 1.2. Quarto templates for the exercises below are available on GitHub.

6.1 Setting up WRDS

Academic researchers generally get Compustat data through Wharton Research Data Services, more commonly referred to as WRDS (pronounced “words”). In the following, we assume that you have run the code above, that you have a WRDS account, and that you are connected to the internet.

Now let’s connect to the WRDS database. To actually use the code below, you should replace your_WRDS_ID with your actual WRDS ID. You may also need to add a line Sys.setenv(PGPASSWORD = "your_password") as the third line of code.1

Sys.setenv(PGHOST = "wrds-pgdata.wharton.upenn.edu",
           PGPORT = 9737L,
           PGUSER = "your_WRDS_ID", 
           PGDATABASE = "wrds")

db <- dbConnect(RPostgres::Postgres(), bigint = "integer")

Academic researchers generally get Compustat data through Wharton Research Data Services, more commonly referred to as WRDS (pronounced “words”). In the following, we assume that you have created a local parquet library as described in Section E.6. To actually use the code below, you should replace the value set for DATA_DIR with the location of the data repository on your computer.

Sys.setenv(DATA_DIR = "~/Dropbox/pq_data")

db <- dbConnect(duckdb::duckdb())

The first line of code above sets up environment variables containing information that R can use as defaults when connecting to a PostgreSQL database. The second line of code creates the actual connection to the database and assigns it to the variable db. We set bigint = "integer" here because some packages cannot handle 64-bit integers of PostgreSQL’s bigint type, so we ask these to be mapped to the 32-bit integers used in R.2 We can also set check_interrupts = TRUE if we are running queries that take a long time so that we can interrupt them if we think something has gone wrong. Most queries in this book execute quickly, so this is generally not necessary.3

For users of the PostgreSQL database, an alternative to the code above would pass connection information to the dbConnect() function directly, as shown in the code below.4

db <- dbConnect(RPostgres::Postgres(), 
                host = "wrds-pgdata.wharton.upenn.edu", 
                port = 9737L, 
                # user = "your_WRDS_ID",
                # password = "a_really_good_password",
                dbname ="wrds",
                bigint = "integer")

However, we recommend using the environment variable-based approach above. We also recommend using the approach based on a .pgpass file discussed in detail in the WRDS instructions.5 Putting the connection details outside your code (e.g., in environment variables) makes it possible to write code that works both for you and for others, including your co-authors.

Now that we have established a connection to the WRDS data, we can get some data. Data sources on WRDS are organized into libraries according to their source (in PostgreSQL, these libraries are called schemas, but we will stick with the WRDS terminology for now).

In this chapter we focus on Compustat data, which is found in the comp library. While there are other sources of financial statement data—especially for firms outside North America—in accounting research the pre-eminent data source is Compustat, which is typically obtained via WRDS.

We first look at the table, comp.company, which contains information about the companies in Compustat’s North American database (the equivalent table in Compustat’s “Global” database is comp.g_company). By running the following code, we create R objects that point to the data on the WRDS database and behave a lot like the data frames you have seen in earlier chapters.

company <- tbl(db, Id(schema = "comp", table = "company"))
company <- load_parquet(db, schema = "comp", table = "company")

What have we done here? In the first line, we have created an object company that “points to” the table company in schema comp in the data source pg (which represents the connection to the database). From the help for the tbl() function, we learn that in tbl(src, ...), src refers to the data source. Here pg is an object representing the connection to the WRDS PostgreSQL database. The ... allows us to pass other arguments to the function, and we pass Id(schema = "comp", table = "company") to indicate the underlying database object we are looking for.

But what exactly is company? One way to get information about company is to use the class() function.6

class(company)
[1] "tbl_PqConnection" "tbl_dbi"          "tbl_sql"          "tbl_lazy"        
[5] "tbl"             

The output here is a little complicated for a new user, but the basic idea is that company is a tbl_PqConnection object, and also a tbl_sql object and a tbl object, which means it a tibble of some kind. The critical idea here is that funda is a tbl_sql object, which means it behaves like a tibble, but is actually effectively a reference to a table or query in a relational database. For want of a better term, company is a remote data frame.

While there are some technical details here that are unimportant for our discussion, the important points are that, as a tbl_sql, funda has three important properties:

  1. It’s not a data frame in memory in R. This means that we don’t need to download all of the data and load it into R.
  2. It’s not a data frame in memory in the database (in this case, the remote WRDS PostgreSQL server). This means that we don’t load the full data set into memory even onto the remote computer.
  3. Notwithstanding the previous two points, company behaves in many important ways just like a data frame.

We can use filter() to get only rows meeting certain conditions and we can use select() to indicate which columns we want to retrieve:

company |> 
  filter(gsubind == "45202030", 
         str_sub(conm, 1, 1) == "A") |>
  select(gvkey, conm, state) |>
  head(5)
# Source:   SQL [4 x 3]
# Database: postgres  [iangow@wrds-pgdata.wharton.upenn.edu:9737/wrds]
  gvkey  conm                state
  <chr>  <chr>               <chr>
1 001690 APPLE INC           CA   
2 001820 ASTRONOVA INC       RI   
3 027925 AVID TECHNOLOGY INC MA   
4 324055 ADVANCETC LTD       <NA> 

When we are dealing with a remote data frame, the dplyr code that we write is automatically translated into SQL, a specialized language for manipulating tabular data “used by pretty much every database in existence” (Appendix B has more on SQL and dplyr).

We can inspect the translated SQL using show_query():

company |> 
  filter(gsubind == "45202030", 
         str_sub(conm, 1, 1) == "A") |>
  select(gvkey, conm, state) |>
  head(5) |>
  show_query()
<SQL>
SELECT "gvkey", "conm", "state"
FROM "comp"."company"
WHERE ("gsubind" = '45202030') AND (SUBSTR("conm", 1, 1) = 'A')
LIMIT 5

Here we see that filter() is translated into equivalent SQL using WHERE.

While remote data frames behave a lot like local data frames, they are not quite the same thing and, for many purposes, we need to use local data frames. For example, the lm() function for fitting regressions assumes that the object provided as its data argument is a data frame and it will not work if we give it a remote data frame instead. Conversion of a table from a remote data frame to a local one is achieved using the collect() verb. Here we use collect() and store the result in df_company.

df_company <-
  company |> 
  filter(gsubind == "45202030", 
         str_sub(conm, 1, 1) == "A") |>
  select(gvkey, conm, state) |>
  collect()

Looking at class(df_company), we see that it is not only a “tibble” (tbl_df), but also data.frame, meaning that we can pass it to functions (such as lm()) that expect data.frame objects.

class(df_company)
[1] "tbl_df"     "tbl"        "data.frame"
df_company
# A tibble: 4 × 3
  gvkey  conm                state
  <chr>  <chr>               <chr>
1 001690 APPLE INC           CA   
2 001820 ASTRONOVA INC       RI   
3 027925 AVID TECHNOLOGY INC MA   
4 324055 ADVANCETC LTD       <NA> 

6.2 Financial statement data

Two core tables for the North American Compustat data are comp.funda, which contains annual financial statement data, and comp.fundq, which contains quarterly data.

fundq <- tbl(db, Id(schema = "comp", table = "fundq"))
funda <- tbl(db, Id(schema = "comp", table = "funda"))
fundq <- load_parquet(db, schema = "comp", table = "fundq")
funda <- load_parquet(db, schema = "comp", table = "funda")

We can examine the first object in R by typing its name in the R console:

funda
# Source:   table<"comp"."funda"> [?? x 948]
# Database: postgres  [iangow@wrds-pgdata.wharton.upenn.edu:9737/wrds]
   gvkey  datadate   fyear indfmt consol popsrc datafmt tic   cusip   conm 
   <chr>  <date>     <int> <chr>  <chr>  <chr>  <chr>   <chr> <chr>   <chr>
 1 001000 1961-12-31  1961 INDL   C      D      STD     AE.2  000032… A & …
 2 001000 1962-12-31  1962 INDL   C      D      STD     AE.2  000032… A & …
 3 001000 1963-12-31  1963 INDL   C      D      STD     AE.2  000032… A & …
 4 001000 1964-12-31  1964 INDL   C      D      STD     AE.2  000032… A & …
 5 001000 1965-12-31  1965 INDL   C      D      STD     AE.2  000032… A & …
 6 001000 1966-12-31  1966 INDL   C      D      STD     AE.2  000032… A & …
 7 001000 1967-12-31  1967 INDL   C      D      STD     AE.2  000032… A & …
 8 001000 1968-12-31  1968 INDL   C      D      STD     AE.2  000032… A & …
 9 001000 1969-12-31  1969 INDL   C      D      STD     AE.2  000032… A & …
10 001000 1970-12-31  1970 INDL   C      D      STD     AE.2  000032… A & …
# ℹ more rows
# ℹ 938 more variables: acctchg <chr>, acctstd <chr>, acqmeth <chr>,
#   adrr <dbl>, ajex <dbl>, ajp <dbl>, bspr <chr>, compst <chr>,
#   curcd <chr>, curncd <chr>, currtr <dbl>, curuscn <dbl>, final <chr>,
#   fyr <int>, ismod <int>, ltcm <chr>, ogm <chr>, pddur <int>, scf <int>,
#   src <int>, stalt <chr>, udpl <chr>, upd <int>, apdedate <date>,
#   fdate <date>, pdate <date>, acchg <dbl>, acco <dbl>, accrt <dbl>, …

From this code, we can see that comp.funda is a very wide table, with 948 columns. With so many columns, it’s a moderately large table (at the time of writing, 1212 MB in PostgreSQL), but by focusing on certain columns, the size of the data can be dramatically reduced. Let’s learn more about this table.

First, how many rows does it have?

funda |>
  count() |> 
  pull()
[1] 901636

From this snippet, we can see that comp.funda has 901,636 rows. When given an ungrouped data frame as the first argument, the count() function simply counts the number of rows in the data frame.7 In this case, funda |> count() returns a server-side data frame with a single column (n). The command pull() extracts that single column as a vector in R.8

While not evident from casual inspection of the table, the primary key of comp.funda is (gvkey, datadate, indfmt, consol, popsrc, datafmt). One requirement for a valid primary key is that each value of the primary key is associated with only one row of the data set. (Chapter 19 of R for Data Science discusses this point.) That this requirement is met can be seen with output from the following code, which counts the number of rows associated with each set of values for (gvkey, datadate, indfmt, consol, popsrc, datafmt) and stores that number in num_rows and then displays the various values of num_rows.

funda |>
  count(gvkey, datadate, indfmt, consol, popsrc, datafmt,
        name = "num_rows") |> 
  collect()
# A tibble: 901,636 × 7
   gvkey  datadate   indfmt consol popsrc datafmt  num_rows
   <chr>  <date>     <chr>  <chr>  <chr>  <chr>       <int>
 1 028485 2000-03-31 INDL   C      D      STD             1
 2 004607 2011-06-30 INDL   C      D      SUMM_STD        1
 3 028420 2021-12-31 INDL   C      D      STD             1
 4 013148 1990-12-31 INDL   C      D      STD             1
 5 030501 2019-12-31 INDL   C      D      STD             1
 6 113491 2002-12-31 INDL   C      D      STD             1
 7 016721 2008-03-31 INDL   C      D      SUMM_STD        1
 8 031392 2014-09-30 INDL   C      D      STD             1
 9 026748 2009-09-30 INDL   C      D      SUMM_STD        1
10 043667 2023-12-31 INDL   C      D      STD             1
# ℹ 901,626 more rows

Because num_rows is equal to 1 in 901,636 cases and this is the number of rows in the data set, we have (gvkey, datadate, indfmt, consol, popsrc, datafmt) as a valid key.

The second requirement of a primary key is that none of the columns contains null values (in SQL, NULL; in R, NA). The code below checks this.

funda |>
  filter(is.na(gvkey) | is.na(datadate) | is.na(indfmt) |
         is.na(consol) | is.na(popsrc) | is.na(datafmt)) |>
  count() |>
  pull()
[1] 0

Teasing apart the primary key, we have a firm identifier (gvkey) and financial period identifier (datadate), along with four variables that are more technical in nature: indfmt, consol, popsrc, and datafmt. Table 6.1 provides the number of observations for each set of values of these four variables.

funda |>
  count(indfmt, consol, popsrc, datafmt) |>
  arrange(desc(n)) |>
  collect()
Table 6.1: Observations by (indfmt, consol, popsrc, datafmt) values
indfmt consol popsrc datafmt n
INDL C D STD 573,430
INDL C D SUMM_STD 279,283
FS C D STD 46,814
INDL P D STD 1,171
INDL R D STD 480
INDL P D SUMM_STD 300
FS R D STD 51
INDL R D SUMM_STD 28
INDL D D SUMM_STD 25
INDL D D STD 25
INDL C D PRE_AMENDS 19
INDL C D PRE_AMENDSS 10

As discussed on the CRSP website, the first set of values, which covers 573,430 observations, represents the standard (STD) secondary keyset, as it is what is used by most researchers when using Compustat. We can create a version of funda that limits data to this STD secondary keyset as follows:

funda_mod <-
  funda |>
  filter(indfmt == "INDL", datafmt == "STD",
         consol == "C", popsrc == "D")

For funda_mod, (gvkey, datadate) should represent a primary key. We can check the first requirement using code like that we used above (we already confirmed the second requirement for the larger data set).

funda_mod |>
  group_by(gvkey, datadate) |>
  summarize(num_rows = n(), .groups = "drop") |>
  count(num_rows) |>
  collect()
# A tibble: 1 × 2
  num_rows      n
     <int>  <int>
1        1 573430

Now, let’s look at the table comp.company, which the following code confirms has 39 columns and for which gvkey is a primary key.

company
# Source:   table<"comp"."company"> [?? x 39]
# Database: postgres  [iangow@wrds-pgdata.wharton.upenn.edu:9737/wrds]
   conm      gvkey add1  add2  add3  add4  addzip busdesc cik   city  conml
   <chr>     <chr> <chr> <chr> <chr> <chr> <chr>  <chr>   <chr> <chr> <chr>
 1 A & E PL… 0010… <NA>  <NA>  <NA>  <NA>  <NA>   A & E … <NA>  <NA>  A & …
 2 A & M FO… 0010… 1924… <NA>  <NA>  <NA>  94104  <NA>    0000… Tulsa A & …
 3 AAI CORP  0010… 124 … <NA>  <NA>  <NA>  21030… Textro… 0001… Hunt… AAI …
 4 A.A. IMP… 0010… 7700… <NA>  <NA>  <NA>  63125  A.A. I… 0000… St. … A.A.…
 5 AAR CORP  0010… One … <NA>  <NA>  <NA>  60191  AAR Co… 0000… Wood… AAR …
 6 A.B.A. I… 0010… 1026… <NA>  <NA>  <NA>  33782  A.B.A.… <NA>  Pine… A.B.…
 7 ABC INDS… 0010… 301 … <NA>  <NA>  <NA>  46590  ABC In… <NA>  Wino… ABC …
 8 ABKCO IN… 0010… 1700… <NA>  <NA>  <NA>  10019  ABKCO … 0000… New … ABKC…
 9 ABM COMP… 0010… 3 Wh… <NA>  <NA>  <NA>  92714  ABM Co… <NA>  Irvi… ABM …
10 ABS INDU… 0010… Inte… <NA>  <NA>  <NA>  44904  ABS In… 0000… Will… ABS …
# ℹ more rows
# ℹ 28 more variables: costat <chr>, county <chr>, dlrsn <chr>, ein <chr>,
#   fax <chr>, fic <chr>, fyrc <int>, ggroup <chr>, gind <chr>,
#   gsector <chr>, gsubind <chr>, idbflag <chr>, incorp <chr>, loc <chr>,
#   naics <chr>, phone <chr>, prican <chr>, prirow <chr>, priusa <chr>,
#   sic <chr>, spcindcd <int>, spcseccd <int>, spcsrc <chr>, state <chr>,
#   stko <int>, weburl <chr>, dldte <date>, ipodate <date>

First, each gvkey value is associated with just one row.

company |>
  group_by(gvkey) |>
  summarize(num_rows = n(), .groups = "drop") |>
  count(num_rows) |>
  collect()
# A tibble: 1 × 2
  num_rows     n
     <int> <int>
1        1 54397

Second, there are no missing values of gvkey.

company |>
  filter(is.na(gvkey)) |>
  count() |>
  collect()
# A tibble: 1 × 1
      n
  <int>
1     0

6.2.1 Illustration: Microsoft versus Apple

Suppose that we were interested in comparing the profitability of Apple and Microsoft over time. To measure performance, we will calculate a measure of return on assets, here measured as the value of “Income Before Extraordinary Items” scaled by “Total Assets” for Microsoft (GVKEY: 012141) and Apple (GVKEY: 001690) over time.9

sample <-
  company |>
  filter(gvkey %in% c("001690", "012141")) |>
  select(conm, gvkey)

sample
# Source:   SQL [2 x 2]
# Database: postgres  [iangow@wrds-pgdata.wharton.upenn.edu:9737/wrds]
  conm           gvkey 
  <chr>          <chr> 
1 APPLE INC      001690
2 MICROSOFT CORP 012141

Here we can see that sample is a server-side data frame with the company name (conm) and identifier (gvkey) for Apple and Microsoft. We can use show_query() to see what the server-side data frame represents.

sample |>
  show_query()
<SQL>
SELECT "conm", "gvkey"
FROM "comp"."company"
WHERE ("gvkey" IN ('001690', '012141'))

Here filter() is translated into a WHERE condition, %in% is translated to IN, and c("001690", "012141") is translated to ('001690', '012141').

It turns out that “Income Before Extraordinary Items” is represented by ib and “Total Assets” is at (see the CRSP manual for details). We can further restrict funda beyond funda_mod so that it contains just the key variables (gvkey, datadate) and the values for ib and at as follows. (In the following code, we will gradually refine the data set assigned to ib_at.)

ib_at <-
  funda_mod |>
  select(gvkey, datadate, ib, at) 

ib_at |> head(5)
# Source:   SQL [5 x 4]
# Database: postgres  [iangow@wrds-pgdata.wharton.upenn.edu:9737/wrds]
  gvkey  datadate       ib    at
  <chr>  <date>      <dbl> <dbl>
1 001000 1961-12-31  0.05  NA   
2 001000 1962-12-31  0.12  NA   
3 001000 1963-12-31  0.003 NA   
4 001000 1964-12-31  0.039  1.42
5 001000 1965-12-31 -0.197  2.31

and then restrict this by joining it with sample as follows:

ib_at <-
  funda_mod |>
  select(gvkey, datadate, ib, at) |>
  inner_join(sample)
Joining with `by = join_by(gvkey)`
ib_at |> head(5)
# Source:   SQL [5 x 5]
# Database: postgres  [iangow@wrds-pgdata.wharton.upenn.edu:9737/wrds]
  gvkey  datadate      ib    at conm     
  <chr>  <date>     <dbl> <dbl> <chr>    
1 001690 1980-09-30  11.7  65.4 APPLE INC
2 001690 1981-09-30  39.4 255.  APPLE INC
3 001690 1982-09-30  61.3 358.  APPLE INC
4 001690 1983-09-30  76.7 557.  APPLE INC
5 001690 1984-09-30  64.1 789.  APPLE INC

In the code above, we used a natural join, which joins by variables found in both data sets (in this case, gvkey). In general, we want to be explicit about the join variables, which we can do using the by argument to the join function:

ib_at <-
  funda_mod |>
  select(gvkey, datadate, ib, at) |>
  inner_join(sample, by = "gvkey")

ib_at |> head(5)
# Source:   SQL [5 x 5]
# Database: postgres  [iangow@wrds-pgdata.wharton.upenn.edu:9737/wrds]
  gvkey  datadate      ib    at conm     
  <chr>  <date>     <dbl> <dbl> <chr>    
1 001690 1980-09-30  11.7  65.4 APPLE INC
2 001690 1981-09-30  39.4 255.  APPLE INC
3 001690 1982-09-30  61.3 358.  APPLE INC
4 001690 1983-09-30  76.7 557.  APPLE INC
5 001690 1984-09-30  64.1 789.  APPLE INC

We next calculate a value for return on assets (using income before extraordinary items and the ending balance of total assets for simplicity).

ib_at <-
  funda_mod |>
  select(gvkey, datadate, ib, at) |>
  inner_join(sample, by = "gvkey") |>
  mutate(roa = ib / at)

The final step is to bring these data into R. At this stage, ib_at is still a server-side data frame (tbl_sql), which we can see using the class function.10

"tbl_sql" %in% class(ib_at)
[1] TRUE

To bring the data into R, we can use the collect() function, as follows:

ib_at <-
  funda_mod |>
  select(gvkey, datadate, ib, at) |>
  inner_join(sample, by = "gvkey") |>
  mutate(roa = ib / at) |>
  collect()

Now we have a local data frame that is no longer an instance of the tbl_sql class:

class(ib_at)
[1] "tbl_df"     "tbl"        "data.frame"

Note that because we have used select() to get to just five fields and filter() to focus on two firms, the tibble ib_at is just 5.9 kB in size. Note the placement of collect() at the end of this pipeline means that the amount of data we need to retrieve from the database and load into R is quite small. If we had placed collect() immediately after funda_mod, we would have been retrieving data on thousands of observations and 948 variables. If we had placed collect() immediately after the select() statement, we would have been retrieving data on thousands of observations, but just 4 variables. But by placing collect() after the inner_join(), we are only retrieving data for firm-years related to Microsoft and Apple. Retrieving data into our local R instance requires both reading it off disk and, if our database server is remote, transferring it over the internet. Each of these operations is (relatively) slow and thus should be avoided where possible.

Having collected our data, we can make Figure 6.1.11

ib_at |>
  ggplot(aes(x = datadate, y = roa, 
             linetype = conm, color = conm)) +
  geom_line() +
  theme(legend.position = "bottom")
Plot of return on assets (ROA) for Microsoft and Apple from early 1980s to early 2020s. Apple's ROA is lower than Microsoft's before 2011, but higher since. Plot is included to build skills in making plots, not for its content.
Figure 6.1: ROA over time for Microsoft and Apple

6.3 Exercises

  1. Suppose we didn’t have access to Compustat (or an equivalent database) for the analysis above, describe a process you would use to get the data required to make the plot above comparing performance of Microsoft and Apple.

  2. In the following code, how do funda_mod and funda_mod_alt differ? (For example, where are the data for each table?) What does the statement collect(n = 100) at the end of this code do?

funda <- tbl(db, Id(schema = "comp", table = "funda"))

funda_mod <-
  funda |>
  filter(indfmt == "INDL", datafmt == "STD",
         consol == "C", popsrc == "D") |>
  filter(fyear >= 1980)

funda_mod_alt <-
  funda_mod |>
  collect(n = 100)
funda <- load_parquet(db, schema = "comp", table = "funda")

funda_mod <-
  funda |>
  filter(indfmt == "INDL", datafmt == "STD",
         consol == "C", popsrc == "D") |>
  filter(fyear >= 1980)

funda_mod_alt <-
  funda_mod |>
  collect(n = 100)
  1. The table comp.company has data on SIC (Standard Industrial Classification) codes in the field sic. In words, what is the case_when() function doing in the following code? Why do we end up with just two rows?
company <- tbl(db, Id(schema = "comp", table = "company"))

sample <-
  company |>
  select(gvkey, sic) |>
  mutate(co_name = case_when(gvkey == "001690" ~ "Apple",
                             gvkey == "012141" ~ "Microsoft")) |>
  filter(!is.na(co_name))
company <- load_parquet(db, schema = "comp", table = "company")

sample <-
  company |>
  select(gvkey, sic) |>
  mutate(co_name = case_when(gvkey == "001690" ~ "Apple",
                             gvkey == "012141" ~ "Microsoft")) |>
  filter(!is.na(co_name))
  1. What does the data frame another_sample represent? What happens if we change the inner_join() below to simply inner_join(sample)? What happens if we change it to inner_join(sample, by = "sic") (i.e., omit the suffix = c("", "_other") portion)? Why do we use filter(gvkey != gvkey_other)?
another_sample <-
  company |>
  select(gvkey, sic) |>
  inner_join(sample, by = "sic", suffix = c("", "_other")) |>
  filter(gvkey != gvkey_other) |>
  mutate(group = str_c(co_name, " peer")) |>
  select(gvkey, group)
  1. What is the following code doing?
total_sample <-
  sample |>
  rename(group = co_name) |>
  select(gvkey, group) |>
  union_all(another_sample)
  1. Suppose that we are interested in firms’ R&D activity. One measure of R&D activity is R&D Intensity, which can be defined as “R&D expenses” (Compustat item xrd) scaled by “Total Assets” (Compustat item at). In xrd_at, what’s the difference between rd_intensity and rd_intensity_alt? Does filter(at > 0) seem like a reasonable filter? What happens if we omit it?
xrd_at <-
  funda_mod |>
  select(gvkey, datadate, fyear, conm, xrd, at) |>
  filter(at > 0) |>
  mutate(rd_intensity = xrd / at,
         xrd_alt = coalesce(xrd, 0),
         rd_intensity_alt = xrd_alt / at) |>
  inner_join(total_sample, by = "gvkey") 
  1. Looking at a sample of rows from xrd_at_sum, it appears that the three R&D intensity measures are always identical for Apple and Microsoft, but generally different for their peer groups. What explains these differences? Can you say that one measure is “correct”? Or would you say “it depends”?
xrd_at_sum <-
  xrd_at |>
  group_by(group, fyear) |>
  summarize(rd_intensity1 = mean(xrd / at),
            rd_intensity2 = mean(xrd_alt / at),
            rd_intensity3 = if_else(sum(at) > 0, sum(xrd) / sum(at), NA),
            .groups = "drop")
  
xrd_at_sum |>
  arrange(desc(fyear), group) |>
  collect(n = 8)
# A tibble: 8 × 5
  group          fyear rd_intensity1 rd_intensity2 rd_intensity3
  <chr>          <int>         <dbl>         <dbl>         <dbl>
1 Apple peer      2024        0.105         0.105         0.113 
2 Microsoft       2024        0.0576        0.0576        0.0576
3 Microsoft peer  2024        0.256         0.228         0.0672
4 Apple           2023        0.0848        0.0848        0.0848
5 Apple peer      2023        0.151         0.147         0.102 
6 Microsoft       2023        0.0660        0.0660        0.0660
7 Microsoft peer  2023        0.224         0.189         0.0796
8 Apple           2022        0.0744        0.0744        0.0744
  1. Write code to produce Figure 6.2. Also produce plots that use rd_intensity1 and rd_intensity2 as measures of R&D intensity. Do the plots help you think about which of the three measures makes most sense?
Plot of research and development (R&D) intensity for Microsoft and Apple and their respective peers from early 1980s to early 2020s. Apple's R&D spending fluctuates more than the other three groups' and is lower than theirs for most years after 2000. Plot is included to build skills in making plots, not for its content.
Figure 6.2: R&D intensity over time for Microsoft and Apple and their peers

  1. We put our passwords in a special password file, as described here, so we don’t need this step. It’s obviously not a good idea to put your password in code.↩︎

  2. The data we use in this book will not exceed the limits of 32-bit integers, so this involves no loss. See the help for RPostgres for more information. You can access this help by typing ? RPostgres::Postgres in the R console.↩︎

  3. We flag longer running queries when they arise.↩︎

  4. Remove the # from the lines beginning with # and replace the password and user name if you try this.↩︎

  5. See https://go.unimelb.edu.au/kv68. Note that we do not recommend setting up a connection to WRDS in your .Rprofile as suggested by WRDS, as this yields code that is less transparent and usable by others. Additionally, it means setting up a connection even when you may not need one.↩︎

  6. If you are using the parquet file–based approach, your output will differ from what is shown here.↩︎

  7. Many dplyr functions take data frames as (first) arguments and return data frames as values.↩︎

  8. As that vector has a single value, it is displayed without any information about the database connection that we used to get the information. This looks cleaner in this setting, as you can see yourself if you run the command yourself without pull() at the end.↩︎

  9. Financial analysts might object to using an end-of-period value as the denominator; we do so here because it simplifies the coding and our goal here is just illustrative.↩︎

  10. An alternative to this line would be inherits(ib_at, "tbl_sql"). More on classes can be found in “Advanced R” at https://adv-r.hadley.nz.↩︎

  11. Note that even if we hadn’t called collect(), recent versions of ggplot2 are aware of remote data frames and will take care of this for us. But other functions (e.g., lm(), which we saw in Chapter 4) are not aware of remote data frames and thus require us to use collect() before passing data to them.↩︎