9 Linking databases
Almost any research project in accounting will involve merging data from multiple sources.
For example, to evaluate the market reaction to earnings announcements, we might start with data on comp.fundq
from Compustat (gvkey
and datadate
to identify firm-quarters plus associated announcement dates in rdq
) and then look to merge with daily stock return data from the Center for Research in Security Prices (CRSP, pronounced as “crisp”) on crsp.dsf
, which uses permno
and date
to identify firms and their daily returns, ret
.
But this raises the question as to which permno
(if any) matches a given gvkey
.
Alternatively, given security price information from CRSP, we might ask what it the most recent financial statement information for that secruity.
This raises the question as to which gvkey
(if any) matches the given permno
.
9.1 Firm identifiers
The idea behind firm identifiers is that they uniquely identify a firm for a particular purpose.67 While Compustat uses GVKEYs, CRSP uses PERMNOs, the SEC uses CIKs, stock exchanges use tickers, and there are also CUSIPs. Of course, identifiers apply not only to firms, but also people.68 Nonetheless, most of this chapter will focus on firm identifiers, in part because of the importance of firms as units of observation in finance and accounting research, but also because identifying firms is much harder than identifying people. While not specific to the platform we describe in this book, the issue of identifiers is one that seems less perplexing and better-handled when a relational database provides the backbone of your data store as it does here.
Data provider | Firm identifiers | Notes |
---|---|---|
Compustat (comp ) |
gvkey |
|
CRSP (crsp ) |
permno , permco
|
permno is a security identifier |
IBES (ibes ) |
ticker |
ticker is not necessarily the ticker assigned to the firm by the exchange on which it trades |
SEC EDGAR | CIK | |
Audit Analytics (audit ) |
company_fkey |
company_fkey is the same as CIK |
Various | CUSIP | CUSIP is a security identifier |
9.1.1 Firm identifiers: A quiz
Our sense is that firm identifiers is one of those topics that seem easy, but is actually fairly tricky. Here is a quick quiz to test your knowledge of firm identifiers.69
- General Motors Corporation declared bankruptcy in June 2009? Does the “successor firm” General Motors Company have the same GVKEY as General Motors Corporation? The same PERMNO?
- Can a CUSIP map to more than one PERMNO? To more than one GVKEY?
- Can a PERMNO map to more than one CUSIP?
- Can a GVKEY map to more than one PERMCO?
- Can a PERMCO map to different CIKs?
- If you have two data sets, \(X\) and \(Y\) and CUSIP is a “firm” identifier on each, can you simply merge using CUSIPs?
- When would a “firm” change CUSIPs?
- When would a “firm” change CIKs?
- If the firm identifier on IBES is
ticker
, should I merge with CRSP usingticker
fromcrsp.stocknames
?
Maybe you know the answers to some questions, but not all. If so, read on; this chapter aims to provide answers to these questions and more.
9.2 Linking CRSP and Compustat
The CRSP/Compustat Merged (CCM) database provides the standard link between CRSP data and Compustat’s fundamental data. The CCM provides three tables that you will see in common use:
crsp.ccmxpf_lnkhist
crsp.ccmxpf_lnkused
crsp.ccmxpf_linktable
The reality is that the only table we need to worry about is crsp.ccmxpf_lnkhist
, as the other two tables can be (and likely are) constructed from it (see here for details).70
library(dbplyr)
library(dplyr, warn.conflicts = FALSE)
library(DBI)
library(tidyr)
pg <- dbConnect(RPostgres::Postgres(),
bigint = "integer",
check_interrupts = TRUE)
## # Source: SQL [?? x 8]
## # Database: postgres [iangow@/tmp:5432/iangow]
## gvkey linkprim liid linktype lpermno lpermco linkdt linkenddt
## <chr> <chr> <chr> <chr> <int> <int> <date> <date>
## 1 001000 C 00X NU NA NA 1961-01-01 1970-09-29
## 2 001000 P 01 NU NA NA 1970-09-30 1970-11-12
## 3 001000 P 01 LU 25881 23369 1970-11-13 1978-06-30
## 4 001001 C 00X NU NA NA 1978-01-01 1983-09-19
## 5 001001 P 01 LU 10015 6398 1983-09-20 1986-07-31
## 6 001002 C 00X NR NA NA 1960-01-01 1970-09-29
## 7 001002 C 01 NR NA NA 1970-09-30 1972-12-13
## 8 001002 C 01 NR NA NA 1973-06-06 1973-08-31
## 9 001002 C 01 LC 10023 22159 1972-12-14 1973-06-05
## 10 001003 C 00X NU NA NA 1980-01-01 1983-12-06
## # … with more rows
The basic idea of the table above is that given a gvkey
and a date, one can match that gvkey
-date combination to a PERMNO (here called lpermno
) by merging on gvkey
where the date is between linkdt
and linkenddt
.
One thing you will see is that there are cases where lpermno
is NA
, so “matching” these rows will result in non-matches, which is of no real value.
The only value might be in determining whether the non-match has linktype
of NR
, which means that lack of a link has been “confirmed by research” (presumably by CRSP), or of NU
, which means the link is “not yet confirmed” by research.
## # Source: SQL [3 x 2]
## # Database: postgres [iangow@/tmp:5432/iangow]
## linktype n
## <chr> <int>
## 1 NU 34110
## 2 NR 40533
## 3 NP 4
In practice, we would likely ignore all matches with linktype %in% c("NU", "NR")
or (equivalently) is.na(lpermno)
.
Let’s look at the remaining linktype
values.
ccm_link <-
ccmxpf_lnkhist %>%
filter(!is.na(lpermno))
ccm_link %>%
count(linktype) %>%
arrange(desc(n))
## # Source: SQL [6 x 2]
## # Database: postgres [iangow@/tmp:5432/iangow]
## # Ordered by: desc(n)
## linktype n
## <chr> <int>
## 1 LC 16696
## 2 LU 16017
## 3 LS 4601
## 4 LX 1141
## 5 LN 186
## 6 LD 118
The cases where linktype
is LD
represent cases where two GVKEYs map to a single PERMNO at the same time and, according to WRDS, “this link should not be used.”
Here is one example:
## # Source: SQL [3 x 8]
## # Database: postgres [iangow@/tmp:5432/iangow]
## gvkey linkprim liid linktype lpermno lpermco linkdt linkenddt
## <chr> <chr> <chr> <chr> <int> <int> <date> <date>
## 1 011550 P 01 LC 23536 21931 1962-01-31 NA
## 2 013353 P 01 LD 23536 21931 1962-01-31 1986-12-31
## 3 013353 C 99X LD 23536 21931 1987-01-01 2020-12-31
Here we’ll take the WRDS’s advice and omit these.
The cases where linktype
is LX
represent cases where the security referred to on Compustat is one that trades on a foreign exchange and CRSP is merely “helpfully” linking to a different security that is found on CRSP.
Here is one example:
## # Source: SQL [2 x 8]
## # Database: postgres [iangow@/tmp:5432/iangow]
## gvkey linkprim liid linktype lpermno lpermco linkdt linkenddt
## <chr> <chr> <chr> <chr> <int> <int> <date> <date>
## 1 001186 P 01 LC 78223 26174 1982-11-01 NA
## 2 001186 N 01C LX 78223 26174 1982-11-01 NA
These matches are duplicates and we don’t want them.
The remaining category for discussion is where linktype
is LN
.
These are cases where a link exists, but Compustat does not have price data to allow CRSP to check the quality of the link.
This is a case where researcher discretion may be used to include these, but most researchers appear to exclude these cases and we will do likewise.
Given the above, we are only including cases where linktype
is in LC
(valid, researched link), LU
(unresearched link), or LS
(link valid for this lpermno
only).
ccm_link <-
ccmxpf_lnkhist %>%
filter(linktype %in% c("LC", "LU", "LS"))
ccm_link %>%
count(linkprim) %>%
arrange(desc(n))
## # Source: SQL [4 x 2]
## # Database: postgres [iangow@/tmp:5432/iangow]
## # Ordered by: desc(n)
## linkprim n
## <chr> <int>
## 1 P 28918
## 2 C 7870
## 3 J 428
## 4 N 98
Now, let’s consider, linkprim
. WRDS explains as follows:
linkprim
clarifies the link’s relationship to Compustat’s marked primary security within the related range. “P” indicates a primary link marker, as identified by Compustat in monthly security data. “C” indicates a primary link marker, as identified by CRSP to resolve ranges of overlapping or missing primary markers from Compustat in order to produce one primary security throughout the company history. “J” indicates a joiner secondary issue of a company, identified by Compustat in monthly security data.
And cases where linkprim
equals N
are duplicated links due to the existence of Canadian securities for a US-traded firm; so we ignore these.
A natural question is whether, for any given GVKEY-date, is there only one PERMNO that is matched with linkprim IN ('P', 'C')
?
ccm_link <-
ccmxpf_lnkhist %>%
filter(linktype %in% c("LC", "LU", "LS"),
linkprim %in% c("C", "P"))
# Look for overlapping date ranges
ccm_link %>%
group_by(gvkey) %>%
window_order(linkdt) %>%
mutate(lead_linkdt = lead(linkdt),
lag_linkenddt = lag(linkenddt)) %>%
filter(linkenddt >= lead_linkdt | lag_linkenddt >= linkdt) %>%
ungroup() %>%
count()
## # Source: SQL [1 x 1]
## # Database: postgres [iangow@/tmp:5432/iangow]
## # Ordered by: linkdt
## n
## <int>
## 1 0
So there are no cases of overlapping dates, which means that only one lpermno
is linked to for a given date.
Our sense is that the last iteration of ccm_link
above is more or less the standard approach used by researchers in practice.
You may occasionally see code that filters on usedflag==1
, which is a variable found on crsp.ccmxpf_lnkused
.
But, as can be seen here, using this table and filter yields exactly the same result as ccm_link
above.
Note that the vast majority of GVKEYs map to just one PERMNO even without regard to time.
ccm_link %>%
select(gvkey, lpermno) %>%
distinct() %>%
group_by(gvkey) %>%
mutate(num_permnos = n()) %>%
ungroup() %>%
count(num_permnos) %>%
arrange(num_permnos)
## # Source: SQL [5 x 2]
## # Database: postgres [iangow@/tmp:5432/iangow]
## # Ordered by: num_permnos
## num_permnos n
## <int> <int>
## 1 1 31468
## 2 2 2018
## 3 3 219
## 4 4 12
## 5 5 5
The case with 5 PERMNOs is, if we recall correctly, a total mess with tracking stock, spin-offs, etc. But one observation doesn’t matter much.71
ccm_link <-
ccmxpf_lnkhist %>%
filter(linktype %in% c("LC", "LU", "LS"),
linkprim %in% c("C", "P")) %>%
mutate(linkenddt = coalesce(linkenddt, max(linkenddt, na.rm = TRUE)))
9.2.1 Applying the CRSP-Compustat link
For concreteness in thinking about issues related to linking Compustat and CRSP, let’s imagine we are interested in announcement returns for annual earnings for fiscal years ending after 31 December 1999.
The following code generates from Compustat the table of events that we can link to CRSP.
We start with earnings on comp.funda
and then link with the fourth-quarter data for “report date of quarterly earnings”, which is the variable rdq
.
funda <- tbl(pg, sql("SELECT * FROM comp.funda"))
fundq <- tbl(pg, sql("SELECT * FROM comp.fundq"))
funda_mod <-
funda %>%
filter(indfmt == "INDL", datafmt == "STD",
consol == "C", popsrc == "D")
firm_years <-
funda_mod %>%
select(gvkey, datadate)
rdqs <-
fundq %>%
filter(indfmt == "INDL", datafmt == "STD",
consol == "C", popsrc == "D") %>%
filter(!is.na(rdq), fqtr == 4) %>%
select(gvkey, datadate, rdq)
annc_dates <-
firm_years %>%
inner_join(rdqs, by = c("gvkey", "datadate")) %>%
filter(datadate >= "2000-01-01")
The resulting table has gvkey
and datadate
, as well as rdq
.
But in linking to CRSP, a question arises as to which data to use to check that a link is valid at the relevant time.
This question seems difficult to answer in the abstract, so let’s try both dates (datadate
and rdq
) and drill deeper into cases where the linked permno
values differ (when the linked permno
is the same, we are probably safe).
datadate_permnos <-
annc_dates %>%
inner_join(ccm_link,
by = join_by(gvkey,
datadate >= linkdt, datadate <= linkenddt)) %>%
select(gvkey, datadate, rdq, lpermno, lpermco) %>%
rename(permno = lpermno,
permco = lpermco)
rdq_permnos <-
annc_dates %>%
inner_join(ccm_link,
join_by(gvkey, rdq >= linkdt, rdq <= linkenddt)) %>%
select(gvkey, datadate, rdq, lpermno, lpermco) %>%
rename(permno = lpermno,
permco = lpermco)
link_table_combined <-
rdq_permnos %>%
full_join(datadate_permnos,
by = c("gvkey", "datadate", "rdq"),
suffix = c("_rdq", "_datadate")) %>%
filter(!is.na(permno_rdq) | !is.na(permno_datadate)) %>%
mutate(same_permno = coalesce(permno_rdq == permno_datadate, FALSE),
same_permco = coalesce(permco_rdq == permco_datadate, FALSE),
has_permno_rdq = !is.na(permno_rdq),
has_permno_datadate = !is.na(permno_datadate))
link_table_stats <-
link_table_combined %>%
count(same_permno, same_permco,
has_permno_rdq, has_permno_datadate) %>%
arrange(same_permno, same_permco,
has_permno_rdq, has_permno_datadate) %>%
ungroup() %>%
collect()
link_table_stats
## # A tibble: 5 × 5
## same_permno same_permco has_permno_rdq has_permno_datadate n
## <lgl> <lgl> <lgl> <lgl> <int>
## 1 FALSE FALSE FALSE TRUE 1872
## 2 FALSE FALSE TRUE FALSE 2880
## 3 FALSE FALSE TRUE TRUE 20
## 4 FALSE TRUE TRUE TRUE 20
## 5 TRUE TRUE TRUE TRUE 119100
So in 96.1% cases, we match to the same permno
using either date.
In many cases, we match to a permno
using one date but not the other.
In 2880 cases, we have a permno
link on rdq
, but not on datadate
.
Inspecting some of these cases, they appear to be cases where the stock was not traded on datadate
, but were trading by rdq
, perhaps due an IPO in the meantime.
In 1872 cases, we have a permno
link on dataadate
, but not on rdq
.
One example is the gvkey
of 021998
, which on 2017-12-31
relates to the firm Neothetics Inc.
On January 17, 2018, Neothetics, Inc., or Neothetics, and privately-held Evofem Biosciences Operations, Inc., or Private Evofem, completed … the Merger. … The Merger was structured as a reverse capitalization and Private Evofem was determined to be the accounting acquirer based on the terms of the Merger and other factors.”
So the continuing entity with respect to financial reporting is Evofem, which has a GVKEY of 032961
.
The twist in this case is that “the financial information included in the first 10-K filed after the merger is that of Neothetics prior to the Merger.”
Thus the representation of ni
and at
as continuous series up to 2017-12-31
provided by Compustat for gvkey
of 021998
(see below) and a series starting from 2018-12-31
for gvkey
of 032961
seems correct.
While inspection of the [financial statements] for the period ending 2018-12-31
suggests that there are values that could be supplied for gvkey
of 032961
and datadate
of 2017-12-31
(Compustat leaves this missing), it seems difficult to view the market reaction on rdq
(2018-02-26
) as being to the financial statement information associated with either gvkey
(021998
or 032961
) alone and thus a non-match on ccm_link
for this case seems appropriate.
## # Source: SQL [2 x 8]
## # Database: postgres [iangow@/tmp:5432/iangow]
## gvkey linkprim liid linktype lpermno lpermco linkdt linkenddt
## <chr> <chr> <chr> <chr> <int> <int> <date> <date>
## 1 021998 P 01 LC 15075 55117 2014-11-20 2018-01-17
## 2 032961 P 01 LC 15075 55117 2018-01-18 2023-01-31
ccm_link %>%
filter(lpermno == 15075) %>%
select(gvkey) %>%
inner_join(funda_mod) %>%
select(gvkey, datadate, at, ni) %>%
arrange(datadate)
## Joining with `by = join_by(gvkey)`
## # Source: SQL [?? x 4]
## # Database: postgres [iangow@/tmp:5432/iangow]
## # Ordered by: datadate
## gvkey datadate at ni
## <chr> <date> <dbl> <dbl>
## 1 021998 2012-12-31 12.8 -7.83
## 2 021998 2013-12-31 4.53 -15.0
## 3 021998 2014-12-31 76.9 -10.8
## 4 021998 2015-12-31 40.1 -43.2
## 5 021998 2016-12-31 12.8 -13.0
## 6 021998 2017-12-31 4.12 -9.99
## 7 032961 2017-12-31 NA NA
## 8 032961 2018-12-31 4.01 -126.
## 9 032961 2019-12-31 28.3 -80.0
## 10 032961 2020-12-31 110. -142.
## # … with more rows
Only in 40 cases do we find different permno
values on the two dates.
In 20 cases, we link to different permno
values, but the same permco
value.
One of these cases is for gvkey
of 017010
.
It’s not clear why CRSP switched the primary permno
on 2018-01-01
.72
## # Source: SQL [4 x 8]
## # Database: postgres [iangow@/tmp:5432/iangow]
## gvkey linkprim liid linktype lpermno lpermco linkdt linkenddt
## <chr> <chr> <chr> <chr> <int> <int> <date> <date>
## 1 017010 P 01 LC 15998 54311 2016-04-18 2017-12-31
## 2 017010 J 01 LC 15998 54311 2018-01-01 NA
## 3 017010 J 03 LC 16000 54311 2016-04-18 2017-12-31
## 4 017010 P 03 LC 16000 54311 2018-01-01 NA
What about the 20 cases with different permco
values?
Looking at the case where gvkey
is 183603
, the issue is a merger.
According to the 10-K filed on 2017-02-28
, “on January 10, 2017, NSAM completed the tri-party merger with Colony Capital, Inc., or Colony, and NorthStar Realty Finance Corp., or NorthStar Realty or NRF, under which the companies combined in an all-stock merger of equals transaction, referred to as the Mergers, to create Colony NorthStar, an internally-managed, diversified real estate and investment management company. … Although NSAM is the legal acquirer in the Mergers, Colony has been designated as the accounting acquirer, resulting in a reverse acquisition of NSAM for accounting purposes.”
The issue here is that the gvkey
of 183603
, which related to Colony prior to the merger, survives on Compustat, while the permno
of 14686
, which related to NSAM prior to the merger, is the surviving permno
.
The correct approach here depends to some extent on the research question.
In a sense, the net income reported on 2017-02-28 relates to an entity (a pre-merger Colony) that whose securities ceased to trade on 2017-01-10
, so the correct answer is that there is no appropriate permno
to use for measuring the market reaction to earnings on 2017-02-28
.
In summary, it appears that cases where have a permno
link on datadate
, but not on rdq
, or where the matched permno
changes between the two dates are problematic and should not be used for the purpose of our hypothetical event study.
Cases where have a permno
link on rdq
, but not on datadate
, seem more appropriate, but are likely to be a bit different from other firms.
9.3 All about CUSIPs
According to CUSIP Global Services, “CUSIP identifiers are the universally accepted standard for classifying financial instruments across institutions and exchanges worldwide.”
“Derived from the Committee on Uniform Security Identification Procedures, CUSIPs are 9-character identifiers that capture an issue’s important differentiating characteristics for issuers and their financial instruments in the U.S. and Canada.”
CUSIP Global Services uses the CUSIP of Amazon.com’s common stock, 023135106
, as an example of the components of a 9-character CUSIP.
The first six characters, 023135
represent the issuer, which is Amazon.com, a company, in this case, but could be a municipality or a government agency.
The next two characters (10
) indicate the type of instrument (e.g., debt or equity), but also uniquely identifies the issue among the issuer’s securities.
The final character (6
) is a check digit created by a mathematical formula.
This last character will indicate any corruption of the preceding 8 characters.
Note that the characters need not be digits.
For example, the Class C Common Stock of Dell Technologies Inc. has a CUSIP of 24703L202
, which contains the letter L
.73
While a full CUSIP always comprises nine characters, many data services abbreviate the CUSIP by omitting the check digit (to create an “eight-digit” CUSIP) or both the check digit and the issue identifier (to create a “six-digit” CUSIP).
For example, the CRSP table crsp.stocknames
uses eight-digit CUSIPs.
9.4 Exercises
- The CRSP table
crsp.stocknames
includes two CUSIP-related fields,cusip
andncusip
. What are the differences between the two fields? What does it mean whenncusip
is missing, butcusip
is present?
Like CUSIPs, PERMNOs are security-level identifiers. Can a PERMNO be associated with more than one CUSIP at a given point in time? Can a PERMNO be associated with more than one CUSIP over time? Can a CUSIP be associated with more than one PERMNO over time?
Looking at entries on
crsp.stocknames
whereticker
isDELL
, we see two differentpermno
values. What explains this?
## # Source: SQL [4 x 3]
## # Database: postgres [iangow@/tmp:5432/iangow]
## permno namedt nameenddt
## <int> <date> <date>
## 1 11081 1988-06-22 2003-07-21
## 2 11081 2003-07-22 2013-10-29
## 3 18267 2018-12-28 2020-03-22
## 4 18267 2020-03-23 2022-12-30
- Looking at
permno
of11081
(Dell), we see two different CUSIP values. What change appear to caused the change in CUSIP for what CRSP regards as the same security?
9.5 Linking Compustat with Audit Analytics
The SEC EDGAR databases uses CIKs to identify firms.
Other databases, often derived from SEC EDGAR data, also use CIKs as firm identifiers.
One such database is Audit Analytics (audit
).
We use Audit Analytics to illustrate linking of other databases using CIK links and focus on the table containing data on restatements (audit.auditnonreli
) to do so.
Because accounting restatements most closely relate to financial statements, we focus here on linking Compustat to Audit Analytics.
For the purposes of this chapter, we are not concerned with the detailed contents of audit.auditnonreli
, merely with linking Compustat to that table.
The three fields from audit.auditnonreli
that we will use here are res_notif_key
, company_fkey
, and file_date_num
.
The first field is a primary key for the audit.auditnonreli
table, the second field is what Audit Analytics calls CIKs, and the last field is date on which the SEC filing related to the restatement was filed.
This will be our “test date” for linking GVKEYs to CIKs.
auditnonreli <- tbl(pg, sql("SELECT * FROM audit.auditnonreli"))
aa_lookup <-
auditnonreli %>%
select(res_notif_key, company_fkey, file_date_num) %>%
rename(cik = company_fkey) %>%
collect()
As our source linking GVKEYs to CIKs, we will use the table crsp.comphist
, which is created by CRSP from historical editions of the Compustat database.
comphist <- tbl(pg, sql("SELECT * FROM crsp.comphist"))
gvkey_cik_link <-
comphist %>%
filter(!is.na(hcik)) %>%
group_by(gvkey) %>%
window_order(hchgdt) %>%
# Some fancy code to deal with tricky cases (see exercises)
mutate(new_cik = as.integer(row_number()==1L | hcik != lag(hcik)),
cik_id = cumsum(new_cik)) %>%
group_by(gvkey, hcik, cik_id) %>%
# SQL kills missing hchgenddt values, so this enables me
# to restore these
summarize(missing_hchgenddt = any(is.na(hchgenddt), na.rm = TRUE),
hchgdt = min(hchgdt, na.rm = TRUE),
hchgenddt = max(hchgenddt, na.rm = TRUE),
.groups = "drop") %>%
mutate(hchgenddt = if_else(missing_hchgenddt, NA, hchgenddt)) %>%
select(-missing_hchgenddt, -cik_id) %>%
rename(cik = hcik) %>%
arrange(gvkey, hchgdt) %>%
collect(0)
gvkey_cik_link %>%
select(gvkey, cik) %>%
distinct() %>%
group_by(gvkey) %>%
summarize(num_ciks = n()) %>%
ungroup() %>%
count(num_ciks)
## # A tibble: 3 × 2
## num_ciks n
## <int> <int>
## 1 1 35803
## 2 2 1280
## 3 3 33
gvkey_aa_link <-
aa_lookup %>%
inner_join(gvkey_cik_link, by = "cik") %>%
filter(file_date_num >= hchgdt | hchgdt == min(hchgdt),
file_date_num <= hchgenddt | is.na(hchgenddt)) %>%
select(res_notif_key, gvkey) %>%
right_join(aa_lookup, by = "res_notif_key")
## Warning in inner_join(., gvkey_cik_link, by = "cik"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 72 of `x` matches multiple rows in `y`.
## ℹ Row 22747 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship = "many-to-many"` to silence this warning.
gvkey_aa_link %>%
count(has_gvkey = !is.na(gvkey),
year = lubridate::year(file_date_num)) %>%
pivot_wider(names_from = has_gvkey, values_from = n,
names_glue = "has_gvkey_{has_gvkey}") %>%
mutate(total = has_gvkey_TRUE + has_gvkey_FALSE,
prop_w_gvkey = has_gvkey_TRUE/total) %>%
rename(num_w_gvkey = has_gvkey_TRUE) %>%
select(-has_gvkey_FALSE)
## # A tibble: 29 × 4
## year num_w_gvkey total prop_w_gvkey
## <dbl> <int> <int> <dbl>
## 1 1996 21 24 0.875
## 2 1997 62 64 0.969
## 3 1998 79 88 0.898
## 4 1999 127 137 0.927
## 5 2000 444 551 0.806
## 6 2001 540 642 0.841
## 7 2002 574 739 0.777
## 8 2003 719 876 0.821
## 9 2004 857 995 0.861
## 10 2005 1413 1683 0.840
## # … with 19 more rows
auditnonreli %>%
group_by(res_notif_key) %>%
summarize(num_rows = n()) %>%
ungroup() %>%
filter(num_rows != 1) %>%
collect() %>%
nrow() == 0
## [1] TRUE
## [1] TRUE