8  Linking databases

Almost any research project in accounting or finance research will involve merging data from multiple sources. When we joined data tables in Chapter 2, we had a common identifier across tables. But this will not always be the case. For example, to evaluate the market reaction to earnings announcements, we might start with data on comp.fundq from Compustat, where gvkey and datadate to identify firm-quarters and associated announcement dates (rdq), and then look to merge with daily stock return data from the Center for Research in Security Prices (CRSP, pronounced “crisp”) on crsp.dsf, which uses permno and date to identify each firm’s trading equity and daily stock returns (ret). But this raises the question as to which permno (if any) matches a given gvkey. Alternatively, given security price information from CRSP, we might ask what it the most recent financial statement information for that security. This raises the question as to which gvkey (if any) matches a given permno and this chapter provides guidance of the standard approaches to linking these databases.

Tip

The code in this chapter uses the following packages. For instructions on how to set up your computer to use the code found in this book, see Section 1.2. Quarto templates for the exercises below are available at https://github.com/iangow/far_templates.

library(dplyr, warn.conflicts = FALSE)
library(dbplyr)
library(DBI)
library(tidyr)
library(ggplot2)
library(lubridate)        # For floor_date()

8.1 Firm identifiers

The idea behind a firm identifiers is that it uniquely identifies a firm for a particular purpose.1 While Compustat uses GVKEYs, CRSP uses PERMNOs, the SEC uses CIKs, stock exchanges use tickers, and there are also CUSIPs. Of course, identifiers apply not only to firms, but also people.2 Nonetheless, most of this chapter will focus on firm identifiers, in part because of the importance of firms as units of observation in accounting and finance research, but also because identifying firms is much harder than identifying people. While not specific to the platform we describe in this book, the issue of identifiers is one that seems less perplexing and better-handled when a relational database provides the backbone of your data store as it does here.

Data provider Firm identifiers Notes
Compustat (comp) gvkey
CRSP (crsp) permno, permco permno is a security identifier
IBES (ibes) ticker ticker is not necessarily the ticker assigned to the firm by the exchange on which it trades
SEC EDGAR CIK
Audit Analytics (audit) company_fkey company_fkey is the same as CIK
Various CUSIP CUSIP is a security identifier

8.1.1 Firm identifiers: A quiz

Our sense is that firm identifiers is one of those topics that seem easy, but is actually fairly tricky. Here is a quick quiz to test your knowledge of firm identifiers.3

  • General Motors Corporation declared bankruptcy in June 2009? Does the “successor firm” General Motors Company have the same GVKEY as General Motors Corporation? The same PERMNO?
  • Can a CUSIP map to more than one PERMNO? To more than one GVKEY?
  • Can a PERMNO map to more than one CUSIP?
  • Can a GVKEY map to more than one PERMCO?
  • Can a PERMCO map to different CIKs?
  • If you have two data sets, \(X\) and \(Y\) and CUSIP is a “firm” identifier on each, can you simply merge using CUSIPs?
  • When would a “firm” change CUSIPs?
  • When would a “firm” change CIKs?
  • If the firm identifier on IBES is ticker, should I merge with CRSP using ticker from crsp.stocknames?

Maybe you know the answers to some questions, but not all. If so, read on; this chapter aims to provide answers to many of these questions.4

8.2 The CRSP database

According to its website, “the Center for Research in Security Prices, LLC (CRSP) maintains the most comprehensive collection of security price, return, and volume data for the NYSE, AMEX and NASDAQ stock markets. Additional CRSP files provide stock indices, beta-based and cap-based portfolios, treasury bond and risk-free rates, mutual funds, and real estate data. [CRSP] maintains the most comprehensive collection of security price, return, and volume data for the NYSE, AMEX and NASDAQ stock markets. Additional CRSP files provide stock indices, beta-based and cap-based portfolios, treasury bond and risk-free rates, mutual funds, and real estate data.” We will discuss the CRSP/COMPUSTAT Merged Database in Section 8.3.5

CRSP provides PERMNO, its own “permanent identifier” for each security in its database. Additionally, CRSP provides a company-level identifier, PERMCO, for each company. CRSP’s goals in creating these identifiers is to allow “for clean and accurate backtesting, time-series and event studies, measurement of performance, accurate benchmarking, and securities analysis.”

“CRSP contains end-of-day and month-end prices on all listed NYSE, Amex, and NASDAQ common stocks along with basic market indices, and includes the most comprehensive distribution information available, with the most accurate total return calculations.”

End-of-day prices are found on crsp.dsf and month-end prices are on crsp.msf. Let’s take a look at these two tables.

pg <- dbConnect(RPostgres::Postgres(), bigint = "integer")

dsf <- tbl(pg, Id(schema = "crsp", table = "dsf")) 
msf <- tbl(pg, Id(schema = "crsp", table = "msf"))
dsf |> collect(n = 5)
# A tibble: 5 × 20
  cusip    permno permco issuno hexcd hsiccd date       bidlo askhi   prc    vol
  <chr>     <int>  <int>  <int> <int>  <int> <date>     <dbl> <dbl> <dbl>  <int>
1 92343V10  65875  20288      0     1   4813 1993-05-21  53.2  54.2  53.2 928900
2 92343V10  65875  20288      0     1   4813 1993-05-24  53.4  53.9  53.9 291100
3 92343V10  65875  20288      0     1   4813 1993-05-25  53.5  53.9  53.9 251800
4 92343V10  65875  20288      0     1   4813 1993-05-26  53.8  54.9  54.8 475100
5 92343V10  65875  20288      0     1   4813 1993-05-27  54.4  54.9  54.8 286400
# ℹ 9 more variables: ret <dbl>, bid <dbl>, ask <dbl>, shrout <dbl>,
#   cfacpr <dbl>, cfacshr <dbl>, openprc <dbl>, numtrd <int>, retx <dbl>
msf |> collect(n = 5)
# A tibble: 5 × 21
  cusip    permno permco issuno hexcd hsiccd date       bidlo askhi   prc   vol
  <chr>     <int>  <int>  <int> <int>  <int> <date>     <dbl> <dbl> <dbl> <int>
1 68391610  10000   7952  10396     3   3990 1985-12-31 NA    NA    NA       NA
2 68391610  10000   7952  10396     3   3990 1986-01-31 -2.5  -4.44 -4.38  1771
3 68391610  10000   7952  10396     3   3990 1986-02-28 -3.25 -4.38 -3.25   828
4 68391610  10000   7952  10396     3   3990 1986-03-31 -3.25 -4.44 -4.44  1078
5 68391610  10000   7952  10396     3   3990 1986-04-30 -4    -4.31 -4      957
# ℹ 10 more variables: ret <dbl>, bid <dbl>, ask <dbl>, shrout <dbl>,
#   cfacpr <dbl>, cfacshr <dbl>, altprc <dbl>, spread <dbl>, altprcdt <date>,
#   retx <dbl>

The CRSP Indices database contains a number of CRSP indices. Here we focus on two index tables, crsp.dsi and crsp.msi, which can be viewed complementing crsp.dsf and crsp.msf respectively.

dsi <- tbl(pg, Id(schema = "crsp", table = "dsi")) 
msi <- tbl(pg, Id(schema = "crsp", table = "msi"))
dsi |> collect(n = 5)
# A tibble: 5 × 11
  date          vwretd    vwretx   ewretd   ewretx sprtrn spindx   totval totcnt
  <date>         <dbl>     <dbl>    <dbl>    <dbl>  <dbl>  <dbl>    <dbl>  <int>
1 1925-12-31 NA        NA        NA       NA           NA     NA   2.75e7    503
2 1926-01-02  0.00569   0.00569   0.00952  0.00952     NA     NA   2.76e7    497
3 1926-01-04  0.000706  0.000706  0.00578  0.00578     NA     NA   2.76e7    502
4 1926-01-05 -0.00482  -0.00487  -0.00193 -0.00203     NA     NA   2.75e7    501
5 1926-01-06 -0.000423 -0.000427  0.00118  0.00115     NA     NA   2.76e7    505
# ℹ 2 more variables: usdval <dbl>, usdcnt <int>
msi |> collect(n = 5)
# A tibble: 5 × 11
  date          vwretd   vwretx  ewretd  ewretx  sprtrn spindx    totval totcnt
  <date>         <dbl>    <dbl>   <dbl>   <dbl>   <dbl>  <dbl>     <dbl>  <int>
1 1925-12-31 NA        NA       NA      NA      NA        12.5 27487487.    503
2 1926-01-30  0.000561 -0.00140  0.0232  0.0214  0.0225   12.7 27624241.    506
3 1926-02-27 -0.0330   -0.0366  -0.0535 -0.0555 -0.0440   12.2 26752064.    514
4 1926-03-31 -0.0640   -0.0700  -0.0968 -0.101  -0.0591   11.5 25083173.    519
5 1926-04-30  0.0370    0.0340   0.0330  0.0302  0.0227   11.7 25886744.    521
# ℹ 2 more variables: usdval <dbl>, usdcnt <int>

8.2.1 Exercises

  1. If you look at the stock tables (crsp.dsf and crsp.msf), you will see that prc can be negative on either table. Do negative stock prices make sense economically speaking? What do negative stock prices on CRSP mean? (CRSP documentation can be found here.) What would be some alternative approaches to encode this information? (Write code to recast the data using one of these approaches.) Why do you think that CRSP chose the approach used?

  2. How do ret and retx differ? Which variable are you more likely to use in research?

  3. Looking at the date variable on crsp.msf, is it always the last day of the month? If not, why not?

  4. Suggest the “natural” primary key for these tables. Check that this is a primary key for crsp.msf.

  5. What is being depicted in Figure 8.1 and Figure 8.2? What are the sources of variation across months in the first plot? Looking at the plots, what appears to be the main driver of variation in the first plot. Create an additional plot to visualize the source of variation in the first not depicted below. In the code below, we are using collect() followed by mutate(month = floor_date(date, "month")) to calculate month. What changes in terms of where the processing occurs if we replace these two lines with mutate(month = as.Date(floor_date("month", date)))? Do we get different results? Why do we need the as.Date() function in the second case?

plot_data <-
  dsf |>
  select(date) |>
  filter(between(date, "2017-12-31", "2022-12-31")) |>
  collect() |>
  mutate(month = floor_date(date, "month"))

freqs <-
  plot_data |>
  count(month) |>
  collect()

freqs |>
  ggplot(aes(x = month, y = n)) +
  geom_bar(stat = "identity") +
  scale_x_date(date_breaks = "2 months",
               expand = expansion()) +
  theme(axis.text.x = element_text(angle = 90))
Figure 8.1: Number of observations by month (#1)
freqs_alt <-
  plot_data |>
  distinct() |>
  count(month)

freqs_alt |>
  ggplot(aes(x = month, y = n)) +
  geom_bar(stat = "identity") +
  scale_x_date(date_breaks = "2 months",
               expand = expansion()) +
  theme(axis.text.x = element_text(angle = 90))
Figure 8.2: Number of observations by month (#2)
  1. What is the primary key for crsp.dsi and crsp.msi? Verify that it is a valid key for both tables.

  2. Using the dplyr verb anti_join(), determine if there are any dates on crsp.dsf that do not appear on crsp.dsi or vice versa. Do the same for crsp.msi and crsp.msf.

8.3 Linking CRSP and Compustat

The CRSP/Compustat Merged (CCM) database provides the standard link between CRSP data and Compustat’s fundamental data. The CCM provides three tables that you will see in common use:

  • crsp.ccmxpf_lnkhist
  • crsp.ccmxpf_lnkused
  • crsp.ccmxpf_linktable

The reality is that the only table we need to worry about is crsp.ccmxpf_lnkhist, as the other two tables can be (and likely are) constructed from it (see here for details).6

ccmxpf_lnkhist <- tbl(pg, Id(schema = "crsp", table = "ccmxpf_lnkhist"))
ccmxpf_lnkhist
Table 8.1: A sample of observations on crsp.ccmxpf_lnkhist
gvkey linkprim liid linktype lpermno lpermco linkdt linkenddt
001000 C 00X NU NA NA 1961-01-01 1970-09-29
001000 P 01 NU NA NA 1970-09-30 1970-11-12
001000 P 01 LU 25881 23369 1970-11-13 1978-06-30
001001 C 00X NU NA NA 1978-01-01 1983-09-19
001001 P 01 LU 10015 6398 1983-09-20 1986-07-31
001002 C 00X NR NA NA 1960-01-01 1970-09-29
001002 C 01 NR NA NA 1970-09-30 1972-12-13
001002 C 01 NR NA NA 1973-06-06 1973-08-31
001002 C 01 LC 10023 22159 1972-12-14 1973-06-05
001003 C 00X NU NA NA 1980-01-01 1983-12-06

The basic idea of crsp.ccmxpf_lnkhist is that given a gvkey and a date, one can match that gvkey-date combination to a PERMNO (here called lpermno) by merging on gvkey where the date is between linkdt and linkenddt. One thing you will see is that there are cases where lpermno is NA, so “matching” these rows will result in non-matches, which is of no real value. The only value might be in determining whether the non-match has linktype of NR, which means that lack of a link has been “confirmed by research” (presumably by CRSP), or of NU, which means the link is “not yet confirmed” by research.

ccmxpf_lnkhist |>
  filter(is.na(lpermno)) |>
  count(linktype)
linktype n
NU 34125
NR 43255
NP 4

In practice, we would likely ignore all matches with linktype %in% c("NU", "NR") or (equivalently) is.na(lpermno). Let’s look at the remaining linktype values.

ccm_link <-
  ccmxpf_lnkhist |>
  filter(!is.na(lpermno)) 

ccm_link |>
  count(linktype) |>
  arrange(desc(n))
linktype n
LC 17013
LU 15993
LS 5158
LX 1153
LN 186
LD 118

The cases where linktype is LD represent cases where two GVKEYs map to a single PERMNO at the same time and, according to WRDS, “this link should not be used.” Here is one example:

ccm_link |> filter(lpermno == 23536)
gvkey linkprim liid linktype lpermno lpermco linkdt linkenddt
011550 P 01 LC 23536 21931 1962-01-31 NA
013353 P 01 LD 23536 21931 1962-01-31 1986-12-31
013353 C 99X LD 23536 21931 1987-01-01 2020-12-31

Here we’ll take the WRDS’s advice and omit these.

The cases where linktype is LX represent cases where the security referred to on Compustat is one that trades on a foreign exchange and CRSP is merely “helpfully” linking to a different security that is found on CRSP. Here is one example:

ccm_link |> filter(gvkey == "001186")
gvkey linkprim liid linktype lpermno lpermco linkdt linkenddt
001186 P 01 LC 78223 26174 1982-11-01 NA
001186 N 01C LX 78223 26174 1982-11-01 NA

These matches are duplicates and we don’t want them.

The remaining category for discussion is where linktype is LN. These are cases where a link exists, but Compustat does not have price data to allow CRSP to check the quality of the link. While researcher discretion might be used to include these, most researchers appear to exclude these cases and we will do likewise. Given the above, we are only including cases where linktype is in LC (valid, researched link), LU (unresearched link), or LS (link valid for this lpermno only).

ccm_link <-
  ccmxpf_lnkhist |>
  filter(linktype %in% c("LC", "LU", "LS")) 

ccm_link |>
  count(linkprim) |>
  arrange(desc(n))
linkprim n
P 29689
C 7961
J 420
N 94

Now, let’s consider, linkprim. WRDS explains as follows:

linkprim clarifies the link’s relationship to Compustat’s marked primary security within the related range. “P” indicates a primary link marker, as identified by Compustat in monthly security data. “C” indicates a primary link marker, as identified by CRSP to resolve ranges of overlapping or missing primary markers from Compustat in order to produce one primary security throughout the company history. “J” indicates a joiner secondary issue of a company, identified by Compustat in monthly security data.

This suggests we should omit cases where linkprim equals J. Given that cases where linkprim equals N are duplicated links due to the existence of Canadian securities for a US-traded firm, we will exclude these too. This leaves us with linkprim %in% c("C", "P"):

ccm_link <-
  ccmxpf_lnkhist |>
  filter(linktype %in% c("LC", "LU", "LS"),
         linkprim %in% c("C", "P")) 

A natural question is whether, for any given GVKEY-date, there is only one PERMNO that is matched with linkprim IN ('P', 'C') on any given date. We can examine this by looking for overlapping date ranges between different rows of ccm_link for a given gvkey:

ccm_link |>
  group_by(gvkey) |>
  window_order(linkdt) |>
  mutate(lead_linkdt = lead(linkdt),
         lag_linkenddt = lag(linkenddt)) |>
  filter(linkenddt >= lead_linkdt | lag_linkenddt >= linkdt) |>
  ungroup() |>
  count() |> 
  pull()
[1] 0

So there are no cases of overlapping dates, which means that only one lpermno is linked to a given gvkey for a given date. Our sense is that the last iteration of ccm_link above is more or less the standard approach used by researchers in practice. You may occasionally see code that filters on usedflag==1, which is a variable found on crsp.ccmxpf_lnkused, not on crsp.ccmxpf_lnkhist. But, it can be shown that using this table and filter yields exactly the same result as ccm_link above.

Note that the vast majority of GVKEYs map to just one PERMNO even without regard to date.

ccm_link |>    
  group_by(gvkey) |>
  summarize(num_permnos = n_distinct(lpermno)) |>
  count(num_permnos, sort = TRUE)
num_permnos n
1 32198
2 1056
3 76
4 3
5 1

The case with 5 PERMNOs is a complicated one involving tracking stock, spin-offs, etc.7 But one observation doesn’t matter much.

A final consideration is the presence of missing values on linkenddt. In general, these missing values indicate that there the link remains valid at the time the data set was constructed. One approach would be to merge using code something like datadate <= linkenddt | is.na(linkenddt). Another approach would be to fill in missing values with a date that represents the plausible latest date for the data set. The latter approach does avoid the need for later checks of missing values. In the following code we use the latest value of linkenddt for the entire table to fill in all missing values on linkenddt.

ccm_link <-
  ccmxpf_lnkhist |>
  filter(linktype %in% c("LC", "LU", "LS"),
         linkprim %in% c("C", "P")) |>
  mutate(linkenddt = coalesce(linkenddt,
                              max(linkenddt, na.rm = TRUE)))

8.4 All about CUSIPs

According to CUSIP Global Services, “CUSIP identifiers are the universally accepted standard for classifying financial instruments across institutions and exchanges worldwide. Derived from the Committee on Uniform Security Identification Procedures, CUSIPs are 9-character identifiers that capture an issue’s important differentiating characteristics for issuers and their financial instruments in the U.S. and Canada.”

CUSIP Global Services uses the CUSIP of Amazon.com’s common stock, 023135106, as an example of the components of a 9-character CUSIP. The first six characters—023135—represent the issuer, which is Amazon.com, a company, in this case, but could be a municipality or a government agency. The next two characters (10) indicate the type of instrument (e.g., debt or equity), but also uniquely identifies the issue among the issuer’s securities. The final character (6) is a check digit created by a mathematical formula. This last character will indicate any corruption of the preceding 8 characters. Note that the characters need not be digits. For example, the Class C Common Stock of Dell Technologies Inc. has a CUSIP of 24703L202, which contains the letter L.9

While a full CUSIP always comprises nine characters, many data services abbreviate the CUSIP by omitting the check digit (to create an “eight-digit” CUSIP) or both the check digit and the issue identifier (to create a “six-digit” CUSIP). For example, the CRSP table crsp.stocknames uses eight-digit CUSIPs.

Notwithstanding the existence of crsp.ccmxpf_lnkhist, some researchers choose to link CRSP and Compustat using CUSIPs. For example, the code supplied with Jame et al. (2016) merges CRSP daily data on returns, prices, volume and shares outstanding with Compustat data on shareholders’ equity using CUSIP and “year”.10

To evaluate the appropriateness of using CUSIPs to link CRSP and Compustat, we can construct a link table for comp.funda using CUSIPs (funda_cusip_link below) and compare with with an analogous link table constructed using ccm_link (funda_ccm_link below). First, let’s construct the subset of comp.funda of interest.

funda_mod <-
  funda |>
  filter(indfmt == "INDL", datafmt == "STD",
         consol == "C", popsrc == "D") |>
  mutate(mkt_cap = prcc_f * csho) |>
  select(gvkey, datadate, cusip, at, mkt_cap) 

Our source for PERMNO-CUSIP links is stocknames. There are some cases where there is no value on ncusip, but there is a value on cusip and we use coalesce() to fill in missing values in such cases.

stocknames <- tbl(pg, Id(schema = "crsp", table = "stocknames"))

stocknames_plus <-
  stocknames |>
  mutate(ncusip = coalesce(ncusip, cusip))

Now we can construct funda_cusip_link containing CUSIP-based matches for each (gkvey, datadate).

funda_cusip_link <-
  funda_mod |>
  mutate(ncusip = substr(cusip, 1L, 8L)) |>
  inner_join(stocknames_plus, 
             join_by(ncusip, 
                     between(datadate, namedt, nameenddt))) |>
  select(gvkey, datadate, permno, permco)

Similarly, we can construct our matches using ccm_link.

funda_ccm_link <-
  funda_mod |>
  select(gvkey, datadate) |>
  inner_join(ccm_link, 
             join_by(gvkey, 
                     between(datadate, linkdt, linkenddt))) |>
  select(gvkey, datadate, lpermno, lpermco) |>
  rename(permno = lpermno, permco = lpermco)

Finally, we combine both sets of matches in funda_link_combined for comparison.

funda_link_combined <-
  funda_mod |>
  select(-cusip) |>
  left_join(funda_ccm_link, by = join_by(gvkey, datadate)) |>
  left_join(funda_cusip_link,
            by = join_by(gvkey, datadate), 
            suffix = c("_ccm", "_cusip")) |>
  mutate(same_permno = permno_ccm == permno_cusip,
         same_permco = permco_ccm == permco_cusip,
         has_permno_ccm = !is.na(permno_ccm),
         has_permno_cusip = !is.na(permno_cusip)) |>
  filter(has_permno_ccm | has_permno_cusip) |>
  collect()

Regarding Table 8.3, we can probably view the cases with same_permno as valid matches, but would probably need to check the cases where same_permno is FALSE.

funda_link_combined |>
  count(same_permno, same_permco)
Table 8.3: Comparison of CCM- and CUSIP-based links
same_permno same_permco n
FALSE FALSE 130
FALSE TRUE 160
TRUE TRUE 244017
NA NA 113429

The cases where same_permno is NA in Table 8.3 are explored in Table 8.4. We would need to investigate the cases where one of permno_ccm or permno_cusip is NA to understand the source of the non-matches in one table or the other. However, a reasonable view seems to be that ccm_link provides many valid matches that are lost when matching using CUSIPs and for this reason crsp.ccmxpf_lnkhist should be preferred to CUSIP-based matches.

funda_link_combined |>
  count(has_permno_ccm, has_permno_cusip)
Table 8.4: Differences in coverage of CCM- and CUSIP-based link tables
has_permno_ccm has_permno_cusip n
FALSE TRUE 3252
TRUE FALSE 110177
TRUE TRUE 244307

8.5 Exercises

  1. The CRSP table crsp.stocknames includes two CUSIP-related fields, cusip and ncusip. What are the differences between the two fields? What does it mean when ncusip is missing, but cusip is present?

  2. Like CUSIPs, PERMNOs are security-level identifiers. Can a PERMNO be associated with more than one CUSIP at a given point in time? Can a PERMNO be associated with more than one CUSIP over time? Can a CUSIP be associated with more than one PERMNO over time?

  3. Looking at entries on crsp.stocknames where ticker is DELL, we see two different permno values. What explains this?

stocknames |> 
  filter(grepl('^DELL ', comnam)) |> 
  select(permno, cusip, ncusip, comnam, siccd, namedt, nameenddt)
permno cusip ncusip comnam siccd namedt nameenddt
11081 24702R10 24702510 DELL COMPUTER CORP 3570 1988-06-22 2003-07-21
11081 24702R10 24702R10 DELL INC 3570 2003-07-22 2013-10-29
16267 24703L10 24703L10 DELL TECHNOLOGIES INC 3824 2016-09-07 2018-12-27
18267 24703L20 24703L20 DELL TECHNOLOGIES INC 3824 2018-12-28 2020-03-22
18267 24703L20 24703L20 DELL TECHNOLOGIES INC 3571 2020-03-23 2023-12-29
  1. Looking at permno of 11081 (Dell), we see two different CUSIP values. What change appears to have caused a change in CUSIP for what CRSP regards as the same security?

  2. Based on the results shown in Table 8.3 and Table 8.4, what would be the main issue with using a CUSIP-based match of CRSP with Compustat? Choose two rows from the two tables where one of the columns is FALSE. Can you discern from the underlying tables what the issue is creating a FALSEvalue and which match (if any) is valid? Can you conclude that the CCM-based match is the preferred one in each case?


  1. Of course, PERMNOs, CUSIPs, and tickers (at best) identify securities, not firms. More on this below.↩︎

  2. In the United States, a Social Security Number (SSN) is a pretty robust identifier of people, as would be a Tax File Number (TFN) in Australia. Though as researchers, we generally don’t have access to SSNs or TFNs.↩︎

  3. Obviously, we are assuming that you recognize the various identifiers. If not, read on.↩︎

  4. Coverage of CIKs is deferred to Chapter 23 and we do not use IBES data in this book.↩︎

  5. See the CRSP documentation for details on the other CRSP databases, such as CRSP US Treasury and Inflation Series, CRSP Mutual Funds, and CRSP/Ziman Real Estate Data Series.↩︎

  6. WRDS says “SAS programmers should use the Link History dataset (ccmxpf_lnkhist) from CRSP”.↩︎

  7. See here for details.↩︎

  8. The same seems to be true for the case with gvkey of 003581. Again, it’s not clear why CRSP switched the primary permno on 2018-01-01.↩︎

  9. See here for details.↩︎

  10. Only the last observation for a calendar year is kept for CRSP and year means year(datadate) for Compustat.↩︎