5 Un intermezzo

I take a brief pause in this chapter to organize some bigger-picture thoughts about Chapter 4 of Tanimura (2021) before launching into a to-be-written alternative version of that chapter in Chapter 6. This chapter is not intended to be particularly coherent, but it meant more as a dumping ground for some raw materials that might be used to create an alternative version of Chapter 4 of Tanimura (2021).

I think there is a lot to like about Chapter 4 of Tanimura (2021). First, I love the use of a real data set, as I think it gets readers and students thinking about the data more deeply than is possible with simulated [fake] data. Also real data are more apt to cough up interesting complexities that are not easily anticipated with fake data. Second, the focus on the steps needed to get the data ready for analysis distinguishes Chapter 4 of Tanimura (2021)—indeed the whole book—both from books that focus on more elementary SQL (e.g., CREATE TABLE) and simpler data and from books that get hung up in statistical methods and neglect the all-important steps of preparing the data. Third, there is a richness to the examples. Tanimura (2021) pushes the legislators data hard.

That said, there are some weaknesses. First, at 54 pages, the chapter is probably too long. While it’s notionally shorter than Chapter 3 of Tanimura (2021), Chapter 3 is really one chapter on data types (first 13 pages) and another chapter on the retail sales data set (47 pages). My sense is that it might make sense to drop a couple of queries and perhaps use a different data set for some aspects (e.g., the cumulative analysis feels a bit forced with the legislators data). Second, not every query is motivated by a precise question. As such, it can be hard to understand why various choices are being made. Third, I think this chapter nicely illustrates the outer limits of what you want to be doing with straight SQL with nested subqueries. I think one can do better using dbplyr to generate SQL. I believe that using dbplyr allows one to use more of a modular, building-block approach to some of the queries. For example, one can split apart the formation of cohorts from the arrangement of the basic survival data; I illustrate some ideas along these lines at the end of this chapter. I guess that this could lead to a shorter chapter. Fourth, Tanimura (2021) does not discuss approaches to handling censored data. While censoring is something more often found in an advanced statistics book, I think it can be handled at a level that does not demand high levels of statistical sophistication (e.g., see Chapter 10 of Singer and Willett 2003).¹

Finally, I feel that the taxonomy provided in Tanimura (2021) is not as helpful as it might be. Perhaps an alternative taxonomy would be better. So I start with a taxonomy of analyses using survival analysis as the basic platform.

5.0.1 An alternative taxonomy?

The situations for which survival analysis was named can be framed either in terms of the period of survival or in terms of the event which denotes the end of the survival period (e.g., death). While many sources (e.g., Singer and Willett 2003, 306) focus on the event, one can equally well focus on the states (i.e., “alive” before the event and [euphemistically] “not alive” after the event).

With the terminology established, the first dichotomy is between once-only events and possibly repeated events. With senso strictu survivorship analysis where the “event” is death, the event only occurs once.

In other situations, the “once-only” nature of the event studied is a function of how it is defined by the analyst. Singer and Willett (2003) provide examples such as time to relapse among recently treated alcoholic and age at first suicide ideation. While one could track alcoholics over an extended period that allows them the opportunity to relapse, re-enter treatment, and then perhaps relapse again, by defining relapse as the first such event after treatment, it can occur at most once. Because of the way these events have been defined, if the event hadn’t occurred by a date later than \(t\), then it hadn’t occurred by \(t\).

Often defining the event in such a way that it can happen at most once provides clear analytical benefits. Apart from simplifying data collation, a focus on once-only events allows an analyst to use one of the many statistical models adapted to such situations. For example, Singer and Willett (2003) “focuses exclusively on the most common and fundamental problem: the analysis of single spells [i.e., once-only events]” [p. 311]. Singer and Willett (2003) provides tools for discrete-time and continuous-time analyses, including the Cox regression model and extensions to that model. While such models are beyond the scope of this book, they can be implemented once an analyst has organized the data into the canonical form for survival data. My focus below is on organizing data for such analysis.

A second dichotomy relates to whether the states are inherently continuous or involve repeated actions. Conventional understanding has it that a person begins life at (or near) the date of birth and is alive until the date of death. “Being alive” is a continuous state. In the context of Chapter 4 of Tanimura (2021), a legislator is elected to a term and remains a legislator until the term ends, either through expiration of the term, resignation, or death.

But in other situations, there may be no inherently continuous state. For example, suppose (very hypothetically) that I were a regular Starbucks customer. Almost every morning I stop by the local Starbucks for a muffin and my regular coffee. Sometimes I pick up an afternoon caramel macchiato and I might grab breakfast at the airport Starbucks when travelling. Almost never do I partake in the Seattle-based company’s wares on weekends.

In this scenario, I might reasonably be described as a “regular Starbucks customer” even in the evening or on weekends. The challenge for the data analyst is that, while there is no continuous state observed in the data as there is with congressional terms or human lifetimes, we might want to treat “regular Starbucks customer” as the state of interest and therefore need to define “regular Starbucks customer” in terms of actions (e.g., at least one purchase per month) or through related states (e.g., being a member of the Starbucks rewards program).

In some cases, there will be underlying states of interest that are continuous, but the data analyst might need to infer these states from discrete observations in the data.

My first thought is that the mapping from these two dichotomies to each the four types of cohort analysis identified by Tanimura (2021) is not entirely clear. Tanimura (2021, 124) suggests that retention analysis is particularly relevant when the activity engaged involves “repeated actions”, but it is not clear whether it is the first or the second dichotomy above that is relevant here. Depending on the revenue model, “playing an online game” might involve either discrete actions over time or maintaining a subscription.

In other places, the rationale for the taxonomy of Tanimura (2021) is unclear. Tanimura (2021, 124) suggests that retention analysis is “concerned with whether the cohort member has a record in the time series on a particular date, expressed as a number of periods from the starting date.” But it is not clear that this is the natural choice to make in all cases.

An illustration of the implications of this definition can be seen in Table 5.11 near the end of this chapter. There it can be seen that John Quincy Adams (“JQA”) would be considered to have been “retained” in periods 28 through 45 notwithstanding his absence during periods 6 through 27, including a stint as president from 1825–1829. If we think of this analysis as being one about customer retention, we might have incurred significant customer acquisition costs (e.g., trial periods or discounts) to get JQA onboard in period 0 and then incurred those same costs all over again to sign up JQA in period 28. For the purposes of customer retention analysis (e.g., evaluating the effectiveness of AOL-style CDs distributed in the mail … we are talking about the olden days, after all), we might want to either ignore periods after losing JQA as a customer, or treat these as a completely different stint (see below). While this does not make the retention analysis in Tanimura (2021) necessarily wrong, it does suggest that the taxonomy is not exhaustive of the ways we might want to structure an analysis.

Tanimura (2021, 124) suggests that what distinguishes survivorship analysis is that “instead of calculating whether an entity was present in a certain period [as in retention analyis], we calculate whether the entity is present in that period or later in the time series.” One rationale for this distinction is that, if we observe that an individual is in the “alive” state at both time \(0\) and time \(2\), we can conclude that that individual was alive at time \(1\). In this sense, the distinction maybe about make inferences about continuous states from discrete events or actions. Unfortunately, the legislators data is not a good setting for making such inferences in general, as one can be a senator, then not a senator, then a senator again and—though we learn below that some “filling” may be needed—the data on terms of the legislators is provided to us in a way that allows us to deduce the continuous state (“being a legislator”) without needing to fill in gaps between discrete events.

It is also not clear that the survivorship-versus-retention dichotomy of Tanimura (2021) has anything to do with whether the events are once-only events or not. For once-only events, being alive at time \(t\) is implied by being alive at time \(t\) “or later in the time series.” Elsewhere Tanimura (2021, 127) suggests that often “a cohort can lose but cannot gain new members once it is formed.” So for many purposes the survivorship-versus-retention dichotomy only matters when the events are not once-only.

The other two categories of analysis in the Tanimura (2021, 124) are returnship and cumulative. Returnship examines of repeated actions occurring “more than some minimum threshold of times … during a fixed window of time” and cumulative analyses “are concerned with the total number of times or amounts measured at one or more fixed time windows”. Returnship seems to involve counting over windows, while cumulative analysis likely requires summing of some measure over windows. Given that a count is simply a sum of ones that indicate an action, this suggests that returnship is often perhaps a species of cumulative analysis.

My hunch is that the latter two categories are not at the same level as the first two. One could do returnship and cumulative analysis in situations where there are once-only events or where multiple events are possible. Similarly, one could do returnship and cumulative analysis in situations whether the states are inherently continuous or involve repeated actions.

I argue that what distinguishes the latter two categories from the other analyses is the metric chosen. For survival, we are in effect integrating the function \(f(t) = 1\) from the start of the window to the end. With cumulative analysis, the function we are integrating might be, say, \(f(t) = \text{Revenue}(t)\) if we are looking to accumulate revenue and returnship we might be summing the count function over some time window. This view gets some support from Tanimura (2021, 127), where retention analysis is described as potentially including “the count of entities or sum of money or actions” [i.e., things discussed in the sections on returnship and cumulative analyses].

That said, my sense is that returnship and cumulative analysis can present novel challenges from a data analysis perspective. The legislators data set does appear to provide an interesting example for returnship in the form of the question: “What proportion of legislators who start as representatives go on to become senators?” But this is just a survival analysis with the “event” being “becoming senator” rather than “ceasing to be a legislator” and the population limited to “legislators who start as a representative” rather than including all legislators. In contrast, the “cumulative” example using legislators (average number of terms) feels a bit more contrived and likely quite different in terms of the structure of the code from what you would need to do “cumulative revenue per customer by cohort” or similar.

My hunch is that the most elemental analysis is one mirroring the original kind of survival analysis: once-only events with continuous states. Once this setting has been covered, one could cover issues involved in handling potentially repeated events and those arising when one observes discrete actions rather than events delineating boundaries between continuous states.

My other hunch is that the analysis is much easier to understand when it is motivated by a clear question, such as the one used in the section covering returnship as discussed above. It isn’t clear why the retention curve in Figure 4-4 of Tanimura (2021, 135) includes legislators at time \(t\) even if they have had a spell out of office prior to \(t\) (“survivorship” analyses would seem to exclude these) and does not include legislators at time \(t\) if they are out of office at that time, but will be in office at a later date (“retention” analyses would seem to include these). If the question that Figure 4-4 of Tanimura (2021, 135) intends to answer were stated more clearly, then the choices to be made might also be clearer.

library(DBI)
library(tidyverse)
library(dbplyr)
library(ggplot2)
library(farr)
library(jsonlite)

db <- dbConnect(duckdb::duckdb())

legislators_terms <- load_parquet(db, "legislators_terms")
legislators <- load_parquet(db, "legislators")

In addition to the two tables above, in this chapter I will use an additional data frame with data on congressional terms.

congress_terms <- load_parquet(db, "congress_terms")

5.1 Survival data: The basic building block

Singer and Willett (2003, 306) suggest that “researchers use survival analysis in a variety of contexts that share a common characterisit: interest centers on describing whether events occur or when events occur.”

5.1.1 Survival: Once-only events

I suggest that the canonical form of data for survival analysis is a table with the following columns:

Identifier: One more columns that identify the individuals in the sample. In the legislator data, this would be id_bioguide.
Entry date: The relevant start date for the individual
Event date: The date, if any, on which the relevant event occurs for the individual.

Most queries involving cohorts in Chapter 2 do not use a survivorship approach.² The queries using a survivorship approach are given in Section 2.4.1.

Before moving on to discuss the census-date approach, I spend some time considering the approaches used in Section 2.4.1. The first query (Tanimura 2021, 154–55) effectively organizes the data into the canonical form for survival data with first_century as the cohort, except that the exit date is measured as term_start for each legislator’s last term.³ In effect, any gaps in a legislator’s term are filled in in the sense that any legislator with a term covering any period after five years would be deemed to have “survived” for five years.

What is not clear from Tanimura (2021) is why one would prefer “survivorship” analysis (with filling-in) over “retention” analysis, as there is no real context given for the queries. One example that is given does not help much: “if we are concerned about the number of students who are still in school after a certain number of years, we could look for the absence of a graduation event in a data set” (Tanimura 2021, 154). But suppose our cohort comprises students who begin high school, where high school is understood as covering grades 9 through 12. It would not seem correct to treat a student who completed grades 9 and 10, then dropped out for a year before returning to complete the eleventh grade and then dropping out again as having “survived” four years of high school.

5.1.2 Survival: Censoring

Note that we will often not observe event dates for many observations due to censoring. While events may be inevitable for any given individual (the proverbial death and taxes), such events need not occur during the period covered by the sample under study. This fact may not always be obvious from the data set in question.

To illustrate this point, I consider the legislators data . It is important to recognize that the data in legislators and legislators_terms were obtained at some unspecified date in the past.

max_term_start <-
  legislators_terms |>
  summarize(max(term_start, na.rm = TRUE)) |>
  pull()

max_term_start

[1] "2020-05-19"

As we can see there are no terms in the data starting after the November 2020 election, suggesting that the data were extracted before terms related to that election started. Looking at the term_end dates, we see that no member of the House of Representatives (term_type == "rep") has a term ending after 3 January 2021.

legislators_terms |>
  filter(term_end >= max_term_start) |>
  count(term_type, term_end) |>
  arrange(term_type, term_end)

term_type	term_end	n
rep	2020-05-22	1
rep	2021-01-03	437
sen	2020-11-03	1
sen	2021-01-03	34
sen	2023-01-03	32
sen	2025-01-03	33

As we are in “the future” with regard to whenever the legislators data were extracted, we can gain some insight into when the data were collected by looking at the data we can get today from the GitHub repository from which the data were obtained. The data on legislators’ terms are divided into two JSON files—one for current legislators and one for past legislators. Using JSON wrangling of the kind I have written about in the past, get_terms() processes either of these two files and returns a data frame.

get_terms <- function(file) {

  url <- str_c("https://theunitedstates.io/congress-legislators/", file)
  
  tibble(json = list(read_json(url))) |>
    unnest_longer(json) |>
    unnest_wider(json) |>
    select(id, terms) |>
    unnest_wider(id) |> 
    select(bioguide, terms) |>
    unnest_longer(terms) |>
    unnest_wider(terms) |>
    select(bioguide, type, start, end, state, district, 
           party, url,  class) |>
    rename_with(\(x) str_c("term_", x), c(type, start, end)) |>
    rename(id_bioguide = bioguide)
}

I apply get_terms() twice to get the data and combine them and push them to the DuckDB database instance containing the older data sets we have.

updated_terms <-
  get_terms("legislators-current.json") |>
  bind_rows(get_terms("legislators-historical.json")) |>
  copy_to(db, df = _, name = "updated_terms", overwrite = TRUE)

Finally, I join the old and new data and look for case where the updated term_end differs from that in the older data set. Such cases are shown in Table 5.1.

updated_term_ends <-
  legislators |>
  select(id_bioguide, full_name) |>
  inner_join(updated_terms, by = "id_bioguide") |>
  inner_join(legislators_terms,
             by = c("id_bioguide", "term_start"),
             suffix = c("_updated", "")) |>
  filter(term_end != term_end_updated) |>
  select(full_name, term_type, term_start, term_end,
         term_end_updated) |>
  arrange(term_end_updated)

Table 5.1: Legislators with changed term_end values in updated data

full_name	term_type	term_start	term_end	term_end_updated
John Lewis	rep	2019-01-03	2021-01-03	2020-07-17
Tom Graves	rep	2019-01-03	2021-01-03	2020-10-04
Martha McSally	sen	2019-01-03	2021-01-03	2020-12-02
Paul Cook	rep	2019-01-03	2021-01-03	2020-12-07
Kamala D. Harris	sen	2017-01-03	2023-01-03	2021-01-18
Kelly Loeffler	sen	2020-01-06	2020-11-03	2021-01-20
Dianne Feinstein	sen	2019-01-03	2025-01-03	2023-09-28
Robert Menendez	sen	2019-01-03	2025-01-03	2024-08-20

Looking at the table, we see that the earliest updated term_end relates to John Lewis, a civil rights icon and long-time legislator who passed away on 17 July 2020. It seems reasonable to assume that the data were collated prior to that date and, lacking more precise information, I assume that the extraction date was 30 June 2020.

extract_date <- as.Date("2020-06-30")

Once we fix extract_date, we should consider all term_end dates after that date as merely expected term_end dates, as subsequent events might have led to some members of Congress to finish their terms early. For this reason, it might seem appropriate to code such term_end values as “don’t know” or NA (NULL in SQL). This is especially so when we view last_term as representing the end of a legislator’ career, as we simply don’t know when a legislator’s career will end. But coding the data in this way would lead to a loss of the information impounded in the fact that a legislator’s term has extended at least as far as 30 June 2020. For this reason, the standard approach is to code term_end as the last observed date and create an indicator variable that informs users of the data that the data have been censored for this observation. The following code creates survival_data, a data frame containing legislator data in canonical survival form. A sample of these data is provided in Table 5.2.

survival_data <-
  legislators_terms |>
  group_by(id_bioguide) |>
  summarize(entry_date = min(term_start, na.rm = TRUE),
            exit_date = max(term_end, na.rm = TRUE),
            .groups = "drop") |>
  mutate(censored = exit_date > extract_date,
         exit_date = if_else(censored, extract_date, exit_date))

survival_data

Table 5.2: Legislator data in canonical survival form

id_bioguide	entry_date	exit_date	censored
T000464	2007-01-04	2020-06-30	TRUE
W000802	2007-01-04	2020-06-30	TRUE
M001176	2009-01-06	2020-06-30	TRUE
R000584	2009-01-06	2020-06-30	TRUE
C001110	2017-01-03	2020-06-30	TRUE
B001267	2009-01-22	2020-06-30	TRUE
C001094	2013-01-03	2020-06-30	TRUE
B001250	2003-01-07	2020-06-30	TRUE
S001193	2013-01-03	2020-06-30	TRUE
B000574	1996-05-21	2020-06-30	TRUE

Tanimura (2021) does not address the issue of censoring. This likely reflects the emphasis of that book on preparing data for descriptive analysis in business contexts rather than on more sophisticated statistical analysis, such as that appearing in scientific journals. However, it is important to account for censoring for many practical business questions. For instance, Chapter 7 of Tanimura (2021) discusses experimental analysis and one could imagine looking at the effects on customer retention of some marketing initiative using A/B testing of the kind studied there. In this sitation, it will be common for the retention periods for some customers to be censored.

It is worth noting that what I am calling a canonical form for survival data differs from what Singer and Willett (2003) call the person-level form of data. The main difference in the person-level form seen in Figure 10.4 of Singer and Willett (2003, 353) is that (using variable names from our current context) entry_date and exit_date have been converted into a “time to event” form reflecting the number of units of time between entry_date and exit_date. One corollary of this aspect of the person-level form is, as pointed out by Singer and Willett (2003, 310), a need to specify a “metric for clocking time” (e.g., days or years). With the canonical data form I present here, I effectively defer the need to choose the metric; as will be seen one cannot avoid this choice.

5.1.3 Events that can happen more than once: The census-date approach

In many contexts, the event of interest will be a once-only event. In some cases, the one-off nature of the event will be a function of the underlying realities (e.g., senso strictu survivorship analysis in which individuals only die once). In other cases, the one-off nature of the event will derive from the definition employed by the researcher (e.g., first relapse in the study of alcoholism above). The latter set might be driven by what is of greatest interest to researchers, or the definition may lead to simplifications with analytical benefits.

However in many cases it will be important to recognize that events can occur for an individual more than once. A subscription streaming service such as Netflix or Max will have customers who sign up for a monthly subscription, drop off some months later, then rejoin at a later date. Ignoring this behaviour in analysis would be problematic if it is common enough to represent a significant portion of a service’s business.

There are two approaches to addressing these situations. One approach is to focus analysis on “whether the cohort member has a record in the time series on a particular date, expressed as a number of periods from the starting date” (Tanimura 2021, 124). There is a little ambiguity embedded in this sentence regarding how the term “period” is defined. One definition turns on dates specific to an individual, such as age, which is typically reckoned as the number of birthdays an individual has experienced.⁴ Another definition, which I call the census-date approach, reckons periods using a common date across individuals or cohorts.

Each census-date analysis in Chapter 2 uses 31 December of each year as the census date. For reasons we will learn about in Section 5.1.4, the choice of 31 December is appropriate because the apparent gaps that appear in the tenures of legislators in certain parts of the year do not affect 31 December. To implement this approach, I use the data frame year_ends below, which I create in R before copying to DuckDB.

year_ends <-
  tibble(year = 1770:2030L) |>
  mutate(date = ymd(str_c(year, "-12-31"))) |>
  select(date) |>
  copy_to(db, df = _, name = "year_ends", overwrite = TRUE)

5.1.4 Events that can happen more than once: Stints

The Oxford English Dictionary defines a stint as “a person’s fixed or allotted period of work.” In the current context, I define a stint as a continuous period in office either as a representive or as a senator including periods that comprise multiple terms. The critical thing for current purposes is that one can have more than one stint as a legislator. One could have one stint as representative, followed by a stint as senator, followed by a second stint as representative. An alternative term a stint might be a spell, as used by Singer and Willett (2003, 311) in a quote above.

As always, to better understand the issues one faces in practice, it is helpful to look at real data. I start with data in legislators_terms related to Nancy Pelosi, a long-serving representative from California.

pelosi_terms <-
  legislators_terms |> 
  filter(id_bioguide == "P000197") |>
  select(id_bioguide, term_type, term_start, term_end, state, district)

Table 5.3: Congression terms of Nancy Pelosi

id_bioguide	term_type	term_start	term_end	state	district
P000197	rep	1987-01-06	1989-01-03	CA	5
P000197	rep	1989-01-03	1991-01-03	CA	5
P000197	rep	1991-01-03	1993-01-03	CA	5
P000197	rep	1993-01-05	1995-01-03	CA	8
P000197	rep	1995-01-04	1997-01-03	CA	8
P000197	rep	1997-01-07	1999-01-03	CA	8
P000197	rep	1999-01-06	2001-01-03	CA	8
P000197	rep	2001-01-03	2003-01-03	CA	8
P000197	rep	2003-01-07	2005-01-03	CA	8
P000197	rep	2005-01-04	2007-01-03	CA	8
P000197	rep	2007-01-04	2009-01-03	CA	8
P000197	rep	2009-01-06	2011-01-03	CA	8
P000197	rep	2011-01-05	2013-01-03	CA	8
P000197	rep	2013-01-03	2015-01-03	CA	12
P000197	rep	2015-01-06	2017-01-03	CA	12
P000197	rep	2017-01-03	2019-01-03	CA	12
P000197	rep	2019-01-03	2021-01-03	CA	12

As can be seen in Table 5.3, Nancy Pelosi had served continuously since 1987.⁵ She started out as the repsentative for California’s 5th congressional district, then for its 8th, then for its 12th.⁶ In other words, Pelosi has had a single stint as a legislator.

It might seem from Nancy Pelosi’s later terms that we could identify continuing stints by comparing term_start with the preceding term_end value and coding two terms as part of the same stint if the dates are the same. This approach would work, for example, for the term starting 2019-01-03 and the preceding term ending the same day.

However, if we calculate gap, the number of days between the end of the previous term and the start of the current one, we see that there is often a non-zero gap between terms as coded in the legislators_terms data set. This can be seen in Table 5.4, which presents data from pelosi_term_gaps calculated below.

pelosi_term_gaps <-
  pelosi_terms |>
  group_by(id_bioguide) |>
  window_order(term_start) |>
  mutate(gap = term_start - lag(term_end)) |>
  ungroup() |>
  select(id_bioguide, term_type, term_start, term_end, gap)

Table 5.4: Congression terms of Nancy Pelosi with gap

id_bioguide	term_type	term_start	term_end	gap
P000197	rep	1987-01-06	1989-01-03	NA
P000197	rep	1989-01-03	1991-01-03	0
P000197	rep	1991-01-03	1993-01-03	0
P000197	rep	1993-01-05	1995-01-03	2
P000197	rep	1995-01-04	1997-01-03	1
P000197	rep	1997-01-07	1999-01-03	4
P000197	rep	1999-01-06	2001-01-03	3
P000197	rep	2001-01-03	2003-01-03	0

In Table 5.4, we can see that there are gaps of as much as 4 days between the end of one term and the start of the next. These gaps would be problematic for a census-date approach that used 4 January as the census date, as Nancy Pelosi would appear not to be a legislator on that date in 1993, 1997, or 1999. In contrast, the census-date approach using 31 December would not be affected by these gaps.

The shortness of the gaps above raises the possibility that one approach would be to consider two terms to be part of the same stint if there is just a short gap between them. But what is a “short gap” for this purpose? Would four days suffice? Looking at the terms of James Abdnor (id_bioguide == "A000009")) in Table 5.5 suggests not.

Table 5.5: Congression terms of James Abdnor

id_bioguide	term_type	term_start	term_end	gap
A000009	rep	1973-01-03	1975-01-03	NA
A000009	rep	1975-01-14	1977-01-03	11
A000009	rep	1977-01-04	1979-01-03	1
A000009	rep	1979-01-15	1981-01-03	12
A000009	sen	1981-01-05	1987-01-03	2

So maybe a month (say 30 days)? Looking at the terms of John Quincy Adams (id_bioguide == "A000041")) in Table 5.6, it seems this would not work.

Table 5.6: Congression terms of John Quincy Adams

id_bioguide	term_type	term_start	term_end	gap
A000041	sen	1803-10-17	1809-03-03	NA
A000041	rep	1831-12-05	1833-03-03	8312
A000041	rep	1833-12-02	1835-03-03	274
A000041	rep	1835-12-07	1837-03-03	279
A000041	rep	1837-09-04	1839-03-03	185
A000041	rep	1839-12-02	1841-03-03	274
A000041	rep	1841-05-31	1843-03-03	89
A000041	rep	1843-12-04	1845-03-03	276
A000041	rep	1845-12-01	1847-03-03	273
A000041	rep	1847-12-06	1849-03-03	278

What’s going on here? If we go to the website of the United States Senate, we can see that sessions of Congress often began later in the year in the 19th century. Furthermore, the term_start data in legislators_terms appear to be based on the dates on which legislative sessions started. Data on these dates—derived from the United States Senate website—are contained in congress_terms.

congress_terms |> 
  filter(congress <= 5)

congress	term_start	term_end
1	1789-03-04	1791-03-03
2	1791-10-24	1793-03-03
3	1793-12-02	1795-03-03
4	1795-12-07	1797-03-03
5	1797-05-15	1799-03-03

congress_terms |> 
  filter(congress >= 114)

congress	term_start	term_end
114	2015-01-06	2017-01-03
115	2017-01-03	2019-01-03
116	2019-01-03	2021-01-03
117	2021-01-03	2023-01-03
118	2023-01-03	2025-01-03

Fortunately, each congress ends in the same year as the succeeding one starts:

congress_terms |> 
  window_order(term_start) |> 
  count(same_year = year(term_start) != year(lag(term_end)))

same_year	n
FALSE	117
NA	1

Using this fact and going term by term for each legislator and term type, we can identify a new stint for a legislator as one where there is no preceding term (i.e., the first stint) or the preceding term does not end in the same year as the current one. Note that, because I group by id_bioguide and term_type, if the preceding term was of a different type (rep or sen) from the current one, then the current term will be considered as starting a new stint. We can then number each stint by summing the numeric values of new_stint.

stints_raw <-
  legislators_terms |>
  group_by(id_bioguide, term_type) |>
  window_order(term_start) |>
  mutate(same_year = year(term_start) == year(lag(term_end)),
         new_stint = is.na(same_year) | !same_year) |>
  group_by(id_bioguide) |>
  window_order(term_start) |>
  mutate(stint_number = cumsum(as.integer(new_stint))) |>
  ungroup() |>
  arrange(id_bioguide, term_start) |>
  select(id_bioguide, term_type, term_start, term_end, 
         same_year, new_stint, stint_number) |>
  compute()

After adding stint_number to the data on legislator terms, we can organize the data by stints, which reduces the size of the data and facilitates analysis by stint. I store this reduced data set in stints.

stints <-
  stints_raw |>
  group_by(id_bioguide, term_type, stint_number) |>
  summarize(stint_start = min(term_start, na.rm = TRUE),
            stint_end = max(term_end, na.rm = TRUE),
            .groups = "drop")

We can revisit the three legislators discussed at the start of this section to confirm that they are organized in stints in the way we would expect. As can be seen in Table 5.7, we have two stints each for James Abdnor and John Quincy Adams and one sting for Nancy Pelosi.

Table 5.7: Congressional stints of example legislators

id_bioguide	full_name	term_type	stint_number	stint_start	stint_end
A000009	James Abdnor	rep	1	1973-01-03	1981-01-03
A000009	James Abdnor	sen	2	1981-01-05	1987-01-03
A000041	John Quincy Adams	sen	1	1803-10-17	1809-03-03
A000041	John Quincy Adams	rep	2	1831-12-05	1849-03-03
P000197	Nancy Pelosi	rep	1	1987-01-06	2021-01-03

Looking at Table 5.8, we see that there are two legislators with six stints each.

Table 5.8: Distribution of number of stints

stint_number	n
1	12518
2	1775
3	250
4	33
5	7
6	2

Is this an error? I collect data on these two legislators in six_stinters.

six_stinters <-
  stints |>
  filter(stint_number == 6) |>
  distinct(id_bioguide) |>
  inner_join(legislators, by = "id_bioguide") |>
  select(id_bioguide, full_name) |>
  inner_join(stints, by = "id_bioguide") |> 
  arrange(id_bioguide, stint_start)

From Table 5.9, it seems that these are legitimately six-stint legislators. Henry Clay and John Randolph were major political figures over much of the first half of the nineteenth century. Some of their stints were broken by switches from one congressional body to another. Other stints were broken by stints in the executive branch: Clay was Secretary of State and Randolph spent a short time as the US ambassador to Russia.⁷

Table 5.9: Legislators with six stints

id_bioguide	full_name	term_type	stint_number	stint_start	stint_end
C000482	Henry Clay	sen	1	1806-01-01	1807-03-03
C000482	Henry Clay	sen	2	1810-01-01	1811-03-03
C000482	Henry Clay	rep	3	1811-11-04	1821-03-03
C000482	Henry Clay	rep	4	1823-12-01	1827-03-03
C000482	Henry Clay	sen	5	1831-12-05	1843-03-03
C000482	Henry Clay	sen	6	1849-12-03	1853-03-03
R000047	John Randolph	rep	1	1799-12-02	1813-03-03
R000047	John Randolph	rep	2	1815-12-04	1817-03-03
R000047	John Randolph	rep	3	1819-12-06	1825-12-26
R000047	John Randolph	sen	4	1825-12-26	1827-03-03
R000047	John Randolph	rep	5	1827-12-03	1829-03-03
R000047	John Randolph	rep	6	1833-12-02	1835-03-03

The fact that congressional terms always end in the same year as the following term begins means that we won’t miss terms that continue previous terms using the year(term_start) == year(lag(term_end))) rule. That is, we should have no “false negatives”.

But do we need to worry about “false positives”: cases where we classify a term as continuing the previous term when there is actually a gap between the two terms. To examine the possibility of false positives, I adapt the stints code above to create problem_stints, which contains terms that began after the start of the congressional term in that year.

problem_stints <-
  legislators_terms |>
  group_by(id_bioguide, term_type) |>
  window_order(term_start) |>
  mutate(year_start = year(term_start),
         same_year = year(term_start) == year(lag(term_end))) |>
  inner_join(
    congress_terms |>
      mutate(year_start = year(term_start)),
    by = "year_start", 
    suffix = c("", "_congress")) |>
  filter(same_year, 
         term_start > term_start_congress) |>
  inner_join(legislators |> select(id_bioguide, full_name),
             by = "id_bioguide") |>
  select(id_bioguide, full_name, term_type, 
         term_start_congress, term_start) |>
  arrange(term_start)

Candidate false positives are presented in Table 5.10.

Table 5.10: Possibly problematic stint data

id_bioguide	full_name	term_type	term_start_congress	term_start
J000191	John Warfield Johnston	sen	1871-03-04	1871-03-15
T000116	Henry Moore Teller	sen	1877-10-15	1877-12-05
W000088	Monrad Charles Wallgren	rep	1939-01-03	1939-01-05
D000585	Henry Clarence Dworshak	sen	1949-01-03	1949-10-14
G000465	Robert Paul Griffin	rep	1965-01-04	1965-01-09
S001180	Kurt Schrader	rep	2017-01-03	2017-01-10
J000255	Walter B. Jones	rep	2019-01-03	2019-01-04

As with many data tasks, what is required to handle these is not some fancy coding or analysis, but simple “shoe leather” For example, take the latest case from Table 5.10. According to The Oregonian, Walter Jones missed being sworn in on the first day of the 116th Congress because he was on honeymoon. Kurt Schrader is another who missed the first day of Congress. The gaps are somewhat longer in the cases of Henry Moore Teller and Henry Clarence Dworchak.

Teller was one of the initial pair of senators from Colarado, which as admitted the 38th state of the United States on 1 August 1876. Initial digging into the former case suggests that “the Senate confirmed the nomination of Henry Moore Teller as secretary of the interior under President Chester A. Arthur. Teller resigned his Senate seat on April 17, 1876, and served as interior secretary until March 3, 1885.” Hmm? So Teller resigned before Colarado was even a state to serve under Chester Arthur, who didn’t even become president until 1881. This is clearly a mistake (a giveaway is the apparent nine-year stint as cabinet secretary).

5.1.5 Why use census dates?

With events that happen at most once, there is no clear rationale for using either census dates or stints. Each observation will have just one stint (albeit with a potentially censored exit date) and, assuming that date of entry is \(t=0\), then being in the sample after date \(t=T\) implies being in the sample on any census date between \(0\) and \(T\).

Thus the differences between analyses using census dates and stints will arise when individuals can have multiple stints. So why use census dates rather than calculating stints?

I think one reason for using a census-date is that can be much easier to implement. Constructing stints for legislators required additional data (i.e., congress_terms) and a little data-wrangling gymnastics. In contrast, by using 31 December as the census date (a date unlikely to be in a “gap”), we can easily construct measures of tenure with less effort.

Additionally, the census-date appraoach makes it easy to implement the chosen “metric for clocking time” (Singer and Willett 2003, 310). By using year_ends, almost all examples in Tanimura (2021) measure time in terms of years.

5.2 Organizing cohort data

There are several different cohorts uses in analyses in Chapter 4 of Tanimura (2021).

first_year
first_century (also called cohort_century or century)
first_state
gender
first_state and gender
term_type
first_century and first_type
date (last day of the year) and century (last_century?)

If you read Chapter 2 carefully, you will see that any individual legislator is placed in no more than one of the cohorts above in any analysis. As such one particulaly clean (and clear) approach would create a table comprising the individual identifier (id_bioguide) and the cohort assignment for each individual. For example, the following code creates a first_centuries table with cohort assignments for first_century (here named cohort for reasons that will become clear in a moment).

first_centuries <-
  legislators_terms |>
  group_by(id_bioguide) |>
  summarize(cohort = century(min(term_start, na.rm = TRUE)),
            .groups = "drop")

Note that not all analyses in Chapter 2 involve cohorts. For example, the analysis of tenures (Tanimura 2021, 170–74) is described as “rather than cohorting on first_term, we can cohort on tenure instead.” But I feel it is a little confusing to describe this analysis as involving “cohorting” as any individual legislator will appear in multiple cohorts over time.⁸

first_terms <-
  legislators_terms |>
  group_by(id_bioguide) |>
  summarize(first_term = min(term_start, na.rm = TRUE),
            .groups = "drop")

census_data <-
  legislators_terms |>
  inner_join(first_terms, by = "id_bioguide") |>
  left_join(year_ends, 
            join_by(
              between(y$date, x$term_start, x$term_end))) |>
  mutate(period = coalesce(year(age(date, first_term)), 0))

The cohort analysis can now proceed by combining these two data frames and calculating pct_retained after a few data steps.

cen_pct_retaineds <-
  first_centuries |>
  inner_join(census_data, by = "id_bioguide") |>
  group_by(cohort, period) |>
  summarize(cohort_retained = n_distinct(id_bioguide),
            .groups = "drop") |>
  group_by(cohort) |>
  window_order(period) |>
  mutate(cohort_size = first(cohort_retained)) |>
  ungroup() |>
  mutate(pct_retained = cohort_retained / cohort_size) |>
  select(cohort, period, cohort_size, cohort_retained,
         pct_retained) |>
  arrange(cohort, period)

cen_pct_retaineds

cohort	period	cohort_size	cohort_retained	pct_retained
18	0	368	368	1.0000000
18	1	368	360	0.9782609
18	2	368	242	0.6576087
18	3	368	233	0.6331522
18	4	368	149	0.4048913
18	5	368	144	0.3913043
18	6	368	99	0.2690217
18	7	368	101	0.2744565
18	8	368	73	0.1983696
18	9	368	70	0.1902174

Of course, this logic could be embedded in a function:

get_pct_retained <- function(cohorts, census_data) {

  cohorts |>
    inner_join(census_data, by = "id_bioguide") |>
    group_by(cohort, period) |>
    summarize(cohort_retained = n_distinct(id_bioguide),
              .groups = "drop") |>
    group_by(cohort) |>
    window_order(period) |>
    mutate(cohort_size = first(cohort_retained)) |>
    ungroup() |>
    mutate(pct_retained = cohort_retained / cohort_size) |>
    select(cohort, period, cohort_size, cohort_retained,
           pct_retained) |>
    arrange(cohort, period)
}

As can be seen from the following output, we get the same results using get_pct_retained() as we got above.

get_pct_retained(first_centuries, census_data)

cohort	period	cohort_size	cohort_retained	pct_retained
18	0	368	368	1.0000000
18	1	368	360	0.9782609
18	2	368	242	0.6576087
18	3	368	233	0.6331522
18	4	368	149	0.4048913
18	5	368	144	0.3913043
18	6	368	99	0.2690217
18	7	368	101	0.2744565
18	8	368	73	0.1983696
18	9	368	70	0.1902174

The benefit of using a function like this is that it makes it easy to do the analysis using different cohorts. Here I re-run the analysis using cohorts formed on gender:

genders <-
  legislators |>
  distinct(id_bioguide, gender) |>
  rename(cohort = gender)

get_pct_retained(genders, census_data)

cohort	period	cohort_size	cohort_retained	pct_retained
F	0	366	366	1.0000000
F	1	366	349	0.9535519
F	2	366	261	0.7131148
F	3	366	256	0.6994536
F	4	366	223	0.6092896
F	5	366	222	0.6065574
F	6	366	178	0.4863388
F	7	366	174	0.4754098
F	8	366	143	0.3907104
F	9	366	139	0.3797814

It is relatively straightforward to apply get_pct_retained() to replicate the “retention for legislators who were in office in the year 2000” analysis (Tanimura 2021, 151–53). Defining the cohorts is fairly straightforward:

test_year <- 2000
year_start <- ymd(str_c(test_year, "-01-01"))
year_end <- ymd(str_c(test_year, "-12-31"))

legislators_2000 <-
  legislators_terms |>
  filter(term_start <= year_start, term_end >= year_end) |>
  distinct(id_bioguide, term_type) |>
  rename(cohort = term_type)

The main tweak is to the calculation of the census data, as "2000-01-01" is used in place of first_term and we need to focus on dates in or after 2000 (this is achieved using filter(period >= 0)).

census_data_2000 <-
  legislators_terms |>
  select(id_bioguide, term_type, term_start, term_end) |>
  inner_join(year_ends, 
             join_by(
               between(y$date, x$term_start, x$term_end))) |>
  mutate(period = coalesce(year(age(date, year_start)), 0)) |>
  filter(period >= 0)

Now I just need to call the same function. Note that I sort the data to match the output in the book (Tanimura 2021, 152).

get_pct_retained(legislators_2000, census_data_2000) |> 
  arrange(period, cohort)

cohort	period	cohort_size	cohort_retained	pct_retained
rep	0	439	439	1.0000000
sen	0	100	100	1.0000000
rep	1	439	392	0.8929385
sen	1	100	88	0.8800000
rep	2	439	389	0.8861048
sen	2	100	85	0.8500000
rep	3	439	340	0.7744875
sen	3	100	79	0.7900000
rep	4	439	338	0.7699317
sen	4	100	79	0.7900000

jqa_census_data <-
  census_data |>
  filter(id_bioguide == "A000041") |>
  select(id_bioguide, term_type, term_start, term_end,
         date, period)

Table 5.11: Census data for John Quincy Adams

id_bioguide	term_type	term_start	term_end	date	period
A000041	sen	1803-10-17	1809-03-03	1803-12-31	0
A000041	sen	1803-10-17	1809-03-03	1804-12-31	1
A000041	sen	1803-10-17	1809-03-03	1805-12-31	2
A000041	sen	1803-10-17	1809-03-03	1806-12-31	3
A000041	sen	1803-10-17	1809-03-03	1807-12-31	4
A000041	sen	1803-10-17	1809-03-03	1808-12-31	5
A000041	rep	1831-12-05	1833-03-03	1831-12-31	28
A000041	rep	1831-12-05	1833-03-03	1832-12-31	29
A000041	rep	1833-12-02	1835-03-03	1833-12-31	30
A000041	rep	1833-12-02	1835-03-03	1834-12-31	31
A000041	rep	1835-12-07	1837-03-03	1835-12-31	32
A000041	rep	1835-12-07	1837-03-03	1836-12-31	33
A000041	rep	1837-09-04	1839-03-03	1837-12-31	34
A000041	rep	1837-09-04	1839-03-03	1838-12-31	35
A000041	rep	1839-12-02	1841-03-03	1839-12-31	36
A000041	rep	1839-12-02	1841-03-03	1840-12-31	37
A000041	rep	1841-05-31	1843-03-03	1841-12-31	38
A000041	rep	1841-05-31	1843-03-03	1842-12-31	39
A000041	rep	1843-12-04	1845-03-03	1843-12-31	40
A000041	rep	1843-12-04	1845-03-03	1844-12-31	41
A000041	rep	1845-12-01	1847-03-03	1845-12-31	42
A000041	rep	1845-12-01	1847-03-03	1846-12-31	43
A000041	rep	1847-12-06	1849-03-03	1847-12-31	44
A000041	rep	1847-12-06	1849-03-03	1848-12-31	45

By “advanced” I mean relative to the standards implicitly expected of readers of Tanimura (2021).↩︎
Most queries uses what I call a census-date approach that I discuss in Section 5.1.3.↩︎
The choice of term_start rather than term_end is an odd one that is justified by the notion that “we don’t need to know the specific dates of each term” because of a focus on “the share of legislators who survived in office for a decade or more after their first term started” (Tanimura 2021, 154). Of course, a two-term senator will serve for 12 years, but the tenure calculated in this way would be six years rather than 12, which would seem to be a material error.↩︎
Things might be different for people born on 29 February.↩︎
The legislators_terms data set ends before Pelosi’s term ending in January 2021. From that data set, it is not clear that Pelosi was elected to additional terms beginning in 2021 and 2023.↩︎
According to Wikipedia, the shifts from one district to another were due to redistricting, or the redrawing of electoral boundaries as population numbers changed.↩︎
According to Wikipedia, Randolph did not present his credentials and soon resigned for health reasons.↩︎
It might be viewed as the concatenation of multiple cohort analyses, one for each period studied.↩︎