4  Cohorts

4.1 Cohorts: A Useful Analysis Framework

Chapter 4 examines the fascinating topic of cohorts, where a cohort is a group of observations (often people) who acquire a shared characteristic at (approximately) the same time. For example, children entering kindergarten in New Zealand in 2017 or the Harvard MBA Class of 2002.

While cohort analysis has some attractive features, I guess that data do not often come in a format that facilitates such analysis. Instead, as is the case with the legislators data set studied in Chapter 4, the data analyst needs to rearrange the data to support cohort analysis.

I found Chapter 4 a little confusing on a first pass through it.1 The chapter launches into some SQL code intended to create cohorts, but it’s a little unclear why we’re doing what we’re doing, and we quickly see that our cohort analysis does not make sense (e.g., we have more of our original cohort in period 5 than we had in period 4) and we must have done something wrong. I think I see the idea Cathy is going for here: one needs to think carefully about how to arrange the data to avoid subtle mistakes. The challenge I see is that it’s not obvious that everyone would make the same mistake and one is too deep in the weeds of the code to really see the forest for the trees.

So before I launch into the code, I will spend a little time thinking about things conceptually. I will start with a different example from that used in Chapter 4, but one that I think brings out some of the issues.

For some reason I associate cohort analysis with life expectancy. The people who are born today form a cohort and one might ask: How long do we expect members of this cohort to live? One often hears life expectancy statistics quoted as something like: “In Australia, a boy born in 2018–2020 can expect to live to the age of 81.2 years and a girl would be expected to live to 85.3 years compared to 51.1 for boys and 54.8 years for girls born in in 1891–1900.”2

The people who construct the life expectancies for the children must be veritable polymaths. They need to anticipate future developments in medical care and technology. Skin cancer is a significant cause of death in Australia, due to a mismatch between the median complexion and the intensity of the sun. But the analysts calculating life expectancy need to think about how medical technology is likely to affect rates of death from carcinoma in the future. I can imagine whole-body scanners a bit like the scanners in US airports that detect skin cancers before they become problematic. These analysts also need to understand how road safety will evolve. Will children today all be in driverless vehicles in fifty years time and will accidents then be a rarity? And what about war? The data analyst needs to be able to forecast the possibility of World War III breaking out and shortening life spans. Who are these people?

Of course it seems unlikely these über-analysts exist. Rather they surely do something more prosaic. Here is my guess as to how life expectancies are constructed.3 I guess that the data analyst gathers data on cohorts notionally formed at some point in the past and then looks at survival rates for that cohort over some period, then aggregates those data into a life expectancy.

For example, the data analyst might gather data on people who turned 21 in 2018 and then data on whether those people surived to their 22nd birthday. The proportion of such people who make their 22nd birthday could be interpreted as a survival probability \(p_{21}\). Repeat that for each age-based cohort to get probabilities \(\left\{p_i: i = 0, 1, \dots, 119, 120 \right\}\). Now to find the median life expectancy, we could calculate something like this:4

\[ \left\{j: \arg \min_{i} \left(\prod_{0}^{i} p_i\right) \leq \frac{1}{2} \right\} \] So we have a (fairly) well-defined procedure here. There are obviously some details to be worked out. For example, do we focus on one year (2018 in this case)? Or collect data over multiple years? Does it make sense to form cohorts by years? Or would grouping into larger cohorts (e.g., 20–25) make more sense? Do we identify people by birthdays? Or just use some kind of census date? (People who are 21 on 1 July might have just turned 21, or might be about to turn 22.)

But what exactly have we calculated? In a sense it’s a nonsensical number. Why would survival rates for 88-year-olds in 2018 be relevant for the life expectancy of newborns today, who will face a very different world when they turn 88 in 2111. First, perhaps the analysts really don’t calculate it in this way (though I’m doubtful they are polymaths). Second, even though it’s a “meaningless” number, it probably still has attractive properties, such as the ability to represent in a one or two numbers a lot about the quality of life in Australia.

A final note is that it is not clear to me where the “51.1 for boys and 54.8 years for girls born in in 1891–1900” values come from. Are these the equivalent life expectancies calculated using data available around 1900? Or are these the observed lifespans of people born in 1891–1900? If the latter, how accurate were the former as estimates of these values?

library(DBI)
library(tidyverse)
library(dbplyr)
library(ggplot2)
library(duckdb)

4.2 The Legislators Data Set

4.2.1 Reading the data

Let’s move onto the legislators data set.

db <- dbConnect(duckdb::duckdb())

legislators <-
  tbl(db, "read_csv_auto('data/legislators.csv')") |>
  compute(name = "legislators")

legislators_terms <-
    tbl(db, "read_csv_auto('data/legislators_terms.csv')") |>
  compute(name = "legislators_terms")

4.2.2 Exploring the data

There are two tables in the legislators data set. The legislators data looks to have id_bioguide as a primary key.

WITH

id_rows AS (
  SELECT id_bioguide, count(*) AS rows_per_id
  FROM legislators
  GROUP BY 1)

SELECT rows_per_id, count(*) AS num_ids
FROM id_rows
GROUP BY 1;
1 records
rows_per_id num_ids
1 12518
SELECT COUNT(*) AS num_missing_ids
FROM legislators
WHERE id_bioguide IS NULL
1 records
num_missing_ids
0
WITH id_rows AS (
  SELECT id_bioguide, term_start,
    count(*) AS rows_per_id
  FROM legislators_terms
  GROUP BY 1, 2)

SELECT rows_per_id, count(*) AS num_ids
FROM id_rows
GROUP BY 1;
1 records
rows_per_id num_ids
1 44063
SELECT COUNT(*) AS num_missing_ids
FROM legislators_terms
WHERE id_bioguide IS NULL OR term_start IS NULL
1 records
num_missing_ids
0
WITH id_rows AS (
  SELECT id_bioguide, term_number,
    count(*) AS rows_per_id
  FROM legislators_terms
  GROUP BY 1, 2)

SELECT rows_per_id, count(*) AS num_ids
FROM id_rows
GROUP BY 1;
1 records
rows_per_id num_ids
1 44063
SELECT COUNT(*) AS num_missing_ids
FROM legislators_terms
WHERE term_end IS NULL
1 records
num_missing_ids
0
SELECT max(term_start) AS max_term_start,
  max(term_end) AS max_term_end
FROM legislators_terms
1 records
max_term_start max_term_end
2020-05-19 2025-01-03
legislators_terms |>
  filter(term_end > '2020-05-19') |>
  count(term_end, term_type)
term_end term_type n
2021-01-03 rep 437
2021-01-03 sen 34
2023-01-03 sen 32

Clearly we have term_end dates that are anticipated based on

4.3 Retention

4.3.1 SQL for a Basic Retention Curve

SELECT id_bioguide, min(term_start) AS first_term
FROM legislators_terms 
GROUP BY 1
LIMIT 3;
3 records
id_bioguide first_term
F000062 1992-11-10
T000464 2007-01-04
A000360 2003-01-07
WITH first_terms AS (
  SELECT id_bioguide, min(term_start) AS first_term
  FROM legislators_terms 
  GROUP BY 1) 

SELECT date_part('year', age(b.term_start, a.first_term)) AS period,
  COUNT(DISTINCT a.id_bioguide) AS cohort_retained
FROM first_terms a
INNER JOIN legislators_terms b
USING (id_bioguide)
GROUP BY 1
ORDER BY 1
LIMIT 4;
4 records
period cohort_retained
0 12518
1 3600
2 3619
3 1831
first_terms <- 
  legislators_terms |>
  group_by(id_bioguide) |>
  summarize(first_term = min(term_start, na.rm = TRUE),
            .groups = "drop")

cohorts <-
  legislators_terms |>
  inner_join(first_terms, by = "id_bioguide") |>
  mutate(period = year(age(term_start, first_term))) |>
  group_by(period) |>
  summarize(cohort_retained = sql("count(distinct id_bioguide)")) 

cohorts |>
  arrange(period)
period cohort_retained
0 12518
1 3600
2 3619
WITH 
first_terms AS (
  SELECT id_bioguide, min(term_start) AS first_term
  FROM legislators_terms 
  GROUP BY 1),

cohorts AS (
  SELECT date_part('year', age(b.term_start, a.first_term)) AS period,
    COUNT(DISTINCT a.id_bioguide) AS cohort_retained
  FROM first_terms a
  JOIN legislators_terms b 
  USING (id_bioguide)
  GROUP BY 1)
  
SELECT period,
  first_value(cohort_retained) OVER w AS cohort_size,
  cohort_retained,
  cohort_retained * 1.0 / 
    first_value(cohort_retained) OVER w AS prop_retained
FROM cohorts
WINDOW w AS (ORDER BY period)
LIMIT 3;
3 records
period cohort_size cohort_retained prop_retained
0 12518 12518 1.0000000
1 12518 3600 0.2875859
2 12518 3619 0.2891037
retained_data <-
  cohorts |>
  window_order(period) |>
  mutate(cohort_size = first(cohort_retained)) |>
  mutate(pct_retained = cohort_retained * 1.0/cohort_size) |>
  select(period, cohort_size, cohort_retained, pct_retained) 

retained_data
period cohort_size cohort_retained pct_retained
0 12518 12518 1.0000000
1 12518 3600 0.2875859
2 12518 3619 0.2891037
retained_data |>
  ggplot(aes(x = period, y = pct_retained)) +
  geom_line()

WITH 
first_terms AS (
  SELECT id_bioguide, min(term_start) AS first_term
  FROM legislators_terms 
  GROUP BY 1),

cohorts AS (
  SELECT date_part('year', age(b.term_start, a.first_term)) AS period,
      COUNT(DISTINCT a.id_bioguide) AS cohort_retained
  FROM first_terms a
  JOIN legislators_terms b on a.id_bioguide = b.id_bioguide 
  GROUP BY 1),
  
retained_data AS (
  SELECT period,
    first_value(cohort_retained) OVER w AS cohort_size,
    cohort_retained,
    cohort_retained * 1.0 / first_value(cohort_retained) OVER w AS pct_retained
  FROM cohorts
  WINDOW w AS (ORDER BY period))

SELECT cohort_size,
  max(CASE WHEN period = 0 THEN pct_retained END) AS yr0,
  max(CASE WHEN period = 1 THEN pct_retained END) AS yr1,
  max(CASE WHEN period = 2 THEN pct_retained END) AS yr2,
  max(CASE WHEN period = 3 THEN pct_retained END) AS yr3,
  max(CASE WHEN period = 4 THEN pct_retained END) AS yr4
FROM retained_data
GROUP BY 1;
1 records
cohort_size yr0 yr1 yr2 yr3 yr4
12518 1 0.2875859 0.2891037 0.1462694 0.2564307
retained_data |>
  select(period, pct_retained) |>
  filter(period <= 4) |>
  collect() |>
  arrange(period) |>
  pivot_wider(names_from = period, 
              names_prefix = "yr",
              values_from = pct_retained)
yr0 yr1 yr2 yr3 yr4
1 0.2875859 0.2891037 0.1462694 0.2564307

4.3.2 Adjusting Time Series to Increase Retention Accuracy

Use copy_inline() here if using a read-only database.

year_ends <-
  tibble(date = seq(as.Date('1770-12-31'), 
                    as.Date('2030-12-31'), 
                    by = "1 year")) |>
  copy_to(db, df = _, overwrite = TRUE, name = "year_ends")
WITH

first_terms AS (
  SELECT id_bioguide, min(term_start) AS first_term
  FROM legislators_terms 
  GROUP BY 1)
  
SELECT a.id_bioguide, a.first_term, b.term_start, b.term_end,
  c.date,
  date_part('year', age(c.date, a.first_term)) AS period
FROM first_terms a
JOIN legislators_terms b USING (id_bioguide)
LEFT JOIN year_ends c
ON c.date BETWEEN b.term_start and b.term_end
ORDER BY id_bioguide
LIMIT 3;
3 records
id_bioguide first_term term_start term_end date period
A000001 1951-01-03 1951-01-03 1953-01-03 1951-12-31 0
A000001 1951-01-03 1951-01-03 1953-01-03 1952-12-31 1
A000002 1947-01-03 1971-01-21 1973-01-03 1972-12-31 25
cohorts <-
  first_terms |>
  inner_join(legislators_terms, by = join_by(id_bioguide)) |>
  left_join(year_ends, 
            by = join_by(between(y$date, x$term_start, x$term_end))) |>
  mutate(period = date_part('year', age(date, first_term))) |>
  select(id_bioguide, first_term, term_start, term_end, date, period) 

cohorts
id_bioguide first_term term_start term_end date period
T000165 2020-05-19 2020-05-19 2021-01-03 2020-12-31 0
G000061 2020-05-19 2020-05-19 2021-01-03 2020-12-31 0
M000687 1987-01-06 2020-05-05 2021-01-03 2020-12-31 33
WITH

first_terms AS (
  SELECT id_bioguide, min(term_start) AS first_term
  FROM legislators_terms 
  GROUP BY 1),

cohorts_retained AS (        
    SELECT coalesce(date_part('year', age(c.date, a.first_term)), 0) AS period,
      COUNT(DISTINCT a.id_bioguide) AS cohort_retained
    FROM first_terms a
    JOIN legislators_terms b
    USING (id_bioguide)
    LEFT JOIN year_ends c ON c.date BETWEEN b.term_start AND b.term_end
    GROUP BY 1)
    
SELECT period,
  first_value(cohort_retained) OVER w AS cohort_size,
  cohort_retained,
  cohort_retained * 1.0/first_value(cohort_retained) OVER w AS pct_retained
FROM cohorts_retained
WINDOW w AS (ORDER BY period);
Displaying records 1 - 10
period cohort_size cohort_retained pct_retained
0 12518 12518 1.0000000
1 12518 12328 0.9848219
2 12518 8166 0.6523406
3 12518 8069 0.6445918
4 12518 5862 0.4682857
5 12518 5795 0.4629334
6 12518 4361 0.3483783
7 12518 4339 0.3466209
8 12518 3521 0.2812750
9 12518 3485 0.2783991
cohorts_retained <-
  cohorts |>
  mutate(period = coalesce(date_part('year', age(date, first_term)), 0)) |>
  select(period, id_bioguide) |>
  distinct() |>
  group_by(period) |>
  summarize(cohort_retained = n()) 

pct_retained <-
  cohorts_retained |>
  window_order(period) |>
  mutate(cohort_size = first(cohort_retained),
         cohort_retained = as.double(cohort_retained),
         pct_retained = cohort_retained/cohort_size) 

pct_retained |>
  arrange(period)
period cohort_retained cohort_size pct_retained
0 12518 12518 1.0000000
1 12328 12518 0.9848219
2 8166 12518 0.6523406
pct_retained |>
  ggplot(aes(x = period, y = pct_retained)) + 
  geom_line()

WITH first_terms AS (
  SELECT id_bioguide, min(term_start) AS first_term
  FROM legislators_terms 
  GROUP BY 1)
  
SELECT id_bioguide, a.first_term, b.term_start,
  CASE WHEN b.term_type = 'rep' THEN b.term_start + interval '2 years'
       WHEN b.term_type = 'sen' THEN b.term_start + interval '6 years'
  END AS term_end
FROM first_terms a
JOIN legislators_terms b USING (id_bioguide)
LIMIT 3;
3 records
id_bioguide first_term term_start term_end
F000062 1992-11-10 2019-01-03 2025-01-03
T000464 2007-01-04 2019-01-03 2025-01-03
A000360 2003-01-07 2015-01-06 2021-01-06
first_terms |>
  inner_join(legislators_terms, by = join_by(id_bioguide)) |>
  mutate(term_end = 
           case_when(term_type == 'rep' ~ term_start + years(2),
                     term_type == 'sen' ~ term_start + years(6))) |>
  select(id_bioguide, first_term, term_start, term_end)
id_bioguide first_term term_start term_end
F000062 1992-11-10 2019-01-03 2025-01-03
T000464 2007-01-04 2019-01-03 2025-01-03
A000360 2003-01-07 2015-01-06 2021-01-06

For now, I have omitted the query after the paragraph starting “A second option …”.

4.3.3 Cohorts Derived from the Time Series Itself

WITH first_terms AS (
  SELECT id_bioguide, min(term_start) AS first_term
  FROM legislators_terms 
  GROUP BY 1)
  
SELECT date_part('year', a.first_term) AS first_year,
  COALESCE(date_part('year', age(c.date, a.first_term)), 0) AS period,
  COUNT(DISTINCT a.id_bioguide) AS cohort_retained
FROM first_terms a
INNER JOIN legislators_terms b 
USING (id_bioguide)
LEFT JOIN year_ends c 
ON c.date BETWEEN b.term_start AND b.term_end 
GROUP BY 1, 2
ORDER BY 1, 2
LIMIT 3;
3 records
first_year period cohort_retained
1789 0 89
1789 1 89
1789 2 57
yr_cohort_retaineds <-
  first_terms |>
  inner_join(legislators_terms, by = "id_bioguide") |>
  left_join(year_ends, 
            join_by(between(y$date, x$term_start, x$term_end))) |>
  mutate(first_year = year(first_term),
         period = coalesce(year(age(date, first_term)), 0)) |>
  group_by(first_year, period) |>
  summarize(cohort_retained = n_distinct(id_bioguide),
            .groups = "drop") 

yr_cohort_retaineds |>
  arrange(first_year, period)
first_year period cohort_retained
1789 0 89
1789 1 89
1789 2 57
yr_cohort_retaineds |>
  group_by(first_year) |>
  window_order(period) |>
  mutate(cohort_size = first(cohort_retained),
         pct_retained = 1.0 * cohort_retained/cohort_size) |>
  select(first_year, period, cohort_size, cohort_retained, pct_retained) |>
  arrange(first_year, period)
first_year period cohort_size cohort_retained pct_retained
1789 0 89 89 1.0000000
1789 1 89 89 1.0000000
1789 2 89 57 0.6404494
WITH first_terms AS (
  SELECT id_bioguide, min(term_start) AS first_term
  FROM legislators_terms 
  GROUP BY 1),
                
first_centuries AS (
  SELECT 
    date_part('century', a.first_term) AS first_century,
    coalesce(date_part('year', age(c.date, a.first_term)), 0) AS period,
    COUNT(DISTINCT a.id_bioguide) AS cohort_retained
  FROM first_terms AS a
  INNER JOIN legislators_terms b 
  USING (id_bioguide)
  LEFT JOIN year_ends c 
  ON c.date BETWEEN b.term_start AND b.term_end 
  GROUP BY 1, 2)

SELECT first_century, period,
  first_value(cohort_retained) OVER w AS cohort_size,
  cohort_retained,
  cohort_retained * 1.0 / first_value(cohort_retained) OVER w AS prop_retained
FROM first_centuries
WINDOW w AS (PARTITION BY first_century ORDER BY period)
ORDER BY 1, 2
LIMIT 3;
3 records
first_century period cohort_size cohort_retained prop_retained
18 0 368 368 1.0000000
18 1 368 360 0.9782609
18 2 368 242 0.6576087
cen_cohort_retaineds <-
  first_terms |>
  inner_join(legislators_terms, by = "id_bioguide") |>
  left_join(year_ends, 
            join_by(between(y$date, x$term_start, x$term_end))) |>
  mutate(first_century = century(first_term),
         period = coalesce(year(age(date, first_term)), 0)) |>
  group_by(first_century, period) |>
  summarize(cohort_retained = n_distinct(id_bioguide),
            .groups = "drop")
cen_pct_retaineds <-
  cen_cohort_retaineds |>
  group_by(first_century) |>
  window_order(period) |>
  mutate(cohort_size = first(cohort_retained),
         pct_retained = 1.0 * cohort_retained/cohort_size) |>
  ungroup() |>
  select(first_century, period, cohort_size, cohort_retained, pct_retained) 

cen_pct_retaineds |>
  arrange(first_century, period)
first_century period cohort_size cohort_retained pct_retained
18 0 368 368 1.0000000
18 1 368 360 0.9782609
18 2 368 242 0.6576087
cen_pct_retaineds |>
  collect() |>
  mutate(first_century = factor(first_century)) |>
  ggplot(aes(x = period, y = pct_retained,
             color = first_century,
             linetype = first_century,
             group = first_century)) + 
  geom_line()

SELECT DISTINCT id_bioguide,
  min(term_start) OVER w AS first_term,
  first_value(state) OVER w AS first_state
FROM legislators_terms 
WINDOW w AS (PARTITION BY id_bioguide ORDER BY term_start)
ORDER BY id_bioguide
Displaying records 1 - 10
id_bioguide first_term first_state
A000001 1951-01-03 ND
A000002 1947-01-03 VA
A000003 1817-12-01 GA
A000004 1843-12-04 MA
A000005 1887-12-05 TX
A000006 1868-01-01 NC
A000007 1875-12-06 MA
A000008 1857-12-07 ME
A000009 1973-01-03 SD
A000010 1954-01-01 NE
first_states <-
  first_terms <- 
  legislators_terms |>
  group_by(id_bioguide) |>
  window_order(term_start) |>
  mutate(first_term = min(term_start, na.rm = TRUE),
         first_state = first(state)) |>
  ungroup() |>
  select(id_bioguide, first_term, first_state) |>
  distinct()
WITH first_states AS (
  SELECT DISTINCT id_bioguide,
    min(term_start) OVER w AS first_term,
    first_value(state) OVER w AS first_state
  FROM legislators_terms 
  WINDOW w AS (PARTITION BY id_bioguide ORDER BY term_start)),

state_first_periods AS (
  SELECT 
    first_state,
    coalesce(date_part('year', age(c.date, a.first_term)), 0) AS period,
    COUNT(DISTINCT a.id_bioguide) AS cohort_retained
  FROM first_states AS a
  INNER JOIN legislators_terms b 
  USING (id_bioguide)
  LEFT JOIN year_ends c 
  ON c.date BETWEEN b.term_start AND b.term_end 
  GROUP BY 1, 2)

SELECT first_state, period,
  first_value(cohort_retained) OVER w AS cohort_size,
  cohort_retained,
  cohort_retained * 1.0 / first_value(cohort_retained) OVER w AS prop_retained
FROM state_first_periods
WINDOW w AS (PARTITION BY first_state ORDER BY period)
ORDER BY 1, 2
LIMIT 3;
3 records
first_state period cohort_size cohort_retained prop_retained
AK 0 19 19 1.0000000
AK 1 19 19 1.0000000
AK 2 19 15 0.7894737
state_first_periods <-
  first_states |>
  inner_join(legislators_terms, by = "id_bioguide") |>
  left_join(year_ends, 
            join_by(between(y$date, x$term_start, x$term_end))) |>
  mutate(period = coalesce(year(age(date, first_term)), 0)) |>
  group_by(first_state, period) |>
  summarize(cohort_retained = n_distinct(id_bioguide),
            .groups = "drop")
state_pct_retaineds <- 
  state_first_periods |>
  select(first_state, period, cohort_retained) |>
  group_by(first_state) |>
  window_order(period) |>
  mutate(cohort_size = first(cohort_retained),
         prop_retained = cohort_retained * 1.0 / cohort_size) |>
  ungroup()

state_pct_retaineds |>
  arrange(first_state, period) 
first_state period cohort_retained cohort_size prop_retained
AK 0 19 19 1.0000000
AK 1 19 19 1.0000000
AK 2 15 19 0.7894737
top_5_states <- c("NY", "PA", "OH", "IL", "MA")

state_pct_retaineds |>
  filter(first_state %in% top_5_states) |>
  ggplot(aes(x = period, 
             y = prop_retained,
             color = first_state,
             linetype = first_state,
             group = first_state)) + 
  geom_line()

4.3.4 Defining the Cohort from a Separate Table

WITH first_terms AS (
  SELECT id_bioguide, min(term_start) AS first_term
  FROM legislators_terms 
  GROUP BY 1)
  
SELECT d.gender,
  coalesce(date_part('year',age(c.date,a.first_term)),0) AS period,
  count(distinct a.id_bioguide) AS cohort_retained
FROM first_terms a
JOIN legislators_terms b ON a.id_bioguide = b.id_bioguide 
LEFT JOIN year_ends c ON c.date BETWEEN b.term_start AND b.term_end 
INNER JOIN legislators d ON a.id_bioguide = d.id_bioguide
GROUP BY 1,2
ORDER BY 2,1
LIMIT 4;
4 records
gender period cohort_retained
F 0 366
M 0 12152
F 1 349
M 1 11979
cohorts <-
  first_terms |>
  inner_join(legislators_terms, by = "id_bioguide") |>
  left_join(year_ends, join_by(between(y$date, x$term_start, x$term_end))) |>
  inner_join(legislators, by = "id_bioguide") |>
  mutate(period = coalesce(year(age(date, first_term)), 0)) |>
  group_by(gender, period) |>
  summarize(cohort_retained = n_distinct(id_bioguide),
            .groups = "drop")

cohorts |>
  arrange(period, gender) |>
  collect(n = 4) 
gender period cohort_retained
F 0 366
M 0 12152
F 1 349
M 1 11979
WITH first_terms AS (
  SELECT id_bioguide, min(term_start) AS first_term
  FROM legislators_terms 
  GROUP BY 1),
  
cohorts AS (
  SELECT d.gender,
    coalesce(date_part('year',age(c.date,a.first_term)),0) as period,
    count(distinct a.id_bioguide) as cohort_retained
  FROM first_terms a
  JOIN legislators_terms b on a.id_bioguide = b.id_bioguide 
  LEFT JOIN year_ends c on c.date between b.term_start and b.term_end 
  JOIN legislators d on a.id_bioguide = d.id_bioguide
  GROUP BY 1, 2)
  
SELECT gender, period,
  cohort_retained,
  first_value(cohort_retained) OVER w AS cohort_size,
  cohort_retained * 1.0 / 
  first_value(cohort_retained) OVER w AS pct_retained
FROM cohorts aa
WINDOW w AS (partition by gender order by period)
ORDER BY 2, 1
LIMIT 4;
4 records
gender period cohort_retained cohort_size pct_retained
F 0 366 366 1.0000000
M 0 12152 12152 1.0000000
F 1 349 366 0.9535519
M 1 11979 12152 0.9857637
cohorts |>
  group_by(gender) |>
  window_order(period) |>
  mutate(cohort_size = first(cohort_retained)) |>
  ungroup() |>
  mutate(pct_retained = 1.0 * cohort_retained / cohort_size) |>
    arrange(period, gender) |>
  collect(n = 4)
gender period cohort_retained cohort_size pct_retained
F 0 366 366 1.0000000
M 0 12152 12152 1.0000000
F 1 349 366 0.9535519
M 1 11979 12152 0.9857637
WITH first_terms AS (
  SELECT id_bioguide, min(term_start) AS first_term
  FROM legislators_terms 
  GROUP BY 1),
  
cohorts AS (
  SELECT d.gender,
    coalesce(date_part('year',age(c.date, a.first_term)),0) as period,
    count(distinct a.id_bioguide) as cohort_retained
  FROM first_terms a
  JOIN legislators_terms b on a.id_bioguide = b.id_bioguide 
  LEFT JOIN year_ends c on c.date between b.term_start and b.term_end 
  JOIN legislators d on a.id_bioguide = d.id_bioguide
  WHERE first_term BETWEEN '1917-01-01' AND '1999-12-31'
  GROUP BY 1, 2)
  
SELECT gender, period,
  cohort_retained,
  first_value(cohort_retained) OVER w AS cohort_size,
  cohort_retained * 1.0 / 
  first_value(cohort_retained) OVER w AS pct_retained
FROM cohorts aa
WINDOW w AS (PARTITION BY gender ORDER BY period)
ORDER BY 2, 1
LIMIT 4;
4 records
gender period cohort_retained cohort_size pct_retained
F 0 200 200 1.0000000
M 0 3833 3833 1.0000000
F 1 187 200 0.9350000
M 1 3769 3833 0.9833029
cohorts <-
  first_terms |>
  filter(between(first_term, '1917-01-01', '1999-12-31')) |>
  inner_join(legislators_terms, by = "id_bioguide") |>
  left_join(year_ends, join_by(between(y$date, x$term_start, x$term_end))) |>
  inner_join(legislators, by = "id_bioguide") |>
  mutate(period = coalesce(year(age(date, first_term)), 0)) |>
  group_by(gender, period) |>
  summarize(cohort_retained = n_distinct(id_bioguide),
            .groups = "drop") 

cohorts |>
  group_by(gender) |>
  window_order(period) |>
  mutate(cohort_size = first(cohort_retained)) |>
  ungroup() |>
  mutate(pct_retained = 1.0 * cohort_retained / cohort_size) |>
  arrange(period, gender) |>
  collect(n = 4)
gender period cohort_retained cohort_size pct_retained
F 0 200 200 1.0000000
M 0 3833 3833 1.0000000
F 1 187 200 0.9350000
M 1 3769 3833 0.9833029

4.3.5 Dealing with Sparse Cohorts

WITH 
first_terms AS (
  SELECT id_bioguide, min(term_start) AS term_start,
  FROM legislators_terms
  GROUP BY 1),

first_states AS (
  SELECT id_bioguide, gender,
    term_start AS first_term, 
    state AS first_state
  FROM first_terms
  INNER JOIN legislators_terms
  USING (id_bioguide, term_start)
  INNER JOIN legislators
  USING (id_bioguide)
  WHERE term_start between '1917-01-01' and '1999-12-31'),
  
cohorts AS (
  SELECT first_state, gender,
    coalesce(date_part('year', age(date, first_term)), 0) AS period,
    COUNT(DISTINCT id_bioguide) AS cohort_retained
  FROM first_states
  INNER JOIN legislators_terms
  USING (id_bioguide)
  LEFT JOIN year_ends
  ON date BETWEEN term_start AND term_end
  GROUP BY 1, 2, 3),

periods AS (
  SELECT generate_series as period 
   FROM generate_series(0, 20, 1)),

cohort_sizes AS (
  SELECT gender, first_state,
    COUNT(DISTINCT id_bioguide) AS cohort_size
  FROM first_states
  GROUP BY 1, 2),
  
pct_retaineds AS (
  SELECT gender, first_state, period, 
    cohort_size,
    coalesce(cohort_retained, 0) AS cohort_retained,
    coalesce(cohort_retained, 0) * 1.0 / cohort_size AS pct_retained
  FROM cohort_sizes
  CROSS JOIN periods
  LEFT JOIN cohorts
  USING (gender, first_state, period))
  
SELECT gender, first_state, cohort_size,
  max(case when period = 0 then pct_retained end) as yr0,
  max(case when period = 2 then pct_retained end) as yr2,
  max(case when period = 4 then pct_retained end) as yr4,
  max(case when period = 6 then pct_retained end) as yr6,
  max(case when period = 8 then pct_retained end) as yr8,
  max(case when period = 10 then pct_retained end) as yr10
FROM pct_retaineds
WHERE first_state IN ('AL', 'AR', 'CA')
GROUP BY 1, 2, 3
ORDER BY 1, 2
LIMIT 3;
3 records
gender first_state cohort_size yr0 yr2 yr4 yr6 yr8 yr10
F AL 3 1 0.00 0.0 0.00 0.00 0.00
F AR 5 1 0.80 0.2 0.40 0.40 0.40
F CA 25 1 0.92 0.8 0.64 0.68 0.68
first_terms <- 
  legislators_terms |>
  group_by(id_bioguide) |>
  summarize(term_start = min(term_start, na.rm = TRUE),
            .groups = "drop")

first_states <-
  first_terms |>
  inner_join(legislators_terms, join_by(id_bioguide, term_start)) |>
  inner_join(legislators, by = "id_bioguide") |>
  rename(first_term = term_start,
         first_state = state) |>
  filter(between(first_term, '1917-01-01', '1999-12-31')) |>
  select(id_bioguide, gender, first_term, first_state)
cohorts <-
  first_states |> 
  inner_join(legislators_terms, by = "id_bioguide") |>
  left_join(year_ends, 
            join_by(between(y$date, x$term_start, x$term_end))) |>
  mutate(period = coalesce(year(age(date, first_term)), 0)) |>
  group_by(first_state, gender, period) |>
  summarize(cohort_retained = n_distinct(id_bioguide), 
            .groups = "drop")

From the query below, it seems we’re having an issue whereby legislators are “returning from the dead” in a sense.

cohorts |> 
  filter(first_state == "AR", gender == "F") |> 
  select(period, cohort_retained) |>
  arrange(period) |>
  collect(n = 8)
period cohort_retained
0 5
1 5
2 4
3 4
4 1
5 1
6 2
7 2

Does this mean something is wrong with our query? Let’s check.

Of course, unlike cohorts of living people, dropping out of the cohort of legislators does not prevent you from returning later on. We can pull out a portion of the query creating cohorts and take a closer look. We first get the id_bioguide values for the female representatives from Arkansas (AR) who are around in period == 4. We then look at some data from legislators and legislators_terms for these two representatives.

weird_id_bioguides <-
  first_states |> 
  filter(first_state == "AR", gender == "F") |>
  inner_join(legislators_terms, by = "id_bioguide") |>
  left_join(year_ends, 
            join_by(between(y$date, x$term_start, x$term_end))) |>
  mutate(period = coalesce(year(age(date, first_term)), 0)) |>
  filter(period >= 4) |>
  select(id_bioguide) |>
  distinct() |>
  pull()

legislators |>
  filter(id_bioguide %in% weird_id_bioguides) |>
  inner_join(legislators_terms, by = "id_bioguide") |>
  select(id_bioguide, full_name, term_type, term_start, term_end) |>
  arrange(id_bioguide, term_start) |>
  collect()
id_bioguide full_name term_type term_start term_end
C000138 Hattie Wyatt Caraway sen 1931-12-07 1933-03-03
C000138 Hattie Wyatt Caraway sen 1933-03-09 1939-01-03
C000138 Hattie Wyatt Caraway sen 1939-01-03 1945-01-03
L000035 Blanche Lambert Lincoln rep 1993-01-05 1995-01-03
L000035 Blanche Lambert Lincoln rep 1995-01-04 1997-01-03
L000035 Blanche Lambert Lincoln sen 1999-01-06 2005-01-03
L000035 Blanche Lambert Lincoln sen 2005-01-04 2011-01-03

From the above, it seems that Blanche Lambert Lincoln would have dropped out of the cohort in periods 4 and 5, then returned in 6 and 7.

To aid this kind of “debugging” of our queries, we could easily have retained some underlying data in cohorts. For example, by added cohort_ids = array_agg(id_bioguide) to the query, we can easily interrogate cohorts for the underlying legislator IDs.

cohorts <-
  first_states |> 
  inner_join(legislators_terms, by = "id_bioguide") |>
  left_join(year_ends, 
            join_by(between(y$date, x$term_start, x$term_end))) |>
  mutate(period = coalesce(year(age(date, first_term)), 0)) |>
  group_by(first_state, gender, period) |>
  summarize(cohort_retained = n_distinct(id_bioguide),
            cohort_ids = array_agg(id_bioguide),
            .groups = "drop")
cohorts |>
  filter(first_state == "AR", gender == "F", 
         between(period, 3, 8)) |>
  mutate(cohort_ids = as.character(cohort_ids)) |>
  select(period, cohort_retained, cohort_ids) |>
  arrange(period) |>
  collect()
period cohort_retained cohort_ids
3 4 [L000035, C000138, W000634, O000061]
4 1 [C000138]
5 1 [C000138]
6 2 [L000035, C000138]
7 2 [L000035, C000138]
8 2 [L000035, C000138]
periods <- 
  tibble(period = 1:20) |>
  copy_to(db, df = _)
cohort_sizes <-
  first_states |>
  group_by(gender, first_state) |>
  summarize(cohort_size = n_distinct(id_bioguide),
            .groups = "drop")
pct_retaineds <-
  cohort_sizes |>
  cross_join(periods) |>
  left_join(cohorts, 
            join_by(gender, first_state, period)) |>
  mutate(cohort_retained = coalesce(cohort_retained, 0),
         pct_retained = cohort_retained * 1.0/ cohort_size) |>
  select(gender, first_state, period, cohort_size,
         cohort_retained, pct_retained)
pct_retaineds |>
  filter(period %in% c(2, 4, 6, 8, 10),
         first_state %in% c('AL', 'AR', 'CA')) |>
  select(gender, first_state, cohort_size, period, pct_retained) |>
  pivot_wider(names_from = "period",
              names_prefix = "yr",
              values_from = "pct_retained") |>
  arrange(gender, first_state)
gender first_state cohort_size yr2 yr4 yr6 yr8 yr10
F AL 3 0.00 0.0 0.00 0.00 0.00
F AR 5 0.80 0.2 0.40 0.40 0.40
F CA 25 0.92 0.8 0.64 0.68 0.68
pct_retaineds |>
  filter(first_state %in% c('AL', 'AR', 'CA')) |>
  ggplot(aes(x = period, y = pct_retained, 
             colour = gender, group = gender)) +
  geom_line() +
  facet_wrap(vars(first_state), ncol = 1)

4.3.6 Defining Cohorts from Dates other than the First Date

WITH first_terms AS (
  SELECT DISTINCT id_bioguide, term_type,
    '2000-01-01'::date AS first_term,
    min(term_start) AS min_start
  FROM legislators_terms 
  WHERE term_start <= '2000-12-31' and term_end >= '2000-01-01'
  GROUP BY 1, 2, 3),
  
cohort_dates AS (
  SELECT id_bioguide, date
  FROM legislators_terms
  LEFT JOIN year_ends ON date BETWEEN term_start AND term_end),
  
cohorts AS (
  SELECT term_type,
    coalesce(date_part('year', age(date, first_term)), 0) AS period,
    COUNT(DISTINCT id_bioguide) AS cohort_retained
  FROM first_terms
  JOIN cohort_dates
  USING (id_bioguide)
  GROUP BY 1, 2)
  
SELECT term_type, period,
  first_value(cohort_retained) OVER w AS cohort_size,
  cohort_retained,
  cohort_retained * 1.0 / 
    first_value(cohort_retained) OVER w AS pct_retained
FROM cohorts
WHERE period >= 0
WINDOW w AS (PARTITION BY term_type ORDER BY period)
ORDER BY 2, 1
LIMIT 4;
4 records
term_type period cohort_size cohort_retained pct_retained
rep 0 440 440 1.0000000
sen 0 101 101 1.0000000
rep 1 440 392 0.8909091
sen 1 101 89 0.8811881
min_year <- 2000L

first_terms <-
  legislators_terms |>
  filter(year(term_start) <= min_year, 
         year(term_end) >= min_year) |>
  mutate(first_term = as.Date(paste0(min_year, '-01-01'))) |>
  group_by(id_bioguide, term_type, first_term) |>
  summarize(min_start = min(term_start, na.rm = TRUE),
            .groups = "drop")
cohort_dates <-
  legislators_terms |>
  left_join(year_ends, 
            join_by(between(y$date, x$term_start, x$term_end))) |>
  filter(year(date) >= min_year) |>
  select(id_bioguide, date)
cohorts <-
  first_terms |>
  inner_join(cohort_dates, by = "id_bioguide") |>
  mutate(period = coalesce(year(age(date, first_term)), 0)) |>
  group_by(term_type, period) |>
  summarize(cohort_retained = n_distinct(id_bioguide),
            .groups = "drop")
pct_retained_2000 <-
  cohorts |> 
  group_by(term_type) |>
  window_order(period) |>
  mutate(cohort_size = first(cohort_retained),
         pct_retained = cohort_retained * 1.0 / cohort_size) |>
  select(term_type, period, cohort_size, cohort_retained,
         pct_retained)

pct_retained_2000 |>
  arrange(period, term_type) |>
  collect(n = 4)
# A tibble: 4 × 5
# Groups:   term_type [2]
  term_type period cohort_size cohort_retained pct_retained
  <chr>      <dbl>       <dbl>           <dbl>        <dbl>
1 rep            0         439             439        1    
2 sen            0         101             101        1    
3 rep            1         439             392        0.893
4 sen            1         101              89        0.881
pct_retained_2000 |>
  filter(period <= 20) |>
  mutate(pct_retained = pct_retained * 100) |>
  ggplot(aes(x = period, y = pct_retained,
             colour = term_type, group = term_type)) +
  geom_line()

4.5 Cross-Section Analysis, Through a Cohort Lens

SELECT b.date, count(distinct a.id_bioguide) as legislators
FROM legislators_terms a
JOIN year_ends b on b.date between a.term_start and a.term_end
and b.date <= '2020-12-31'
GROUP BY 1
;
Displaying records 1 - 10
date legislators
1855-12-31 306
1856-12-31 306
1853-12-31 315
1854-12-31 316
1851-12-31 303
1852-12-31 310
1849-12-31 310
1850-12-31 313
1848-12-31 306
1847-12-31 301
SELECT b.date
,date_part('century',first_term)::int as century
,count(distinct a.id_bioguide) as legislators
FROM legislators_terms a
JOIN year_ends b on b.date between a.term_start and a.term_end
and b.date <= '2020-12-31'
JOIN
(
        SELECT id_bioguide, min(term_start) as first_term
        FROM legislators_terms
        GROUP BY 1
) c on a.id_bioguide = c.id_bioguide        
GROUP BY 1,2
;
Displaying records 1 - 10
date century legislators
1919-12-31 20 516
1965-12-31 20 546
1995-12-31 20 539
2005-12-31 20 388
1975-12-31 20 547
1811-12-31 19 153
1893-12-31 19 470
1909-12-31 20 344
1937-12-31 20 547
2010-12-31 21 262
SELECT date
,century
,legislators
,sum(legislators) over (partition by date) as cohort
,legislators * 100.0 / sum(legislators) over (partition by date) as pct_century
FROM
(
        SELECT b.date
        ,date_part('century',first_term)::int as century
        ,count(distinct a.id_bioguide) as legislators
        FROM legislators_terms a
        JOIN year_ends b on b.date between a.term_start and a.term_end
and b.date <= '2020-01-01'
        JOIN
        (
                SELECT id_bioguide, min(term_start) as first_term
                FROM legislators_terms
                GROUP BY 1
        ) c on a.id_bioguide = c.id_bioguide        
        GROUP BY 1,2
) a
ORDER BY 1,2
;
Displaying records 1 - 10
date century legislators cohort pct_century
1789-12-31 18 89 89 100
1790-12-31 18 95 95 100
1791-12-31 18 99 99 100
1792-12-31 18 101 101 100
1793-12-31 18 141 141 100
1794-12-31 18 140 140 100
1795-12-31 18 145 145 100
1796-12-31 18 150 150 100
1797-12-31 18 152 152 100
1798-12-31 18 155 155 100
WITH 

first_terms AS (
  SELECT id_bioguide, min(term_start) as first_term
                FROM legislators_terms
                GROUP BY 1),
                
aa AS (
  SELECT b.date,
    date_part('century',first_term)::int as century,
    count(distinct a.id_bioguide) as legislators
  FROM legislators_terms a
  JOIN year_ends b 
  ON b.date between a.term_start and a.term_end
    and b.date <= '2020-01-01'
  JOIN first_terms c 
  USING (id_bioguide)
  GROUP BY 1,2) 
                
SELECT date,
  coalesce(sum(case when century = 18 then legislators end) * 100.0 / sum(legislators),0) as pct_18,
  coalesce(sum(case when century = 19 then legislators end) * 100.0 / sum(legislators),0) as pct_19,
  coalesce(sum(case when century = 20 then legislators end) * 100.0 / sum(legislators),0) as pct_20,
  coalesce(sum(case when century = 21 then legislators end) * 100.0 / sum(legislators),0) as pct_21
FROM aa
GROUP BY 1
ORDER BY 1;
Displaying records 1 - 10
date pct_18 pct_19 pct_20 pct_21
1789-12-31 100 0 0 0
1790-12-31 100 0 0 0
1791-12-31 100 0 0 0
1792-12-31 100 0 0 0
1793-12-31 100 0 0 0
1794-12-31 100 0 0 0
1795-12-31 100 0 0 0
1796-12-31 100 0 0 0
1797-12-31 100 0 0 0
1798-12-31 100 0 0 0
SELECT id_bioguide, date,
  count(date) over (partition by id_bioguide order by date rows between unbounded preceding and current row) as cume_years
FROM (
  SELECT distinct id_bioguide, date
  FROM legislators_terms
  JOIN year_ends 
  ON date between term_start and term_end
    and date <= '2020-01-01');
Displaying records 1 - 10
id_bioguide date cume_years
A000009 1973-12-31 1
A000009 1974-12-31 2
A000009 1975-12-31 3
A000009 1976-12-31 4
A000009 1977-12-31 5
A000009 1978-12-31 6
A000009 1979-12-31 7
A000009 1980-12-31 8
A000009 1981-12-31 9
A000009 1982-12-31 10
SELECT date, cume_years,
  count(distinct id_bioguide) as legislators
FROM
(
    SELECT id_bioguide, date
    ,count(date) over (partition by id_bioguide order by date rows between unbounded preceding and current row) as cume_years
    FROM
    (
        SELECT distinct a.id_bioguide, b.date
        FROM legislators_terms a
        JOIN year_ends b 
        on b.date between a.term_start and a.term_end and b.date <= '2020-01-01'
        GROUP BY 1,2
    ) aa
) aaa
GROUP BY 1,2
;
Displaying records 1 - 10
date cume_years legislators
1995-12-31 17 30
1999-12-31 5 68
2018-12-31 6 71
1988-12-31 2 54
1978-12-31 2 79
1903-12-31 7 75
2018-12-31 2 61
1948-12-31 3 10
1868-12-31 2 119
1999-12-31 9 30
WITH aa AS (
  SELECT DISTINCT a.id_bioguide, b.date
  FROM legislators_terms a
  JOIN year_ends b 
    on b.date BETWEEN a.term_start AND a.term_end AND b.date <= '2020-01-01'
  GROUP BY 1,2),
  
aaa AS (
  SELECT id_bioguide, date,
    count(date) OVER (PARTITION BY id_bioguide ORDER BY date) AS cume_years
  FROM aa),
  
aaaa AS (
  SELECT date, cume_years,
    COUNT(DISTINCT id_bioguide) as legislators
  FROM aaa
  GROUP BY 1,2)
SELECT date, count(*) as tenures
FROM aaaa
GROUP BY 1;
Displaying records 1 - 10
date tenures
1973-12-31 35
1974-12-31 36
1975-12-31 36
1976-12-31 36
1977-12-31 34
1978-12-31 34
1979-12-31 31
1980-12-31 32
1981-12-31 31
1982-12-31 32
WITH term_dates AS (
  SELECT distinct a.id_bioguide, b.date
  FROM legislators_terms a
  JOIN year_ends b 
  on b.date between a.term_start and a.term_end and b.date <= '2020-01-01'),

cum_term_dates AS (
  SELECT id_bioguide, date,
    count(date) over (partition by id_bioguide order by date rows between unbounded preceding and current row) as cume_years
  FROM term_dates),
  
cum_term_bands AS (
  SELECT date,
    case when cume_years <= 4 then '1 to 4'
         when cume_years <= 10 then '5 to 10'
        when cume_years <= 20 then '11 to 20'
         else '21+' end as tenure,
    count(distinct id_bioguide) as legislators
  FROM cum_term_dates
  GROUP BY 1,2)
  
SELECT date, tenure,
  legislators * 100.0 / sum(legislators) over w as pct_legislators 
FROM cum_term_bands
WINDOW w AS (partition by date)
ORDER BY date DESC;
Displaying records 1 - 10
date tenure pct_legislators
2019-12-31 1 to 4 29.98138
2019-12-31 5 to 10 32.02980
2019-12-31 21+ 17.87710
2019-12-31 11 to 20 20.11173
2018-12-31 11 to 20 21.33581
2018-12-31 5 to 10 33.76623
2018-12-31 1 to 4 25.60297
2018-12-31 21+ 19.29499
2017-12-31 11 to 20 21.37546
2017-12-31 5 to 10 34.75836
term_dates <-
  legislators_terms |>
  inner_join(year_ends |> filter(date <= '2020-01-01'),
             join_by(between(y$date, x$term_start, x$term_end))) |>
  distinct(id_bioguide, date) 
cum_term_dates <-
  term_dates |>
  group_by(id_bioguide) |>
  window_order(date) |>
  window_frame(-Inf, 0) |>
  mutate(cume_years = n()) |>
  ungroup() |>
  select(id_bioguide, date, cume_years) 
cum_term_bands <-
  cum_term_dates |> 
  mutate(tenure = case_when(cume_years <= 4 ~ '1 to 4',
                            cume_years <= 10 ~ '5 to 10',
                            cume_years <= 20 ~ '11 to 20',
                            TRUE ~ '21+')) |>
  group_by(date, tenure) |>
  summarize(legislators = n_distinct(id_bioguide),
            .groups = "drop") 
total_legs <-
  cum_term_bands |>
  group_by(date) |>
  summarize(num_legs = sum(legislators, na.rm = TRUE),
            .groups = "drop")
cum_term_bands |>
  inner_join(total_legs, by = "date") |>
  mutate(pct_legislators = legislators * 100.0 / num_legs) |>
  select(date, tenure, pct_legislators) |>
  arrange(desc(date)) |>
  collect(n = 8)
date tenure pct_legislators
2019-12-31 5 to 10 32.02980
2019-12-31 1 to 4 29.98138
2019-12-31 11 to 20 20.11173
2019-12-31 21+ 17.87710
2018-12-31 11 to 20 21.33581
2018-12-31 5 to 10 33.76623
2018-12-31 21+ 19.29499
2018-12-31 1 to 4 25.60297
dbDisconnect(db, shutdown = TRUE)

  1. In writing this sentence, I am still working through the chapter.↩︎

  2. See here for the source for these data.↩︎

  3. I made no effort to research how they are constructed because (1) I am lazy and (2) my imagined version is probably better for my current purposes.↩︎

  4. There are some details I’m glossing over here, such as the fact that there will be some \(j\) where the cumulative survival probability is just above one-half, but where that for \(j + 1\) is just below one-half, so some interpolation will be required.↩︎