4 Cohorts

4.1 Cohorts: A Useful Analysis Framework

Chapter 4 examines the fascinating topic of cohorts, where a cohort is a group of observations (often people) who acquire a shared characteristic at (approximately) the same time. For example, children entering kindergarten in New Zealand in 2017 or the Harvard MBA Class of 2002.

While cohort analysis has some attractive features, I guess that data do not often come in a format that facilitates such analysis. Instead, as is the case with the legislators data set studied in Chapter 4, the data analyst needs to rearrange the data to support cohort analysis.

I found Chapter 4 a little confusing on a first pass through it.¹ The chapter launches into some SQL code intended to create cohorts, but it’s a little unclear why we’re doing what we’re doing, and we quickly see that our cohort analysis does not make sense (e.g., we have more of our original cohort in period 5 than we had in period 4) and we must have done something wrong. I think I see the idea Cathy is going for here: one needs to think carefully about how to arrange the data to avoid subtle mistakes. The challenge I see is that it’s not obvious that everyone would make the same mistake and one is too deep in the weeds of the code to really see the forest for the trees.

So before I launch into the code, I will spend a little time thinking about things conceptually. I will start with a different example from that used in Chapter 4, but one that I think brings out some of the issues.

For some reason I associate cohort analysis with life expectancy. The people who are born today form a cohort and one might ask: How long do we expect members of this cohort to live? One often hears life expectancy statistics quoted as something like: “In Australia, a boy born in 2018–2020 can expect to live to the age of 81.2 years and a girl would be expected to live to 85.3 years compared to 51.1 for boys and 54.8 years for girls born in in 1891–1900.”²

The people who construct the life expectancies for the children must be veritable polymaths. They need to anticipate future developments in medical care and technology. Skin cancer is a significant cause of death in Australia, due to a mismatch between the median complexion and the intensity of the sun. But the analysts calculating life expectancy need to think about how medical technology is likely to affect rates of death from carcinoma in the future. I can imagine whole-body scanners a bit like the scanners in US airports that detect skin cancers before they become problematic. These analysts also need to understand how road safety will evolve. Will children today all be in driverless vehicles in fifty years time and will accidents then be a rarity? And what about war? The data analyst needs to be able to forecast the possibility of World War III breaking out and shortening life spans. Who are these people?

Of course it seems unlikely these über-analysts exist. Rather they surely do something more prosaic. Here is my guess as to how life expectancies are constructed.³ I guess that the data analyst gathers data on cohorts notionally formed at some point in the past and then looks at survival rates for that cohort over some period, then aggregates those data into a life expectancy.

For example, the data analyst might gather data on people who turned 21 in 2018 and then data on whether those people surived to their 22nd birthday. The proportion of such people who make their 22nd birthday could be interpreted as a survival probability \(p_{21}\). Repeat that for each age-based cohort to get probabilities \(\left\{p_i: i = 0, 1, \dots, 119, 120 \right\}\). Now to find the median life expectancy, we could calculate something like this:⁴

\[ \left\{j: \arg \min_{i} \left(\prod_{0}^{i} p_i\right) \leq \frac{1}{2} \right\} \] So we have a (fairly) well-defined procedure here. There are obviously some details to be worked out. For example, do we focus on one year (2018 in this case)? Or collect data over multiple years? Does it make sense to form cohorts by years? Or would grouping into larger cohorts (e.g., 20–25) make more sense? Do we identify people by birthdays? Or just use some kind of census date? (People who are 21 on 1 July might have just turned 21, or might be about to turn 22.)

But what exactly have we calculated? In a sense it’s a nonsensical number. Why would survival rates for 88-year-olds in 2018 be relevant for the life expectancy of newborns today, who will face a very different world when they turn 88 in 2111. First, perhaps the analysts really don’t calculate it in this way (though I’m doubtful they are polymaths). Second, even though it’s a “meaningless” number, it probably still has attractive properties, such as the ability to represent in a one or two numbers a lot about the quality of life in Australia.

A final note is that it is not clear to me where the “51.1 for boys and 54.8 years for girls born in in 1891–1900” values come from. Are these the equivalent life expectancies calculated using data available around 1900? Or are these the observed lifespans of people born in 1891–1900? If the latter, how accurate were the former as estimates of these values?

library(DBI)
library(tidyverse)
library(dbplyr)
library(ggplot2)
library(duckdb)

4.2 The Legislators Data Set

4.2.1 Reading the data

Let’s move onto the legislators data set.

db <- dbConnect(duckdb::duckdb())

legislators <-
  tbl(db, "read_csv_auto('data/legislators.csv')") |>
  compute(name = "legislators")

legislators_terms <-
    tbl(db, "read_csv_auto('data/legislators_terms.csv')") |>
  compute(name = "legislators_terms")

4.2.2 Exploring the data

There are two tables in the legislators data set. The legislators data looks to have id_bioguide as a primary key.

WITH

id_rows AS (
  SELECT id_bioguide, count(*) AS rows_per_id
  FROM legislators
  GROUP BY 1)

SELECT rows_per_id, count(*) AS num_ids
FROM id_rows
GROUP BY 1;

1 records
rows_per_id	num_ids
1	12518

SELECT COUNT(*) AS num_missing_ids
FROM legislators
WHERE id_bioguide IS NULL

1 records
num_missing_ids
0

WITH id_rows AS (
  SELECT id_bioguide, term_start,
    count(*) AS rows_per_id
  FROM legislators_terms
  GROUP BY 1, 2)

SELECT rows_per_id, count(*) AS num_ids
FROM id_rows
GROUP BY 1;

1 records
rows_per_id	num_ids
1	44063

SELECT COUNT(*) AS num_missing_ids
FROM legislators_terms
WHERE id_bioguide IS NULL OR term_start IS NULL

1 records
num_missing_ids
0

WITH id_rows AS (
  SELECT id_bioguide, term_number,
    count(*) AS rows_per_id
  FROM legislators_terms
  GROUP BY 1, 2)

SELECT rows_per_id, count(*) AS num_ids
FROM id_rows
GROUP BY 1;

1 records
rows_per_id	num_ids
1	44063

SELECT COUNT(*) AS num_missing_ids
FROM legislators_terms
WHERE term_end IS NULL

1 records
num_missing_ids
0

SELECT max(term_start) AS max_term_start,
  max(term_end) AS max_term_end
FROM legislators_terms

1 records
max_term_start	max_term_end
2020-05-19	2025-01-03

legislators_terms |>
  filter(term_end > '2020-05-19') |>
  count(term_end, term_type)

term_end	term_type	n
2021-01-03	rep	437
2021-01-03	sen	34
2023-01-03	sen	32

Clearly we have term_end dates that are anticipated based on

4.3 Retention

4.3.1 SQL for a Basic Retention Curve

SELECT id_bioguide, min(term_start) AS first_term
FROM legislators_terms 
GROUP BY 1
LIMIT 3;

3 records
id_bioguide	first_term
F000062	1992-11-10
T000464	2007-01-04
A000360	2003-01-07

WITH first_terms AS (
  SELECT id_bioguide, min(term_start) AS first_term
  FROM legislators_terms 
  GROUP BY 1) 

SELECT date_part('year', age(b.term_start, a.first_term)) AS period,
  COUNT(DISTINCT a.id_bioguide) AS cohort_retained
FROM first_terms a
INNER JOIN legislators_terms b
USING (id_bioguide)
GROUP BY 1
ORDER BY 1
LIMIT 4;

4 records
period	cohort_retained
0	12518
1	3600
2	3619
3	1831

first_terms <- 
  legislators_terms |>
  group_by(id_bioguide) |>
  summarize(first_term = min(term_start, na.rm = TRUE),
            .groups = "drop")

cohorts <-
  legislators_terms |>
  inner_join(first_terms, by = "id_bioguide") |>
  mutate(period = year(age(term_start, first_term))) |>
  group_by(period) |>
  summarize(cohort_retained = sql("count(distinct id_bioguide)")) 

cohorts |>
  arrange(period)

period	cohort_retained
0	12518
1	3600
2	3619

WITH 
first_terms AS (
  SELECT id_bioguide, min(term_start) AS first_term
  FROM legislators_terms 
  GROUP BY 1),

cohorts AS (
  SELECT date_part('year', age(b.term_start, a.first_term)) AS period,
    COUNT(DISTINCT a.id_bioguide) AS cohort_retained
  FROM first_terms a
  JOIN legislators_terms b 
  USING (id_bioguide)
  GROUP BY 1)
  
SELECT period,
  first_value(cohort_retained) OVER w AS cohort_size,
  cohort_retained,
  cohort_retained * 1.0 / 
    first_value(cohort_retained) OVER w AS prop_retained
FROM cohorts
WINDOW w AS (ORDER BY period)
LIMIT 3;

3 records
period	cohort_size	cohort_retained	prop_retained
0	12518	12518	1.0000000
1	12518	3600	0.2875859
2	12518	3619	0.2891037

retained_data <-
  cohorts |>
  window_order(period) |>
  mutate(cohort_size = first(cohort_retained)) |>
  mutate(pct_retained = cohort_retained * 1.0/cohort_size) |>
  select(period, cohort_size, cohort_retained, pct_retained) 

retained_data

period	cohort_size	cohort_retained	pct_retained
0	12518	12518	1.0000000
1	12518	3600	0.2875859
2	12518	3619	0.2891037

retained_data |>
  ggplot(aes(x = period, y = pct_retained)) +
  geom_line()

WITH 
first_terms AS (
  SELECT id_bioguide, min(term_start) AS first_term
  FROM legislators_terms 
  GROUP BY 1),

cohorts AS (
  SELECT date_part('year', age(b.term_start, a.first_term)) AS period,
      COUNT(DISTINCT a.id_bioguide) AS cohort_retained
  FROM first_terms a
  JOIN legislators_terms b on a.id_bioguide = b.id_bioguide 
  GROUP BY 1),
  
retained_data AS (
  SELECT period,
    first_value(cohort_retained) OVER w AS cohort_size,
    cohort_retained,
    cohort_retained * 1.0 / first_value(cohort_retained) OVER w AS pct_retained
  FROM cohorts
  WINDOW w AS (ORDER BY period))

SELECT cohort_size,
  max(CASE WHEN period = 0 THEN pct_retained END) AS yr0,
  max(CASE WHEN period = 1 THEN pct_retained END) AS yr1,
  max(CASE WHEN period = 2 THEN pct_retained END) AS yr2,
  max(CASE WHEN period = 3 THEN pct_retained END) AS yr3,
  max(CASE WHEN period = 4 THEN pct_retained END) AS yr4
FROM retained_data
GROUP BY 1;

1 records
cohort_size	yr0	yr1	yr2	yr3	yr4
12518	1	0.2875859	0.2891037	0.1462694	0.2564307

retained_data |>
  select(period, pct_retained) |>
  filter(period <= 4) |>
  collect() |>
  arrange(period) |>
  pivot_wider(names_from = period, 
              names_prefix = "yr",
              values_from = pct_retained)

yr0	yr1	yr2	yr3	yr4
1	0.2875859	0.2891037	0.1462694	0.2564307

4.3.2 Adjusting Time Series to Increase Retention Accuracy

Use copy_inline() here if using a read-only database.

year_ends <-
  tibble(date = seq(as.Date('1770-12-31'), 
                    as.Date('2030-12-31'), 
                    by = "1 year")) |>
  copy_to(db, df = _, overwrite = TRUE, name = "year_ends")

WITH

first_terms AS (
  SELECT id_bioguide, min(term_start) AS first_term
  FROM legislators_terms 
  GROUP BY 1)
  
SELECT a.id_bioguide, a.first_term, b.term_start, b.term_end,
  c.date,
  date_part('year', age(c.date, a.first_term)) AS period
FROM first_terms a
JOIN legislators_terms b USING (id_bioguide)
LEFT JOIN year_ends c
ON c.date BETWEEN b.term_start and b.term_end
ORDER BY id_bioguide
LIMIT 3;

3 records
id_bioguide	first_term	term_start	term_end	date	period
A000001	1951-01-03	1951-01-03	1953-01-03	1951-12-31	0
A000001	1951-01-03	1951-01-03	1953-01-03	1952-12-31	1
A000002	1947-01-03	1971-01-21	1973-01-03	1972-12-31	25

cohorts <-
  first_terms |>
  inner_join(legislators_terms, by = join_by(id_bioguide)) |>
  left_join(year_ends, 
            by = join_by(between(y$date, x$term_start, x$term_end))) |>
  mutate(period = date_part('year', age(date, first_term))) |>
  select(id_bioguide, first_term, term_start, term_end, date, period) 

cohorts

id_bioguide	first_term	term_start	term_end	date	period
T000165	2020-05-19	2020-05-19	2021-01-03	2020-12-31	0
G000061	2020-05-19	2020-05-19	2021-01-03	2020-12-31	0
M000687	1987-01-06	2020-05-05	2021-01-03	2020-12-31	33

WITH

first_terms AS (
  SELECT id_bioguide, min(term_start) AS first_term
  FROM legislators_terms 
  GROUP BY 1),

cohorts_retained AS (        
    SELECT coalesce(date_part('year', age(c.date, a.first_term)), 0) AS period,
      COUNT(DISTINCT a.id_bioguide) AS cohort_retained
    FROM first_terms a
    JOIN legislators_terms b
    USING (id_bioguide)
    LEFT JOIN year_ends c ON c.date BETWEEN b.term_start AND b.term_end
    GROUP BY 1)
    
SELECT period,
  first_value(cohort_retained) OVER w AS cohort_size,
  cohort_retained,
  cohort_retained * 1.0/first_value(cohort_retained) OVER w AS pct_retained
FROM cohorts_retained
WINDOW w AS (ORDER BY period);

Displaying records 1 - 10
period	cohort_size	cohort_retained	pct_retained
0	12518	12518	1.0000000
1	12518	12328	0.9848219
2	12518	8166	0.6523406
3	12518	8069	0.6445918
4	12518	5862	0.4682857
5	12518	5795	0.4629334
6	12518	4361	0.3483783
7	12518	4339	0.3466209
8	12518	3521	0.2812750
9	12518	3485	0.2783991

cohorts_retained <-
  cohorts |>
  mutate(period = coalesce(date_part('year', age(date, first_term)), 0)) |>
  select(period, id_bioguide) |>
  distinct() |>
  group_by(period) |>
  summarize(cohort_retained = n()) 

pct_retained <-
  cohorts_retained |>
  window_order(period) |>
  mutate(cohort_size = first(cohort_retained),
         cohort_retained = as.double(cohort_retained),
         pct_retained = cohort_retained/cohort_size) 

pct_retained |>
  arrange(period)

period	cohort_retained	cohort_size	pct_retained
0	12518	12518	1.0000000
1	12328	12518	0.9848219
2	8166	12518	0.6523406

pct_retained |>
  ggplot(aes(x = period, y = pct_retained)) + 
  geom_line()

WITH first_terms AS (
  SELECT id_bioguide, min(term_start) AS first_term
  FROM legislators_terms 
  GROUP BY 1)
  
SELECT id_bioguide, a.first_term, b.term_start,
  CASE WHEN b.term_type = 'rep' THEN b.term_start + interval '2 years'
       WHEN b.term_type = 'sen' THEN b.term_start + interval '6 years'
  END AS term_end
FROM first_terms a
JOIN legislators_terms b USING (id_bioguide)
LIMIT 3;

3 records
id_bioguide	first_term	term_start	term_end
F000062	1992-11-10	2019-01-03	2025-01-03
T000464	2007-01-04	2019-01-03	2025-01-03
A000360	2003-01-07	2015-01-06	2021-01-06

first_terms |>
  inner_join(legislators_terms, by = join_by(id_bioguide)) |>
  mutate(term_end = 
           case_when(term_type == 'rep' ~ term_start + years(2),
                     term_type == 'sen' ~ term_start + years(6))) |>
  select(id_bioguide, first_term, term_start, term_end)

id_bioguide	first_term	term_start	term_end
F000062	1992-11-10	2019-01-03	2025-01-03
T000464	2007-01-04	2019-01-03	2025-01-03
A000360	2003-01-07	2015-01-06	2021-01-06

For now, I have omitted the query after the paragraph starting “A second option …”.

4.3.3 Cohorts Derived from the Time Series Itself

WITH first_terms AS (
  SELECT id_bioguide, min(term_start) AS first_term
  FROM legislators_terms 
  GROUP BY 1)
  
SELECT date_part('year', a.first_term) AS first_year,
  COALESCE(date_part('year', age(c.date, a.first_term)), 0) AS period,
  COUNT(DISTINCT a.id_bioguide) AS cohort_retained
FROM first_terms a
INNER JOIN legislators_terms b 
USING (id_bioguide)
LEFT JOIN year_ends c 
ON c.date BETWEEN b.term_start AND b.term_end 
GROUP BY 1, 2
ORDER BY 1, 2
LIMIT 3;

3 records
first_year	period	cohort_retained
1789	0	89
1789	1	89
1789	2	57

yr_cohort_retaineds <-
  first_terms |>
  inner_join(legislators_terms, by = "id_bioguide") |>
  left_join(year_ends, 
            join_by(between(y$date, x$term_start, x$term_end))) |>
  mutate(first_year = year(first_term),
         period = coalesce(year(age(date, first_term)), 0)) |>
  group_by(first_year, period) |>
  summarize(cohort_retained = n_distinct(id_bioguide),
            .groups = "drop") 

yr_cohort_retaineds |>
  arrange(first_year, period)

first_year	period	cohort_retained
1789	0	89
1789	1	89
1789	2	57

yr_cohort_retaineds |>
  group_by(first_year) |>
  window_order(period) |>
  mutate(cohort_size = first(cohort_retained),
         pct_retained = 1.0 * cohort_retained/cohort_size) |>
  select(first_year, period, cohort_size, cohort_retained, pct_retained) |>
  arrange(first_year, period)

first_year	period	cohort_size	cohort_retained	pct_retained
1789	0	89	89	1.0000000
1789	1	89	89	1.0000000
1789	2	89	57	0.6404494

WITH first_terms AS (
  SELECT id_bioguide, min(term_start) AS first_term
  FROM legislators_terms 
  GROUP BY 1),
                
first_centuries AS (
  SELECT 
    date_part('century', a.first_term) AS first_century,
    coalesce(date_part('year', age(c.date, a.first_term)), 0) AS period,
    COUNT(DISTINCT a.id_bioguide) AS cohort_retained
  FROM first_terms AS a
  INNER JOIN legislators_terms b 
  USING (id_bioguide)
  LEFT JOIN year_ends c 
  ON c.date BETWEEN b.term_start AND b.term_end 
  GROUP BY 1, 2)

SELECT first_century, period,
  first_value(cohort_retained) OVER w AS cohort_size,
  cohort_retained,
  cohort_retained * 1.0 / first_value(cohort_retained) OVER w AS prop_retained
FROM first_centuries
WINDOW w AS (PARTITION BY first_century ORDER BY period)
ORDER BY 1, 2
LIMIT 3;

3 records
first_century	period	cohort_size	cohort_retained	prop_retained
18	0	368	368	1.0000000
18	1	368	360	0.9782609
18	2	368	242	0.6576087

cen_cohort_retaineds <-
  first_terms |>
  inner_join(legislators_terms, by = "id_bioguide") |>
  left_join(year_ends, 
            join_by(between(y$date, x$term_start, x$term_end))) |>
  mutate(first_century = century(first_term),
         period = coalesce(year(age(date, first_term)), 0)) |>
  group_by(first_century, period) |>
  summarize(cohort_retained = n_distinct(id_bioguide),
            .groups = "drop")

cen_pct_retaineds <-
  cen_cohort_retaineds |>
  group_by(first_century) |>
  window_order(period) |>
  mutate(cohort_size = first(cohort_retained),
         pct_retained = 1.0 * cohort_retained/cohort_size) |>
  ungroup() |>
  select(first_century, period, cohort_size, cohort_retained, pct_retained) 

cen_pct_retaineds |>
  arrange(first_century, period)

first_century	period	cohort_size	cohort_retained	pct_retained
18	0	368	368	1.0000000
18	1	368	360	0.9782609
18	2	368	242	0.6576087

cen_pct_retaineds |>
  collect() |>
  mutate(first_century = factor(first_century)) |>
  ggplot(aes(x = period, y = pct_retained,
             color = first_century,
             linetype = first_century,
             group = first_century)) + 
  geom_line()

SELECT DISTINCT id_bioguide,
  min(term_start) OVER w AS first_term,
  first_value(state) OVER w AS first_state
FROM legislators_terms 
WINDOW w AS (PARTITION BY id_bioguide ORDER BY term_start)
ORDER BY id_bioguide

Displaying records 1 - 10
id_bioguide	first_term	first_state
A000001	1951-01-03	ND
A000002	1947-01-03	VA
A000003	1817-12-01	GA
A000004	1843-12-04	MA
A000005	1887-12-05	TX
A000006	1868-01-01	NC
A000007	1875-12-06	MA
A000008	1857-12-07	ME
A000009	1973-01-03	SD
A000010	1954-01-01	NE

first_states <-
  first_terms <- 
  legislators_terms |>
  group_by(id_bioguide) |>
  window_order(term_start) |>
  mutate(first_term = min(term_start, na.rm = TRUE),
         first_state = first(state)) |>
  ungroup() |>
  select(id_bioguide, first_term, first_state) |>
  distinct()

WITH first_states AS (
  SELECT DISTINCT id_bioguide,
    min(term_start) OVER w AS first_term,
    first_value(state) OVER w AS first_state
  FROM legislators_terms 
  WINDOW w AS (PARTITION BY id_bioguide ORDER BY term_start)),

state_first_periods AS (
  SELECT 
    first_state,
    coalesce(date_part('year', age(c.date, a.first_term)), 0) AS period,
    COUNT(DISTINCT a.id_bioguide) AS cohort_retained
  FROM first_states AS a
  INNER JOIN legislators_terms b 
  USING (id_bioguide)
  LEFT JOIN year_ends c 
  ON c.date BETWEEN b.term_start AND b.term_end 
  GROUP BY 1, 2)

SELECT first_state, period,
  first_value(cohort_retained) OVER w AS cohort_size,
  cohort_retained,
  cohort_retained * 1.0 / first_value(cohort_retained) OVER w AS prop_retained
FROM state_first_periods
WINDOW w AS (PARTITION BY first_state ORDER BY period)
ORDER BY 1, 2
LIMIT 3;

3 records
first_state	period	cohort_size	cohort_retained	prop_retained
AK	0	19	19	1.0000000
AK	1	19	19	1.0000000
AK	2	19	15	0.7894737

state_first_periods <-
  first_states |>
  inner_join(legislators_terms, by = "id_bioguide") |>
  left_join(year_ends, 
            join_by(between(y$date, x$term_start, x$term_end))) |>
  mutate(period = coalesce(year(age(date, first_term)), 0)) |>
  group_by(first_state, period) |>
  summarize(cohort_retained = n_distinct(id_bioguide),
            .groups = "drop")

state_pct_retaineds <- 
  state_first_periods |>
  select(first_state, period, cohort_retained) |>
  group_by(first_state) |>
  window_order(period) |>
  mutate(cohort_size = first(cohort_retained),
         prop_retained = cohort_retained * 1.0 / cohort_size) |>
  ungroup()

state_pct_retaineds |>
  arrange(first_state, period)

first_state	period	cohort_retained	cohort_size	prop_retained
AK	0	19	19	1.0000000
AK	1	19	19	1.0000000
AK	2	15	19	0.7894737

top_5_states <- c("NY", "PA", "OH", "IL", "MA")

state_pct_retaineds |>
  filter(first_state %in% top_5_states) |>
  ggplot(aes(x = period, 
             y = prop_retained,
             color = first_state,
             linetype = first_state,
             group = first_state)) + 
  geom_line()

4.3.4 Defining the Cohort from a Separate Table

WITH first_terms AS (
  SELECT id_bioguide, min(term_start) AS first_term
  FROM legislators_terms 
  GROUP BY 1)
  
SELECT d.gender,
  coalesce(date_part('year',age(c.date,a.first_term)),0) AS period,
  count(distinct a.id_bioguide) AS cohort_retained
FROM first_terms a
JOIN legislators_terms b ON a.id_bioguide = b.id_bioguide 
LEFT JOIN year_ends c ON c.date BETWEEN b.term_start AND b.term_end 
INNER JOIN legislators d ON a.id_bioguide = d.id_bioguide
GROUP BY 1,2
ORDER BY 2,1
LIMIT 4;

4 records
gender	period	cohort_retained
F	0	366
M	0	12152
F	1	349
M	1	11979

cohorts <-
  first_terms |>
  inner_join(legislators_terms, by = "id_bioguide") |>
  left_join(year_ends, join_by(between(y$date, x$term_start, x$term_end))) |>
  inner_join(legislators, by = "id_bioguide") |>
  mutate(period = coalesce(year(age(date, first_term)), 0)) |>
  group_by(gender, period) |>
  summarize(cohort_retained = n_distinct(id_bioguide),
            .groups = "drop")

cohorts |>
  arrange(period, gender) |>
  collect(n = 4)

gender	period	cohort_retained
F	0	366
M	0	12152
F	1	349
M	1	11979

WITH first_terms AS (
  SELECT id_bioguide, min(term_start) AS first_term
  FROM legislators_terms 
  GROUP BY 1),
  
cohorts AS (
  SELECT d.gender,
    coalesce(date_part('year',age(c.date,a.first_term)),0) as period,
    count(distinct a.id_bioguide) as cohort_retained
  FROM first_terms a
  JOIN legislators_terms b on a.id_bioguide = b.id_bioguide 
  LEFT JOIN year_ends c on c.date between b.term_start and b.term_end 
  JOIN legislators d on a.id_bioguide = d.id_bioguide
  GROUP BY 1, 2)
  
SELECT gender, period,
  cohort_retained,
  first_value(cohort_retained) OVER w AS cohort_size,
  cohort_retained * 1.0 / 
  first_value(cohort_retained) OVER w AS pct_retained
FROM cohorts aa
WINDOW w AS (partition by gender order by period)
ORDER BY 2, 1
LIMIT 4;

4 records
gender	period	cohort_retained	cohort_size	pct_retained
F	0	366	366	1.0000000
M	0	12152	12152	1.0000000
F	1	349	366	0.9535519
M	1	11979	12152	0.9857637

cohorts |>
  group_by(gender) |>
  window_order(period) |>
  mutate(cohort_size = first(cohort_retained)) |>
  ungroup() |>
  mutate(pct_retained = 1.0 * cohort_retained / cohort_size) |>
    arrange(period, gender) |>
  collect(n = 4)

gender	period	cohort_retained	cohort_size	pct_retained
F	0	366	366	1.0000000
M	0	12152	12152	1.0000000
F	1	349	366	0.9535519
M	1	11979	12152	0.9857637

WITH first_terms AS (
  SELECT id_bioguide, min(term_start) AS first_term
  FROM legislators_terms 
  GROUP BY 1),
  
cohorts AS (
  SELECT d.gender,
    coalesce(date_part('year',age(c.date, a.first_term)),0) as period,
    count(distinct a.id_bioguide) as cohort_retained
  FROM first_terms a
  JOIN legislators_terms b on a.id_bioguide = b.id_bioguide 
  LEFT JOIN year_ends c on c.date between b.term_start and b.term_end 
  JOIN legislators d on a.id_bioguide = d.id_bioguide
  WHERE first_term BETWEEN '1917-01-01' AND '1999-12-31'
  GROUP BY 1, 2)
  
SELECT gender, period,
  cohort_retained,
  first_value(cohort_retained) OVER w AS cohort_size,
  cohort_retained * 1.0 / 
  first_value(cohort_retained) OVER w AS pct_retained
FROM cohorts aa
WINDOW w AS (PARTITION BY gender ORDER BY period)
ORDER BY 2, 1
LIMIT 4;

4 records
gender	period	cohort_retained	cohort_size	pct_retained
F	0	200	200	1.0000000
M	0	3833	3833	1.0000000
F	1	187	200	0.9350000
M	1	3769	3833	0.9833029

cohorts <-
  first_terms |>
  filter(between(first_term, '1917-01-01', '1999-12-31')) |>
  inner_join(legislators_terms, by = "id_bioguide") |>
  left_join(year_ends, join_by(between(y$date, x$term_start, x$term_end))) |>
  inner_join(legislators, by = "id_bioguide") |>
  mutate(period = coalesce(year(age(date, first_term)), 0)) |>
  group_by(gender, period) |>
  summarize(cohort_retained = n_distinct(id_bioguide),
            .groups = "drop") 

cohorts |>
  group_by(gender) |>
  window_order(period) |>
  mutate(cohort_size = first(cohort_retained)) |>
  ungroup() |>
  mutate(pct_retained = 1.0 * cohort_retained / cohort_size) |>
  arrange(period, gender) |>
  collect(n = 4)

gender	period	cohort_retained	cohort_size	pct_retained
F	0	200	200	1.0000000
M	0	3833	3833	1.0000000
F	1	187	200	0.9350000
M	1	3769	3833	0.9833029

4.3.5 Dealing with Sparse Cohorts

WITH 
first_terms AS (
  SELECT id_bioguide, min(term_start) AS term_start,
  FROM legislators_terms
  GROUP BY 1),

first_states AS (
  SELECT id_bioguide, gender,
    term_start AS first_term, 
    state AS first_state
  FROM first_terms
  INNER JOIN legislators_terms
  USING (id_bioguide, term_start)
  INNER JOIN legislators
  USING (id_bioguide)
  WHERE term_start between '1917-01-01' and '1999-12-31'),
  
cohorts AS (
  SELECT first_state, gender,
    coalesce(date_part('year', age(date, first_term)), 0) AS period,
    COUNT(DISTINCT id_bioguide) AS cohort_retained
  FROM first_states
  INNER JOIN legislators_terms
  USING (id_bioguide)
  LEFT JOIN year_ends
  ON date BETWEEN term_start AND term_end
  GROUP BY 1, 2, 3),

periods AS (
  SELECT generate_series as period 
   FROM generate_series(0, 20, 1)),

cohort_sizes AS (
  SELECT gender, first_state,
    COUNT(DISTINCT id_bioguide) AS cohort_size
  FROM first_states
  GROUP BY 1, 2),
  
pct_retaineds AS (
  SELECT gender, first_state, period, 
    cohort_size,
    coalesce(cohort_retained, 0) AS cohort_retained,
    coalesce(cohort_retained, 0) * 1.0 / cohort_size AS pct_retained
  FROM cohort_sizes
  CROSS JOIN periods
  LEFT JOIN cohorts
  USING (gender, first_state, period))
  
SELECT gender, first_state, cohort_size,
  max(case when period = 0 then pct_retained end) as yr0,
  max(case when period = 2 then pct_retained end) as yr2,
  max(case when period = 4 then pct_retained end) as yr4,
  max(case when period = 6 then pct_retained end) as yr6,
  max(case when period = 8 then pct_retained end) as yr8,
  max(case when period = 10 then pct_retained end) as yr10
FROM pct_retaineds
WHERE first_state IN ('AL', 'AR', 'CA')
GROUP BY 1, 2, 3
ORDER BY 1, 2
LIMIT 3;

3 records
gender	first_state	cohort_size	yr0	yr2	yr4	yr6	yr8	yr10
F	AL	3	1	0.00	0.0	0.00	0.00	0.00
F	AR	5	1	0.80	0.2	0.40	0.40	0.40
F	CA	25	1	0.92	0.8	0.64	0.68	0.68

first_terms <- 
  legislators_terms |>
  group_by(id_bioguide) |>
  summarize(term_start = min(term_start, na.rm = TRUE),
            .groups = "drop")

first_states <-
  first_terms |>
  inner_join(legislators_terms, join_by(id_bioguide, term_start)) |>
  inner_join(legislators, by = "id_bioguide") |>
  rename(first_term = term_start,
         first_state = state) |>
  filter(between(first_term, '1917-01-01', '1999-12-31')) |>
  select(id_bioguide, gender, first_term, first_state)

cohorts <-
  first_states |> 
  inner_join(legislators_terms, by = "id_bioguide") |>
  left_join(year_ends, 
            join_by(between(y$date, x$term_start, x$term_end))) |>
  mutate(period = coalesce(year(age(date, first_term)), 0)) |>
  group_by(first_state, gender, period) |>
  summarize(cohort_retained = n_distinct(id_bioguide), 
            .groups = "drop")

From the query below, it seems we’re having an issue whereby legislators are “returning from the dead” in a sense.

cohorts |> 
  filter(first_state == "AR", gender == "F") |> 
  select(period, cohort_retained) |>
  arrange(period) |>
  collect(n = 8)

period	cohort_retained
0	5
1	5
2	4
3	4
4	1
5	1
6	2
7	2

Does this mean something is wrong with our query? Let’s check.

Of course, unlike cohorts of living people, dropping out of the cohort of legislators does not prevent you from returning later on. We can pull out a portion of the query creating cohorts and take a closer look. We first get the id_bioguide values for the female representatives from Arkansas (AR) who are around in period == 4. We then look at some data from legislators and legislators_terms for these two representatives.

weird_id_bioguides <-
  first_states |> 
  filter(first_state == "AR", gender == "F") |>
  inner_join(legislators_terms, by = "id_bioguide") |>
  left_join(year_ends, 
            join_by(between(y$date, x$term_start, x$term_end))) |>
  mutate(period = coalesce(year(age(date, first_term)), 0)) |>
  filter(period >= 4) |>
  select(id_bioguide) |>
  distinct() |>
  pull()

legislators |>
  filter(id_bioguide %in% weird_id_bioguides) |>
  inner_join(legislators_terms, by = "id_bioguide") |>
  select(id_bioguide, full_name, term_type, term_start, term_end) |>
  arrange(id_bioguide, term_start) |>
  collect()

id_bioguide	full_name	term_type	term_start	term_end
C000138	Hattie Wyatt Caraway	sen	1931-12-07	1933-03-03
C000138	Hattie Wyatt Caraway	sen	1933-03-09	1939-01-03
C000138	Hattie Wyatt Caraway	sen	1939-01-03	1945-01-03
L000035	Blanche Lambert Lincoln	rep	1993-01-05	1995-01-03
L000035	Blanche Lambert Lincoln	rep	1995-01-04	1997-01-03
L000035	Blanche Lambert Lincoln	sen	1999-01-06	2005-01-03
L000035	Blanche Lambert Lincoln	sen	2005-01-04	2011-01-03

From the above, it seems that Blanche Lambert Lincoln would have dropped out of the cohort in periods 4 and 5, then returned in 6 and 7.

To aid this kind of “debugging” of our queries, we could easily have retained some underlying data in cohorts. For example, by added cohort_ids = array_agg(id_bioguide) to the query, we can easily interrogate cohorts for the underlying legislator IDs.

cohorts <-
  first_states |> 
  inner_join(legislators_terms, by = "id_bioguide") |>
  left_join(year_ends, 
            join_by(between(y$date, x$term_start, x$term_end))) |>
  mutate(period = coalesce(year(age(date, first_term)), 0)) |>
  group_by(first_state, gender, period) |>
  summarize(cohort_retained = n_distinct(id_bioguide),
            cohort_ids = array_agg(id_bioguide),
            .groups = "drop")

cohorts |>
  filter(first_state == "AR", gender == "F", 
         between(period, 3, 8)) |>
  mutate(cohort_ids = as.character(cohort_ids)) |>
  select(period, cohort_retained, cohort_ids) |>
  arrange(period) |>
  collect()

period	cohort_retained	cohort_ids
3	4	[L000035, C000138, W000634, O000061]
4	1	[C000138]
5	1	[C000138]
6	2	[L000035, C000138]
7	2	[L000035, C000138]
8	2	[L000035, C000138]

periods <- 
  tibble(period = 1:20) |>
  copy_to(db, df = _)

cohort_sizes <-
  first_states |>
  group_by(gender, first_state) |>
  summarize(cohort_size = n_distinct(id_bioguide),
            .groups = "drop")

pct_retaineds <-
  cohort_sizes |>
  cross_join(periods) |>
  left_join(cohorts, 
            join_by(gender, first_state, period)) |>
  mutate(cohort_retained = coalesce(cohort_retained, 0),
         pct_retained = cohort_retained * 1.0/ cohort_size) |>
  select(gender, first_state, period, cohort_size,
         cohort_retained, pct_retained)

pct_retaineds |>
  filter(period %in% c(2, 4, 6, 8, 10),
         first_state %in% c('AL', 'AR', 'CA')) |>
  select(gender, first_state, cohort_size, period, pct_retained) |>
  pivot_wider(names_from = "period",
              names_prefix = "yr",
              values_from = "pct_retained") |>
  arrange(gender, first_state)

gender	first_state	cohort_size	yr2	yr4	yr6	yr8	yr10
F	AL	3	0.00	0.0	0.00	0.00	0.00
F	AR	5	0.80	0.2	0.40	0.40	0.40
F	CA	25	0.92	0.8	0.64	0.68	0.68

pct_retaineds |>
  filter(first_state %in% c('AL', 'AR', 'CA')) |>
  ggplot(aes(x = period, y = pct_retained, 
             colour = gender, group = gender)) +
  geom_line() +
  facet_wrap(vars(first_state), ncol = 1)

4.3.6 Defining Cohorts from Dates other than the First Date

WITH first_terms AS (
  SELECT DISTINCT id_bioguide, term_type,
    '2000-01-01'::date AS first_term,
    min(term_start) AS min_start
  FROM legislators_terms 
  WHERE term_start <= '2000-12-31' and term_end >= '2000-01-01'
  GROUP BY 1, 2, 3),
  
cohort_dates AS (
  SELECT id_bioguide, date
  FROM legislators_terms
  LEFT JOIN year_ends ON date BETWEEN term_start AND term_end),
  
cohorts AS (
  SELECT term_type,
    coalesce(date_part('year', age(date, first_term)), 0) AS period,
    COUNT(DISTINCT id_bioguide) AS cohort_retained
  FROM first_terms
  JOIN cohort_dates
  USING (id_bioguide)
  GROUP BY 1, 2)
  
SELECT term_type, period,
  first_value(cohort_retained) OVER w AS cohort_size,
  cohort_retained,
  cohort_retained * 1.0 / 
    first_value(cohort_retained) OVER w AS pct_retained
FROM cohorts
WHERE period >= 0
WINDOW w AS (PARTITION BY term_type ORDER BY period)
ORDER BY 2, 1
LIMIT 4;

4 records
term_type	period	cohort_size	cohort_retained	pct_retained
rep	0	440	440	1.0000000
sen	0	101	101	1.0000000
rep	1	440	392	0.8909091
sen	1	101	89	0.8811881

min_year <- 2000L

first_terms <-
  legislators_terms |>
  filter(year(term_start) <= min_year, 
         year(term_end) >= min_year) |>
  mutate(first_term = as.Date(paste0(min_year, '-01-01'))) |>
  group_by(id_bioguide, term_type, first_term) |>
  summarize(min_start = min(term_start, na.rm = TRUE),
            .groups = "drop")

cohort_dates <-
  legislators_terms |>
  left_join(year_ends, 
            join_by(between(y$date, x$term_start, x$term_end))) |>
  filter(year(date) >= min_year) |>
  select(id_bioguide, date)

cohorts <-
  first_terms |>
  inner_join(cohort_dates, by = "id_bioguide") |>
  mutate(period = coalesce(year(age(date, first_term)), 0)) |>
  group_by(term_type, period) |>
  summarize(cohort_retained = n_distinct(id_bioguide),
            .groups = "drop")

pct_retained_2000 <-
  cohorts |> 
  group_by(term_type) |>
  window_order(period) |>
  mutate(cohort_size = first(cohort_retained),
         pct_retained = cohort_retained * 1.0 / cohort_size) |>
  select(term_type, period, cohort_size, cohort_retained,
         pct_retained)

pct_retained_2000 |>
  arrange(period, term_type) |>
  collect(n = 4)

# A tibble: 4 × 5
# Groups:   term_type [2]
  term_type period cohort_size cohort_retained pct_retained
  <chr>      <dbl>       <dbl>           <dbl>        <dbl>
1 rep            0         439             439        1    
2 sen            0         101             101        1    
3 rep            1         439             392        0.893
4 sen            1         101              89        0.881

pct_retained_2000 |>
  filter(period <= 20) |>
  mutate(pct_retained = pct_retained * 100) |>
  ggplot(aes(x = period, y = pct_retained,
             colour = term_type, group = term_type)) +
  geom_line()

4.4 Related Cohort Analyses

4.4.1 Survivorship

WITH first_centuries AS (
   SELECT id_bioguide,
    date_part('century', min(term_start)) AS first_century,
    min(term_start) AS first_term,
    max(term_start) AS last_term,
    date_part('year', age(max(term_start), min(term_start))) AS tenure
  FROM legislators_terms
  GROUP BY 1)
  
SELECT first_century,
  count(distinct id_bioguide) as cohort_size,
  count(distinct case when tenure >= 10 then id_bioguide end) as survived_10,
  count(distinct case when tenure >= 10 then id_bioguide end) * 1.0 
 / count(distinct id_bioguide) as pct_survived_10
FROM first_centuries
GROUP BY 1
ORDER BY 1;

4 records
first_century	cohort_size	survived_10	pct_survived_10
18	368	83	0.2255435
19	6299	892	0.1416098
20	5091	1853	0.3639756
21	760	119	0.1565789

WITH first_centuries AS (
  SELECT id_bioguide,
    date_part('century', min(term_start)) AS first_century,
    count(term_start) AS total_terms
  FROM legislators_terms
  GROUP BY 1)
  
SELECT first_century,
  COUNT(DISTINCT id_bioguide) AS cohort_size,
  COUNT(DISTINCT CASE WHEN total_terms >= 5 THEN id_bioguide END) AS survived_5,
  COUNT(DISTINCT CASE WHEN total_terms >= 5 THEN id_bioguide END) * 1.0
    / count(distinct id_bioguide) AS pct_survived_5_terms
FROM first_centuries
GROUP BY 1
ORDER BY 1;

4 records
first_century	cohort_size	survived_5	pct_survived_5_terms
18	368	63	0.1711957
19	6299	711	0.1128751
20	5091	2153	0.4229032
21	760	205	0.2697368

WITH first_centuries AS (
  SELECT id_bioguide,
    date_part('century',min(term_start)) AS first_century,
    count(term_start) AS total_terms
  FROM legislators_terms
  GROUP BY 1),

terms AS (
  SELECT generate_series as terms 
  FROM generate_series(1, 20, 1))
  
SELECT first_century,
  terms,
  count(distinct id_bioguide) as cohort,
  count(distinct case when total_terms >= terms then id_bioguide end) as cohort_survived,
  count(distinct case when total_terms >= terms then id_bioguide end) * 1.0 
 / count(distinct id_bioguide) as pct_survived
FROM first_centuries
CROSS JOIN terms
GROUP BY 1, 2
ORDER BY first_century, terms
LIMIT 3;

3 records
first_century	terms	cohort	cohort_survived	pct_survived
18	1	368	368	1.0000000
18	2	368	249	0.6766304
18	3	368	153	0.4157609

first_centuries <-
  legislators_terms |>
  group_by(id_bioguide) |>
  summarize(first_century = century(min(term_start, na.rm = TRUE)),
            total_terms = n())

terms <- 
  tibble(terms = 1:20) |>
  copy_to(db, df = _, name = "terms")

first_centuries |>
  cross_join(terms) |>
  mutate(survived_id = if_else(total_terms >= terms, id_bioguide, NA)) |>
  group_by(first_century, terms) |>
  summarize(cohort = n_distinct(id_bioguide),
            cohort_survived = n_distinct(survived_id), 
            .groups = "drop") |>
  mutate(pct_survived = 1.0 * cohort_survived / cohort) |>
  arrange(first_century, terms)

first_century	terms	cohort	cohort_survived	pct_survived
18	1	368	368	1.0000000
18	2	368	249	0.6766304
18	3	368	153	0.4157609

4.4.2 Returnship, or Repeat Purchase Behaviour

WITH rep_first_terms AS (
  SELECT id_bioguide, min(term_start) AS first_term
  FROM legislators_terms
  WHERE term_type = 'rep'
  GROUP BY 1)
      
SELECT date_part('century', first_term)::int AS cohort_century,
  count(id_bioguide) AS reps
FROM rep_first_terms
GROUP BY 1
ORDER BY 1

4 records
cohort_century	reps
18	299
19	5773
20	4481
21	683

WITH rep_first_terms AS (
  SELECT id_bioguide, min(term_start) AS first_term
  FROM legislators_terms
  WHERE term_type = 'rep'
  GROUP BY 1)

SELECT date_part('century', first_term) AS cohort_century,
  count(id_bioguide) as reps
FROM rep_first_terms
GROUP BY 1
ORDER BY 1;

4 records
cohort_century	reps
18	299
19	5773
20	4481
21	683

WITH rep_first_terms AS (
  SELECT id_bioguide, min(term_start) AS first_term
  FROM legislators_terms
  WHERE term_type = 'rep'
  GROUP BY 1),

num_reps AS (
  SELECT date_part('century', first_term) as cohort_century,
    count(id_bioguide) as reps
  FROM rep_first_terms
  GROUP BY 1),
  
sens AS (
  SELECT date_part('century',b.first_term) as cohort_century,
    count(distinct b.id_bioguide) as rep_and_sen
  FROM rep_first_terms b
  JOIN legislators_terms c on b.id_bioguide = c.id_bioguide
  and c.term_type = 'sen' and c.term_start > b.first_term
  GROUP BY 1)
  
SELECT aa.cohort_century,
  bb.rep_and_sen * 1.0 / aa.reps as pct_rep_and_sen
FROM num_reps aa
LEFT JOIN sens bb 
on aa.cohort_century = bb.cohort_century;

4 records
cohort_century	pct_rep_and_sen
20	0.0566838
21	0.0366032
18	0.1906355
19	0.0569894

WITH 

rep_first_terms AS (
  SELECT id_bioguide, min(term_start) AS first_term
  FROM legislators_terms
  WHERE term_type = 'rep'
  GROUP BY 1),
  
reps AS (
  SELECT date_part('century',a.first_term) as cohort_century,
    count(id_bioguide) as reps
  FROM rep_first_terms a
  WHERE first_term <= '2009-12-31'
  GROUP BY 1),
  
sens AS (
  SELECT date_part('century',b.first_term) as cohort_century,
    count(distinct b.id_bioguide) as rep_and_sen
  FROM rep_first_terms b
  JOIN legislators_terms c 
  on b.id_bioguide = c.id_bioguide
    and c.term_type = 'sen' and c.term_start > b.first_term
    WHERE age(c.term_start, b.first_term) <= interval '10 years'
  GROUP BY 1)
  
SELECT aa.cohort_century
,bb.rep_and_sen * 1.0 / aa.reps as pct_rep_and_sen
FROM reps aa
LEFT JOIN sens bb 
USING (cohort_century);

4 records
cohort_century	pct_rep_and_sen
20	0.0348137
21	0.0763636
18	0.0969900
19	0.0244240

WITH 

rep_first_terms AS (
  SELECT id_bioguide, min(term_start) AS first_term
  FROM legislators_terms
  WHERE term_type = 'rep'
  GROUP BY 1),
  
reps AS (
  SELECT date_part('century', first_term) as cohort_century,
    count(id_bioguide) as reps
  FROM rep_first_terms
  WHERE first_term <= '2009-12-31'
  GROUP BY 1),
  
sen_terms AS (
  SELECT id_bioguide, term_type, term_start
  FROM legislators_terms
  WHERE term_type = 'sen'),
  
gaps AS (
  SELECT id_bioguide, first_term, 
    age(term_start, first_term) AS gap
  FROM rep_first_terms 
  JOIN sen_terms 
  USING (id_bioguide)
  WHERE term_start > first_term),
  
gap_indicators AS (
  SELECT date_part('century', first_term) as cohort_century,
    count(distinct case when gap <= interval '5 years' then id_bioguide end) as rep_and_sen_5_yrs,
    count(distinct case when gap <= interval '10 years' then id_bioguide end) as rep_and_sen_10_yrs,
    count(distinct case when gap <= interval '15 years' then id_bioguide end) as rep_and_sen_15_yrs
  FROM gaps
  GROUP BY 1)

SELECT cohort_century::int as cohort_century,
  round(rep_and_sen_5_yrs * 1.0 / reps, 4) as pct_5_yrs,
  round(rep_and_sen_10_yrs * 1.0 / reps, 4) as pct_10_yrs,
  round(rep_and_sen_15_yrs * 1.0 / reps, 4) as pct_15_yrs
FROM reps
LEFT JOIN gap_indicators
USING (cohort_century)
ORDER BY cohort_century

4 records
cohort_century	pct_5_yrs	pct_10_yrs	pct_15_yrs
18	0.0502	0.0970	0.1438
19	0.0088	0.0244	0.0409
20	0.0100	0.0348	0.0478
21	0.0400	0.0764	0.0873

4.4.3 Cumulative Calculations

WITH types AS (
  SELECT distinct id_bioguide,
    first_value(term_type) over w as first_type,
    min(term_start) over w as first_term,
    (min(term_start) over w) + interval '10 years' as first_plus_10
  FROM legislators_terms
  WINDOW w AS (partition by id_bioguide order by term_start))
  
SELECT date_part('century',a.first_term)::int as century,
  first_type,
  count(distinct a.id_bioguide) as cohort,
  count(b.term_start) as terms
FROM types a
LEFT JOIN legislators_terms b 
on a.id_bioguide = b.id_bioguide and 
  b.term_start between a.first_term and a.first_plus_10
GROUP BY 1, 2;

8 records
century	first_type	cohort	terms
20	rep	4473	16203
19	rep	5744	12165
20	sen	618	1008
19	sen	555	795
18	sen	71	101
21	sen	77	118
21	rep	683	2203
18	rep	297	760

WITH a AS (
  SELECT DISTINCT id_bioguide,
    first_value(term_type) OVER w AS first_type,
    min(term_start) OVER w AS first_term
  FROM legislators_terms
  WINDOW w AS (PARTITION BY id_bioguide ORDER BY term_start)),
  
cohort_terms AS (
  SELECT date_part('century',a.first_term)::int AS century,
    first_type,
    COUNT(distinct a.id_bioguide) AS cohort,
    COUNT(b.term_start) as terms,
    COUNT(b.term_start) * 1.0 / COUNT(DISTINCT a.id_bioguide) AS cohort_term
  FROM a
  LEFT JOIN legislators_terms b 
  on a.id_bioguide = b.id_bioguide AND 
    b.term_start BETWEEN a.first_term AND a.first_term + interval '10 years'
  GROUP BY 1, 2)
  
SELECT century,
  max(case when first_type = 'rep' then cohort end) as rep_cohort,
  max(case when first_type = 'rep' then cohort_term end) as avg_rep_terms,
  max(case when first_type = 'sen' then cohort end) as sen_cohort,
  max(case when first_type = 'sen' then cohort_term end) as avg_sen_terms
FROM cohort_terms
GROUP BY 1;

4 records
century	rep_cohort	avg_rep_terms	sen_cohort	avg_sen_terms
20	4473	3.622401	618	1.631068
19	5744	2.117862	555	1.432432
21	683	3.225476	77	1.532468
18	297	2.558923	71	1.422535

4.5 Cross-Section Analysis, Through a Cohort Lens

SELECT b.date, count(distinct a.id_bioguide) as legislators
FROM legislators_terms a
JOIN year_ends b on b.date between a.term_start and a.term_end
and b.date <= '2020-12-31'
GROUP BY 1
;

Displaying records 1 - 10
date	legislators
1855-12-31	306
1856-12-31	306
1853-12-31	315
1854-12-31	316
1851-12-31	303
1852-12-31	310
1849-12-31	310
1850-12-31	313
1848-12-31	306
1847-12-31	301

SELECT b.date
,date_part('century',first_term)::int as century
,count(distinct a.id_bioguide) as legislators
FROM legislators_terms a
JOIN year_ends b on b.date between a.term_start and a.term_end
and b.date <= '2020-12-31'
JOIN
(
        SELECT id_bioguide, min(term_start) as first_term
        FROM legislators_terms
        GROUP BY 1
) c on a.id_bioguide = c.id_bioguide        
GROUP BY 1,2
;

Displaying records 1 - 10
date	century	legislators
1919-12-31	20	516
1965-12-31	20	546
1995-12-31	20	539
2005-12-31	20	388
1975-12-31	20	547
1811-12-31	19	153
1893-12-31	19	470
1909-12-31	20	344
1937-12-31	20	547
2010-12-31	21	262

SELECT date
,century
,legislators
,sum(legislators) over (partition by date) as cohort
,legislators * 100.0 / sum(legislators) over (partition by date) as pct_century
FROM
(
        SELECT b.date
        ,date_part('century',first_term)::int as century
        ,count(distinct a.id_bioguide) as legislators
        FROM legislators_terms a
        JOIN year_ends b on b.date between a.term_start and a.term_end
and b.date <= '2020-01-01'
        JOIN
        (
                SELECT id_bioguide, min(term_start) as first_term
                FROM legislators_terms
                GROUP BY 1
        ) c on a.id_bioguide = c.id_bioguide        
        GROUP BY 1,2
) a
ORDER BY 1,2
;

Displaying records 1 - 10
date	century	legislators	cohort	pct_century
1789-12-31	18	89	89	100
1790-12-31	18	95	95	100
1791-12-31	18	99	99	100
1792-12-31	18	101	101	100
1793-12-31	18	141	141	100
1794-12-31	18	140	140	100
1795-12-31	18	145	145	100
1796-12-31	18	150	150	100
1797-12-31	18	152	152	100
1798-12-31	18	155	155	100

WITH 

first_terms AS (
  SELECT id_bioguide, min(term_start) as first_term
                FROM legislators_terms
                GROUP BY 1),
                
aa AS (
  SELECT b.date,
    date_part('century',first_term)::int as century,
    count(distinct a.id_bioguide) as legislators
  FROM legislators_terms a
  JOIN year_ends b 
  ON b.date between a.term_start and a.term_end
    and b.date <= '2020-01-01'
  JOIN first_terms c 
  USING (id_bioguide)
  GROUP BY 1,2) 
                
SELECT date,
  coalesce(sum(case when century = 18 then legislators end) * 100.0 / sum(legislators),0) as pct_18,
  coalesce(sum(case when century = 19 then legislators end) * 100.0 / sum(legislators),0) as pct_19,
  coalesce(sum(case when century = 20 then legislators end) * 100.0 / sum(legislators),0) as pct_20,
  coalesce(sum(case when century = 21 then legislators end) * 100.0 / sum(legislators),0) as pct_21
FROM aa
GROUP BY 1
ORDER BY 1;

Displaying records 1 - 10
date	pct_18	pct_19	pct_20	pct_21
1789-12-31	100	0	0	0
1790-12-31	100	0	0	0
1791-12-31	100	0	0	0
1792-12-31	100	0	0	0
1793-12-31	100	0	0	0
1794-12-31	100	0	0	0
1795-12-31	100	0	0	0
1796-12-31	100	0	0	0
1797-12-31	100	0	0	0
1798-12-31	100	0	0	0

SELECT id_bioguide, date,
  count(date) over (partition by id_bioguide order by date rows between unbounded preceding and current row) as cume_years
FROM (
  SELECT distinct id_bioguide, date
  FROM legislators_terms
  JOIN year_ends 
  ON date between term_start and term_end
    and date <= '2020-01-01');

Displaying records 1 - 10
id_bioguide	date	cume_years
A000009	1973-12-31	1
A000009	1974-12-31	2
A000009	1975-12-31	3
A000009	1976-12-31	4
A000009	1977-12-31	5
A000009	1978-12-31	6
A000009	1979-12-31	7
A000009	1980-12-31	8
A000009	1981-12-31	9
A000009	1982-12-31	10

SELECT date, cume_years,
  count(distinct id_bioguide) as legislators
FROM
(
    SELECT id_bioguide, date
    ,count(date) over (partition by id_bioguide order by date rows between unbounded preceding and current row) as cume_years
    FROM
    (
        SELECT distinct a.id_bioguide, b.date
        FROM legislators_terms a
        JOIN year_ends b 
        on b.date between a.term_start and a.term_end and b.date <= '2020-01-01'
        GROUP BY 1,2
    ) aa
) aaa
GROUP BY 1,2
;

Displaying records 1 - 10
date	cume_years	legislators
1995-12-31	17	30
1999-12-31	5	68
2018-12-31	6	71
1988-12-31	2	54
1978-12-31	2	79
1903-12-31	7	75
2018-12-31	2	61
1948-12-31	3	10
1868-12-31	2	119
1999-12-31	9	30

WITH aa AS (
  SELECT DISTINCT a.id_bioguide, b.date
  FROM legislators_terms a
  JOIN year_ends b 
    on b.date BETWEEN a.term_start AND a.term_end AND b.date <= '2020-01-01'
  GROUP BY 1,2),
  
aaa AS (
  SELECT id_bioguide, date,
    count(date) OVER (PARTITION BY id_bioguide ORDER BY date) AS cume_years
  FROM aa),
  
aaaa AS (
  SELECT date, cume_years,
    COUNT(DISTINCT id_bioguide) as legislators
  FROM aaa
  GROUP BY 1,2)
SELECT date, count(*) as tenures
FROM aaaa
GROUP BY 1;

Displaying records 1 - 10
date	tenures
1973-12-31	35
1974-12-31	36
1975-12-31	36
1976-12-31	36
1977-12-31	34
1978-12-31	34
1979-12-31	31
1980-12-31	32
1981-12-31	31
1982-12-31	32

WITH term_dates AS (
  SELECT distinct a.id_bioguide, b.date
  FROM legislators_terms a
  JOIN year_ends b 
  on b.date between a.term_start and a.term_end and b.date <= '2020-01-01'),

cum_term_dates AS (
  SELECT id_bioguide, date,
    count(date) over (partition by id_bioguide order by date rows between unbounded preceding and current row) as cume_years
  FROM term_dates),
  
cum_term_bands AS (
  SELECT date,
    case when cume_years <= 4 then '1 to 4'
         when cume_years <= 10 then '5 to 10'
        when cume_years <= 20 then '11 to 20'
         else '21+' end as tenure,
    count(distinct id_bioguide) as legislators
  FROM cum_term_dates
  GROUP BY 1,2)
  
SELECT date, tenure,
  legislators * 100.0 / sum(legislators) over w as pct_legislators 
FROM cum_term_bands
WINDOW w AS (partition by date)
ORDER BY date DESC;

Displaying records 1 - 10
date	tenure	pct_legislators
2019-12-31	1 to 4	29.98138
2019-12-31	5 to 10	32.02980
2019-12-31	21+	17.87710
2019-12-31	11 to 20	20.11173
2018-12-31	11 to 20	21.33581
2018-12-31	5 to 10	33.76623
2018-12-31	1 to 4	25.60297
2018-12-31	21+	19.29499
2017-12-31	11 to 20	21.37546
2017-12-31	5 to 10	34.75836

term_dates <-
  legislators_terms |>
  inner_join(year_ends |> filter(date <= '2020-01-01'),
             join_by(between(y$date, x$term_start, x$term_end))) |>
  distinct(id_bioguide, date)

cum_term_dates <-
  term_dates |>
  group_by(id_bioguide) |>
  window_order(date) |>
  window_frame(-Inf, 0) |>
  mutate(cume_years = n()) |>
  ungroup() |>
  select(id_bioguide, date, cume_years)

cum_term_bands <-
  cum_term_dates |> 
  mutate(tenure = case_when(cume_years <= 4 ~ '1 to 4',
                            cume_years <= 10 ~ '5 to 10',
                            cume_years <= 20 ~ '11 to 20',
                            TRUE ~ '21+')) |>
  group_by(date, tenure) |>
  summarize(legislators = n_distinct(id_bioguide),
            .groups = "drop")

total_legs <-
  cum_term_bands |>
  group_by(date) |>
  summarize(num_legs = sum(legislators, na.rm = TRUE),
            .groups = "drop")

cum_term_bands |>
  inner_join(total_legs, by = "date") |>
  mutate(pct_legislators = legislators * 100.0 / num_legs) |>
  select(date, tenure, pct_legislators) |>
  arrange(desc(date)) |>
  collect(n = 8)

date	tenure	pct_legislators
2019-12-31	5 to 10	32.02980
2019-12-31	1 to 4	29.98138
2019-12-31	11 to 20	20.11173
2019-12-31	21+	17.87710
2018-12-31	11 to 20	21.33581
2018-12-31	5 to 10	33.76623
2018-12-31	21+	19.29499
2018-12-31	1 to 4	25.60297

dbDisconnect(db, shutdown = TRUE)

In writing this sentence, I am still working through the chapter.↩︎
See here for the source for these data.↩︎
I made no effort to research how they are constructed because (1) I am lazy and (2) my imagined version is probably better for my current purposes.↩︎
There are some details I’m glossing over here, such as the fact that there will be some \(j\) where the cumulative survival probability is just above one-half, but where that for \(j + 1\) is just below one-half, so some interpolation will be required.↩︎