Cohort analysis
1 Introduction
This “book” is an iterative reworking of Chapter 4 of Tanimura (2021). I start with Chapter 2, which is a rendition of the original Chapter 4 using Quarto. From a conversation with the author, my understanding is that Tanimura (2021) was written as a Google Docs document with copied-and-pasted output from SQL (for tabular output) and Tableau or Python (for plots). In contrast I am able to reproduce Chapter 4 without leaving the confines of RStudio … and with no copying-and-pasting.
I then move on to Chapter 3, which is a reworking of Chapter 2 using common table expressions (CTEs) in place of subqueries. I have argued elsewhere that CTEs make for simpler and more accurate SQL than subqueries.1 I also think that CTEs improve the chapter from a pedagogical perspective, as the queries, and their constituents, are much clearer.
Chapter 4 builds on Chapter 3 by translating the CTE-based queries into equivalent dbplyr
code. The dbplyr
package is a largely behind-the-scenes R package that is invoked by dplyr
code when the data source is a remote data frame. The dplyr
package is arguably the foundation for the Tidyverse, an “an opinionated collection of R packages designed for data science … [that] share an underlying design philosophy, grammar, and data structures.”2 As I wrote here, dplyr
has deep connections with SQL and dbplyr
brings these connections to life by literally translating code written using dplyr
functions into SQL.
1.1 Why dbplyr
?
But dbplyr
is more than just a way for users of dplyr
to access SQL data sources without having to learn a new language. First, even if one is working with local data sources, dbplyr
allows one to replace the in-memory “back-end” of R with much more performant back-ends, such as that provided by DuckDB. I have found cases where performance increases a thousand-fold by pushing the data from local data frames using dplyr
to DuckDB with no more than a couple of lines of code.3
Second, dbplyr
allows one to do things that cannot be done easily with dplyr
. For example, using dbplyr
-specific functions such as window_frame()
and window_order()
opens up the rich world of window functions in a way that is not possible using dplyr
alone. An illustration of this is provided in our book chapter on the seminal “FFJR” paper where window_frame()
makes it easy to compute total dividends for the 12 months before and then the 12 months after a stock split.4
My view is that dbplyr
is the “killer app” of the Tidyverse. Combining the power of SQL (and more) with the ability to make documents using Quarto and plots using ggplot2
and all the statistical power of R means one rarely has to leave the confines of RStudio (or whatever interface you choose) to get all sorts of things done.
I also think that writing SQL queries using dbplyr
is easier than writing SQL directly. And there are few limitations to dbplyr
relative to SQL. An illustration of this can be seen here, where I translate essentially all of the SQL queries Tanimura (2021) into equivalent dplyr
/dbplyr
queries.5
While Python is the preferred data science platform of many users, I think dplyr
is one reason to learn a little R. There is no exact equivalent of dplyr
/dbplyr
in Python. While a user of dplyr
could generally use pandas to achieve many of the same tasks, one cannot point pandas to an SQL data source and almost seamlessly move to a database back-end. While libraries like Ibis have recently emerged to allow users to interact with a variety of database backends in a consistent way that feels a lot like using dplyr
, a user of pandas would need to learn new approaches.6
1.2 Thoughts on Chapter 4 of Tanimura (2021)
I am a big fan of Tanimura (2021). Too many books on SQL are written as though the readers will become database administrators. As such they tend to start with CREATE TABLE
statements and discussions of designing databases. While I think such an approach covers materials that data analysts might ultimately benefit from knowing, it’s the wrong place to start for someone who wants to learn how to analyse data. In this regard, Tanimura (2021) is excellent, as its aimed squarely at data analysts, not database administrators.
Another feature of Tanimura (2021) that I like is that it is very hands-on, with data and code being made availale to allow readers to run all queries shown in the book. Furthermore, the queries generally use real data, including data on US legislators (explored here), UFO sightings, and earthquakes. While the setting might sound exotic, Tanimura (2021) does an excellent job of motivating the business contexts in which the analyses examined would be relevant. For example, retention analysis might be used to study factors that affect whether customers renew subscriptions.
However, there are weaknesses. The first weakness is that not all readers will see the parallels between analysing data on the tenures of legislators and business contexts such as customer retention as easily as the author, Cathy Tanimura, evidently does.7
Second, speaking specifically to Chapter 4 of Tanimura (2021), at 54 pages, the chapter is very long. I suspect that even very dedicated readers would take a long time to work through that chapter and that many readers would struggle to recall the early parts of the chapter by the time they got to the end.
Third, the ideas covered in Chapter 4 are difficult and at times the presentation seems difficult to follow. For example, the chapter starts out with an incorrect query before showing how to do things correctly, then working through several permutations before returning to discussion of incorrect approaches (i.e., cross-sectional analyses that introduce possible survivorship biases). I feel that the complexity is increased by the use of subqueries, which makes the parallels between the various analyses less clear.
So one aim of this “book” is to develop an alternative approach to the analysis that is clearer for readers.
1.3 DuckDB and learning SQL
Apart from an orientation to database administrators, another weakness of books on SQL is that inevitably need to deal with at least one implmentation of SQL if readers are to have the opportunity to run SQL code themselves. PostgreSQL, SQL Server, MySQL, and SQLite are popular choices.
Tanimura (2021) uses PostgreSQL, a solid choice given its rich type system, modern SQL dialect and price (it’s free). Nonetheless, a reader of Tanimura (2021) needs to install PostgreSQL, set up a server, download data, then edit and run SQL scripts to stand up a database. This is a barrier to entry, particularly for a data analyst who is looking to learn SQL to do analysis.
DeBarros (2022) also uses PostgreSQL. He starts with a couple of chapters on setting up PostgreSQL and the usual CREATE TABLE
fare.
Beaulieu (2020) uses MySQL and Chapter 2 of that book is the usual “creating and populating a datadase” with discussion of data types and CREATE TABLE
queries.
Another option would be SQL Server, a Microsoft product. A student of SQL might encounter SQL Server because many teaching SQL would come from a corporate IT background, where Microsoft has cornered the market by catering to the interests of IT departments rather than the needs of end-users. But this is a non-starter if you’re using MacOS or Linux.8
PostgreSQL, SQL Server, and MySQL all require a user to set up an database server to use them. Nield (2016) calls these centralized databases. In contrast, SQLite can be run inside another process, such as withing Python or R. While this obviously lowers the set-up costs, my casual search for SQL learning materials suggest that SQLite is not an especially popular system for authors of such materials.
Nield (2016) is one resource that uses SQLite, which he calls a lightweight database. While Chapter 3 of Nield (2016) looks a lot like similarly situated chapters in other books, though the simplicity that SQLite brings is evident from the absence of CREATE TABLE
discussion in that chapter.
Of course, SQLite has some weaknesses. Its type system is relatively weak, especially relative to PostgreSQL, which supports a rich set of types for date-times, geometry, JSON, XML, as well as arrays and functions related to these. While the SQL offered by SQLite used to lack features of PostgreSQL, such as window functions, recent versions of SQLite appear to have closed the gap significantly. Still gaps remain, such as the lack of support for regular expression functions.
DuckDB is another lightweight database. Probably not as lightweight as SQLite. But rich SQL a lot like PostgreSQL. Also a lot of features
In fact, the example I use to illustrate this point comes from Chapter 4 of Tanimura (2021).↩︎
The Tidyverse was earlier known as the Hadleyverse after its principal architect, Hadley Wickham.↩︎
For example, the query producing
risk_asymmetry
here took 15 minutes before I moved the query inside the database—either PostgreSQL or DuckDB—where it took about one second.↩︎While there are packages that allow such rolling averages, these are outside the Tidyverse itself and likely do not offer the same performance in many cases.↩︎
I say essentially all because it seems I did not translate all the queries from Chapter 4 of Tanimura (2021); but I do those translations here.↩︎
In some ways, pandas is supplies functionality that comes with base R, so comparing it with
dplyr
is not entirely appropriate. But I think the comparison is valid is that many data analysts using Python in 2024 would use pandas in the same way that many data analysts using R would usedplyr
.↩︎There’s a certain genius behind the data sets that Tanimura selects for each chapter.↩︎
Even installing an ODBC driver for SQL Server requires compiling code on MacOS because Microsoft doesn’t deign to provide a packaged driver itself.↩︎