SQL for Data Analysis using R

Author

Ian Gow

Published

19 May 2023

Preface

This “book” is a collection of notes made as I work through Tanimura (2021). An alternative title might have been “Data Analysis with Data Stored in Databases”; while SQL is the language used in Tanimura (2021) to do the analysis, this is in some ways a minor detail. My view is that one can easy do “SQL” analysis using the dplyr package in R (this uses dbplyr behind the scenes … mostly). The dplyr package will quietly translate R code into SQL for the user.

An advantage of using dplyr rather than SQL directly is that one doesn’t need to learn as much SQL. In principle, one could use dplyr without knowing any SQL. Given that dplyr and R can be used to analyse data from other data sources, this reduces the amount that is needed to be learnt to do analysis. Additionally, one could write code to analyse data from a database and then easily reuse the code for data from a CSV file.

Notwithstanding the discussion in the previous paragraph, I recommend that people who find themselves using SQL-driven databases a lot learn some SQL. This view is implicitly endorsed by Hadley Wickham with the inclusion of a significant amount of material intended to teach SQL to readers of “R for Data Science”. While we include a brief primer on SQL for dplyr users here, clearly an excellent source for learning SQL is Tanimura (2021) itself.

Another benefit of using R is that we can do more with R. Tanimura (2021) includes many plots of data produced using SQL, but the code to make the plots is not included. In contrast, here I will include code to produce the data as well as code to make plots (where applicable). In Chapter 7, we will see that using R avoids the need to go off to an “online calculator” to calculate $p$-values and the like.

My view is that the material here might be useful to someone who is using Tanimura (2021) and wants to see how the ideas there can be implemented using R, while at the same time being useful to someone who knows (or is looking to learn) R and is looking for realistic data sets to work on.

The structure of this book

First of all, I should note that this “book” is not a standalone book. To make sense of what I have here, you really need to get a copy of Tanimura (2021) itself. This book is more of a companion guide to someone who is more familiar with R than SQL, or coming from SQL and looking to learn R in the most effective way possible, or simply looking for a way to get more out of Tanimura (2021).

With that out of the way, the book evolves in its relation with Tanimura (2021). Chapters 1 and 2 differ from their counterparts in Tanimura (2021). Chapter 1 provides a tutorial on SQL. Chapter 2 provides a tutorial on dbplyr.

In Chapters 3 and 4, I focus on the task of translating SQL to dbplyr equivalents. As a convenience, I include SQL and equivalent R code for each code example provided in Tanimura (2021). In some cases, I tweak the SQL found in Tanimura (2021) to better conform to my tastes or to make the translation exercise easier (e.g., I tend to use CTEs much more than Tanimura (2021) does).

From Chapter 5 onwards, I generally omit SQL and just present the dplyr equivalents of the SQL provided in Tanimura (2021).

SQL and me

Computer Information Systems I was a required subject in my undergraduate degree program.¹ While I do not recall many of the details of the subject, I do recall a lecturer whose English was difficult to understand (I am not sure how many lectures I attended, as I thought at the time that lectures are a very inefficient way to learn … as I still do) and a group project.

The group project required the construction of an SQL database for a hypothetical business. My group comprised three people—myself, a good friend, and someone who was working full-time and studying a degree specializing in information systems. I was a self-supporting full-time student, so I could not afford the software we used to build the database (my memory tells me it was about $60, but I know enough not to trust memories), but my friend had a copy. I recall that some parts of the project were completed over the phone (this was before the internet), perhaps not the best way to write SQL. In the end, I guess we got the project done and submitted.²

In any case, after that subject I continued on with the rest of my degree, but I don’t think I used SQL or any data analysis tools except for Excel for the remainder of my Bachelor of Commerce (the Bachelor of Laws was free of quantitative analysis as far as I recall).

When I graduated I joined a “strategy consulting” firm. My first project involved trying to explain the factors affecting branch profitability for a regional firm offering banking and insurance products. I don’t remember all the details, but there were many data analyses and some involved making requests for SQL queries to be run to supply the data that I used.

My second significant project involved analyses of product and customer profitability. Again the details are hazy, but I recall that analyses required joining multiple tables involving data on mapping of products to cost centres, products to orders, orders to customers, and so on. I would guess that I used Microsoft Access on a laptop that was underpowered (for its time).³

A later project had a reputation that preceded it. Several people had joined this project before deciding to leave the firm. My recollection is that a partner had been hired from another firm with the understanding that he would lead an “insurance benchmarking study”. The assigned junior person on the project would be waiting until the very busy partner had managed to extract complete surveys from various insurance firms. Faced with this prospect, I tried to engineer my way off this project by creating a Visual Basic program that would take completed surveys and, with assistance of a Microsoft Access database, create a Powerpoint slide automatically. My program worked OK, but I ended up needing to resign to get off the project myself.

A later freelance consulting project had me working with the financial planning team of a major Australian bank. Their existing planning process involved sending out numerous Excel spreadsheet templates to various units (branches, project managers) and then carefully stitching back the completed spreadsheets into a single large spreadsheet. This process would take weeks because the returned spreadsheets had been quite mangled. I helped the team reengineer the process to use a Microsoft Access database fed by locked templates. The resulting process took hours instead of weeks.

Making the fateful—and much regretted—decision to leave business for a “career” in business academia, my exposure to SQL has not gone down. For the first several years after entering the PhD program at Stanford, PROC SQL in SAS was a mainstay of my analysis pipeline. In 2011, I decided to migrate to an alternative database, as I found SAS restrictive. After trying a number of alternatives (SQLite and MySQL), I settled on PostgreSQL as the backend of my data analysis workflow.

Initially, I was writing a lot of SQL and using other programs (e.g., Stata and R) to analyse data. When dplyr and dbplyr emerged, I immediately found this to be very intuitive and facile for analysis. Nowadays, I rarely write SQL directly and rely on dbplyr to do that for me. Of course, it’s still very helpful to know SQL well and I suspect I still “think in SQL” even though I type in dbplyr commands. Today, I essentially never use “data files”; all my data go through a PostgreSQL database.

The point of boring any reader with the autobiographical details above is to illustrate that one can go a lot of places in data analysis and not get very far from SQL.

Bachelor of Commerce/Bachelor of Laws at the University of New South Wales.↩︎
I vaguely recall getting a lower grade than my friend for the group assignment, which was inexplicable since there was no indication who did what. I survived.↩︎
I recall that partners had the higher-powered laptops—necessary for writing Lotus Notes and reading Powerpoint slides—while analysts like me had older laptops. It didn’t make sense to me at the time, and still doesn’t.↩︎