1  Introduction to SQL

1.1 What is Data Analysis?

The focus of Tanimura (2021) is on preparing data sets for analysis by business practitioners, who might work in accounting, marketing, financial analysis, human resources, or product management. Data analysis blends computing, statistics, and background business knowledge. A successful data analyst will be able to write queries to get the right data for a particular question, but will also have skills in data visualization and statistical analysis.

1.2 Why SQL?

SQL is the lanaguage of databases. So one answer to the question “why use SQL?” invokes Sutton’s Law. An apocryphal story has it that, when asked why he robbed banks, famed bank robber Willie Sutton replied “because that’s where the money is.” We use SQL because databases are where the data are.

1.3 SQL: The basics

1.3.1 Setting up your computer

Assuming that you have the ability to install software and a WRDS account, setting up your computer so that you can run the code in this book is straightforward and takes just a few minutes. We list the required steps below and also provide a video demonstrating these steps here.

  1. Download and install R. R is available for all major platforms (Windows, Linux, and MacOS) here.

  2. Download and install RStudio. An open-source version of RStudio is available here.

  3. Install required packages from CRAN. CRAN stands for “Comprehensive R Archive Network” and is the official repository for packages (also known as libraries) made available for R. In this book, we will make use of a number of R packages. These can be installed easily by running the following code in RStudio.1

    install.packages(c("DBI", "dbplyr", "dplyr", "duckdb", "flextable",
        "janitor", "knitr", "readxl", "stringr", "tidyverse"))

1.3.2 Reading in data

If you copy and paste the following code into your R console and run it, you will load data into a database table retail_sales, which can be queried using SQL. You may find it easiest to download the template prepared for this chapter.

library(DBI)
library(tidyverse)
library(dbplyr)
library(knitr)

db <- dbConnect(duckdb::duckdb())

retail_sales <-
  tbl(db, "read_csv_auto('data/us_retail_sales.csv')") |>
  compute(name = "retail_sales")

1.3.3 Table expressions

A table expression contains a FROM clause and (optionally) WHERE, GROUP BY, and HAVING clauses. The WHERE, GROUP BY, and HAVING clauses specify a pipeline of successive transformations performed on the table specified in the FROM clause. These transformations produce a virtual table containing the rows that are passed to SELECT to compute the output rows of the query.

Perhaps the most basic form of the SELECT statement is SELECT *, which asks for all columns from the table specified in the FROM clause.

SELECT *
FROM retail_sales
Displaying records 1 - 10
sales_month naics_code kind_of_business reason_for_null sales
1992-01-01 441 Motor vehicle and parts dealers NA 29811
1992-01-01 4411 Automobile dealers NA 25800
1992-01-01 4411, 4412 Automobile and other motor vehicle dealers NA 26788
1992-01-01 44111 New car dealers NA 24056
1992-01-01 44112 Used car dealers NA 1744
1992-01-01 4413 Automotive parts, acc., and tire stores NA 3023
1992-01-01 442 Furniture and home furnishings stores NA 3846
1992-01-01 442, 443 Furniture, home furn, electronics, and appliance stores NA 7503
1992-01-01 4421 Furniture stores NA 2392
1992-01-01 4422 Home furnishings stores NA 1454

But we can also specify columns that we are interested in by name.

SELECT sales_month, kind_of_business, sales
FROM retail_sales
Displaying records 1 - 10
sales_month kind_of_business sales
1992-01-01 Motor vehicle and parts dealers 29811
1992-01-01 Automobile dealers 25800
1992-01-01 Automobile and other motor vehicle dealers 26788
1992-01-01 New car dealers 24056
1992-01-01 Used car dealers 1744
1992-01-01 Automotive parts, acc., and tire stores 3023
1992-01-01 Furniture and home furnishings stores 3846
1992-01-01 Furniture, home furn, electronics, and appliance stores 7503
1992-01-01 Furniture stores 2392
1992-01-01 Home furnishings stores 1454

The queries above retrieve all records or rows from a table.2 Often we want to focus on rows meeting certain conditions and the WHERE clause allows us to do this. Here we retrieve all sales occurring in or after January 2002.3

SELECT sales_month, kind_of_business, sales
FROM retail_sales
WHERE sales_month >= '2002-01-01'
Displaying records 1 - 10
sales_month kind_of_business sales
2002-01-01 Motor vehicle and parts dealers 60565
2002-01-01 Automobile dealers 52948
2002-01-01 Automobile and other motor vehicle dealers 55799
2002-01-01 New car dealers 48146
2002-01-01 Used car dealers 4802
2002-01-01 Automotive parts, acc., and tire stores 4766
2002-01-01 Furniture and home furnishings stores 7149
2002-01-01 Furniture, home furn, electronics, and appliance stores 14342
2002-01-01 Furniture stores 4097
2002-01-01 Home furnishings stores 3052

1.3.4 Functions and operators

In the WHERE clause above we specified sales_month >= '2002-01-01'. We naturally think of this as being either TRUE or FALSE for a given observation, but there is also a third option that we will discuss in a moment and which allows us to handle cases where we do not know the value for sales_month`.

SQL contains the usual range of comparison operators: <, >, <=, >=, =, and !=, where = means “equal” and both <> and != mean “not equal”.

SELECT sales_month, kind_of_business, sales
FROM retail_sales
WHERE kind_of_business = 'Used car dealers'
Displaying records 1 - 10
sales_month kind_of_business sales
1992-01-01 Used car dealers 1744
1992-02-01 Used car dealers 1990
1992-03-01 Used car dealers 2177
1992-04-01 Used car dealers 2601
1992-05-01 Used car dealers 2171
1992-06-01 Used car dealers 2207
1992-07-01 Used car dealers 2251
1992-08-01 Used car dealers 2087
1992-09-01 Used car dealers 2016
1992-10-01 Used car dealers 2149

Apart from the comparison operators, SQL includes the usual mathematical operators, including +, -, *, and /.

All SQL implementations include a multitude of functions, including mathematical functions such as abs(), exp(), log(), ln(), and sqrt().

1.3.5 Summarizing data

An important class

SELECT kind_of_business, sum(sales) AS total_sales
FROM retail_sales
GROUP BY kind_of_business
ORDER BY total_sales DESC;
Displaying records 1 - 10
kind_of_business total_sales
Retail and food services sales, total 118053993
Retail sales and food services excl gasoline stations 107701613
Retail sales, total 105580364
Retail sales and food services excl motor vehicle and parts 93509935
Retail sales and food services excl motor vehicle and parts and gasoline stations 83157555
Retail sales, total (excl. motor vehicle and parts dealers) 81036306
GAFO(1) 29041144
Motor vehicle and parts dealers 24544058
Automobile and other motor vehicle dealers 22462364
Automobile dealers 20963805
SELECT kind_of_business, sum(sales) AS total_sales
FROM retail_sales
GROUP BY 1
ORDER BY 2 DESC;
Displaying records 1 - 10
kind_of_business total_sales
Retail and food services sales, total 118053993
Retail sales and food services excl gasoline stations 107701613
Retail sales, total 105580364
Retail sales and food services excl motor vehicle and parts 93509935
Retail sales and food services excl motor vehicle and parts and gasoline stations 83157555
Retail sales, total (excl. motor vehicle and parts dealers) 81036306
GAFO(1) 29041144
Motor vehicle and parts dealers 24544058
Automobile and other motor vehicle dealers 22462364
Automobile dealers 20963805
SELECT year(sales_month) AS year,
  sum(sales) AS total_sales
FROM retail_sales
WHERE kind_of_business = 'Used car dealers'
GROUP BY 1
ORDER BY 2 DESC;
Displaying records 1 - 10
year total_sales
2020 124040
2019 119067
2018 114390
2017 112174
2016 105905
2015 98330
2014 90424
2013 83916
2012 80228
2007 79696
SELECT year(sales_month) AS year,
  sum(sales) AS total_sales
FROM retail_sales
WHERE kind_of_business = 'Used car dealers'
GROUP BY 1
HAVING total_sales >= 100000
ORDER BY 2 DESC;
5 records
year total_sales
2020 124040
2019 119067
2018 114390
2017 112174
2016 105905
SELECT kind_of_business, reason_for_null,
  count(*)
FROM retail_sales
WHERE sales IS NULL
GROUP BY 1, 2
Displaying records 1 - 10
kind_of_business reason_for_null count_star()
All other home furnishings stores Not Available 108
Electronics stores Not Available 60
Paint and wallpaper stores Not Available 108
Supermarkets and other grocery (except convenience) stores Not Available 108
Other clothing stores Not Available 108
Floor covering stores Supressed 46
Drinking places Supressed 34
Jewelry stores Supressed 3
Full service restaurants Supressed 11
Home furnishings stores Supressed 3

1.3.6 What have we not covered?

  • Joins
  • Subqueries and common table expressions (CTEs)

  1. You can copy and paste the code into the “Console” in RStudio.↩︎

  2. Though only the first 10 rows are displayed in this book.↩︎

  3. Note that we specify the date as a string literal and in ISO 8601 format, in which the year, month, and day of the date are presented in that order, including leading zeros for the month and day and with the components separated by dashes. This format will become familiar with you if you continue to do any work in data science.↩︎