library(DBI)
library(tidyverse)
library(dbplyr)
library(knitr)
<- dbConnect(duckdb::duckdb())
db
<-
retail_sales tbl(db, "read_csv_auto('data/us_retail_sales.csv')") |>
compute(name = "retail_sales")
1 Introduction to SQL
1.1 What is Data Analysis?
The focus of Tanimura (2021) is on preparing data sets for analysis by business practitioners, who might work in accounting, marketing, financial analysis, human resources, or product management. Data analysis blends computing, statistics, and background business knowledge. A successful data analyst will be able to write queries to get the right data for a particular question, but will also have skills in data visualization and statistical analysis.
1.2 Why SQL?
SQL is the lanaguage of databases. So one answer to the question “why use SQL?” invokes Sutton’s Law. An apocryphal story has it that, when asked why he robbed banks, famed bank robber Willie Sutton replied “because that’s where the money is.” We use SQL because databases are where the data are.
1.3 SQL: The basics
1.3.1 Setting up your computer
Assuming that you have the ability to install software and a WRDS account, setting up your computer so that you can run the code in this book is straightforward and takes just a few minutes. We list the required steps below and also provide a video demonstrating these steps here.
Download and install R. R is available for all major platforms (Windows, Linux, and MacOS) here.
Download and install RStudio. An open-source version of RStudio is available here.
Install required packages from CRAN. CRAN stands for “Comprehensive R Archive Network” and is the official repository for packages (also known as libraries) made available for R. In this book, we will make use of a number of R packages. These can be installed easily by running the following code in RStudio.1
install.packages(c("DBI", "dbplyr", "dplyr", "duckdb", "flextable",
"janitor", "knitr", "readxl", "stringr", "tidyverse"))
1.3.2 Reading in data
If you copy and paste the following code into your R console and run it, you will load data into a database table retail_sales
, which can be queried using SQL. You may find it easiest to download the template prepared for this chapter.
1.3.3 Table expressions
A table expression contains a FROM
clause and (optionally) WHERE
, GROUP BY
, and HAVING
clauses. The WHERE
, GROUP BY
, and HAVING
clauses specify a pipeline of successive transformations performed on the table specified in the FROM
clause. These transformations produce a virtual table containing the rows that are passed to SELECT
to compute the output rows of the query.
Perhaps the most basic form of the SELECT
statement is SELECT *
, which asks for all columns from the table specified in the FROM
clause.
SELECT *
FROM retail_sales
sales_month | naics_code | kind_of_business | reason_for_null | sales |
---|---|---|---|---|
1992-01-01 | 441 | Motor vehicle and parts dealers | NA | 29811 |
1992-01-01 | 4411 | Automobile dealers | NA | 25800 |
1992-01-01 | 4411, 4412 | Automobile and other motor vehicle dealers | NA | 26788 |
1992-01-01 | 44111 | New car dealers | NA | 24056 |
1992-01-01 | 44112 | Used car dealers | NA | 1744 |
1992-01-01 | 4413 | Automotive parts, acc., and tire stores | NA | 3023 |
1992-01-01 | 442 | Furniture and home furnishings stores | NA | 3846 |
1992-01-01 | 442, 443 | Furniture, home furn, electronics, and appliance stores | NA | 7503 |
1992-01-01 | 4421 | Furniture stores | NA | 2392 |
1992-01-01 | 4422 | Home furnishings stores | NA | 1454 |
But we can also specify columns that we are interested in by name.
SELECT sales_month, kind_of_business, sales
FROM retail_sales
sales_month | kind_of_business | sales |
---|---|---|
1992-01-01 | Motor vehicle and parts dealers | 29811 |
1992-01-01 | Automobile dealers | 25800 |
1992-01-01 | Automobile and other motor vehicle dealers | 26788 |
1992-01-01 | New car dealers | 24056 |
1992-01-01 | Used car dealers | 1744 |
1992-01-01 | Automotive parts, acc., and tire stores | 3023 |
1992-01-01 | Furniture and home furnishings stores | 3846 |
1992-01-01 | Furniture, home furn, electronics, and appliance stores | 7503 |
1992-01-01 | Furniture stores | 2392 |
1992-01-01 | Home furnishings stores | 1454 |
The queries above retrieve all records or rows from a table.2 Often we want to focus on rows meeting certain conditions and the WHERE
clause allows us to do this. Here we retrieve all sales occurring in or after January 2002.3
SELECT sales_month, kind_of_business, sales
FROM retail_sales
WHERE sales_month >= '2002-01-01'
sales_month | kind_of_business | sales |
---|---|---|
2002-01-01 | Motor vehicle and parts dealers | 60565 |
2002-01-01 | Automobile dealers | 52948 |
2002-01-01 | Automobile and other motor vehicle dealers | 55799 |
2002-01-01 | New car dealers | 48146 |
2002-01-01 | Used car dealers | 4802 |
2002-01-01 | Automotive parts, acc., and tire stores | 4766 |
2002-01-01 | Furniture and home furnishings stores | 7149 |
2002-01-01 | Furniture, home furn, electronics, and appliance stores | 14342 |
2002-01-01 | Furniture stores | 4097 |
2002-01-01 | Home furnishings stores | 3052 |
1.3.4 Functions and operators
In the WHERE
clause above we specified sales_month >= '2002-01-01'
. We naturally think of this as being either TRUE
or FALSE
for a given observation, but there is also a third option that we will discuss in a moment and which allows us to handle cases where we do not know the value for sales_month`.
SQL contains the usual range of comparison operators: <
, >
, <=
, >=
, =
, and !=
, where =
means “equal” and both <>
and !=
mean “not equal”.
SELECT sales_month, kind_of_business, sales
FROM retail_sales
WHERE kind_of_business = 'Used car dealers'
sales_month | kind_of_business | sales |
---|---|---|
1992-01-01 | Used car dealers | 1744 |
1992-02-01 | Used car dealers | 1990 |
1992-03-01 | Used car dealers | 2177 |
1992-04-01 | Used car dealers | 2601 |
1992-05-01 | Used car dealers | 2171 |
1992-06-01 | Used car dealers | 2207 |
1992-07-01 | Used car dealers | 2251 |
1992-08-01 | Used car dealers | 2087 |
1992-09-01 | Used car dealers | 2016 |
1992-10-01 | Used car dealers | 2149 |
Apart from the comparison operators, SQL includes the usual mathematical operators, including +
, -
, *
, and /
.
All SQL implementations include a multitude of functions, including mathematical functions such as abs()
, exp()
, log()
, ln()
, and sqrt()
.
1.3.5 Summarizing data
An important class
SELECT kind_of_business, sum(sales) AS total_sales
FROM retail_sales
GROUP BY kind_of_business
ORDER BY total_sales DESC;
kind_of_business | total_sales |
---|---|
Retail and food services sales, total | 118053993 |
Retail sales and food services excl gasoline stations | 107701613 |
Retail sales, total | 105580364 |
Retail sales and food services excl motor vehicle and parts | 93509935 |
Retail sales and food services excl motor vehicle and parts and gasoline stations | 83157555 |
Retail sales, total (excl. motor vehicle and parts dealers) | 81036306 |
GAFO(1) | 29041144 |
Motor vehicle and parts dealers | 24544058 |
Automobile and other motor vehicle dealers | 22462364 |
Automobile dealers | 20963805 |
SELECT kind_of_business, sum(sales) AS total_sales
FROM retail_sales
GROUP BY 1
ORDER BY 2 DESC;
kind_of_business | total_sales |
---|---|
Retail and food services sales, total | 118053993 |
Retail sales and food services excl gasoline stations | 107701613 |
Retail sales, total | 105580364 |
Retail sales and food services excl motor vehicle and parts | 93509935 |
Retail sales and food services excl motor vehicle and parts and gasoline stations | 83157555 |
Retail sales, total (excl. motor vehicle and parts dealers) | 81036306 |
GAFO(1) | 29041144 |
Motor vehicle and parts dealers | 24544058 |
Automobile and other motor vehicle dealers | 22462364 |
Automobile dealers | 20963805 |
SELECT year(sales_month) AS year,
sum(sales) AS total_sales
FROM retail_sales
WHERE kind_of_business = 'Used car dealers'
GROUP BY 1
ORDER BY 2 DESC;
year | total_sales |
---|---|
2020 | 124040 |
2019 | 119067 |
2018 | 114390 |
2017 | 112174 |
2016 | 105905 |
2015 | 98330 |
2014 | 90424 |
2013 | 83916 |
2012 | 80228 |
2007 | 79696 |
SELECT year(sales_month) AS year,
sum(sales) AS total_sales
FROM retail_sales
WHERE kind_of_business = 'Used car dealers'
GROUP BY 1
HAVING total_sales >= 100000
ORDER BY 2 DESC;
year | total_sales |
---|---|
2020 | 124040 |
2019 | 119067 |
2018 | 114390 |
2017 | 112174 |
2016 | 105905 |
SELECT kind_of_business, reason_for_null,
count(*)
FROM retail_sales
WHERE sales IS NULL
GROUP BY 1, 2
kind_of_business | reason_for_null | count_star() |
---|---|---|
All other home furnishings stores | Not Available | 108 |
Electronics stores | Not Available | 60 |
Paint and wallpaper stores | Not Available | 108 |
Supermarkets and other grocery (except convenience) stores | Not Available | 108 |
Other clothing stores | Not Available | 108 |
Floor covering stores | Supressed | 46 |
Drinking places | Supressed | 34 |
Jewelry stores | Supressed | 3 |
Full service restaurants | Supressed | 11 |
Home furnishings stores | Supressed | 3 |
1.3.6 What have we not covered?
- Joins
- Subqueries and common table expressions (CTEs)
You can copy and paste the code into the “Console” in RStudio.↩︎
Though only the first 10 rows are displayed in this book.↩︎
Note that we specify the date as a string literal and in ISO 8601 format, in which the year, month, and day of the date are presented in that order, including leading zeros for the month and day and with the components separated by dashes. This format will become familiar with you if you continue to do any work in data science.↩︎