9 Importing data
Thus far, our data have come from either the farr
package or WRDS. In most cases, the WRDS data are nicely rectangular, meaning that we have \(N\) observations of \(K\) variables, and they already have assigned data types, etc.1 In practice, researchers often obtain data from other sources, especially the internet, and such data are often rather messy. This chapter provides an introduction to importing data from such sources.
While this chapter is fairly task-oriented, we think that it serves to reinforce two deeper ideas.
- Reproducibility. This is a big idea—certainly bigger than we can cover in a chapter. But what we will do here will emphasize approaches that are completely reproducible, whether by people you don’t know, co-authors, or even yourself at some subsequent date.
- Tidy data. A lot of data we see out there—including the data we work with in this chapter—is essentially tabular and relational in nature. Once you recognize this, many data problems become much more straightforward.
Even dedicated users of SAS or Stata may find this chapter useful, as R handles tricky data-munging tasks quite easily and the end results are easily exported in formats that other software can read (e.g., write_dta()
from the haven
package creates Stata data files).
We also introduce regular expressions. Briefly speaking, a regular expression is a sequence of characters that define a pattern which can be used in a kind of search (or search-and-replace) on steroids. Regular expressions are very useful when working with data in many contexts.2 The chapters on strings and regular expressions in R for Data Science provide excellent introductions that complement material in this chapter. We recommend that you refer to those chapters as you work through this chapter.
The code in this chapter uses the packages listed below. For instructions on how to set up your computer to use the code found in this book, see Section 1.2. Quarto templates for the exercises below are available on GitHub.
9.1 Reading (seemingly) non-tabular data
9.1.1 Fama-French industry definitions
The first task that we will explore is collecting data on Fama-French industry definitions, which are widely used in finance and accounting research to map SIC codes, of which there are hundreds, into a smaller number of industry groups for analysis.3 For example, we might want to group firms into 48, 12, or even 5 industry groups.
The basic data on Fama-French industry definitions are available from Ken French’s website at Tuck School of Business.
There are multiple classifications, starting with 5 industries, then 10, 12, 17, 30, 38, 48, and finally 49 industries. The data are supplied as zipped text files. For example, the 48-industry data can be found on this page, by clicking the link displayed as Download industry definitions
.
If we download that linked file and unzip it, we can open it in a text editor or even Excel. The first ten lines of the file are as follows:
1 Agric Agriculture
0100-0199 Agricultural production - crops
0200-0299 Agricultural production - livestock
0700-0799 Agricultural services
0910-0919 Commercial fishing
2048-2048 Prepared feeds for animals
2 Food Food Products
2000-2009 Food and kindred products
2010-2019 Meat products
Looking at the second row, we interpret this as saying that firms with SIC codes between 0100
and 0199
are assigned to industry group 1
(let’s call this field ff_ind
), which has a label or short description (ff_ind_short_desc
) Agric
and a full industry description (ff_ind_desc
) of Agriculture
.
One approach to this task might be to write a function like the following (this one is woefully incomplete, as it only covers the first two lines of data above):
get_ff_ind_48 <- function(sic) {
case_when(sic >= 100 & sic <= 199 ~ 1,
sic >= 200 & sic <= 299 ~ 1)
}
While tedious and time-consuming, this is perfectly feasible. In fact, this is essentially the approach taken in code you can find on the internet (e.g., SAS code here or here or Stata code here).
However, doing this would only solve the problem for the 48-industry grouping. And it certainly could not be described as particularly robust to, for example, changes in Fama-French industry definitions.4
Part of the solution that we use below recognizes that the data are really tabular in nature. A relational database purist would likely look at the data above as representing two tables. One table relates Fama-French industries to short and full industry descriptions. The first two rows in this table would look something like this:
ff_ind | ff_ind_short_desc | ff_ind_desc |
---|---|---|
1 | Agric | Agriculture |
2 | Food | Food Products |
The second table would relate Fama-French industries to ranges of SIC codes, and the first few rows of this table would look something like this:
ff_ind | sic_min | sic_max | sic_desc |
---|---|---|---|
1 | 0100 | 0199 | Agricultural production - crops |
1 | 0200 | 0299 | Agricultural production - livestock |
1 | 0700 | 0799 | Agricultural services |
1 | 0910 | 0919 | Commercial fishing |
To keep things simple for this exercise, we will disappoint the purists and make a single table with all six fields: ff_ind
, ff_ind_short_desc
, ff_ind_desc
, sic_min
, sic_max
, sic_desc
.5
So how do we make this table? One approach to this task might be to download the linked file, unzip it, open it up in some program (e.g., Excel), and then massage the data manually into the desired form. But this would have the same issues as the approach above.
We can do better by using R and tools from the Tidyverse package. The first step is to download the data. While one can easily do this manually, but we want to automate this process as much as possible. And we probably don’t have any reason to keep the .zip
file once we have used it. R provides two functions that we can use here: download.file()
and tempfile()
. The tempfile()
function creates a random file name in a location that will be cleaned up by our system automatically once we’re no longer using it.
Here t
is filef68c866a5c8.zip, which is random except for the .zip
extension, something we need for our code to recognize the supplied file as a zipped file. The download.file()
function downloads the file at url
and saves it as t
.
download.file(url, t)
If you look at Ken French’s website, you will see that all the industry-definition files have URLs that follow a certain pattern, with just the number of industry groups (in this case, 48
) changing. Recognizing this, we can rewrite the code above as follows:
ind <- 48
t <- tempfile(fileext = ".zip")
url <- str_c("http://mba.tuck.dartmouth.edu",
"/pages/faculty/ken.french/ftp/Siccodes", ind, ".zip")
download.file(url, t)
Here the str_c()
function from the stringr
package pastes all the pieces back together.6
From visual inspection, we can see that our text file is a fixed-width format text file. So to read the data, we will use the function read_fwf()
from the readr
package.
There are two required arguments to read_fwf()
: file
and col_positions
. From the help for read_fwf()
(type ? readr::read_fwf
in the R console to see this), we see that col_positions
refers to “Column positions, as created by fwf_empty()
, fwf_widths()
or fwf_positions()
. If the width of the last column is variable (i.e., we have a ragged fixed-width format file), we can supply the last end position as NA
.” We can also see that fwf_widths()
is itself a function: fwf_widths(widths, col_names = NULL)
.
Given that we have a very simple file, we can identify the column widths pretty easily. Manually adding a “ruler” of sorts at the top of the file, we can see below that the first column covers columns 1-3, the second column covers 4-10, and the third column starts at 11.
123456789-123456789-123456789-123456789-123456789-...
1 Agric Agriculture
0100-0199 Agricultural production - crops
0200-0299 Agricultural production - livestock
0700-0799 Agricultural services
So we have widths of 3, 7, and we can use NA
to have R figure out the width of the last column. The first two columns should be named ff_ind
and ff_ind_short_desc
, but the third column is problematic, as some rows provide information on ff_ind_short_desc
and some rows provide data that we will eventually put into sic_min
and sic_max
; so let’s call that column temp
for now. Finally, as the first column contains integer values, while the other two are text columns, we can supply a string to the option col_types
argument of read_fwf()
to ensure that the columns are read as those types.
t |>
read_fwf(col_positions = fwf_widths(c(3, 7, NA),
c("ff_ind", "ff_ind_short_desc",
"temp")),
col_types = "icc")
# A tibble: 693 × 3
ff_ind ff_ind_short_desc temp
<int> <chr> <chr>
1 1 Agric Agriculture
2 NA <NA> 0100-0199 Agricultural production - crops
3 NA <NA> 0200-0299 Agricultural production - livestock
4 NA <NA> 0700-0799 Agricultural services
5 NA <NA> 0910-0919 Commercial fishing
6 NA <NA> 2048-2048 Prepared feeds for animals
7 NA <NA> <NA>
8 2 Food Food Products
9 NA <NA> 2000-2009 Food and kindred products
10 NA <NA> 2010-2019 Meat products
# ℹ 683 more rows
The next step is to handle the problematic temp
column. We can see that if ff_ind
is NA
, then temp
contains (if it contains anything) a range of SIC codes, but if ff_ind
is not NA
, then temp
contains the value we want to store in ff_ind_desc
. We can use mutate
along with if_else
statements to extract the data into the appropriate columns (we’ll call the range of SIC codes sic_range
for now) and, having done so, we can drop the column temp
.7
t |>
read_fwf(col_positions = fwf_widths(c(3, 7, NA),
c("ff_ind", "ff_ind_short_desc",
"temp")),
col_types = "icc") |>
mutate(ff_ind_desc = if_else(!is.na(ff_ind), temp, NA),
sic_range = if_else(is.na(ff_ind), temp, NA)) |>
select(-temp)
# A tibble: 693 × 4
ff_ind ff_ind_short_desc ff_ind_desc sic_range
<int> <chr> <chr> <chr>
1 1 Agric Agriculture <NA>
2 NA <NA> <NA> 0100-0199 Agricultural productio…
3 NA <NA> <NA> 0200-0299 Agricultural productio…
4 NA <NA> <NA> 0700-0799 Agricultural services
5 NA <NA> <NA> 0910-0919 Commercial fishing
6 NA <NA> <NA> 2048-2048 Prepared feeds for ani…
7 NA <NA> <NA> <NA>
8 2 Food Food Products <NA>
9 NA <NA> <NA> 2000-2009 Food and kindred produ…
10 NA <NA> <NA> 2010-2019 Meat products
# ℹ 683 more rows
We are getting closer. Now, we see that the issue is that our sic_range
column does not line up with the other three columns. To solve this, we can use fill()
from the tidyr
package. The fill function accepts arguments for the columns to “fill”. In this case, we can fill missing values with the previous non-missing value.
t |>
read_fwf(col_positions = fwf_widths(c(3, 7, NA),
c("ff_ind", "ff_ind_short_desc",
"temp")),
col_types = "icc") |>
mutate(ff_ind_desc = if_else(!is.na(ff_ind), temp, NA),
sic_range = if_else(is.na(ff_ind), temp, NA)) |>
select(-temp) |>
fill(ff_ind, ff_ind_short_desc, ff_ind_desc)
# A tibble: 693 × 4
ff_ind ff_ind_short_desc ff_ind_desc sic_range
<int> <chr> <chr> <chr>
1 1 Agric Agriculture <NA>
2 1 Agric Agriculture 0100-0199 Agricultural productio…
3 1 Agric Agriculture 0200-0299 Agricultural productio…
4 1 Agric Agriculture 0700-0799 Agricultural services
5 1 Agric Agriculture 0910-0919 Commercial fishing
6 1 Agric Agriculture 2048-2048 Prepared feeds for ani…
7 1 Agric Agriculture <NA>
8 2 Food Food Products <NA>
9 2 Food Food Products 2000-2009 Food and kindred produ…
10 2 Food Food Products 2010-2019 Meat products
# ℹ 683 more rows
At this point, we have no further use for the rows where sic_range
is NA
, so we can filter them out.
t |>
read_fwf(col_positions = fwf_widths(c(3, 7, NA),
c("ff_ind", "ff_ind_short_desc",
"temp")),
col_types = "icc") |>
mutate(ff_ind_desc = if_else(!is.na(ff_ind), temp, NA),
sic_range = if_else(is.na(ff_ind), temp, NA)) |>
select(-temp) |>
fill(ff_ind, ff_ind_short_desc, ff_ind_desc) |>
filter(!is.na(sic_range))
# A tibble: 598 × 4
ff_ind ff_ind_short_desc ff_ind_desc sic_range
<int> <chr> <chr> <chr>
1 1 Agric Agriculture 0100-0199 Agricultural productio…
2 1 Agric Agriculture 0200-0299 Agricultural productio…
3 1 Agric Agriculture 0700-0799 Agricultural services
4 1 Agric Agriculture 0910-0919 Commercial fishing
5 1 Agric Agriculture 2048-2048 Prepared feeds for ani…
6 2 Food Food Products 2000-2009 Food and kindred produ…
7 2 Food Food Products 2010-2019 Meat products
8 2 Food Food Products 2020-2029 Dairy products
9 2 Food Food Products 2030-2039 Canned & preserved fru…
10 2 Food Food Products 2040-2046 Flour and other grain …
# ℹ 588 more rows
The last issue to address is the column sic_range
. We want to split that into three target columns (sic_min
, sic_max
, and sic_desc
). To do this, we can use extract()
from the tidyr
package. The two required arguments for extract()
are col
, the column from which data are being extracted, and into
, the columns that will get the data.
The regex
argument to extract()
allows us to specify a regular expression that is used to split the data. We will discuss regular expressions more carefully later in the chapter, so it is fine if you don’t fully follow what’s going on here. For now, we just need to know that the portions contained in each pair of parentheses ((
and )
) are what is captured for each field. The first pair of parentheses surround [0-9]+
, which we can read as “one or more digit characters”. So, for the case of 0100-0199 Agricultural production - crops
, this would match 0100
. This is followed by a -
which is a literal dash that matches the -
after 0100
. The next pair of parentheses is again “one or more digit characters” and captures 0199
. This is followed by \\s*
. The \\s
represents “any space” and the *
means “zero or more”, so \\s*
means “zero or more spaces”, which are matched, but not captured. Finally we have (.*)$
. The .
represents “any character”, so .*
means “zero or more of any character”, which are captured as the third variable. The $
just means “the end of the string”; this mirrors the ^
at the start of the regular expression, which means “the start of the string”. Together, ^
and $
ensure that we are considering the whole string in our analysis.8
The argument convert = TRUE
asks extract()
to convert the data types of the extracted fields to types that seem appropriate (e.g., integers for sic_min
and sic_max
).
t |>
read_fwf(col_positions = fwf_widths(c(3, 7, NA),
c("ff_ind", "ff_ind_short_desc",
"temp")),
col_types = "icc") |>
mutate(ff_ind_desc = if_else(!is.na(ff_ind), temp, NA),
sic_range = if_else(is.na(ff_ind), temp, NA)) |>
select(-temp) |>
fill(ff_ind, ff_ind_short_desc, ff_ind_desc) |>
filter(!is.na(sic_range)) |>
extract(sic_range,
into = c("sic_min", "sic_max", "sic_desc"),
regex = "^([0-9]+)-([0-9]+)\\s*(.*)$",
convert = TRUE)
# A tibble: 598 × 6
ff_ind ff_ind_short_desc ff_ind_desc sic_min sic_max sic_desc
<int> <chr> <chr> <int> <int> <chr>
1 1 Agric Agriculture 100 199 Agricultural pro…
2 1 Agric Agriculture 200 299 Agricultural pro…
3 1 Agric Agriculture 700 799 Agricultural ser…
4 1 Agric Agriculture 910 919 Commercial fishi…
5 1 Agric Agriculture 2048 2048 Prepared feeds f…
6 2 Food Food Products 2000 2009 Food and kindred…
7 2 Food Food Products 2010 2019 Meat products
8 2 Food Food Products 2020 2029 Dairy products
9 2 Food Food Products 2030 2039 Canned & preserv…
10 2 Food Food Products 2040 2046 Flour and other …
# ℹ 588 more rows
Lastly, we can put all of the above into a function. But, as we do so, let’s take a gamble that the same code will work for any of the Fama-French industry classifications if we only change the URL. To do this, we use an argument ind
that reflects the industry grouping of interest and inserts that in the URL. Here we use the str_c()
function from the stringr
package to create the URL using the value supplied as ind
.
get_ff_ind <- function(ind) {
t <- tempfile(fileext = ".zip")
url <- str_c("https://mba.tuck.dartmouth.edu/pages/",
"faculty/ken.french/ftp/Siccodes", ind, ".zip")
download.file(url, t)
t |>
read_fwf(col_positions = fwf_widths(c(3, 7, NA),
c("ff_ind", "ff_ind_short_desc",
"temp")),
col_types = "icc") |>
mutate(ff_ind_desc = if_else(!is.na(ff_ind), temp, NA),
sic_range = if_else(is.na(ff_ind), temp, NA)) |>
select(-temp) |>
fill(ff_ind, ff_ind_short_desc, ff_ind_desc) |>
filter(!is.na(sic_range)) |>
extract(sic_range,
into = c("sic_min", "sic_max", "sic_desc"),
regex = "^([0-9]+)-([0-9]+)\\s*(.*)$",
convert = TRUE)
}
We can test out for 48-industry classification above (just to make sure we didn’t mess up what we already had working):
get_ff_ind(48)
# A tibble: 598 × 6
ff_ind ff_ind_short_desc ff_ind_desc sic_min sic_max sic_desc
<int> <chr> <chr> <int> <int> <chr>
1 1 Agric Agriculture 100 199 Agricultural pro…
2 1 Agric Agriculture 200 299 Agricultural pro…
3 1 Agric Agriculture 700 799 Agricultural ser…
4 1 Agric Agriculture 910 919 Commercial fishi…
5 1 Agric Agriculture 2048 2048 Prepared feeds f…
6 2 Food Food Products 2000 2009 Food and kindred…
7 2 Food Food Products 2010 2019 Meat products
8 2 Food Food Products 2020 2029 Dairy products
9 2 Food Food Products 2030 2039 Canned & preserv…
10 2 Food Food Products 2040 2046 Flour and other …
# ℹ 588 more rows
And now let’s try it for 5- and 12-industry classifications:
get_ff_ind(5)
# A tibble: 58 × 6
ff_ind ff_ind_short_desc ff_ind_desc sic_min sic_max sic_desc
<int> <chr> <chr> <int> <int> <chr>
1 1 Cnsmr Consumer Durables, No… 100 999 ""
2 1 Cnsmr Consumer Durables, No… 2000 2399 ""
3 1 Cnsmr Consumer Durables, No… 2700 2749 ""
4 1 Cnsmr Consumer Durables, No… 2770 2799 ""
5 1 Cnsmr Consumer Durables, No… 3100 3199 ""
6 1 Cnsmr Consumer Durables, No… 3940 3989 ""
7 1 Cnsmr Consumer Durables, No… 2500 2519 ""
8 1 Cnsmr Consumer Durables, No… 2590 2599 ""
9 1 Cnsmr Consumer Durables, No… 3630 3659 ""
10 1 Cnsmr Consumer Durables, No… 3710 3711 ""
# ℹ 48 more rows
get_ff_ind(12)
# A tibble: 49 × 6
ff_ind ff_ind_short_desc ff_ind_desc sic_min sic_max sic_desc
<int> <chr> <chr> <int> <int> <lgl>
1 1 NoDur Consumer Nondurables … 100 999 NA
2 1 NoDur Consumer Nondurables … 2000 2399 NA
3 1 NoDur Consumer Nondurables … 2700 2749 NA
4 1 NoDur Consumer Nondurables … 2770 2799 NA
5 1 NoDur Consumer Nondurables … 3100 3199 NA
6 1 NoDur Consumer Nondurables … 3940 3989 NA
7 2 Durbl Consumer Durables -- … 2500 2519 NA
8 2 Durbl Consumer Durables -- … 2590 2599 NA
9 2 Durbl Consumer Durables -- … 3630 3659 NA
10 2 Durbl Consumer Durables -- … 3710 3711 NA
# ℹ 39 more rows
9.1.2 Exercises
-
Follow the steps below to produce a data set where each column has the appropriate data type.
- Go to the MSCI GICS website.
- Get the link to the file under historical GICS structures that is “Effective until Sep 28, 2018”.
- Use this link and the
tempfile()
anddownload.file()
functions to download the linked file. - Use
read_excel()
from thereadxl
package to read the downloaded file. - Identify any variables that need to be handled like
temp
in the Fama-French data set above and process accordingly. - Use the
fill()
function from thetidyr
package to fill in rows as necessary. - Make sure that each column has the appropriate data type.
Hints:
- You may find it helpful to look at the Excel file so you can see how the data are structured.
- The
read_excel()
function hasskip
andcol_names
arguments that you will probably want to use. - Your final column names should be
sector
,sector_desc
,ind_group
,ind_group_desc
,industry
,industry_desc
,sub_ind
,sub_ind_desc
, andsub_ind_details
. - The following code snippets might be useful:
filter(!is.na(sub_ind_details))
fill(sector:sub_ind_desc, .direction = "down")
mutate(across(where(is.numeric), as.integer))
9.2 Extracting data from messy formats
Sometimes data are provided in formats even messier than fixed-width text files. For example, we may want to extract data (perhaps tabular data) from a PDF. While the data may appear tabular to our eyes, the reality is that PDFs retain very little information about the structure of data, as the PDF format is a lightweight way of making a computer or printer present text for consumption by humans, not statistical software packages.
In this section, we will examine a case study in extracting tabular data from a PDF. Our initial focus will be on extracting data about money “left on the table” by firms in initial public offerings (IPOs). These data are provided by Jay Ritter at the University of Florida in a PDF found here.
In this case, we will use pdf_text()
from the pdftools
package. From the help for pdf_text()
, we learn that “the pdf_text
function renders all text boxes on a text canvas and returns a character vector of equal length to the number of pages in the PDF file.” We also learn that the function accepts an argument for the “pdf file path or raw vector with pdf data”. Since the above URL actually provides an absolute file path for the PDF, we can simply call pdf_text(url)
to download the PDF from the url and convert it to a text representation in R.
url <- str_c("https://site.warrington.ufl.edu/ritter/files/",
"money-left-on-the-table.pdf")
Looking at the PDF, we see that the first page is text and the table starts on page 2. We can omit the first page returned by pdf_text(url)
below by appending [-1]
.
output <- pdf_text(url)[-1]
While the output from pdf_text(url)
or even pdf_text(url)[-1]
uses up too much space to be shown here, if you run the code immediately above, then call each of these alternatives, you can see that the second function returns data that appears quite regular. In this case, it makes sense to run the text through read_lines()
from readr
. We will do this and store the result in temp
so that we can examine it here more closely.9
temp <-
output |>
read_lines()
Let’s look at a few selected lines. Lines 1:5
represent the top of the table, with the column names unfortunately split into two rows.10
print_width <- 70
str_sub(temp[1:5], 1, print_width)
[1] " Dollar amount IPO Offer F"
[2] "left on the table Company date Price m"
[3] ""
[4] "$5,075,000,000 Visa 080319 $44.00 $"
[5] "$3,937,028,063 Airbnb 201210 $68.00 $"
Rather than trying to deal with the column-names-spread-over-two-rows issue with code, we can just manually specify the column names and skip the first two rows when we import the data. For now, we merely create variables to reflect those choices; we will use these variables later.
col_names <- c("amount_left_on_table", "company", "ipo_date", "offer_price",
"first_close_price", "shares_offered", "ticker")
skip_rows <- 2
While the top portion of the table looks like it might be able to work with read_fwf()
, once we look at the lines 56 through 62 at the bottom of the first page and top of the second pages (pp. 2–3 of the original PDF), we can see that they are not aligned.
str_sub(temp[56:62], 1, print_width)
[1] " $541,875,000 Charter Communications* 991109 $19.00 $"
[2] " $541,328,968 Blackstone Group LP 070621 $31.00 $"
[3] ""
[4] " 2"
[5] "$540,929,382 XP Inc 191211 $27.00 $34.46 "
[6] "$538,630,000 Gigamedia 000218 $27.00 $88.00 "
[7] "$531,999,999 CureVac B.V. 200814 $16.00 $55.90 "
The same is true of lines 114 through 120 at the bottom of the second page and top of the third pages (pp. 3–4 of the original PDF).
str_sub(temp[114:120], 1, print_width)
[1] ""
[2] "$360,000,000 Niku 000229 $24.00 $69.00 "
[3] ""
[4] " 3"
[5] "$355,384,125 Lucent Technologies* 960403 $27.00 $30.625 "
[6] "$354,046,000 Fitbit 150618 $20.00 $29.68 "
[7] "$352,400,000 Tradeweb Markets 190404 $27.00 $35.81 "
Finally, rows 461 through 463 contain a footer that is not part of the table.
str_sub(temp[458:464], 1, print_width)
[1] " $128,724,000 HCA Holdings 110310 $30.00 $"
[2] " $128,562,500 Netscape Communications* 950808 $28.00 $"
[3] ""
[4] "Source: Prof. Jay R. Ritter, Warrington College of Business, Universit"
[5] "jay.ritter@warrington.ufl.edu). Data come from IPOScoop.com, Bloomberg"
[6] "WSJ.com, and the S.E.C.’s Edgar electronic database of 424B forms (fin"
[7] ""
We can flag that by looking for Source:
at the start of a row. To match at the start of a string, we need to anchor the regular expression using ^
to match the start.11 We want to read only up to one row before 461, and we will be skipping skip_rows
at the start. So we can calculate the maximum number of rows we want to read as follows:
max_rows <- str_which(temp, "^Source:") - 1 - skip_rows
Also, as we will use Tidyverse tools to munge the data, we will find it useful to put the data in a tibble, albeit one with just a single column, which we call temp
.
ritter_data_raw <-
output |>
read_lines(skip = skip_rows, n_max = max_rows) |>
tibble(temp = _)
Coming as it does from a PDF, there are some messy elements of the data.
First, there are empty rows, which we can detect by matching on ^$
(i.e., the start and end with nothing in between) and we will want to filter out these rows.
ritter_data_raw |>
filter(str_detect(temp, "^$"))
# A tibble: 49 × 1
temp
<chr>
1 ""
2 ""
3 ""
4 ""
5 ""
6 ""
7 ""
8 ""
9 ""
10 ""
# ℹ 39 more rows
Second, there are rows with just spaces and page numbers, which we can match with:
ritter_data_raw |>
filter(str_detect(temp, "^\\s+\\d+$"))
# A tibble: 7 × 1
temp
<chr>
1 " 2"
2 " 3"
3 " 4"
4 " 5"
5 " 6"
6 " 7"
7 " 8"
So we want to filter out these rows too:
ritter_data_raw <-
output |>
read_lines(skip = 2, n_max = max_rows) |>
tibble(temp = _) |>
filter(!str_detect(temp, "^$"),
!str_detect(temp, "^\\s+\\d+$"))
The next step will be, as before, to use extract()
from tidyr
and a regular expression to arrange the data into columns. However, the regular expression that we will need to use will be a bit more complicated than the one above. If it weren’t for the second column, we could use a function from the readr
package such as read_delim()
that is designed to read delimited text data, such as comma-separated values (CSVs), tab-separated values, or (most relevant here) values separated by spaces.
Unfortunately, the second column (company
) contains spaces (e.g., United Parcel Service*
), so this would be read as three columns (for this row, at least).12
Fortunately, company
is the only column with embedded spaces and it is followed by a column (ipo_date
) that is strictly six digits (it has the form yymmdd
, where yy
is the last two digits of the year, mm
are the month digits, and dd
represents the date of the month). So we can use this to effectively “delimit” the company
column from the rest of the data.
The first column (amount_left_on_table
) contains non-space characters, which we can represent as [^\\s]
. Here the ^
functions to negate the expression following it \\s
, which is the general way of representing spaces in regular expressions. So [^\\s]+
denotes “one or more non-space characters” and embedding this in parentheses (i.e., as ([^\\s]+)
) allows us to capture the matching characters. The second column could contain pretty much anything (that is .
in regex terms), followed by one or more spaces, which we could represent as (.+)\\s+
. The third column (ipo_date
) can be captured by a regular expression for six digits (again followed by one or more spaces that we don’t capture): ([0-9]{6})\\s+
. The next four columns are like the first column, so we can use ([^\\s]+)
to capture these, with all but the last column being followed by one or more spaces (\\s+
). One issue that is not evident initially is that some, but not all, rows begin with spaces. We don’t want to capture these spaces, but we want to allow for them to be there, so we can use ^\\s*
to represent “the start of the line followed by zero or more spaces” to do this.
Now we can put all of this together as follows:
regex <- str_c("^\\s*", # Start string (perhaps followed by spaces)
"([^\\s]+)\\s+", # Non-space characters (followed by spaces)
"(.+)\\s+", # Any characters, which may include spaces
# (followed by spaces)
"([0-9]{6})\\s+", # Six digits (followed by spaces)
"([^\\s]+)\\s+", # Non-space characters (followed by spaces)
"([^\\s]+)\\s+", # Non-space characters (followed by spaces)
"([^\\s]+)\\s+", # Non-space characters (followed by spaces)
"([^\\s]+)", # Non-space characters
"$") # End of string
We can then run this through extract()
.
ritter_data_raw |>
# Here we use the regular expression to split the data into columns
extract(temp, col_names, regex)
# A tibble: 402 × 7
amount_left_on_table company ipo_date offer_price first_close_price
<chr> <chr> <chr> <chr> <chr>
1 $5,075,000,000 "Visa … 080319 $44.00 $56.50
2 $3,937,028,063 "Airbnb … 201210 $68.00 $144.71
3 $3,750,040,000 "Snowflake … 200916 $120 $253.93
4 $3,477,690,000 "Rivian Aut… 211110 $78.00 $100.73
5 $2,887,830,000 "DoorDash … 201209 $102 $189.51
6 $1,852,500,000 "Coupang … 210311 $35.00 $49.25
7 $1,586,300,000 "United Par… 991110 $50.00 $68.125
8 $1,540,730,469 "Corvis … 000728 $36.00 $84.71875
9 $1,496,000,000 "Snap … 170302 $17.00 $24.48
10 $1,365,500,000 "Bumble … 210211 $43.00 $70.31
# ℹ 392 more rows
# ℹ 2 more variables: shares_offered <chr>, ticker <chr>
We are getting close. Now we want to convert the fields amount_left_on_table
, offer_price
, first_close_price
, and shares_offered
to numerical values. For this task, the parse_number()
function from readr
is ideal. Note that we can use across()
to apply a single function parse_number()
to multiple columns (in this case, all_of()
these four columns) and, by default, the results will replace the values previously found in those columns. And we also want to convert ipo_date
to an actual date, which we can do using ymd()
from the lubridate
package. At this stage, we store the result in a data frame ritter_data
.
ritter_data <-
ritter_data_raw |>
# Here we use the regular expression to split the data into columns
extract(temp, col_names, regex) |>
# Finally, fix up the data types of the columns
mutate(across(all_of(c("amount_left_on_table", "first_close_price",
"offer_price", "shares_offered")),
parse_number),
ipo_date = ymd(ipo_date),
company = str_trim(company))
ritter_data
# A tibble: 402 × 7
amount_left_on_table company ipo_date offer_price first_close_price
<dbl> <chr> <date> <dbl> <dbl>
1 5075000000 Visa 2008-03-19 44 56.5
2 3937028063 Airbnb 2020-12-10 68 145.
3 3750040000 Snowflake 2020-09-16 120 254.
4 3477690000 Rivian Au… 2021-11-10 78 101.
5 2887830000 DoorDash 2020-12-09 102 190.
6 1852500000 Coupang 2021-03-11 35 49.2
7 1586300000 United Pa… 1999-11-10 50 68.1
8 1540730469 Corvis 2000-07-28 36 84.7
9 1496000000 Snap 2017-03-02 17 24.5
10 1365500000 Bumble 2021-02-11 43 70.3
# ℹ 392 more rows
# ℹ 2 more variables: shares_offered <dbl>, ticker <chr>
At this stage, we should check if we have any rows with NA
values, as the existence of such rows would suggest possible parsing issues.
ritter_data |>
filter(if_any(.cols = everything(), .fns = is.na))
# A tibble: 0 × 7
# ℹ 7 variables: amount_left_on_table <dbl>, company <chr>,
# ipo_date <date>, offer_price <dbl>, first_close_price <dbl>,
# shares_offered <dbl>, ticker <chr>
We have none!
9.2.1 Exercises
In the PDF, Ritter states “IPOs marked with an * also had international tranches (and probably some others, too).” Write code to remove this “*” when present and create an additional column
intl_tranche
that indicates an international tranche. (Hint: Theextract()
function can be used here. In a first pass, you might useinto = c("company", "intl_tranche")
andregex = "^(.*?)(\\*?)$"
with this function. Can you see what the?
in.*?
is doing? There is some explanation here. Does the regex work without this?
? What does\\*?
match?)Ritter defines money left on the table as “the difference between the closing price on the first day of trading and the offer price, multiplied by the number of shares sold.” Can you calculate this from the data provided? Is the calculated amount (
amount
) equal to the amount inamount_left_on_table
in each case? What explains the differences? (Hints: There will be more than one reason. You may find it helpful to calculateratio = amount/amount_left_on_table
and to focus on differences of more than 1% withfilter(abs(ratio - 1) > 0.01)
.)-
In words, what do each of the following regular expressions match? What do they capture?
"^\\s*"
"(.+)\\s+"
"([^\\s]+)\\s+"
"([0-9]{1,2}/[0-9]{4})\\s+"
"([0-9,]+)\\s+"
-
The online appendix for “The Customer Knows Best: The Investment Value of Consumer Opinions” contains Table OA.1 entitled List of firms with customer reviews on Amazon.com. Using an approach similar to that we used for
ritter_data
above, create a data framehuang_data
with columnscompany_name
,industry
,start
,end
,months
,reviews
.-
months
andreviews
should be numerical values. -
start
andend
should be dates (use the first day of the month if only the month and year is specified). - You can solve this by combining the partial regular expressions above into a single regular expression (in the way we combined partial regular expressions above for
ritter_data
usingstr_c()
).
-
Using the following code, where the URL provides a sample of observations on mergers and acquisitions from SDC, create a first-pass import of the data.13 What is the first issue you see when you look at the data in
ma_sdc
? (Hint: Look at the first five rows.) Adapt the code to address this issue. (Hint: You may have to experiment with different values for theskip
argument to get the right setting.)
col_names <- c("date_announced", "date_effective", "tgt_name",
"tgt_nation", "acq_name", "acq_nation",
"status", "pct_of_shares_acq", "pct_owned_after_transaction",
"acq_cusip", "tgt_cusip", "value_of_transaction_mil",
"acq_prior_mktval", "tgt_prior_mktval",
"acq_nation_code", "tgt_nation_code")
url <- str_c("https://gist.githubusercontent.com/iangow/",
"eb7dfe1cd0913821429bdf0566465d41/raw/",
"358d60a4429f5747abc61f8acc026d335fc165f3/sap_sample.txt")
ma_sdc_file <- tempfile()
download.file(url, ma_sdc_file)
Open the file found at
url
in your browser (browseURL(url)
will help here) and locate the row containing the wordCoffey
. What do you see there as relates to the variablestatus
? How does this compare with what you see instatus
if you filter usingtgt_name == "Coffey International Ltd")
? What do you think has happened here? How can setting a value forn
infwf_empty()
help here? (Hint: Usingwhich(ma_sdc$status == "Unconditi")
might help here.)Using an appropriate function from the
lubridate
package, fix the variablesdate_announced
anddate_effective
so that they have typeDate
.What are the minimum and maximum values of
date_announced
anddate_effective
? What explains missing values (if any) here?What do you observe about
acq_cusip
andtgt_cusip
? Can you write some code to check that these variables have been read in correctly? (Hint: The functionstr_length()
might be useful here.)
9.3 Further reading
Chapter 7 of R for Data Science provides an introduction to importing data, including functions such as read_csv()
and read_fwf()
. Chapter 14 in R for Data Science covers strings and Chapter 15 covers regular expressions.
The topic of regular expressions is surprisingly deep and regular expressions are useful in more contexts than might be apparent when first learning about them. Friedl (2006) provides a deep treatment of the topic. Goyvaerts and Levithan (2009) is full of examples of solutions to common cases.
Some WRDS data sets have “incorrect” data types and additional work is needed to address these cases.↩︎
Regular expressions are available in pretty much every package, include Python, R, PostgreSQL, and SAS. Learning on one platform largely carries over to any other. Stata’s support for regular expressions is much weaker than the other platforms’.↩︎
According to https://siccode.com, “Standard Industrial Classification Codes (SIC Codes) identify the primary line of business of a company. It is the most widely used system by the US Government, public, and private organizations.”↩︎
While Fama-French industry definitions might not change very often, we will see other benefits from a more robust and general approach below.↩︎
We justify this approach using the fact that these data sets are not large by any stretch and the assumption that, in general, only one of
ff_ind
,ff_ind_short_desc
andff_ind_desc
actually gets used in practice.↩︎Note that an alternative to
str_c()
would bepaste0()
from base R; we usestr_c()
here because thestringr
functions seem easier to learn as a group than their base R equivalents.↩︎Throughout this section, we will repeat code, but in practice, we would simply add additional lines as we work through the code. This is simply a way to represent the idea of working through the data interactively, something that the Tidyverse makes very easy.↩︎
It seems that omitting
^
and$
has no effect in this case.↩︎Again, you may find it useful to inspect the full contents of
temp
yourself. We don’t do that here due to space constraints.↩︎We only print the first 70 characters of each row to keep the output on the page.↩︎
Note that in other contexts, such as inside
[
and]
,^
will act as a kind of “not” operator.↩︎If the values in
company
had been quoted, e.g., as"United Parcel Service*"
, then reading the data as space-delimited would work. But these are “wild” data from a PDF and such niceties cannot be expected.↩︎We messed with these data, so these cannot be used for research! But they are a realistic representation of an actual data set.↩︎