9  Importing data

Thus far, our data have come from either the farr package or WRDS. In most cases, the WRDS data are nicely rectangular, meaning that we have \(N\) observations of \(K\) variables, and they already have assigned data types, etc.1 In practice, researchers often obtain data from other sources, especially the internet, and such data are often rather messy. This chapter provides an introduction to importing data from such sources.

While this chapter is fairly task-oriented, we think that it serves to reinforce two deeper ideas.

Even dedicated users of SAS or Stata may find this chapter useful, as R handles tricky data-munging tasks quite easily and the end results are easily exported in formats that other software can read (e.g., write_dta() from the haven package creates Stata data files).

We also introduce regular expressions. Briefly speaking, a regular expression is a sequence of characters that define a pattern which can be used in a kind of search (or search-and-replace) on steroids. Regular expressions are very useful when working with data in many contexts.2 The chapters on strings and regular expressions in R for Data Science provide excellent introductions that complement material in this chapter. We recommend that you refer to those chapters as you work through this chapter.

Tip

The code in this chapter uses the packages listed below. For instructions on how to set up your computer to use the code found in this book, see Section 1.2. Quarto templates for the exercises below are available on GitHub.

library(dplyr)
library(tidyr)     
library(readr)     # read_lines(), read_fwf(), etc.
library(stringr)   # str_c(), str_detect()
library(pdftools)  # pdf_text()
library(lubridate) # ymd()

9.1 Reading (seemingly) non-tabular data

9.1.1 Fama-French industry definitions

The first task that we will explore is collecting data on Fama-French industry definitions, which are widely used in finance and accounting research to map SIC codes, of which there are hundreds, into a smaller number of industry groups for analysis.3 For example, we might want to group firms into 48, 12, or even 5 industry groups.

The basic data on Fama-French industry definitions are available from Ken French’s website at Tuck School of Business.

There are multiple classifications, starting with 5 industries, then 10, 12, 17, 30, 38, 48, and finally 49 industries. The data are supplied as zipped text files. For example, the 48-industry data can be found on this page, by clicking the link displayed as Download industry definitions.

If we download that linked file and unzip it, we can open it in a text editor or even Excel. The first ten lines of the file are as follows:

 1 Agric  Agriculture
          0100-0199 Agricultural production - crops
          0200-0299 Agricultural production - livestock
          0700-0799 Agricultural services
          0910-0919 Commercial fishing
          2048-2048 Prepared feeds for animals

 2 Food   Food Products
          2000-2009 Food and kindred products
          2010-2019 Meat products

Looking at the second row, we interpret this as saying that firms with SIC codes between 0100 and 0199 are assigned to industry group 1 (let’s call this field ff_ind), which has a label or short description (ff_ind_short_desc) Agric and a full industry description (ff_ind_desc) of Agriculture.

One approach to this task might be to write a function like the following (this one is woefully incomplete, as it only covers the first two lines of data above):

get_ff_ind_48 <- function(sic) {
  case_when(sic >= 100 & sic <= 199 ~ 1,
            sic >= 200 & sic <= 299 ~ 1)
}

While tedious and time-consuming, this is perfectly feasible. In fact, this is essentially the approach taken in code you can find on the internet (e.g., SAS code here or here or Stata code here).

However, doing this would only solve the problem for the 48-industry grouping. And it certainly could not be described as particularly robust to, for example, changes in Fama-French industry definitions.4

Part of the solution that we use below recognizes that the data are really tabular in nature. A relational database purist would likely look at the data above as representing two tables. One table relates Fama-French industries to short and full industry descriptions. The first two rows in this table would look something like this:

ff_ind ff_ind_short_desc ff_ind_desc
1 Agric Agriculture
2 Food Food Products

The second table would relate Fama-French industries to ranges of SIC codes, and the first few rows of this table would look something like this:

ff_ind sic_min sic_max sic_desc
1 0100 0199 Agricultural production - crops
1 0200 0299 Agricultural production - livestock
1 0700 0799 Agricultural services
1 0910 0919 Commercial fishing

To keep things simple for this exercise, we will disappoint the purists and make a single table with all six fields: ff_ind, ff_ind_short_desc, ff_ind_desc, sic_min, sic_max, sic_desc.5

So how do we make this table? One approach to this task might be to download the linked file, unzip it, open it up in some program (e.g., Excel), and then massage the data manually into the desired form. But this would have the same issues as the approach above.

We can do better by using R and tools from the Tidyverse package. The first step is to download the data. While one can easily do this manually, but we want to automate this process as much as possible. And we probably don’t have any reason to keep the .zip file once we have used it. R provides two functions that we can use here: download.file() and tempfile(). The tempfile() function creates a random file name in a location that will be cleaned up by our system automatically once we’re no longer using it.

t <- tempfile(fileext = ".zip")
url <- str_c("http://mba.tuck.dartmouth.edu/pages/faculty/",
             "ken.french/ftp/Siccodes48.zip")

Here t is filef68c866a5c8.zip, which is random except for the .zip extension, something we need for our code to recognize the supplied file as a zipped file. The download.file() function downloads the file at url and saves it as t.

If you look at Ken French’s website, you will see that all the industry-definition files have URLs that follow a certain pattern, with just the number of industry groups (in this case, 48) changing. Recognizing this, we can rewrite the code above as follows:

ind <- 48
t <- tempfile(fileext = ".zip")
url <- str_c("http://mba.tuck.dartmouth.edu",
             "/pages/faculty/ken.french/ftp/Siccodes", ind, ".zip")
download.file(url, t)

Here the str_c() function from the stringr package pastes all the pieces back together.6

From visual inspection, we can see that our text file is a fixed-width format text file. So to read the data, we will use the function read_fwf() from the readr package.

There are two required arguments to read_fwf(): file and col_positions. From the help for read_fwf() (type ? readr::read_fwf in the R console to see this), we see that col_positions refers to “Column positions, as created by fwf_empty(), fwf_widths() or fwf_positions(). If the width of the last column is variable (i.e., we have a ragged fixed-width format file), we can supply the last end position as NA.” We can also see that fwf_widths() is itself a function: fwf_widths(widths, col_names = NULL).

Given that we have a very simple file, we can identify the column widths pretty easily. Manually adding a “ruler” of sorts at the top of the file, we can see below that the first column covers columns 1-3, the second column covers 4-10, and the third column starts at 11.

123456789-123456789-123456789-123456789-123456789-...
 1 Agric  Agriculture
          0100-0199 Agricultural production - crops
          0200-0299 Agricultural production - livestock
          0700-0799 Agricultural services

So we have widths of 3, 7, and we can use NA to have R figure out the width of the last column. The first two columns should be named ff_ind and ff_ind_short_desc, but the third column is problematic, as some rows provide information on ff_ind_short_desc and some rows provide data that we will eventually put into sic_min and sic_max; so let’s call that column temp for now. Finally, as the first column contains integer values, while the other two are text columns, we can supply a string to the option col_types argument of read_fwf() to ensure that the columns are read as those types.

t |>
  read_fwf(col_positions = fwf_widths(c(3, 7, NA),
                                      c("ff_ind", "ff_ind_short_desc",
                                        "temp")),
           col_types = "icc") 
# A tibble: 693 × 3
   ff_ind ff_ind_short_desc temp                                         
    <int> <chr>             <chr>                                        
 1      1 Agric             Agriculture                                  
 2     NA <NA>              0100-0199 Agricultural production - crops    
 3     NA <NA>              0200-0299 Agricultural production - livestock
 4     NA <NA>              0700-0799 Agricultural services              
 5     NA <NA>              0910-0919 Commercial fishing                 
 6     NA <NA>              2048-2048 Prepared feeds for animals         
 7     NA <NA>              <NA>                                         
 8      2 Food              Food Products                                
 9     NA <NA>              2000-2009 Food and kindred products          
10     NA <NA>              2010-2019 Meat products                      
# ℹ 683 more rows

The next step is to handle the problematic temp column. We can see that if ff_ind is NA, then temp contains (if it contains anything) a range of SIC codes, but if ff_ind is not NA, then temp contains the value we want to store in ff_ind_desc. We can use mutate along with if_else statements to extract the data into the appropriate columns (we’ll call the range of SIC codes sic_range for now) and, having done so, we can drop the column temp.7

t |>
  read_fwf(col_positions = fwf_widths(c(3, 7, NA),
                                      c("ff_ind", "ff_ind_short_desc",
                                        "temp")),
           col_types = "icc") |>
  mutate(ff_ind_desc = if_else(!is.na(ff_ind), temp, NA),
         sic_range = if_else(is.na(ff_ind), temp, NA)) |>
  select(-temp)
# A tibble: 693 × 4
   ff_ind ff_ind_short_desc ff_ind_desc   sic_range                        
    <int> <chr>             <chr>         <chr>                            
 1      1 Agric             Agriculture   <NA>                             
 2     NA <NA>              <NA>          0100-0199 Agricultural productio…
 3     NA <NA>              <NA>          0200-0299 Agricultural productio…
 4     NA <NA>              <NA>          0700-0799 Agricultural services  
 5     NA <NA>              <NA>          0910-0919 Commercial fishing     
 6     NA <NA>              <NA>          2048-2048 Prepared feeds for ani…
 7     NA <NA>              <NA>          <NA>                             
 8      2 Food              Food Products <NA>                             
 9     NA <NA>              <NA>          2000-2009 Food and kindred produ…
10     NA <NA>              <NA>          2010-2019 Meat products          
# ℹ 683 more rows

We are getting closer. Now, we see that the issue is that our sic_range column does not line up with the other three columns. To solve this, we can use fill() from the tidyr package. The fill function accepts arguments for the columns to “fill”. In this case, we can fill missing values with the previous non-missing value.

t |>
  read_fwf(col_positions = fwf_widths(c(3, 7, NA),
                                      c("ff_ind", "ff_ind_short_desc",
                                        "temp")),
           col_types = "icc") |>
  mutate(ff_ind_desc = if_else(!is.na(ff_ind), temp, NA),
         sic_range = if_else(is.na(ff_ind), temp, NA)) |>
  select(-temp) |>
  fill(ff_ind, ff_ind_short_desc, ff_ind_desc)
# A tibble: 693 × 4
   ff_ind ff_ind_short_desc ff_ind_desc   sic_range                        
    <int> <chr>             <chr>         <chr>                            
 1      1 Agric             Agriculture   <NA>                             
 2      1 Agric             Agriculture   0100-0199 Agricultural productio…
 3      1 Agric             Agriculture   0200-0299 Agricultural productio…
 4      1 Agric             Agriculture   0700-0799 Agricultural services  
 5      1 Agric             Agriculture   0910-0919 Commercial fishing     
 6      1 Agric             Agriculture   2048-2048 Prepared feeds for ani…
 7      1 Agric             Agriculture   <NA>                             
 8      2 Food              Food Products <NA>                             
 9      2 Food              Food Products 2000-2009 Food and kindred produ…
10      2 Food              Food Products 2010-2019 Meat products          
# ℹ 683 more rows

At this point, we have no further use for the rows where sic_range is NA, so we can filter them out.

t |>
  read_fwf(col_positions = fwf_widths(c(3, 7, NA),
                                      c("ff_ind", "ff_ind_short_desc",
                                        "temp")),
           col_types = "icc") |>
  mutate(ff_ind_desc = if_else(!is.na(ff_ind), temp, NA),
         sic_range = if_else(is.na(ff_ind), temp, NA)) |>
  select(-temp) |>
  fill(ff_ind, ff_ind_short_desc, ff_ind_desc) |>
  filter(!is.na(sic_range))
# A tibble: 598 × 4
   ff_ind ff_ind_short_desc ff_ind_desc   sic_range                        
    <int> <chr>             <chr>         <chr>                            
 1      1 Agric             Agriculture   0100-0199 Agricultural productio…
 2      1 Agric             Agriculture   0200-0299 Agricultural productio…
 3      1 Agric             Agriculture   0700-0799 Agricultural services  
 4      1 Agric             Agriculture   0910-0919 Commercial fishing     
 5      1 Agric             Agriculture   2048-2048 Prepared feeds for ani…
 6      2 Food              Food Products 2000-2009 Food and kindred produ…
 7      2 Food              Food Products 2010-2019 Meat products          
 8      2 Food              Food Products 2020-2029 Dairy products         
 9      2 Food              Food Products 2030-2039 Canned & preserved fru…
10      2 Food              Food Products 2040-2046 Flour and other grain …
# ℹ 588 more rows

The last issue to address is the column sic_range. We want to split that into three target columns (sic_min, sic_max, and sic_desc). To do this, we can use extract() from the tidyr package. The two required arguments for extract() are col, the column from which data are being extracted, and into, the columns that will get the data.

The regex argument to extract() allows us to specify a regular expression that is used to split the data. We will discuss regular expressions more carefully later in the chapter, so it is fine if you don’t fully follow what’s going on here. For now, we just need to know that the portions contained in each pair of parentheses (( and )) are what is captured for each field. The first pair of parentheses surround [0-9]+, which we can read as “one or more digit characters”. So, for the case of 0100-0199 Agricultural production - crops, this would match 0100. This is followed by a - which is a literal dash that matches the - after 0100. The next pair of parentheses is again “one or more digit characters” and captures 0199. This is followed by \\s*. The \\s represents “any space” and the * means “zero or more”, so \\s* means “zero or more spaces”, which are matched, but not captured. Finally we have (.*)$. The . represents “any character”, so .* means “zero or more of any character”, which are captured as the third variable. The $ just means “the end of the string”; this mirrors the ^ at the start of the regular expression, which means “the start of the string”. Together, ^ and $ ensure that we are considering the whole string in our analysis.8

The argument convert = TRUE asks extract() to convert the data types of the extracted fields to types that seem appropriate (e.g., integers for sic_min and sic_max).

t |>
  read_fwf(col_positions = fwf_widths(c(3, 7, NA),
                                      c("ff_ind", "ff_ind_short_desc",
                                        "temp")),
           col_types = "icc") |>
  mutate(ff_ind_desc = if_else(!is.na(ff_ind), temp, NA),
         sic_range = if_else(is.na(ff_ind), temp, NA)) |>
  select(-temp) |>
  fill(ff_ind, ff_ind_short_desc, ff_ind_desc) |>
  filter(!is.na(sic_range)) |>
  extract(sic_range, 
          into = c("sic_min", "sic_max", "sic_desc"),
          regex = "^([0-9]+)-([0-9]+)\\s*(.*)$",
          convert = TRUE) 
# A tibble: 598 × 6
   ff_ind ff_ind_short_desc ff_ind_desc   sic_min sic_max sic_desc         
    <int> <chr>             <chr>           <int>   <int> <chr>            
 1      1 Agric             Agriculture       100     199 Agricultural pro…
 2      1 Agric             Agriculture       200     299 Agricultural pro…
 3      1 Agric             Agriculture       700     799 Agricultural ser…
 4      1 Agric             Agriculture       910     919 Commercial fishi…
 5      1 Agric             Agriculture      2048    2048 Prepared feeds f…
 6      2 Food              Food Products    2000    2009 Food and kindred…
 7      2 Food              Food Products    2010    2019 Meat products    
 8      2 Food              Food Products    2020    2029 Dairy products   
 9      2 Food              Food Products    2030    2039 Canned & preserv…
10      2 Food              Food Products    2040    2046 Flour and other …
# ℹ 588 more rows

Lastly, we can put all of the above into a function. But, as we do so, let’s take a gamble that the same code will work for any of the Fama-French industry classifications if we only change the URL. To do this, we use an argument ind that reflects the industry grouping of interest and inserts that in the URL. Here we use the str_c() function from the stringr package to create the URL using the value supplied as ind.

get_ff_ind <- function(ind) {
  t <- tempfile(fileext = ".zip")
  url <- str_c("https://mba.tuck.dartmouth.edu/pages/",
               "faculty/ken.french/ftp/Siccodes", ind, ".zip")
  download.file(url, t)

  t |>
    read_fwf(col_positions = fwf_widths(c(3, 7, NA),
                                        c("ff_ind", "ff_ind_short_desc",
                                          "temp")),
             col_types = "icc") |>
    mutate(ff_ind_desc = if_else(!is.na(ff_ind), temp, NA),
         sic_range = if_else(is.na(ff_ind), temp, NA)) |>
    select(-temp) |>
    fill(ff_ind, ff_ind_short_desc, ff_ind_desc) |>
    filter(!is.na(sic_range)) |>
    extract(sic_range, 
            into = c("sic_min", "sic_max", "sic_desc"),
            regex = "^([0-9]+)-([0-9]+)\\s*(.*)$",
            convert = TRUE) 
}

We can test out for 48-industry classification above (just to make sure we didn’t mess up what we already had working):

get_ff_ind(48)
# A tibble: 598 × 6
   ff_ind ff_ind_short_desc ff_ind_desc   sic_min sic_max sic_desc         
    <int> <chr>             <chr>           <int>   <int> <chr>            
 1      1 Agric             Agriculture       100     199 Agricultural pro…
 2      1 Agric             Agriculture       200     299 Agricultural pro…
 3      1 Agric             Agriculture       700     799 Agricultural ser…
 4      1 Agric             Agriculture       910     919 Commercial fishi…
 5      1 Agric             Agriculture      2048    2048 Prepared feeds f…
 6      2 Food              Food Products    2000    2009 Food and kindred…
 7      2 Food              Food Products    2010    2019 Meat products    
 8      2 Food              Food Products    2020    2029 Dairy products   
 9      2 Food              Food Products    2030    2039 Canned & preserv…
10      2 Food              Food Products    2040    2046 Flour and other …
# ℹ 588 more rows

And now let’s try it for 5- and 12-industry classifications:

get_ff_ind(5)
# A tibble: 58 × 6
   ff_ind ff_ind_short_desc ff_ind_desc            sic_min sic_max sic_desc
    <int> <chr>             <chr>                    <int>   <int> <chr>   
 1      1 Cnsmr             Consumer Durables, No…     100     999 ""      
 2      1 Cnsmr             Consumer Durables, No…    2000    2399 ""      
 3      1 Cnsmr             Consumer Durables, No…    2700    2749 ""      
 4      1 Cnsmr             Consumer Durables, No…    2770    2799 ""      
 5      1 Cnsmr             Consumer Durables, No…    3100    3199 ""      
 6      1 Cnsmr             Consumer Durables, No…    3940    3989 ""      
 7      1 Cnsmr             Consumer Durables, No…    2500    2519 ""      
 8      1 Cnsmr             Consumer Durables, No…    2590    2599 ""      
 9      1 Cnsmr             Consumer Durables, No…    3630    3659 ""      
10      1 Cnsmr             Consumer Durables, No…    3710    3711 ""      
# ℹ 48 more rows
get_ff_ind(12)
# A tibble: 49 × 6
   ff_ind ff_ind_short_desc ff_ind_desc            sic_min sic_max sic_desc
    <int> <chr>             <chr>                    <int>   <int> <lgl>   
 1      1 NoDur             Consumer Nondurables …     100     999 NA      
 2      1 NoDur             Consumer Nondurables …    2000    2399 NA      
 3      1 NoDur             Consumer Nondurables …    2700    2749 NA      
 4      1 NoDur             Consumer Nondurables …    2770    2799 NA      
 5      1 NoDur             Consumer Nondurables …    3100    3199 NA      
 6      1 NoDur             Consumer Nondurables …    3940    3989 NA      
 7      2 Durbl             Consumer Durables -- …    2500    2519 NA      
 8      2 Durbl             Consumer Durables -- …    2590    2599 NA      
 9      2 Durbl             Consumer Durables -- …    3630    3659 NA      
10      2 Durbl             Consumer Durables -- …    3710    3711 NA      
# ℹ 39 more rows

9.1.2 Exercises

  1. Follow the steps below to produce a data set where each column has the appropriate data type.

    1. Go to the MSCI GICS website.
    2. Get the link to the file under historical GICS structures that is “Effective until Sep 28, 2018”.
    3. Use this link and the tempfile() and download.file() functions to download the linked file.
    4. Use read_excel() from the readxl package to read the downloaded file.
    5. Identify any variables that need to be handled like temp in the Fama-French data set above and process accordingly.
    6. Use the fill() function from the tidyr package to fill in rows as necessary.
    7. Make sure that each column has the appropriate data type.

Hints:

  1. You may find it helpful to look at the Excel file so you can see how the data are structured.
  2. The read_excel() function has skip and col_names arguments that you will probably want to use.
  3. Your final column names should be sector, sector_desc, ind_group, ind_group_desc, industry, industry_desc, sub_ind, sub_ind_desc, and sub_ind_details.
  4. The following code snippets might be useful:
  • filter(!is.na(sub_ind_details))
  • fill(sector:sub_ind_desc, .direction = "down")
  • mutate(across(where(is.numeric), as.integer))

9.2 Extracting data from messy formats

Sometimes data are provided in formats even messier than fixed-width text files. For example, we may want to extract data (perhaps tabular data) from a PDF. While the data may appear tabular to our eyes, the reality is that PDFs retain very little information about the structure of data, as the PDF format is a lightweight way of making a computer or printer present text for consumption by humans, not statistical software packages.

In this section, we will examine a case study in extracting tabular data from a PDF. Our initial focus will be on extracting data about money “left on the table” by firms in initial public offerings (IPOs). These data are provided by Jay Ritter at the University of Florida in a PDF found here.

In this case, we will use pdf_text() from the pdftools package. From the help for pdf_text(), we learn that “the pdf_text function renders all text boxes on a text canvas and returns a character vector of equal length to the number of pages in the PDF file.” We also learn that the function accepts an argument for the “pdf file path or raw vector with pdf data”. Since the above URL actually provides an absolute file path for the PDF, we can simply call pdf_text(url) to download the PDF from the url and convert it to a text representation in R.

url <- str_c("https://site.warrington.ufl.edu/ritter/files/", 
             "money-left-on-the-table.pdf")

Looking at the PDF, we see that the first page is text and the table starts on page 2. We can omit the first page returned by pdf_text(url) below by appending [-1].

output <- pdf_text(url)[-1]

While the output from pdf_text(url) or even pdf_text(url)[-1] uses up too much space to be shown here, if you run the code immediately above, then call each of these alternatives, you can see that the second function returns data that appears quite regular. In this case, it makes sense to run the text through read_lines() from readr. We will do this and store the result in temp so that we can examine it here more closely.9

temp <- 
  output |>
  read_lines()

Let’s look at a few selected lines. Lines 1:5 represent the top of the table, with the column names unfortunately split into two rows.10

print_width <- 70
str_sub(temp[1:5], 1, print_width)
[1] " Dollar amount                                   IPO        Offer    F"
[2] "left on the table   Company                      date       Price    m"
[3] ""                                                                      
[4] "$5,075,000,000      Visa                        080319      $44.00   $"
[5] "$3,937,028,063      Airbnb                      201210      $68.00   $"

Rather than trying to deal with the column-names-spread-over-two-rows issue with code, we can just manually specify the column names and skip the first two rows when we import the data. For now, we merely create variables to reflect those choices; we will use these variables later.

col_names <-  c("amount_left_on_table", "company", "ipo_date", "offer_price", 
                "first_close_price", "shares_offered", "ticker")
skip_rows <- 2

While the top portion of the table looks like it might be able to work with read_fwf(), once we look at the lines 56 through 62 at the bottom of the first page and top of the second pages (pp. 2–3 of the original PDF), we can see that they are not aligned.

str_sub(temp[56:62], 1, print_width)
[1] " $541,875,000       Charter Communications*     991109      $19.00   $"
[2] " $541,328,968       Blackstone Group LP         070621      $31.00   $"
[3] ""                                                                      
[4] "                                                        2"             
[5] "$540,929,382   XP Inc                       191211   $27.00   $34.46  "
[6] "$538,630,000   Gigamedia                    000218   $27.00   $88.00  "
[7] "$531,999,999   CureVac B.V.                 200814   $16.00   $55.90  "

The same is true of lines 114 through 120 at the bottom of the second page and top of the third pages (pp. 3–4 of the original PDF).

str_sub(temp[114:120], 1, print_width)
[1] ""                                                                      
[2] "$360,000,000   Niku                         000229   $24.00   $69.00  "
[3] ""                                                                      
[4] "                                                 3"                    
[5] "$355,384,125   Lucent Technologies*        960403   $27.00   $30.625  "
[6] "$354,046,000   Fitbit                      150618   $20.00   $29.68   "
[7] "$352,400,000   Tradeweb Markets            190404   $27.00   $35.81   "

Finally, rows 461 through 463 contain a footer that is not part of the table.

str_sub(temp[458:464], 1, print_width)
[1] " $128,724,000     HCA Holdings                   110310    $30.00    $"
[2] " $128,562,500     Netscape Communications*       950808    $28.00    $"
[3] ""                                                                      
[4] "Source: Prof. Jay R. Ritter, Warrington College of Business, Universit"
[5] "jay.ritter@warrington.ufl.edu). Data come from IPOScoop.com, Bloomberg"
[6] "WSJ.com, and the S.E.C.’s Edgar electronic database of 424B forms (fin"
[7] ""                                                                      

We can flag that by looking for Source: at the start of a row. To match at the start of a string, we need to anchor the regular expression using ^ to match the start.11 We want to read only up to one row before 461, and we will be skipping skip_rows at the start. So we can calculate the maximum number of rows we want to read as follows:

max_rows <- str_which(temp, "^Source:") - 1 - skip_rows

Also, as we will use Tidyverse tools to munge the data, we will find it useful to put the data in a tibble, albeit one with just a single column, which we call temp.

ritter_data_raw <-
  output |>
  read_lines(skip = skip_rows, n_max = max_rows) |>
  tibble(temp = _)

Coming as it does from a PDF, there are some messy elements of the data.

First, there are empty rows, which we can detect by matching on ^$ (i.e., the start and end with nothing in between) and we will want to filter out these rows.

ritter_data_raw |> 
  filter(str_detect(temp, "^$")) 
# A tibble: 49 × 1
   temp 
   <chr>
 1 ""   
 2 ""   
 3 ""   
 4 ""   
 5 ""   
 6 ""   
 7 ""   
 8 ""   
 9 ""   
10 ""   
# ℹ 39 more rows

Second, there are rows with just spaces and page numbers, which we can match with:

ritter_data_raw |> 
   filter(str_detect(temp, "^\\s+\\d+$")) 
# A tibble: 7 × 1
  temp                                                       
  <chr>                                                      
1 "                                                        2"
2 "                                                 3"       
3 "                                                4"        
4 "                                                5"        
5 "                                                 6"       
6 "                                                  7"      
7 "                                                8"        

So we want to filter out these rows too:

ritter_data_raw <- 
  output |>
    read_lines(skip = 2, n_max = max_rows) |>
    tibble(temp = _) |>
    filter(!str_detect(temp, "^$"), 
           !str_detect(temp, "^\\s+\\d+$"))

The next step will be, as before, to use extract() from tidyr and a regular expression to arrange the data into columns. However, the regular expression that we will need to use will be a bit more complicated than the one above. If it weren’t for the second column, we could use a function from the readr package such as read_delim() that is designed to read delimited text data, such as comma-separated values (CSVs), tab-separated values, or (most relevant here) values separated by spaces.

Unfortunately, the second column (company) contains spaces (e.g., United Parcel Service*), so this would be read as three columns (for this row, at least).12

Fortunately, company is the only column with embedded spaces and it is followed by a column (ipo_date) that is strictly six digits (it has the form yymmdd, where yy is the last two digits of the year, mm are the month digits, and dd represents the date of the month). So we can use this to effectively “delimit” the company column from the rest of the data.

The first column (amount_left_on_table) contains non-space characters, which we can represent as [^\\s]. Here the ^ functions to negate the expression following it \\s, which is the general way of representing spaces in regular expressions. So [^\\s]+ denotes “one or more non-space characters” and embedding this in parentheses (i.e., as ([^\\s]+)) allows us to capture the matching characters. The second column could contain pretty much anything (that is . in regex terms), followed by one or more spaces, which we could represent as (.+)\\s+. The third column (ipo_date) can be captured by a regular expression for six digits (again followed by one or more spaces that we don’t capture): ([0-9]{6})\\s+. The next four columns are like the first column, so we can use ([^\\s]+) to capture these, with all but the last column being followed by one or more spaces (\\s+). One issue that is not evident initially is that some, but not all, rows begin with spaces. We don’t want to capture these spaces, but we want to allow for them to be there, so we can use ^\\s* to represent “the start of the line followed by zero or more spaces” to do this.

Now we can put all of this together as follows:

regex <- str_c("^\\s*",          # Start string (perhaps followed by spaces)
               "([^\\s]+)\\s+",  # Non-space characters (followed by spaces)
               "(.+)\\s+",       # Any characters, which may include spaces 
                                 #  (followed by spaces)
               "([0-9]{6})\\s+", # Six digits (followed by spaces)
               "([^\\s]+)\\s+",  # Non-space characters (followed by spaces)
               "([^\\s]+)\\s+",  # Non-space characters (followed by spaces)
               "([^\\s]+)\\s+",  # Non-space characters (followed by spaces)
               "([^\\s]+)",      # Non-space characters 
               "$")              # End of string

We can then run this through extract().

ritter_data_raw |>
    # Here we use the regular expression to split the data into columns
    extract(temp, col_names, regex) 
# A tibble: 402 × 7
   amount_left_on_table company      ipo_date offer_price first_close_price
   <chr>                <chr>        <chr>    <chr>       <chr>            
 1 $5,075,000,000       "Visa      … 080319   $44.00      $56.50           
 2 $3,937,028,063       "Airbnb    … 201210   $68.00      $144.71          
 3 $3,750,040,000       "Snowflake … 200916   $120        $253.93          
 4 $3,477,690,000       "Rivian Aut… 211110   $78.00      $100.73          
 5 $2,887,830,000       "DoorDash  … 201209   $102        $189.51          
 6 $1,852,500,000       "Coupang   … 210311   $35.00      $49.25           
 7 $1,586,300,000       "United Par… 991110   $50.00      $68.125          
 8 $1,540,730,469       "Corvis    … 000728   $36.00      $84.71875        
 9 $1,496,000,000       "Snap      … 170302   $17.00      $24.48           
10 $1,365,500,000       "Bumble    … 210211   $43.00      $70.31           
# ℹ 392 more rows
# ℹ 2 more variables: shares_offered <chr>, ticker <chr>

We are getting close. Now we want to convert the fields amount_left_on_table, offer_price, first_close_price, and shares_offered to numerical values. For this task, the parse_number() function from readr is ideal. Note that we can use across() to apply a single function parse_number() to multiple columns (in this case, all_of() these four columns) and, by default, the results will replace the values previously found in those columns. And we also want to convert ipo_date to an actual date, which we can do using ymd() from the lubridate package. At this stage, we store the result in a data frame ritter_data.

ritter_data <-
  ritter_data_raw |>
  # Here we use the regular expression to split the data into columns
  extract(temp, col_names, regex) |>
  # Finally, fix up the data types of the columns
  mutate(across(all_of(c("amount_left_on_table", "first_close_price",
                         "offer_price", "shares_offered")), 
                parse_number),
         ipo_date = ymd(ipo_date),
         company = str_trim(company)) 

ritter_data
# A tibble: 402 × 7
   amount_left_on_table company    ipo_date   offer_price first_close_price
                  <dbl> <chr>      <date>           <dbl>             <dbl>
 1           5075000000 Visa       2008-03-19          44              56.5
 2           3937028063 Airbnb     2020-12-10          68             145. 
 3           3750040000 Snowflake  2020-09-16         120             254. 
 4           3477690000 Rivian Au… 2021-11-10          78             101. 
 5           2887830000 DoorDash   2020-12-09         102             190. 
 6           1852500000 Coupang    2021-03-11          35              49.2
 7           1586300000 United Pa… 1999-11-10          50              68.1
 8           1540730469 Corvis     2000-07-28          36              84.7
 9           1496000000 Snap       2017-03-02          17              24.5
10           1365500000 Bumble     2021-02-11          43              70.3
# ℹ 392 more rows
# ℹ 2 more variables: shares_offered <dbl>, ticker <chr>

At this stage, we should check if we have any rows with NA values, as the existence of such rows would suggest possible parsing issues.

ritter_data |>
  filter(if_any(.cols = everything(), .fns = is.na))
# A tibble: 0 × 7
# ℹ 7 variables: amount_left_on_table <dbl>, company <chr>,
#   ipo_date <date>, offer_price <dbl>, first_close_price <dbl>,
#   shares_offered <dbl>, ticker <chr>

We have none!

9.2.1 Exercises

  1. In the PDF, Ritter states “IPOs marked with an * also had international tranches (and probably some others, too).” Write code to remove this “*” when present and create an additional column intl_tranche that indicates an international tranche. (Hint: The extract() function can be used here. In a first pass, you might use into = c("company", "intl_tranche") and regex = "^(.*?)(\\*?)$" with this function. Can you see what the ? in .*? is doing? There is some explanation here. Does the regex work without this ?? What does \\*? match?)

  2. Ritter defines money left on the table as “the difference between the closing price on the first day of trading and the offer price, multiplied by the number of shares sold.” Can you calculate this from the data provided? Is the calculated amount (amount) equal to the amount in amount_left_on_table in each case? What explains the differences? (Hints: There will be more than one reason. You may find it helpful to calculate ratio = amount/amount_left_on_table and to focus on differences of more than 1% with filter(abs(ratio - 1) > 0.01).)

  3. In words, what do each of the following regular expressions match? What do they capture?

    • "^\\s*"
    • "(.+)\\s+"
    • "([^\\s]+)\\s+"
    • "([0-9]{1,2}/[0-9]{4})\\s+"
    • "([0-9,]+)\\s+"
  4. The online appendix for “The Customer Knows Best: The Investment Value of Consumer Opinions” contains Table OA.1 entitled List of firms with customer reviews on Amazon.com. Using an approach similar to that we used for ritter_data above, create a data frame huang_data with columns company_name, industry, start, end, months, reviews.

    • months and reviews should be numerical values.
    • start and end should be dates (use the first day of the month if only the month and year is specified).
    • You can solve this by combining the partial regular expressions above into a single regular expression (in the way we combined partial regular expressions above for ritter_data using str_c()).
  5. Using the following code, where the URL provides a sample of observations on mergers and acquisitions from SDC, create a first-pass import of the data.13 What is the first issue you see when you look at the data in ma_sdc? (Hint: Look at the first five rows.) Adapt the code to address this issue. (Hint: You may have to experiment with different values for the skip argument to get the right setting.)

col_names <- c("date_announced", "date_effective", "tgt_name",
               "tgt_nation", "acq_name", "acq_nation",
               "status", "pct_of_shares_acq", "pct_owned_after_transaction",
               "acq_cusip", "tgt_cusip", "value_of_transaction_mil",
               "acq_prior_mktval", "tgt_prior_mktval",
               "acq_nation_code", "tgt_nation_code")

url <- str_c("https://gist.githubusercontent.com/iangow/",
             "eb7dfe1cd0913821429bdf0566465d41/raw/",
             "358d60a4429f5747abc61f8acc026d335fc165f3/sap_sample.txt")
ma_sdc_file <- tempfile()
download.file(url, ma_sdc_file)
ma_sdc_cols <- fwf_empty(ma_sdc_file, col_names = col_names)
ma_sdc <- read_fwf(ma_sdc_file, col_positions = ma_sdc_cols)
  1. Open the file found at url in your browser (browseURL(url) will help here) and locate the row containing the word Coffey. What do you see there as relates to the variable status? How does this compare with what you see in status if you filter using tgt_name == "Coffey International Ltd")? What do you think has happened here? How can setting a value for n in fwf_empty() help here? (Hint: Using which(ma_sdc$status == "Unconditi") might help here.)

  2. Using an appropriate function from the lubridate package, fix the variables date_announced and date_effective so that they have type Date.

  3. What are the minimum and maximum values of date_announced and date_effective? What explains missing values (if any) here?

  4. What do you observe about acq_cusip and tgt_cusip? Can you write some code to check that these variables have been read in correctly? (Hint: The function str_length() might be useful here.)

9.3 Further reading

Chapter 7 of R for Data Science provides an introduction to importing data, including functions such as read_csv() and read_fwf(). Chapter 14 in R for Data Science covers strings and Chapter 15 covers regular expressions.

The topic of regular expressions is surprisingly deep and regular expressions are useful in more contexts than might be apparent when first learning about them. Friedl (2006) provides a deep treatment of the topic. Goyvaerts and Levithan (2009) is full of examples of solutions to common cases.


  1. Some WRDS data sets have “incorrect” data types and additional work is needed to address these cases.↩︎

  2. Regular expressions are available in pretty much every package, include Python, R, PostgreSQL, and SAS. Learning on one platform largely carries over to any other. Stata’s support for regular expressions is much weaker than the other platforms’.↩︎

  3. According to https://siccode.com, “Standard Industrial Classification Codes (SIC Codes) identify the primary line of business of a company. It is the most widely used system by the US Government, public, and private organizations.”↩︎

  4. While Fama-French industry definitions might not change very often, we will see other benefits from a more robust and general approach below.↩︎

  5. We justify this approach using the fact that these data sets are not large by any stretch and the assumption that, in general, only one of ff_ind, ff_ind_short_desc and ff_ind_desc actually gets used in practice.↩︎

  6. Note that an alternative to str_c() would be paste0() from base R; we use str_c() here because the stringr functions seem easier to learn as a group than their base R equivalents.↩︎

  7. Throughout this section, we will repeat code, but in practice, we would simply add additional lines as we work through the code. This is simply a way to represent the idea of working through the data interactively, something that the Tidyverse makes very easy.↩︎

  8. It seems that omitting ^ and $ has no effect in this case.↩︎

  9. Again, you may find it useful to inspect the full contents of temp yourself. We don’t do that here due to space constraints.↩︎

  10. We only print the first 70 characters of each row to keep the output on the page.↩︎

  11. Note that in other contexts, such as inside [ and ], ^ will act as a kind of “not” operator.↩︎

  12. If the values in company had been quoted, e.g., as "United Parcel Service*", then reading the data as space-delimited would work. But these are “wild” data from a PDF and such niceties cannot be expected.↩︎

  13. We messed with these data, so these cannot be used for research! But they are a realistic representation of an actual data set.↩︎