9  Importing data

Thus far, our data have come from either the era_py package or WRDS. In most cases, the WRDS data are nicely rectangular, meaning that we have \(N\) observations of \(K\) variables, and they already have assigned data types, etc.1 In practice, researchers often obtain data from other sources, especially the internet, and such data are often rather messy. This chapter provides an introduction to importing data from such sources.

While this chapter is fairly task-oriented, we think that it serves to reinforce two deeper ideas.

Even dedicated users of SAS or Stata may find this chapter useful, as Python handles tricky data-munging tasks quite easily and the end results are easily exported in formats that other software can read (e.g., df.to_stata() from pandas creates Stata data files).

We also introduce regular expressions. Briefly speaking, a regular expression is a sequence of characters that define a pattern which can be used in a kind of search (or search-and-replace) on steroids. Regular expressions are very useful when working with data in many contexts.2 The Polars CSV guide, the Polars strings guide, and the Python re module documentation provide excellent introductions that complement material in this chapter. We recommend that you refer to those resources as you work through this chapter.

Tip

The code in this chapter uses the packages listed below. For instructions on how to set up your computer to use this code, go to the support page for this book. The support page also includes Quarto templates for the code and exercises below.

import io
import re
import requests
import zipfile
import polars as pl
import pdfplumber
from era_py import NumberedLines

9.1 Reading (seemingly) non-tabular data

9.1.1 Fama-French industry definitions

The first task that we will explore is collecting data on Fama-French industry definitions, which are widely used in finance and accounting research to map SIC codes, of which there are hundreds, into a smaller number of industry groups for analysis.3 For example, we might want to group firms into 48, 12, or even 5 industry groups.

The basic data on Fama-French industry definitions are available from Ken French’s website at Tuck School of Business.

There are multiple classifications, starting with 5 industries, then 10, 12, 17, 30, 38, 48, and finally 49 industries. The data are supplied as zipped text files. For example, the 48-industry data can be found on this page, by clicking the link displayed as Download industry definitions.

If we download that linked file and unzip it, we can open it in a text editor or even Excel. The first ten lines of the file are as follows:

 1 Agric  Agriculture
          0100-0199 Agricultural production - crops
          0200-0299 Agricultural production - livestock
          0700-0799 Agricultural services
          0910-0919 Commercial fishing
          2048-2048 Prepared feeds for animals

 2 Food   Food Products
          2000-2009 Food and kindred products
          2010-2019 Meat products

Looking at the second row, we interpret this as saying that firms with SIC codes between 0100 and 0199 are assigned to industry group 1 (let’s call this field ff_ind), which has a label or short description (ff_ind_short_desc) Agric and a full industry description (ff_ind_desc) of Agriculture.

One approach to this task might be to write a function like the following (this one is woefully incomplete, as it only covers the first two lines of data above):

def get_ff_ind_48(sic):
    if 100 <= sic <= 299:
        return 1
    if sic >= 200 & sic <= 299:
        return 1
    return None

While tedious and time-consuming, this is perfectly feasible. In fact, this is essentially the approach taken in code you can find on the internet (e.g., SAS code here or here or Stata code here).

However, doing this would only solve the problem for the 48-industry grouping. And it certainly could not be described as particularly robust to, for example, changes in Fama-French industry definitions.4

Part of the solution that we use below recognizes that the data are really tabular in nature. A relational database purist would likely look at the data above as representing two tables. One table relates Fama-French industries to short and full industry descriptions. The first two rows in this table would look something like this:

ff_ind ff_ind_short_desc ff_ind_desc
1 Agric Agriculture
2 Food Food Products

The second table would relate Fama-French industries to ranges of SIC codes, and the first few rows of this table would look something like this:

ff_ind sic_min sic_max sic_desc
1 0100 0199 Agricultural production - crops
1 0200 0299 Agricultural production - livestock
1 0700 0799 Agricultural services
1 0910 0919 Commercial fishing

To keep things simple for this exercise, we will disappoint the purists and make a single table with all six fields: ff_ind, ff_ind_short_desc, ff_ind_desc, sic_min, sic_max, sic_desc.5

So how do we make this table? One approach to this task might be to download the linked file, unzip it, open it up in some program (e.g., Excel), and then massage the data manually into the desired form. But this would have the same issues as the approach above.

We can do better by using Python. The first step is to download the data. While one can easily do this manually, but we want to automate this process as much as possible. And we probably don’t have any reason to keep the .zip file once we have used it. The zip_url_to_file() function uses requests to fetch the ZIP file into memory and then zipfile to open the text file inside it.

def zip_url_to_file(url):
    resp = requests.get(url)
    resp.raise_for_status()
    zf = zipfile.ZipFile(io.BytesIO(resp.content))
    return zf.open(zf.namelist()[0])
url = ("http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/"
       "ftp/Siccodes48.zip")

with zip_url_to_file(url) as f:
    txt_raw = f.read().splitlines()

This gives us raw text in memory.

If you look at Ken French’s website, you will see that all the industry-definition files have URLs that follow a certain pattern, with just the number of industry groups (in this case, 48) changing. Recognizing this, we can rewrite the code above as follows:

ind = 48

url = ("http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/"
       f"ftp/Siccodes{ind}.zip")

with zip_url_to_file(url) as f:
    txt_raw = f.read().splitlines()

From visual inspection, we can see that our text file is a fixed-width format text file. So to read the data, we will write a small read_fwf() helper that returns a Polars data frame directly.

def read_fwf(file, widths, names,
             skiprows=0, parse_dates=None, date_format=None):

    pattern = "^" + "".join(
        f"(?P<{name}>.{{{width}}})"
        if i < len(widths) - 1
        else f"(?P<{name}>.{{0,{width}}})"
        for i, (name, width) in enumerate(zip(names, widths))
    ) + "$"

    df = (
        pl.read_lines(file, name="line_text")
        .slice(skiprows)
        .filter(pl.col("line_text").str.strip_chars() != "")
        .with_columns(
            pl.col("line_text").str.extract_groups(pattern).alias("parts"))
        .unnest("parts")
        .with_columns(
            pl.col(names).str.strip_chars().replace("", None)
        )
        .drop("line_text")
    )

    if parse_dates is not None:
        df = df.with_columns(
            pl.col(parse_dates).str.to_date(date_format, strict=False)
        )

    return df

The first argument to read_fwf() is a file-like object (f), and the other required arguments are widths and names.

Given that we have a very simple file, we can identify the column widths pretty easily. Manually adding a “ruler” of sorts at the top of the file, we can see below that the first column covers columns 1-3, the second column covers 4-10, and the third column starts at 11.

123456789-123456789-123456789-123456789-123456789-...
 1 Agric  Agriculture
          0100-0199 Agricultural production - crops
          0200-0299 Agricultural production - livestock
          0700-0799 Agricultural services

So we have widths of 3 and 7 for the first two columns, and we use a large third width (1000) to capture the remainder of the line. The first two columns should be named ff_ind and ff_ind_short_desc, but the third column is problematic, as some rows provide information on ff_ind_short_desc and some rows provide data that we will eventually put into sic_min and sic_max; so let’s call that column temp for now.

with zip_url_to_file(url) as f:
    df_raw = read_fwf(
        f,
        widths=[3, 7, 1000],
        names=["ff_ind", "ff_ind_short_desc", "temp"],
    )

The next step is to handle the problematic temp column. We can see that if ff_ind is null, then temp contains (if it contains anything) a range of SIC codes, but if ff_ind is not null, then temp contains the value we want to store in ff_ind_desc. We can use .with_columns() with pl.when() to split temp into ff_ind_desc and sic_range, and then drop temp once those columns are created.

df = (
    df_raw
    .with_columns(
        ff_ind_desc=pl.when(pl.col("ff_ind").is_not_null())
        .then(pl.col("temp")),
        sic_range=pl.when(pl.col("ff_ind").is_null())
        .then(pl.col("temp")),
    )
    .drop("temp")
)

We are getting closer. Now, we see that the issue is that our sic_range column does not line up with the other three columns. To solve this, we can apply the .forward_fill() method to “fill” missing values with the previous non-missing value for each of the three columns.

df = df.with_columns(
    pl.col("ff_ind", "ff_ind_short_desc", "ff_ind_desc")
    .forward_fill()
)

df
shape: (646, 4)
ff_ind ff_ind_short_desc ff_ind_desc sic_range
str str str str
"1" "Agric" "Agriculture" null
"1" "Agric" "Agriculture" "0100-0199 Agricultural product…
"1" "Agric" "Agriculture" "0200-0299 Agricultural product…
"48" "Other" "Almost Nothing" "4960-4961 Steam & air conditio…
"48" "Other" "Almost Nothing" "4970-4971 Irrigation systems"
"48" "Other" "Almost Nothing" "4990-4991 Cogeneration - SM po…

At this point, we have no further use for the rows where sic_range is null, so we can filter them out.

df = df.filter(pl.col("sic_range").is_not_null())
df
shape: (598, 4)
ff_ind ff_ind_short_desc ff_ind_desc sic_range
str str str str
"1" "Agric" "Agriculture" "0100-0199 Agricultural product…
"1" "Agric" "Agriculture" "0200-0299 Agricultural product…
"1" "Agric" "Agriculture" "0700-0799 Agricultural service…
"48" "Other" "Almost Nothing" "4960-4961 Steam & air conditio…
"48" "Other" "Almost Nothing" "4970-4971 Irrigation systems"
"48" "Other" "Almost Nothing" "4990-4991 Cogeneration - SM po…

The last issue to address is the column sic_range. We want to split that into three target columns (sic_min, sic_max, and sic_desc). To do this, we can use the .extract_groups() method of the Polars string namespace. The required argument for .extract_groups() is a regular expression with capturing groups that is used to split the data. We will discuss regular expressions more carefully later in the chapter, so it is fine if you don’t fully follow what’s going on here. For now, we just need to know that the portions contained in each pair of parentheses (( and )) are what is captured for each field. The first pair of parentheses surround [0-9]+, which we can read as “one or more digit characters”. So, for the case of 0100-0199 Agricultural production - crops, this would match 0100. This is followed by a - which is a literal dash that matches the - after 0100. The next pair of parentheses is again “one or more digit characters” and captures 0199. This is followed by \s*. The \s represents “any space” and the * means “zero or more”, so \s* means “zero or more spaces”, which are matched, but not captured. Finally we have (.*)$. The . represents “any character”, so .* means “zero or more of any character”, which are captured as the third variable. The $ just means “the end of the string”; this mirrors the ^ at the start of the regular expression, which means “the start of the string”. Together, ^ and $ ensure that we are considering the whole string in our analysis.6

Finally, we use named capturing groups (i.e., ?P<var_name>) to name each captured group so that the name of the captured group becomes the column name in the extracted data frame.

df = (
    df
    .with_columns(
        pl.col("sic_range")
        .str.extract_groups(
            r"^(?P<sic_min>[0-9]+)-"
            r"(?P<sic_max>[0-9]+)\s*"
            r"(?P<sic_desc>.*)$"
        )
        .alias("sic_parts")
    )
    .unnest("sic_parts")
)

The columns ff_ind, sic_min and sic_max can be converted into integers.

df.with_columns(
    pl.col("ff_ind", "sic_min", "sic_max").cast(pl.Int32),
)
shape: (598, 7)
ff_ind ff_ind_short_desc ff_ind_desc sic_range sic_min sic_max sic_desc
i32 str str str i32 i32 str
1 "Agric" "Agriculture" "0100-0199 Agricultural product… 100 199 "Agricultural production - crop…
1 "Agric" "Agriculture" "0200-0299 Agricultural product… 200 299 "Agricultural production - live…
1 "Agric" "Agriculture" "0700-0799 Agricultural service… 700 799 "Agricultural services"
48 "Other" "Almost Nothing" "4960-4961 Steam & air conditio… 4960 4961 "Steam & air conditioning suppl…
48 "Other" "Almost Nothing" "4970-4971 Irrigation systems" 4970 4971 "Irrigation systems"
48 "Other" "Almost Nothing" "4990-4991 Cogeneration - SM po… 4990 4991 "Cogeneration - SM power produc…

Lastly, we can put all of the above into a function. But, as we do so, let’s take a gamble that the same code will work for any of the Fama-French industry classifications if we only change the URL. To do this, we use an argument ind that reflects the industry grouping of interest and inserts that in the URL. Here we use an f-string to create the URL using the value supplied as ind.

Also, because of how we constructed each step in building final data set, we can use a single method pipeline to perform all steps.

def get_ff_ind(ind):
    url = (
        "http://mba.tuck.dartmouth.edu"
        f"/pages/faculty/ken.french/ftp/Siccodes{ind}.zip"
    )

    with zip_url_to_file(url) as f:
        df = (
            read_fwf(
                f,
                widths=[3, 7, 1000],
                names=["ff_ind", "ff_ind_short_desc", "temp"],
            )
            .with_columns(
                ff_ind_desc=pl.when(pl.col("ff_ind").is_not_null())
                .then(pl.col("temp")),
                sic_range=pl.when(pl.col("ff_ind").is_null())
                .then(pl.col("temp")),
            )
            .with_columns(
                pl.col("ff_ind", "ff_ind_short_desc", "ff_ind_desc")
                .forward_fill(),
            )
            .filter(pl.col("sic_range").is_not_null())
            .with_columns(
                pl.col("sic_range")
                .str.extract_groups(
                    r"^(?P<sic_min>[0-9]+)-"
                    r"(?P<sic_max>[0-9]+)\s*"
                    r"(?P<sic_desc>.*)$"
                )
                .alias("sic_parts")
            )
            .unnest("sic_parts")
            .with_columns(
                pl.col("ff_ind", "sic_min", "sic_max").cast(pl.Int32),
            )
            .drop("sic_range", "temp")
        )
    return df

We can test out for 48-industry classification above (just to make sure we didn’t mess up what we already had working):

get_ff_ind(48)
shape: (598, 6)
ff_ind ff_ind_short_desc ff_ind_desc sic_min sic_max sic_desc
i32 str str i32 i32 str
1 "Agric" "Agriculture" 100 199 "Agricultural production - crop…
1 "Agric" "Agriculture" 200 299 "Agricultural production - live…
1 "Agric" "Agriculture" 700 799 "Agricultural services"
48 "Other" "Almost Nothing" 4960 4961 "Steam & air conditioning suppl…
48 "Other" "Almost Nothing" 4970 4971 "Irrigation systems"
48 "Other" "Almost Nothing" 4990 4991 "Cogeneration - SM power produc…

And now let’s try it for 5- and 12-industry classifications:

get_ff_ind(5)
shape: (58, 6)
ff_ind ff_ind_short_desc ff_ind_desc sic_min sic_max sic_desc
i32 str str i32 i32 str
1 "Cnsmr" "Consumer Durables, Nondurables… 100 999 ""
1 "Cnsmr" "Consumer Durables, Nondurables… 2000 2399 ""
1 "Cnsmr" "Consumer Durables, Nondurables… 2700 2749 ""
4 "Hlth" "Healthcare, Medical Equipment,… 3693 3693 ""
4 "Hlth" "Healthcare, Medical Equipment,… 3840 3859 ""
4 "Hlth" "Healthcare, Medical Equipment,… 8000 8099 ""
get_ff_ind(12)
shape: (49, 6)
ff_ind ff_ind_short_desc ff_ind_desc sic_min sic_max sic_desc
i32 str str i32 i32 str
1 "NoDur" "Consumer Nondurables -- Food, … 100 999 ""
1 "NoDur" "Consumer Nondurables -- Food, … 2000 2399 ""
1 "NoDur" "Consumer Nondurables -- Food, … 2700 2749 ""
10 "Hlth" "Healthcare, Medical Equipment,… 3840 3859 ""
10 "Hlth" "Healthcare, Medical Equipment,… 8000 8099 ""
11 "Money" "Finance" 6000 6999 ""

9.1.2 Exercises

  1. Follow the steps below to produce a data set where each column has the appropriate data type.

    1. Go to the MSCI GICS website.
    2. Get the link to the file under historical GICS structures that is “Effective until Sep 28, 2018”.
    3. Use pl.read_excel() to read the linked file directly from the URL.
    4. Identify any variables that need to be handled like temp in the Fama-French data set above and process accordingly.
    5. Use .forward_fill() to fill in rows as necessary.
    6. Make sure that each column has the appropriate data type.

Hints:

  1. You may find it helpful to look at the Excel file so you can see how the data are structured.
  2. pl.read_excel() requires an Excel engine package such as fastexcel.
  3. pl.read_excel() has options that can help you skip rows and assign column names.
  4. Your final column names should be sector, sector_desc, ind_group, ind_group_desc, industry, industry_desc, sub_ind, sub_ind_desc, and sub_ind_details.
  5. The following code snippets might be useful:
  • df[df["sub_ind_details"].notna()]
  • df[cols] = df[cols].ffill() (where cols is a list of columns to forward-fill)
  • df.convert_dtypes()

9.2 Extracting data from messy formats

Sometimes data are provided in formats even messier than fixed-width text files. For example, we may want to extract data (perhaps tabular data) from a PDF. While the data may appear tabular to our eyes, the reality is that PDFs retain very little information about the structure of data, as the PDF format is a lightweight way of making a computer or printer present text for consumption by humans, not statistical software packages.

In this section, we will examine a case study in extracting tabular data from a PDF. Our initial focus will be on extracting data about money “left on the table” by firms in initial public offerings (IPOs). These data are provided by Jay Ritter at the University of Florida in a PDF found here.

In this case, we will extract page text from the PDF in Python The pdf_url_to_text() below uses the pdfplumber package to extract the text as a set of lines, and stores it in a convenient NumberedLines object so we can examine it more closely.7

def pdf_url_to_text(url, start_page=0):
    resp = requests.get(url)
    resp.raise_for_status()
    pdf = pdfplumber.open(io.BytesIO(resp.content))
    return NumberedLines(page.extract_text() 
                         for page in pdf.pages[start_page:])

Looking at the PDF, we see that the first page is text and the table starts on page 2. So we drop the first page when creating pdf_lines.

url = (
    "https://site.warrington.ufl.edu/ritter/files/"
    "money-left-on-the-table-in-IPOs.pdf"
)

pdf_lines = pdf_url_to_text(url, start_page=1)

While the entirety of the extracted text uses up too much space to be shown here, the output appears quite regular. By default, typing pdf_lines shows the first few and last few lines.

pdf_lines
NumberedLines([
  00: Dollar amount IPO Offer First closing Number of Ticker
  01: left on the table Company date Price market price shares offered symbol
  02: $5,075,000,000 Visa 080319 $44.00 $56.50 406,000,000 V
  03: $3,937,028,063 Airbnb 201210 $68.00 $144.71 51,323,531 ABNB
  04: $3,750,040,000 Snowflake 200916 $120 $253.93 28,000,000 SNOW
  …
  441: $128,562,500 Netscape Communications* 950808 $28.00 $58.25 4,250,000 NSCP
  442: Source: Prof. Jay R. Ritter, Warrington College of Business, University of Florida (352-846-2837 voice,
  443: jay.ritter@warrington.ufl.edu). Data come from IPOScoop.com, Bloomberg, Thomson Financial, Dealogic,
  444: WSJ.com, and the S.E.C.’s Edgar electronic database of 424B forms (final prospectuses).
  445: 10
])

From the output above, we see that the column headings are spread over the first two rows and the data items themselves start from line 2. Rather than trying to deal with the column-names-spread-over-two-rows issue with code, we skip the first rows when we import the data, using the regular expression r"^\$" to start reading data from the first row beginning with dollar sign ($). To match at the start of a string, as did above, we use ^ to anchor the regular expression to the start of the string. We use \ before $ because $ usually has a special meaning in regular expressions. For now, we merely create variables to reflect those choices; we will use these variables later.

skip_rows = min(pdf_lines.filter(re.compile(r"^\$")).index)

While the top portion of the table looks like it might be able to work with read_fwf(), once we look at the lines around the bottom of the first page and the top of the second page, we can see that they are not aligned.

re_page_nums = re.compile(r"^\s*\d+\s*$")
page_nums = pdf_lines.filter(re_page_nums).index
pdf_lines[(page_nums[0] - 3):(page_nums[0] + 3)]
NumberedLines([
  49: $553,181,250 Finisar 991112 $19.00 $86.875 8,150,000 FNSR
  50: $552,000,000 Intrepid Potash 080421 $32.00 $50.40 30,000,000 IPI
  51: $551,325,000 Newsmax 250331 $10.00 $83.51 7,500,000 NMAX
  52: 2
  53: $550,375,000 Vroom 200609 $22.00 $47.90 21,250,000 VRM
  54: $547,200,000 Petco Health and Wellness 210114 $18.00 $29.40 48,000,000 WOOF
])

The same is true going from the bottom of the second page the top of the third page.

pdf_lines[(page_nums[1] - 3):(page_nums[1] + 3)]
NumberedLines([
  101: $392,150,000 Verisk Analytics 091006 $22.00 $26.60 85,250,000 VRSK
  102: $391,352,500 Beyond Meat 190502 $25.00 $65.66 9,625,000 BYND
  103: $391,250,000 Firepond 000204 $22.00 $100.25 5,000,000 FIRE
  104: 3
  105: $390,625,000 Kailera Therapeutics 260417 $16.00 $26.00 39,062,500 KLRA
  106: $386,120,000 LinkedIn 110519 $45.00 $94.25 7,840,000 LNKD
])

Finally, the last few rows contain a footer that is not part of the table.

pdf_lines[-6:]
NumberedLines([
  440: $128,724,000 HCA Holdings 110310 $30.00 $31.02 126,200,00 HCA
  441: $128,562,500 Netscape Communications* 950808 $28.00 $58.25 4,250,000 NSCP
  442: Source: Prof. Jay R. Ritter, Warrington College of Business, University of Florida (352-846-2837 voice,
  443: jay.ritter@warrington.ufl.edu). Data come from IPOScoop.com, Bloomberg, Thomson Financial, Dealogic,
  444: WSJ.com, and the S.E.C.’s Edgar electronic database of 424B forms (final prospectuses).
  445: 10
])

We can flag that by looking for Source: at the start of a row. To match at the start of a string, we need to anchor the regular expression using ^ to match the start.8 We want to read only up to one row before the row starting with `Source:.

last_row = min(pdf_lines.filter(re.compile("Source:")).index)
pdf_lines[skip_rows:last_row]
NumberedLines([
  02: $5,075,000,000 Visa 080319 $44.00 $56.50 406,000,000 V
  03: $3,937,028,063 Airbnb 201210 $68.00 $144.71 51,323,531 ABNB
  04: $3,750,040,000 Snowflake 200916 $120 $253.93 28,000,000 SNOW
  05: $3,477,690,000 Rivian Automotive 211110 $78.00 $100.73 153,000,000 RIVN
  06: $3,047,309,018 Figma 250731 $33.00 $115.50 36,937,079 FIG
  …
  437: $130,000,000 Autoweb 990323 $14.00 $40.00 5,000,000 AWEB
  438: $129,250,000 Polo Ralph Lauren* 970611 $26.00 $31.50 23,500,000 RL
  439: $128,920,000 Triton Network Systems 000713 $15.00 $38.4375 5,500,000 TNSI
  440: $128,724,000 HCA Holdings 110310 $30.00 $31.02 126,200,00 HCA
  441: $128,562,500 Netscape Communications* 950808 $28.00 $58.25 4,250,000 NSCP
])

As we saw above, there are lines that are just page numbers.

pdf_lines.filter(re_page_nums)
NumberedLines([
  052: 2
  104: 3
  157: 4
  210: 5
  263: 6
  316: 7
  368: 8
  421: 9
  445: 10
])

We can use .filter_out() to eliminate these lines.

filtered = (
    pdf_lines
    [skip_rows:last_row]
    .filter_out(re_page_nums)
    .reset_index()
)

filtered
NumberedLines([
  00: $5,075,000,000 Visa 080319 $44.00 $56.50 406,000,000 V
  01: $3,937,028,063 Airbnb 201210 $68.00 $144.71 51,323,531 ABNB
  02: $3,750,040,000 Snowflake 200916 $120 $253.93 28,000,000 SNOW
  03: $3,477,690,000 Rivian Automotive 211110 $78.00 $100.73 153,000,000 RIVN
  04: $3,047,309,018 Figma 250731 $33.00 $115.50 36,937,079 FIG
  …
  427: $130,000,000 Autoweb 990323 $14.00 $40.00 5,000,000 AWEB
  428: $129,250,000 Polo Ralph Lauren* 970611 $26.00 $31.50 23,500,000 RL
  429: $128,920,000 Triton Network Systems 000713 $15.00 $38.4375 5,500,000 TNSI
  430: $128,724,000 HCA Holdings 110310 $30.00 $31.02 126,200,00 HCA
  431: $128,562,500 Netscape Communications* 950808 $28.00 $58.25 4,250,000 NSCP
])

The next step will be, as before, to use .str.extract_groups() and a regular expression to arrange the data into columns. However, the regular expression that we will need to use will be a bit more complicated than the one above. If it weren’t for the second column, we could use a function that is designed to read delimited text data, such as comma-separated values (CSVs), tab-separated values, or (most relevant here) values separated by spaces.

Unfortunately, the second column (company) contains spaces (e.g., United Parcel Service*), so this would be read as three columns (for this row, at least).9

Fortunately, company is the only column with embedded spaces and it is followed by a column (ipo_date) that is strictly six digits (it has the form yymmdd, where yy is the last two digits of the year, mm are the month digits, and dd represents the date of the month). So we can use this to effectively “delimit” the company column from the rest of the data.

The first column (left_on_table) contains non-space characters, which we can represent as [^\s]+. Here the ^ functions to negate the expression following it \s, which is the general way of representing spaces in regular expressions. So [^\s]+ denotes “one or more non-space characters” and embedding this in parentheses (i.e., as ([^\s]+)) allows us to capture the matching characters as a group. We can go further and use (?P<left_on_table>[^\s]+) so that the captured group is named left_on_table. This will be followed by one or more spaces which we write as \s+ (here the + indicates “one or more”).

The second column is company and could contain pretty much anything (that is . in regex terms), followed by one or more spaces, which we could represent as (?P<company>.+)\s+.

The third column (ipo_date) can be captured by a regular expression for six digits (again followed by one or more spaces that we don’t capture): (?P<ipo_date>[0-9]{6})\s+.

The next four columns (first_close, offer_price, shares_offered, and ticker) are like the first column, so we can again use ([^\s]+) to capture these, with all but the last column being followed by one or more spaces up to the end of the string (\s+$, here the $ represents the end of the string). One issue that is not evident initially is that some, but not all, rows begin with spaces. We don’t want to capture these spaces, but we want to allow for them to be there, so we can use ^\s* to represent “the start of the line followed by zero or more spaces” to do this.

Now we can put all of this together as follows:

regex = (
    r"^\s*"                           # Start string (then perhaps spaces)
    r"(?P<left_on_table>[^\s]+)\s+"   # Non-space characters (then spaces)
    r"(?P<company>.+)\s+"             # Any characters, perhaps spaces
                                      #  (then spaces)
    r"(?P<ipo_date>[0-9]{6})\s+"      # Six digits (then spaces)
    r"(?P<first_close>[^\s]+)\s+"     # Non-space characters (then spaces)
    r"(?P<offer_price>[^\s]+)\s+"     # Non-space characters (then spaces)
    r"(?P<shares_offered>[^\s]+)\s+"  # Non-space characters (then spaces)
    r"(?P<ticker>[^\s]+)"             # Non-space characters
    r"\s*$"                           # Perhaps spaces, then nd of string
)

We can now run this through extract_groups().

(
    pl.DataFrame({"line": list(filtered)})
    .select(pl.col("line").str.extract_groups(regex).alias("parts"))
    .unnest("parts")
)
shape: (432, 7)
left_on_table company ipo_date first_close offer_price shares_offered ticker
str str str str str str str
"$5,075,000,000" "Visa" "080319" "$44.00" "$56.50" "406,000,000" "V"
"$3,937,028,063" "Airbnb" "201210" "$68.00" "$144.71" "51,323,531" "ABNB"
"$3,750,040,000" "Snowflake" "200916" "$120" "$253.93" "28,000,000" "SNOW"
"$128,920,000" "Triton Network Systems" "000713" "$15.00" "$38.4375" "5,500,000" "TNSI"
"$128,724,000" "HCA Holdings" "110310" "$30.00" "$31.02" "126,200,00" "HCA"
"$128,562,500" "Netscape Communications*" "950808" "$28.00" "$58.25" "4,250,000" "NSCP"

We are getting close. Now we want to convert the fields left_on_table, offer_price, first_close, and shares_offered to numerical values. We do this with str.replace_all() and casting in Polars. We also convert ipo_date to an actual date with str.strptime(). At this stage, we store the result in a data frame ritter_data.

ritter_data = (
        pl.DataFrame({"line": list(filtered)})
        .select(pl.col("line").str.extract_groups(regex).alias("parts"))
        .unnest("parts")
        .with_columns(
            pl.col("company").str.strip_chars(),
            pl.col("ipo_date").str.strptime(
                pl.Date,
                format="%y%m%d",
                strict=False,
            ),
        pl.col("shares_offered", "left_on_table", 
               "first_close", "offer_price")
        .str.replace_all(r"[^0-9.\-]", "")
        .cast(pl.Float64, strict=False)
    )
)

ritter_data
shape: (432, 7)
left_on_table company ipo_date first_close offer_price shares_offered ticker
f64 str date f64 f64 f64 str
5.0750e9 "Visa" 2008-03-19 44.0 56.5 4.06e8 "V"
3.9370e9 "Airbnb" 2020-12-10 68.0 144.71 5.1323531e7 "ABNB"
3.7500e9 "Snowflake" 2020-09-16 120.0 253.93 2.8e7 "SNOW"
1.2892e8 "Triton Network Systems" 2000-07-13 15.0 38.4375 5.5e6 "TNSI"
1.28724e8 "HCA Holdings" 2011-03-10 30.0 31.02 1.262e7 "HCA"
1.285625e8 "Netscape Communications*" 1995-08-08 28.0 58.25 4.25e6 "NSCP"

At this stage, we should check if we have any rows with null values, as the existence of such rows would suggest possible parsing issues.

ritter_data.filter(pl.any_horizontal(pl.all().is_null()))
shape: (1, 7)
left_on_table company ipo_date first_close offer_price shares_offered ticker
f64 str date f64 f64 f64 str
2.4583e8 "EquipmentShare.com" 2026-01-23 24.5 32.56 null "EQPT"

We have none!

9.2.1 Exercises

  1. In the PDF, Ritter states “IPOs marked with an * also had international tranches (and probably some others, too).” Write code to remove this “*” when present and create an additional column intl_tranche that indicates an international tranche. (Hint: The extract_groups() function can be used here. In a first pass, you might use named capturing groups and regex = r"^(?P<company>.*?)(?P<intl_tranche>\\*?)$". Can you see what the ? in .*? is doing? There is some explanation here. Does the regex work without this ?? What does \\*? match?)

  2. Ritter defines money left on the table as “the difference between the closing price on the first day of trading and the offer price, multiplied by the number of shares sold.” Can you calculate this from the data provided? Is the calculated amount (amount) equal to the amount in left_on_table in each case? What explains the differences? (Hints: There will be more than one reason. You may find it helpful to calculate ratio = amount / left_on_table and to focus on differences of more than 1% with .filter(abs(ratio - 1) > 0.01).)

  3. In words, what do each of the following regular expressions match? What do they capture?

    • r"^\s*"
    • r"(.+)\s+"
    • r"([^\s]+)\s+"
    • r"([0-9]{1,2}/[0-9]{4})\s+"
    • r"([0-9,]+)\s+"
  4. What happens if we parse shares_offered as floating-point numbers rather than integers?

  5. What happens if we apply .str.extract_groups(regex) directly to pl.DataFrame({"line": list(pdf_lines)})?

  6. The online appendix for “The Customer Knows Best: The Investment Value of Consumer Opinions” contains Table OA.1 entitled List of firms with customer reviews on Amazon.com. Using an approach similar to that we used for ritter_data above, create a data frame huang_data with columns company_name, industry, start, end, months, reviews.

    • months and reviews should be numerical values.
    • start and end should be dates (use the first day of the month if only the month and year is specified).
    • You can solve this by combining the partial regular expressions above into a single regular expression by joining Python raw strings, as we did for ritter_data.
  7. Using the following code, where the URL provides a sample of observations on mergers and acquisitions from SDC, create a first-pass import of the data.10 What is the first issue you see when you look at the data in ma_sdc? (Hint: Look at the first few rows.) Adapt the code to address this issue. (Hint: You may have to experiment with different values for skiprows to get the right setting.)

url = ("https://gist.githubusercontent.com/iangow/"
       "eb7dfe1cd0913821429bdf0566465d41/raw/"
       "358d60a4429f5747abc61f8acc026d335fc165f3/sap_sample.txt")

def url_to_file(url):
    resp = requests.get(url)
    resp.raise_for_status()
    return io.BytesIO(resp.content)

with url_to_file(url) as f:
    ma_sdc = read_fwf(
        f,
        widths=[13, 13, 40, 19, 37, 16, 17, 10, 10, 12, 12, 16, 21, 23, 10, 1000],
        names=
            ["date_announced", "date_effective", "tgt_name",
             "tgt_nation", "acq_name", "acq_nation",
             "status", "pct_of_shares_acq", "pct_owned_after_transaction",
             "acq_cusip", "tgt_cusip", "value_of_transaction_mil",
             "acq_prior_mktval", "tgt_prior_mktval",
             "acq_nation_code", "tgt_nation_code"],
    )
ma_sdc
  1. Fix the variables date_announced and date_effective so that they have type date. (Hint: Use parse_dates=["date_announced", "date_effective"] and date_format="%m/%d/%y".)

  2. What are the minimum and maximum values of date_announced and date_effective? What explains missing values (if any) here?

  3. What do you observe about acq_cusip and tgt_cusip? Can you write some code to check that these variables have been read in correctly? (Hint: The .str.len() method might be useful here.)

9.3 Further reading

The Polars CSV guide provides an introduction to importing delimited data. The Polars strings guide covers common string operations, including pattern matching and extraction. The documentation for polars.Series.str.extract_groups() is especially relevant for the examples in this chapter. Python’s re module documentation provides a broader reference for regular expressions.

The topic of regular expressions is surprisingly deep and regular expressions are useful in more contexts than might be apparent when first learning about them. Friedl (2006) provides a deep treatment of the topic. Goyvaerts and Levithan (2009) is full of examples of solutions to common cases.


  1. Some WRDS data sets have “incorrect” data types and additional work is needed to address these cases.↩︎

  2. Regular expressions are available in pretty much every package, including Python, R, PostgreSQL, and SAS. Learning on one platform largely carries over to any other. Stata’s support for regular expressions is much weaker than the other platforms’.↩︎

  3. According to https://siccode.com, “Standard Industrial Classification Codes (SIC Codes) identify the primary line of business of a company. It is the most widely used system by the US Government, public, and private organizations.”↩︎

  4. While Fama-French industry definitions might not change very often, we will see other benefits from a more robust and general approach below.↩︎

  5. We justify this approach using the fact that these data sets are not large by any stretch and the assumption that, in general, only one of ff_ind, ff_ind_short_desc and ff_ind_desc actually gets used in practice.↩︎

  6. It seems that omitting ^ and $ has no effect in this case.↩︎

  7. Again, you may find it useful to inspect the full contents yourself. We don’t do that here due to space constraints.↩︎

  8. Note that in other contexts, such as inside [ and ], ^ will act as a kind of “not” operator.↩︎

  9. If the values in company had been quoted, e.g., as "United Parcel Service*", then reading the data as space-delimited would work. But these are “wild” data from a PDF and such niceties cannot be expected.↩︎

  10. We messed with these data, so these cannot be used for research! But they are a realistic representation of an actual data set.↩︎