WRDS to Parquet

library(db2pq)

Overview

The main helper on this path is wrds_update_pq(). It downloads a WRDS PostgreSQL table as a Parquet file and skips the download when the existing local file already appears current.

Related helpers are:

wrds_sql_to_pq() exports a custom SQL query against WRDS into the standard Parquet layout.
wrds_schema_to_pq() updates all or selected tables in a WRDS schema.
pq_last_modified(), pq_archive(), and pq_restore() inspect and manage local Parquet files after they have been created.

Using `wrds_update_pq()`

An illustrative first run

A good table to start with is comp.company, which is small enough for testing:

wrds_update_pq("company", "comp")
#> Updated comp.company is available.
#> Beginning file download at 2026-06-02 21:24:12 UTC.
#> Completed file download at 2026-06-02 21:24:13 UTC.

If you run the function above and have not set up your details for a WRDS connection, you will be asked to provide these (e.g., your WRDS username and password). More discussion on this topic can be found in the article on authentication.

In addition, if you have not established DATA_DIR, as discussed in the article on data management, you will be asked to provide that.

Running the same command later should usually skip the download if WRDS metadata indicate that the local file is already current.

company_file <- wrds_update_pq("company", "comp")
#> comp.company already up to date.

The resulting file is an ordinary Parquet file. It can be inspected with arrow, dplyr, DuckDB, Python, or any other tool that reads Parquet.

library(arrow)
library(dplyr, warn.conflicts = FALSE)

read_parquet(company_file) |> select(gvkey, conm, sic)
#> # A tibble: 57,652 × 3
#>    gvkey  conm                     sic  
#>    <chr>  <chr>                    <chr>
#>  1 001000 A & E PLASTIK PAK INC    3089 
#>  2 001001 A & M FOOD SERVICES INC  5812 
#>  3 001002 AAI CORP                 3825 
#>  4 001003 A.A. IMPORTING CO INC    5712 
#>  5 001004 AAR CORP                 5080 
#>  6 001005 A.B.A. INDUSTRIES INC    3724 
#>  7 001006 ABC INDS INC             2711 
#>  8 001007 ABKCO INDUSTRIES INC     3652 
#>  9 001008 ABM COMPUTER SYSTEMS INC 3577 
#> 10 001009 ABS INDUSTRIES INC       3460 
#> # ℹ 57,642 more rows

Identifying Table Names and Schemas

WRDS uses schema names such as crsp, comp, audit, and bank.

wrds_update_pq("funda", "comp", obs = 1000)

You can list tables in a WRDS schema from R:

crsp_tables <- wrds_get_tables("crsp", views = TRUE)
crsp_tables[1:10]
#>  [1] "acti"     "asia"     "asib"     "asic"     "asio"     "asix"    
#>  [7] "bmdebt"   "bmheader" "bmpaymts" "bmquotes"

Use the PostgreSQL schema name exposed by WRDS, which can differ from a short product name or SAS library name. For example, the Fama-French PostgreSQL base-table schema is ff_all, so a schema download starts with:

wrds_get_tables("ff_all")
#>  [1] "factors_china"       "factors_daily"       "factors_monthly"    
#>  [4] "fivefactors_daily"   "fivefactors_monthly" "industry12"         
#>  [7] "industry48"          "liq_ps"              "liq_sadka"          
#> [10] "portfolios"          "portfolios25"        "portfolios_d"

However, these tables are exposed as views in the database schema ff. Setting views = TRUE shows the underlying data made available there.

wrds_schema_to_pq("ff", views = TRUE)
#> Processing 12 table(s) in schema 'ff'.
#> No comment found for ff.factors_china.
#> Beginning file download at 2026-06-02 21:24:17 UTC.
#> Completed file download at 2026-06-02 21:24:17 UTC.
#> Updated ff.factors_daily is available.
#> Beginning file download at 2026-06-02 21:24:18 UTC.
#> Completed file download at 2026-06-02 21:24:18 UTC.
#> Updated ff.factors_monthly is available.
#> Beginning file download at 2026-06-02 21:24:19 UTC.
#> Completed file download at 2026-06-02 21:24:19 UTC.
#> Updated ff.fivefactors_daily is available.
#> Beginning file download at 2026-06-02 21:24:20 UTC.
#> Completed file download at 2026-06-02 21:24:21 UTC.
#> No comment found for ff.fivefactors_monthly.
#> Beginning file download at 2026-06-02 21:24:23 UTC.
#> Completed file download at 2026-06-02 21:24:23 UTC.
#> Updated ff.industry12 is available.
#> Beginning file download at 2026-06-02 21:24:24 UTC.
#> Completed file download at 2026-06-02 21:24:24 UTC.
#> Updated ff.industry48 is available.
#> Beginning file download at 2026-06-02 21:24:25 UTC.
#> Completed file download at 2026-06-02 21:24:25 UTC.
#> Updated ff.liq_ps is available.
#> Beginning file download at 2026-06-02 21:24:26 UTC.
#> Completed file download at 2026-06-02 21:24:27 UTC.
#> Updated ff.liq_sadka is available.
#> Beginning file download at 2026-06-02 21:24:28 UTC.
#> Completed file download at 2026-06-02 21:24:28 UTC.
#> Updated ff.portfolios is available.
#> Beginning file download at 2026-06-02 21:24:29 UTC.
#> Completed file download at 2026-06-02 21:24:29 UTC.
#> Updated ff.portfolios25 is available.
#> Beginning file download at 2026-06-02 21:24:30 UTC.
#> Completed file download at 2026-06-02 21:24:31 UTC.
#> Updated ff.portfolios_d is available.
#> Beginning file download at 2026-06-02 21:24:31 UTC.
#> Completed file download at 2026-06-02 21:24:32 UTC.

The WRDS web query interface can also be useful for identifying the underlying schema and table behind a dataset, but once the table is known, you should download with a db2pq script for reproducibility.

Cleaning and Shaping Data

Correcting Data Types

Some WRDS columns are stored in PostgreSQL with a type that is not ideal for analysis. For example, link-history identifiers in crsp.ccmxpf_lnkhist are often better represented as integers:

wrds_update_pq(
  "ccmxpf_lnkhist", "crsp",
  col_types = list(lpermno = "int32", lpermco = "int32")
)
#> Updated crsp.ccmxpf_lnkhist is available.
#> Beginning file download at 2026-06-02 21:24:33 UTC.
#> Completed file download at 2026-06-02 21:24:33 UTC.

Setting Time Zones

WRDS PostgreSQL often stores timestamps without time-zone information. The tz argument tells db2pq how to interpret those values before writing normalized timestamps to Parquet.

For example, in the FFIEC Call Report data in the WRDS bank schema, timestamps should be interpreted as US Eastern time:

wrds_update_pq(
  "wrds_call_rcfa_1",
  "bank",
  tz = "America/New_York"
)

Some timestamp-like fields are stored as strings. Use col_types with tz to coerce selected fields:

wrds_update_pq(
  "feed03_audit_fees",
  "audit",
  keep = c("auditor_fkey", "file_accepted"),
  col_types = list(
    auditor_fkey = "int32",
    file_accepted = "timestamp"
  ),
  tz = "America/New_York",
  force = TRUE
)

Working With Large Tables

Dropping Unwanted Variables

Some WRDS tables contain blocks of columns that are not needed for most analyses. The drop argument removes columns whose names match one or more regular expressions:

wrds_update_pq(
  "feed02_auditor_changes",
  "audit",
  drop = "^(match|prior|closest)"
)

Selecting Only Certain Variables

Use keep when you want a small subset of columns:

dsf_file = wrds_update_pq(
  "dsf",
  "crsp",
  obs = 100,
  force = TRUE,
  keep = c("permno", "date", "^ret$")
)
#> Forcing update based on user request.
#> Beginning file download at 2026-06-02 21:24:34 UTC.
#> Completed file download at 2026-06-02 21:24:34 UTC.

read_parquet(dsf_file)
#> # A tibble: 100 × 3
#>    permno date           ret
#>     <int> <date>       <dbl>
#>  1  10000 1986-01-07 NA     
#>  2  10000 1986-01-08 -0.0244
#>  3  10000 1986-01-09  0     
#>  4  10000 1986-01-10  0     
#>  5  10000 1986-01-13  0.05  
#>  6  10000 1986-01-14  0.0476
#>  7  10000 1986-01-15  0.0455
#>  8  10000 1986-01-16  0.0435
#>  9  10000 1986-01-17  0     
#> 10  10000 1986-01-20  0     
#> # ℹ 90 more rows

Note that the arguments to keep are evaluated as regular expressions, so you want to use "^ret$" to match ret rather than any column containing ret. For more on regular expressions, see Chapter 9 of Empirical Research in Accounting: Tools and Methods.

Renaming Variables

Use rename to write output columns with preferred names:

wrds_update_pq(
  "company",
  "comp",
  keep = c("gvkey", "conm", "sic"),
  alt_table_name = "sic_codes",
  rename = c(conm = "company_name")
)

For a small documentation example, the renamed file can be written to the temporary repository:

renamed_company_file <- wrds_update_pq(
  "company",
  "comp",
  alt_table_name = "company_names",
  keep = c("gvkey", "conm", "sic"),
  rename = c(conm = "company_name"),
  force = TRUE
)
#> Forcing update based on user request.
#> Beginning file download at 2026-06-02 21:24:35 UTC.
#> Completed file download at 2026-06-02 21:24:36 UTC.

read_parquet(renamed_company_file) |>
  select(gvkey, company_name, sic)
#> # A tibble: 57,652 × 3
#>    gvkey  company_name             sic  
#>    <chr>  <chr>                    <chr>
#>  1 001000 A & E PLASTIK PAK INC    3089 
#>  2 001001 A & M FOOD SERVICES INC  5812 
#>  3 001002 AAI CORP                 3825 
#>  4 001003 A.A. IMPORTING CO INC    5712 
#>  5 001004 AAR CORP                 5080 
#>  6 001005 A.B.A. INDUSTRIES INC    3724 
#>  7 001006 ABC INDS INC             2711 
#>  8 001007 ABKCO INDUSTRIES INC     3652 
#>  9 001008 ABM COMPUTER SYSTEMS INC 3577 
#> 10 001009 ABS INDUSTRIES INC       3460 
#> # ℹ 57,642 more rows

If rename and col_types are used together, names in col_types refer to the output names after renaming:

wrds_update_pq(
  "ccmxpf_lnkhist",
  "crsp",
  rename = c(lpermno = "permno_link"),
  col_types = list(permno_link = "int32")
)

Selecting Only Certain Rows

Use where for SQL row filters. For Compustat annual fundamentals, a common filter is the standard industrial format, standard data format, consolidated, domestic-population set:

wrds_update_pq(
  "funda",
  "comp",
  where = paste(
    "indfmt = 'INDL'",
    "AND datafmt = 'STD'",
    "AND consol = 'C'",
    "AND popsrc = 'D'"
  )
)

Limiting Rows for a Test Download

Use obs to test a download before running the full export:

wrds_update_pq("funda", "comp", obs = 1000)

Combining keep, where, and obs is a useful way to produce a small local file while developing a refresh script.

funda_sample_file <- wrds_update_pq(
  "funda",
  "comp",
  alt_table_name = "funda_sample",
  keep = c("gvkey", "datadate", "fyear", "^at$", "^sale$"),
  where = paste(
    "indfmt = 'INDL'",
    "AND datafmt = 'STD'",
    "AND consol = 'C'",
    "AND popsrc = 'D'"
  ),
  obs = 1000,
  force = TRUE
)
#> Forcing update based on user request.
#> Beginning file download at 2026-06-02 21:24:36 UTC.
#> Completed file download at 2026-06-02 21:24:37 UTC.

read_parquet(funda_sample_file) |> glimpse()
#> Rows: 1,000
#> Columns: 5
#> $ gvkey    <chr> "001000", "001000", "001000", "001000", "001000", "001000", "…
#> $ datadate <date> 1961-12-31, 1962-12-31, 1963-12-31, 1964-12-31, 1965-12-31, …
#> $ fyear    <int> 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1…
#> $ at       <dbl> NA, NA, NA, 1.416, 2.310, 2.430, 2.456, 5.922, 28.712, 33.450…
#> $ sale     <dbl> 0.900, 1.600, 1.457, 2.032, 1.688, 4.032, 3.594, 7.400, 37.39…

Update Logic

Forcing an Update

Use force = TRUE when you want to re-download even if local metadata suggest the file is current. This is useful after changing keep, drop, rename, col_types, or tz.

wrds_update_pq("funda", "comp", force = TRUE, obs = 1000)

Using SAS Metadata

By default, wrds_update_pq() compares the local Parquet metadata with WRDS PostgreSQL table comments. Some WRDS PostgreSQL tables do not have useful comments. In those cases, use_sas = TRUE can retrieve update metadata from the corresponding WRDS SAS dataset.

Some schemas have different names on the PostgreSQL and SAS sides. For example, RavenPack’s PostgreSQL schema can differ from its SAS library:

wrds_update_pq(
  "rpa_entity_mappings",
  "ravenpack_common",
  use_sas = TRUE,
  sas_schema = "rpa"
)

This path requires SSH access to the WRDS SAS server. See the WRDS SSH setup article.

Archiving Data When Updating

Use archive = TRUE when you want replaced files moved into the schema’s archive directory before the new file is written:

wrds_update_pq("company", "comp", archive = TRUE)
pq_last_modified("company", "comp", archive = TRUE)

Archives live under:

<DATA_DIR>/<schema>/archive/

archived_company <- pq_archive("company", "comp")
#> Archived to: db2pq-wrds-to-parquet-examples/comp/archive/company_20260602T060000Z.parquet
basename(archived_company)
#> [1] "company_20260602T060000Z.parquet"

pq_last_modified("company", "comp", archive = TRUE) |>
  select(file_name, last_mod)
#> # A tibble: 1 × 2
#>   file_name                last_mod           
#>   <chr>                    <dttm>             
#> 1 company_20260602T060000Z 2026-06-02 06:00:00

The archived copy can be restored, again without touching the main DATA_DIR.

archived_company_name <- basename(archived_company)

pq_restore(
  archived_company_name,
  "comp",
  archive = FALSE
)
#> Restored to: db2pq-wrds-to-parquet-examples/comp/company.parquet

pq_last_modified("company", "comp")
#> [1] "Company (Updated 2026-06-02)"

Custom SQL

Use wrds_sql_to_pq() when the object you want is better expressed as a query than as a full table export:

wrds_sql_to_pq(
  "SELECT permno, date, ret FROM crsp.dsf WHERE date >= '2024-01-01'",
  table_name = "dsf_recent",
  schema = "crsp"
)

For a small query result, use LIMIT:

sql_query <- "
    SELECT gvkey, datadate, fyear, at, sale
    FROM comp.funda
    WHERE indfmt = 'INDL'
    AND datafmt = 'STD'
    AND consol = 'C'
    AND popsrc = 'D'
    ORDER BY gvkey, datadate
    LIMIT 20
  "

sql_file <- wrds_sql_to_pq(
  sql_query,
  table_name = "funda_sql_sample",
  schema = "comp",
  archive = TRUE
)

read_parquet(sql_file)
#> # A tibble: 20 × 5
#>    gvkey  datadate   fyear    at  sale
#>    <chr>  <date>     <int> <dbl> <dbl>
#>  1 001000 1961-12-31  1961 NA     0.9 
#>  2 001000 1962-12-31  1962 NA     1.6 
#>  3 001000 1963-12-31  1963 NA     1.46
#>  4 001000 1964-12-31  1964  1.42  2.03
#>  5 001000 1965-12-31  1965  2.31  1.69
#>  6 001000 1966-12-31  1966  2.43  4.03
#>  7 001000 1967-12-31  1967  2.46  3.59
#>  8 001000 1968-12-31  1968  5.92  7.4 
#>  9 001000 1969-12-31  1969 28.7  37.4 
#> 10 001000 1970-12-31  1970 33.4  45.3 
#> 11 001000 1971-12-31  1971 29.3  47.0 
#> 12 001000 1972-12-31  1972 19.9  34.4 
#> 13 001000 1973-12-31  1973 21.8  37.8 
#> 14 001000 1974-12-31  1974 25.6  50.3 
#> 15 001000 1975-12-31  1975 23.9  51.2 
#> 16 001000 1976-12-31  1976 38.6  66.4 
#> 17 001000 1977-12-31  1977 44.0  77.9 
#> 18 001001 1978-12-31  1978 NA    NA   
#> 19 001001 1979-12-31  1979 NA    NA   
#> 20 001001 1980-12-31  1980 NA    NA

wrds_update_pq()
wrds_sql_to_pq()
wrds_schema_to_pq()
pq_last_modified()
pq_archive()
Authentication article
Data management article