Export a WRDS table to Parquet, skipping if already up to date
Source:R/wrds_update_pq.R
wrds_update_pq.RdExports a table from the WRDS PostgreSQL database to a Parquet file. Before
downloading, the WRDS table comment is compared against the last_modified
metadata embedded in any existing local Parquet file. The download is skipped
if the local file is already up to date, making this function safe to call
repeatedly as part of a scheduled data refresh.
Usage
wrds_update_pq(
table_name,
schema,
data_dir = NULL,
out_file = NULL,
force = FALSE,
where = NULL,
obs = NULL,
keep = NULL,
drop = NULL,
rename = NULL,
alt_table_name = NULL,
chunk_size = NULL,
col_types = NULL,
tz = "UTC",
archive = FALSE,
archive_dir = "archive",
use_sas = FALSE,
sas_schema = NULL,
encoding = "utf-8",
wrds_id = NULL,
transfer_method = c("dbi", "adbc"),
numeric_mode = c("decimal", "float64", "text", "raw")
)Arguments
- table_name
Name of the table in the WRDS PostgreSQL database.
- schema
Name of the database schema (e.g.
"crsp","comp").- data_dir
Root directory of the local Parquet data repository. Defaults to the
DATA_DIRenvironment variable, with interactive setup when needed. The output file is written to<data_dir>/<schema>/<table_name>.parquet.- out_file
Optional. Full path for the output Parquet file. Overrides the path derived from
data_dir,schema, andtable_name.- force
If
TRUE, download proceeds regardless of the date comparison result.- where
Optional SQL
WHEREclause (without theWHEREkeyword) to filter rows before export. For example,where = "date > '2020-01-01'".- obs
Optional integer. Limits the number of rows imported using SQL
LIMIT. Useful for testing with large tables (e.g.obs = 1000).- keep
Optional character vector of regex patterns. Only columns whose names match at least one pattern are retained. Applied after
drop.- drop
Optional character vector of regex patterns. Columns whose names match at least one pattern are removed. Applied before
keep.- rename
Optional named character vector or list mapping source column names to output column names.
col_typesnames should refer to output names after renaming.- alt_table_name
Optional. Alternative basename for the output Parquet file, used when the file should have a different name from
table_name.- chunk_size
Number of rows fetched and written per chunk. If omitted, defaults to
100000fortransfer_method = "dbi"and250000fortransfer_method = "adbc".- col_types
Optional named list specifying column type overrides. Values may be string type names (e.g.
"int32","float32","date") or ArrowDataTypeobjects. Only columns that need to differ from their inferred types need to be supplied. Seecol_types = list(permno = "int32", ret = "float32"). String names such as"int32","float32","date","timestamp", and"timestamptz"are supported.- tz
Time zone used to interpret
TIMESTAMP WITHOUT TIME ZONEcolumns. Such columns are cast toTIMESTAMPTZin the SQL query usingAT TIME ZONE, so they are written as UTC-normalised timestamps in the Parquet file. Defaults to"UTC". Set toNULLto leave naive timestamps as-is.- archive
If
TRUE, the existing local Parquet file (if any) is moved to the archive subdirectory before being replaced. The archived filename is<table>_<YYYYMMDDTHHMMSSz>.parquet, where the timestamp suffix is derived from thelast_modifiedmetadata embedded in the existing file (i.e. the date it was last downloaded, not the incoming WRDS table comment).- archive_dir
Name of the archive subdirectory relative to the schema directory. Defaults to
"archive".- use_sas
If
TRUE, the last-modified date is obtained by runningPROC CONTENTSon the SAS dataset via SSH, rather than reading the PostgreSQL table comment. Requires the ssh package and SSH key access towrds-cloud-sshkey.wharton.upenn.edu.- sas_schema
SAS library name to use when
use_sas = TRUE. Defaults toschema.- encoding
Character encoding passed to
PROC CONTENTSwhenuse_sas = TRUE. Default is"utf-8".- wrds_id
WRDS user ID used for the SSH connection when
use_sas = TRUE. Defaults to theWRDS_IDenvironment variable.- transfer_method
Transfer backend used for the SQL-to-Parquet step. Use
"dbi"for the existing DBI/data-frame path or"adbc"for the experimental ADBC/Arrow path.- numeric_mode
Numeric handling mode for PostgreSQL
numericcolumns. The default"decimal"transfers values as text and writes boundednumeric(p, s)columns as Arrow decimals. Use"float64"to cast values toDOUBLE PRECISIONbefore fetching,"text"to retain text values, or"raw"to keep the transfer backend's default representation.
Value
Invisibly returns the path to the active Parquet file. If the local file is already current, no download is performed but the same path is returned.
Examples
if (FALSE) { # \dontrun{
wrds_update_pq("dsi", "crsp")
wrds_update_pq("feed21_bankruptcy_notification", "audit")
# Force re-download even if local file is current
wrds_update_pq("dsi", "crsp", force = TRUE)
# Limit columns and rows (useful for testing)
wrds_update_pq("dsf", "crsp", obs = 1000, keep = c("permno", "date", "ret"))
} # }