core.wrds_pg_to_pq

core.wrds_pg_to_pq(
    table_name,
    schema,
    *,
    wrds_id=None,
    data_dir=None,
    col_types=None,
    row_group_size=1048576,
    obs=None,
    modified=None,
    alt_table_name=None,
    keep=None,
    drop=None,
    rename=None,
    where=None,
    batched=True,
    threads=3,
    tz='UTC',
    engine=None,
    numeric_mode=None,
    adbc_batch_size_hint_bytes=None,
    adbc_use_copy=None,
    archive=False,
    archive_dir=None,
)

Export a table from the WRDS PostgreSQL database to a parquet file.

Parameters

Name Type Description Default
table_name Name of table in database. required
schema Name of database schema. required
wrds_id str WRDS user ID used to access WRDS services. This parameter is required and must be provided either explicitly or via the WRDS_ID environment variable. None
data_dir str Root directory of parquet data repository. The default is to use the environment value DATA_DIR or (if not set) the current directory. None
col_types dict Dictionary of PostgreSQL data types to be used when importing data to PostgreSQL or writing to Parquet files. For Parquet files, conversion from PostgreSQL to PyArrow types is handled by DuckDB. Only a subset of columns needs to be supplied. Supplied types should be compatible with data emitted by PostgreSQL (i.e., one can’t “fix” arbitrary type issues using this argument). For example, col_types = {'permno': 'int32', 'permco': 'int32'}. None
row_group_size int Maximum number of rows in each written row group. Default is 1024 * 1024. 1048576
obs int Number of observations to import from database table. Implemented using SQL LIMIT. Setting this to modest value (e.g., obs=1000) can be useful for testing wrds_pg_to_pq() with large tables. None
modified str Last modified string to embed in parquet metadata. If omitted, use the WRDS PostgreSQL table comment as parquet last_modified metadata when available. None
alt_table_name str Basename of parquet file. Used when file should have different name from table_name. None
keep str or iterable Regex pattern(s) indicating columns to keep. None
drop str or iterable Regex pattern(s) indicating columns to drop. If both drop and keep are provided, drop is applied first. None
rename dict Mapping from source WRDS PostgreSQL column names to output column names. col_types entries should refer to the output names after renaming. None
batched bool Indicates whether data will be extracting in batches using to_pyarrow_batches() instead of a single call to to_pyarrow(). Using batches degrades performance slightly, but dramatically reduces memory requirements for large tables. True
threads int The number of threads DuckDB is allowed to use. Setting this may be necessary due to limits imposed on the user by the PostgreSQL database server. 3
engine (duckdb, adbc) Query execution engine used to read PostgreSQL data before writing Parquet. "duckdb"
numeric_mode (text, float64, decimal) Handling for PostgreSQL NUMERIC columns. None keeps the engine-specific default: native decimals on DuckDB, text-backed numerics on ADBC. Explicit col_types entries take precedence. "text"
adbc_batch_size_hint_bytes int On the ADBC path, hint the PostgreSQL ADBC driver about the desired Arrow batch size in bytes. None
adbc_use_copy bool On the ADBC path, enable or disable the PostgreSQL driver’s COPY optimization explicitly. None

Returns

Name Type Description
pq_file str Name of parquet file created.

Examples

>>> wrds_pg_to_pq("dsi", "crsp")
>>> wrds_pg_to_pq("feed21_bankruptcy_notification", "audit")