core.wrds_pg_to_pq

core.wrds_pg_to_pq(
    table_name,
    schema,
    *,
    wrds_id=None,
    data_dir=None,
    col_types=None,
    row_group_size=1048576,
    obs=None,
    modified=None,
    alt_table_name=None,
    keep=None,
    drop=None,
    rename=None,
    where=None,
    batched=True,
    threads=3,
    tz='UTC',
    engine=None,
    numeric_mode=None,
    adbc_batch_size_hint_bytes=None,
    adbc_use_copy=None,
    archive=False,
    archive_dir=None,
)

Export a table from the WRDS PostgreSQL database to a parquet file.

Parameters

Name	Type	Description	Default
table_name		Name of table in database.	required
schema		Name of database schema.	required
wrds_id	str	WRDS user ID used to access WRDS services. This parameter is required and must be provided either explicitly or via the `WRDS_ID` environment variable.	`None`
data_dir	str	Root directory of parquet data repository. The default is to use the environment value `DATA_DIR` or (if not set) the current directory.	`None`
col_types	dict	Dictionary of PostgreSQL data types to be used when importing data to PostgreSQL or writing to Parquet files. For Parquet files, conversion from PostgreSQL to PyArrow types is handled by DuckDB. Only a subset of columns needs to be supplied. Supplied types should be compatible with data emitted by PostgreSQL (i.e., one can’t “fix” arbitrary type issues using this argument). For example, `col_types = {'permno': 'int32', 'permco': 'int32'}`.	`None`
row_group_size	int	Maximum number of rows in each written row group. Default is `1024 * 1024`.	`1048576`
obs	int	Number of observations to import from database table. Implemented using SQL `LIMIT`. Setting this to modest value (e.g., `obs=1000`) can be useful for testing `wrds_pg_to_pq()` with large tables.	`None`
modified	str	Last modified string to embed in parquet metadata. If omitted, use the WRDS PostgreSQL table comment as parquet `last_modified` metadata when available.	`None`
alt_table_name	str	Basename of parquet file. Used when file should have different name from `table_name`.	`None`
keep	str or iterable	Regex pattern(s) indicating columns to keep.	`None`
drop	str or iterable	Regex pattern(s) indicating columns to drop. If both `drop` and `keep` are provided, `drop` is applied first.	`None`
rename	dict	Mapping from source WRDS PostgreSQL column names to output column names. `col_types` entries should refer to the output names after renaming.	`None`
batched	bool	Indicates whether data will be extracting in batches using `to_pyarrow_batches()` instead of a single call to `to_pyarrow()`. Using batches degrades performance slightly, but dramatically reduces memory requirements for large tables.	`True`
threads	int	The number of threads DuckDB is allowed to use. Setting this may be necessary due to limits imposed on the user by the PostgreSQL database server.	`3`
engine	(duckdb, adbc)	Query execution engine used to read PostgreSQL data before writing Parquet.	`"duckdb"`
numeric_mode	(text, float64, decimal)	Handling for PostgreSQL `NUMERIC` columns. `None` keeps the engine-specific default: native decimals on DuckDB, text-backed numerics on ADBC. Explicit `col_types` entries take precedence.	`"text"`
adbc_batch_size_hint_bytes	int	On the ADBC path, hint the PostgreSQL ADBC driver about the desired Arrow batch size in bytes.	`None`
adbc_use_copy	bool	On the ADBC path, enable or disable the PostgreSQL driver’s `COPY` optimization explicitly.	`None`

Returns

Name	Type	Description
pq_file	str	Name of parquet file created.

Examples

>>> wrds_pg_to_pq("dsi", "crsp")
>>> wrds_pg_to_pq("feed21_bankruptcy_notification", "audit")