core.db_to_pq

core.db_to_pq(
    table_name,
    schema,
    *,
    user=None,
    host=None,
    database=None,
    port=None,
    data_dir=None,
    col_types=None,
    row_group_size=1048576,
    obs=None,
    modified=None,
    alt_table_name=None,
    keep=None,
    drop=None,
    rename=None,
    where=None,
    batched=True,
    threads=None,
    tz='UTC',
    engine=None,
    numeric_mode=None,
    adbc_batch_size_hint_bytes=None,
    adbc_use_copy=None,
    archive=False,
    archive_dir=None,
)

Export a PostgreSQL table to a Parquet file.

Parameters

Name Type Description Default
table_name str Name of the source PostgreSQL table. required
schema str Name of the source PostgreSQL schema. required
user str Source PostgreSQL user role. None
host str Source PostgreSQL host name. None
database str Source PostgreSQL database name. None
port int Source PostgreSQL port. None
data_dir str Root directory of the Parquet data repository. If omitted, use DATA_DIR or the current working directory. None
col_types dict Explicit output column types. Only a subset of columns needs to be supplied. Types should describe the exported output columns after any renaming. None
row_group_size int Maximum number of rows per written Parquet row group. Default is 1024 * 1024. 1048576
obs int Maximum number of rows to export. Implemented with SQL LIMIT. None
modified str Last-modified string to embed in the Parquet metadata. If omitted, use the source PostgreSQL table comment when available. None
alt_table_name str Output Parquet basename. If omitted, defaults to table_name. None
keep str or iterable Regex pattern(s) describing source columns to keep or drop. If both are supplied, drop is applied first. None
drop str or iterable Regex pattern(s) describing source columns to keep or drop. If both are supplied, drop is applied first. None
rename dict Mapping from source column names to output column names. Keys are the original PostgreSQL column names and values are the exported names. When rename is used, col_types should refer to the output names after renaming. None
where str SQL WHERE condition used to filter source rows before export. None
batched bool If True, stream Arrow batches instead of materializing the full result at once. This typically reduces memory use for large tables. True
threads int Maximum number of DuckDB worker threads to use on the DuckDB path. None
tz str Time zone assumed for timestamp without time zone source columns before normalizing output timestamps. 'UTC'
engine (duckdb, adbc) Query execution engine used to read PostgreSQL data before writing Parquet. "duckdb"
numeric_mode (text, float64, decimal) Handling for PostgreSQL NUMERIC columns. None keeps the engine-specific default behavior. Explicit col_types entries take precedence. "text"
adbc_batch_size_hint_bytes int On the ADBC path, hint the PostgreSQL driver about the desired Arrow batch size in bytes. None
adbc_use_copy bool On the ADBC path, explicitly enable or disable the PostgreSQL driver’s COPY optimization. None
archive bool Whether an existing Parquet file should be archived before replacement. False
archive_dir str Name of the archive directory relative to data_dir/schema. None

Returns

Name Type Description
str | None Path to the written Parquet file, or None if the query returns no rows.

Examples

Export a table using the default DuckDB-backed path:

>>> from db2pq import db_to_pq
>>> db_to_pq("dsi", "crsp")

Rename a column and apply an output type override:

>>> db_to_pq(
...     "company",
...     "public",
...     rename={"conm": "company_name"},
...     col_types={"company_name": "string"},
... )