Create a lazy tbl backed by DuckDB by scanning one or more FFIEC
Parquet files. Files may be selected either by FFIEC schedule name
or by an explicit Parquet filename or glob pattern. No data is read
eagerly; evaluation occurs only when the result is collected.
Usage
ffiec_scan_pqs(
conn,
schedule = NULL,
pq_file = NULL,
data_dir = NULL,
schema = "ffiec",
prefix = "",
union_by_name = TRUE,
keep_filename = FALSE
)Arguments
- conn
A valid DuckDB connection.
- schedule
Optional character scalar giving the FFIEC schedule name (e.g.
"rc","rci","rcb"). All matching Parquet files of the form{prefix}{schedule}_YYYYMMDD.parquetare scanned.- pq_file
Optional character scalar giving a Parquet filename or glob pattern. May be a base name or relative path within the resolved Parquet directory, or a fully qualified path when
data_diris not supplied.- data_dir
Optional parent directory for Parquet files. If
NULL, the environment variableDATA_DIRis used.- schema
Schema name used to resolve the Parquet directory (default
"ffiec"). Set toNULLto treatdata_dir(orDATA_DIR) as the final Parquet directory.- prefix
Optional filename prefix used when the Parquet files were created (default
"").- union_by_name
Logical; whether to union Parquet files by column name when scanning multiple files (passed to DuckDB's
read_parquet()). Default isTRUE.- keep_filename
Logical; whether to keep the original Parquet file name (passed to DuckDB's
read_parquet()). Default isFALSE.
Details
This function is intended for use with Parquet files produced by [ffiec_process()]. It validates that the requested files exist on disk before constructing the DuckDB query.
Exactly one of schedule or pq_file must be supplied.
Directory resolution
By default, Parquet files are read from a schema subdirectory:
file.path(data_dir, schema). If data_dir is NULL,
DATA_DIR is used as the parent directory.
If schema = NULL, no schema subdirectory is appended; in this case,
data_dir (or DATA_DIR) is treated as the final directory
containing Parquet files.
When schedule is used, files are selected with the glob pattern
{prefix}{schedule}_*.parquet in the resolved directory.
When pq_file is used, the value is treated as a filename or glob
pattern relative to the resolved directory. If neither data_dir nor
DATA_DIR is available, pq_file may instead be a fully qualified
path to an existing Parquet file.
A fast filesystem check is performed using Sys.glob(). If no
matching files are found, the function errors before issuing any DuckDB query.