Parquet utility functions

This page groups together examples for the Parquet utility helpers: pq_data_dir(), pq_last_modified(), pq_archive(), pq_restore(), and pq_remove().

This page shows how the utilities fit together in the practical workflow of maintaining a local Parquet repository.

When to use these helpers

These functions generally help when you already have a Parquet repository and want to:

inspect what is in it
check which vintage of a table you currently have
archive a current active file before replacing it
restore an archived vintage
remove files

The related API pages are:

Setup

Load the packages used in the examples:

library(db2pq)
library(dplyr)

Set DATA_DIR to the Parquet repository you want to inspect:

Sys.setenv(DATA_DIR = "~/Dropbox/pq_data")

Use pq_data_dir() when you want to confirm which repository db2pq will use by default:

pq_data_dir(prompt = FALSE)

The examples below are rendered from a local Parquet repository. The examples that move files run only during a local pkgdown render when the repository contains comp.company and crsp.dsi.

Inspect a Schema Directory

Use pq_last_modified() without table_name to summarize the active Parquet files in a schema:

pq_last_modified(schema = "comp") |>
  select(file_name, table, last_mod) |>
  head(10)
#> # A tibble: 7 × 3
#>   file_name  table      last_mod           
#>   <chr>      <chr>      <dttm>             
#> 1 aco_pnfnda aco_pnfnda 2026-06-02 06:00:00
#> 2 company    company    2026-06-02 06:00:00
#> 3 funda      funda      2026-06-02 06:00:00
#> 4 funda_fncd funda_fncd 2026-06-02 06:00:00
#> 5 fundq      fundq      2026-06-02 06:00:00
#> 6 idx_daily  idx_daily  2026-06-02 06:00:00
#> 7 r_auditors r_auditors 2026-06-02 06:00:00

If you want to inspect archived files instead:

pq_last_modified(schema = "comp", archive = TRUE) |>
  select(file_name, table, last_mod) |>
  head(10)
#> # A tibble: 1 × 3
#>   file_name              table last_mod           
#>   <chr>                  <chr> <dttm>             
#> 1 funda_20260406T060000Z funda 2026-04-06 06:00:00

If your project uses a repository outside the default DATA_DIR, pass data_dir explicitly:

pq_last_modified(schema = "crsp", data_dir = Sys.getenv("DATA_DIR")) |>
  select(file_name, table, last_mod) |>
  head(5)
#> # A tibble: 5 × 3
#>   file_name      table          last_mod           
#>   <chr>          <chr>          <dttm>             
#> 1 ccmxpf_lnkhist ccmxpf_lnkhist 2026-02-06 07:00:00
#> 2 comphist       comphist       2026-02-06 07:00:00
#> 3 dsedelist      dsedelist      2025-02-08 07:00:00
#> 4 dsedist        dsedist        2025-02-08 07:00:00
#> 5 dseexchdates   dseexchdates   NA

Check the Current Active Vintage

With table_name and schema, pq_last_modified() returns the raw embedded last_modified metadata string for the active file:

pq_last_modified(table_name = "dsi", schema = "crsp")
#> [1] "Stock - Market Indexes Daily NYSE/AMEX/NASDAQ/ARCA (Updated 2025-02-08)"

This is often the fastest way to confirm what vintage a local Parquet file represents before starting analysis.

Inspect Archived Vintages

If you archive replaced files, you can ask for the archived versions of a table:

pq_last_modified(table_name = "company", schema = "comp", archive = TRUE) |>
  select(file_name, table, last_mod, last_mod_str) |>
  tail(10)
#> # A tibble: 0 × 4
#> # ℹ 4 variables: file_name <chr>, table <chr>, last_mod <dttm>,
#> #   last_mod_str <chr>

That returns a table-like summary of the archived vintages for the requested dataset. To inspect archived files for a whole schema, use schema without table_name:

pq_last_modified(schema = "comp", archive = TRUE) |>
  select(file_name, table, last_mod) |>
  head(10)
#> # A tibble: 1 × 3
#>   file_name              table last_mod           
#>   <chr>                  <chr> <dttm>             
#> 1 funda_20260406T060000Z funda 2026-04-06 06:00:00

Archive the Currently Active File

You can archive a file manually even outside an update workflow:

pq_archive(table_name = "company", schema = "comp")

Or archive an exact file path. During the live render, this moves the current comp.company file into its archive directory:

company_archive <- pq_archive(file_name = company_file)
basename(company_archive)
#> [1] "company_20260602T060000Z.parquet"

This is useful when you want to preserve the current active vintage before running an experimental refresh or downstream transformation.

Restore an Archived Vintage

To promote an archived file back into the active schema directory:

restored_company <- pq_restore(
  tools::file_path_sans_ext(basename(company_archive)),
  "comp",
  archive = FALSE
)
basename(restored_company)
#> [1] "company.parquet"

The archived basename may include or omit the .parquet suffix. If an active destination file already exists, pq_restore() can archive that file first with its default archive = TRUE.

pq_last_modified(table_name = "company", schema = "comp")
#> [1] "Company (Updated 2026-06-02)"

Remove a File Explicitly

Use pq_remove() when you want to delete an active or archived file rather than archive it. The removal examples below are shown but not run during the documentation build. For an active file:

pq_remove(table_name = "dsi", schema = "crsp")

To remove an archived file:

pq_remove(
  table_name = "company_20260407T060000Z",
  schema = "comp",
  archive = TRUE
)

Or remove a file by exact path:

pq_remove(file_name = company_archive)

Data management article
WRDS to Parquet article
PostgreSQL to Parquet article
Parquet file utility reference pages