Skip to contents

This page groups together examples for the Parquet utility helpers: pq_data_dir(), pq_last_modified(), pq_archive(), pq_restore(), and pq_remove().

This page shows how the utilities fit together in the practical workflow of maintaining a local Parquet repository.

When to use these helpers

These functions generally help when you already have a Parquet repository and want to:

  • inspect what is in it
  • check which vintage of a table you currently have
  • archive a current active file before replacing it
  • restore an archived vintage
  • remove files

The related API pages are:

Setup

Load the packages used in the examples:

Set DATA_DIR to the Parquet repository you want to inspect:

Sys.setenv(DATA_DIR = "~/Dropbox/pq_data")

Use pq_data_dir() when you want to confirm which repository db2pq will use by default:

pq_data_dir(prompt = FALSE)

The examples below are rendered from a local Parquet repository. The examples that move files run only during a local pkgdown render when the repository contains comp.company and crsp.dsi.

Inspect a Schema Directory

Use pq_last_modified() without table_name to summarize the active Parquet files in a schema:

pq_last_modified(schema = "comp") |>
  select(file_name, table, last_mod) |>
  head(10)
#> # A tibble: 10 × 3
#>    file_name     table         last_mod           
#>    <chr>         <chr>         <dttm>             
#>  1 aco_pnfnda    aco_pnfnda    2026-05-25 06:00:00
#>  2 adsprate      adsprate      2026-04-09 06:00:00
#>  3 co_adesind    co_adesind    2026-04-09 06:00:00
#>  4 co_afnd2      co_afnd2      2026-04-09 06:00:00
#>  5 co_filedate   co_filedate   2026-04-09 06:00:00
#>  6 co_hgic       co_hgic       2026-04-09 06:00:00
#>  7 co_ifndq      co_ifndq      2026-04-09 06:00:00
#>  8 company       company       2026-05-26 06:00:00
#>  9 company_names company_names 2026-05-26 06:00:00
#> 10 funda         funda         2026-05-25 06:00:00

If you want to inspect archived files instead:

pq_last_modified(schema = "comp", archive = TRUE) |>
  select(file_name, table, last_mod) |>
  head(10)
#> # A tibble: 10 × 3
#>    file_name                   table      last_mod           
#>    <chr>                       <chr>      <dttm>             
#>  1 aco_pnfnda_20260330T060000Z aco_pnfnda 2026-03-30 06:00:00
#>  2 company_20260105T070000Z    company    2026-01-05 07:00:00
#>  3 company_20260107T070000Z    company    2026-01-07 07:00:00
#>  4 company_20260209T070000Z    company    2026-02-09 07:00:00
#>  5 company_20260218T070000Z    company    2026-02-18 07:00:00
#>  6 company_20260224T070000Z    company    2026-02-24 07:00:00
#>  7 company_20260225T000000Z    company    2026-02-24 07:00:00
#>  8 company_20260225T070000Z    company    2026-02-25 07:00:00
#>  9 company_20260226T070000Z    company    2026-02-26 07:00:00
#> 10 company_20260303T070000Z    company    2026-03-03 07:00:00

If your project uses a repository outside the default DATA_DIR, pass data_dir explicitly:

pq_last_modified(schema = "crsp", data_dir = Sys.getenv("DATA_DIR")) |>
  select(file_name, table, last_mod) |>
  head(5)
#> # A tibble: 5 × 3
#>   file_name        table            last_mod           
#>   <chr>            <chr>            <dttm>             
#> 1 ccmxpf_linktable ccmxpf_linktable 2026-02-06 07:00:00
#> 2 ccmxpf_lnkhist   ccmxpf_lnkhist   2026-02-06 07:00:00
#> 3 ccmxpf_lnkused   ccmxpf_lnkused   2026-02-06 07:00:00
#> 4 comphist         comphist         2026-02-06 07:00:00
#> 5 dse              dse              2025-02-08 07:00:00

Check the Current Active Vintage

With table_name and schema, pq_last_modified() returns the raw embedded last_modified metadata string for the active file:

pq_last_modified(table_name = "dsi", schema = "crsp")
#> [1] "Stock - Market Indexes Daily NYSE/AMEX/NASDAQ/ARCA (Updated 2025-02-08)"

This is often the fastest way to confirm what vintage a local Parquet file represents before starting analysis.

Inspect Archived Vintages

If you archive replaced files, you can ask for the archived versions of a table:

pq_last_modified(table_name = "company", schema = "comp", archive = TRUE) |>
  select(file_name, table, last_mod, last_mod_str) |>
  tail(10)
#> # A tibble: 10 × 4
#>    file_name                table   last_mod            last_mod_str            
#>    <chr>                    <chr>   <dttm>              <chr>                   
#>  1 company_20260225T000000Z company 2026-02-24 07:00:00 Company (Updated 2026-0…
#>  2 company_20260225T070000Z company 2026-02-25 07:00:00 Company (Updated 2026-0…
#>  3 company_20260226T070000Z company 2026-02-26 07:00:00 Company (Updated 2026-0…
#>  4 company_20260303T070000Z company 2026-03-03 07:00:00 Company (Updated 2026-0…
#>  5 company_20260315T060000Z company 2026-03-15 06:00:00 Company (Updated 2026-0…
#>  6 company_20260322T060000Z company 2026-03-22 06:00:00 Company (Updated 2026-0…
#>  7 company_20260323T060000Z company 2026-03-23 06:00:00 Company (Updated 2026-0…
#>  8 company_20260331T060000Z company 2026-03-31 06:00:00 Company (Updated 2026-0…
#>  9 company_20260402T060000Z company 2026-04-02 06:00:00 Company (Updated 2026-0…
#> 10 company_20260407T060000Z company 2026-04-07 06:00:00 Company (Updated 2026-0…

That returns a table-like summary of the archived vintages for the requested dataset. To inspect archived files for a whole schema, use schema without table_name:

pq_last_modified(schema = "comp", archive = TRUE) |>
  select(file_name, table, last_mod) |>
  head(10)
#> # A tibble: 10 × 3
#>    file_name                   table      last_mod           
#>    <chr>                       <chr>      <dttm>             
#>  1 aco_pnfnda_20260330T060000Z aco_pnfnda 2026-03-30 06:00:00
#>  2 company_20260105T070000Z    company    2026-01-05 07:00:00
#>  3 company_20260107T070000Z    company    2026-01-07 07:00:00
#>  4 company_20260209T070000Z    company    2026-02-09 07:00:00
#>  5 company_20260218T070000Z    company    2026-02-18 07:00:00
#>  6 company_20260224T070000Z    company    2026-02-24 07:00:00
#>  7 company_20260225T000000Z    company    2026-02-24 07:00:00
#>  8 company_20260225T070000Z    company    2026-02-25 07:00:00
#>  9 company_20260226T070000Z    company    2026-02-26 07:00:00
#> 10 company_20260303T070000Z    company    2026-03-03 07:00:00

Archive the Currently Active File

You can archive a file manually even outside an update workflow:

pq_archive(table_name = "company", schema = "comp")

Or archive an exact file path. During the live render, this moves the current comp.company file into its archive directory:

company_archive <- pq_archive(file_name = company_file)
basename(company_archive)
#> [1] "company_20260526T060000Z.parquet"

This is useful when you want to preserve the current active vintage before running an experimental refresh or downstream transformation.

Restore an Archived Vintage

To promote an archived file back into the active schema directory:

restored_company <- pq_restore(
  tools::file_path_sans_ext(basename(company_archive)),
  "comp",
  archive = FALSE
)
basename(restored_company)
#> [1] "company.parquet"

The archived basename may include or omit the .parquet suffix. If an active destination file already exists, pq_restore() can archive that file first with its default archive = TRUE.

pq_last_modified(table_name = "company", schema = "comp")
#> [1] "Company (Updated 2026-05-26)"

Remove a File Explicitly

Use pq_remove() when you want to delete an active or archived file rather than archive it. The removal examples below are shown but not run during the documentation build. For an active file:

pq_remove(table_name = "dsi", schema = "crsp")

To remove an archived file:

pq_remove(
  table_name = "company_20260407T060000Z",
  schema = "comp",
  archive = TRUE
)

Or remove a file by exact path:

pq_remove(file_name = company_archive)
  • Data management article
  • WRDS to Parquet article
  • PostgreSQL to Parquet article
  • Parquet file utility reference pages