vectra is an R-native columnar query engine for datasets larger than RAM.
Write dplyr-style pipelines against multi-GB files on a laptop. Data streams through a C11 pull-based engine one row group at a time, so peak memory stays bounded regardless of file size.
Point vectra at any file and query it with dplyr verbs. Nothing runs
until collect().
library(vectra)
# CSV — lazy scan with type inference
tbl_csv("measurements.csv") |>
filter(temperature > 30, year >= 2020) |>
group_by(station) |>
summarise(avg_temp = mean(temperature), n = n()) |>
collect()
# GeoTIFF — climate rasters as tidy data
tbl_tiff("worldclim_bio1.tif") |>
filter(band1 > 0) |>
mutate(temp_c = band1 / 10) |>
collect()
# Point extraction — sample raster values at coordinates, no terra needed
tiff_extract_points("worldclim_bio1.tif",
x = c(10.5, 11.2), y = c(47.1, 47.3))
# SQLite — zero-dependency, no DBI required
tbl_sqlite("survey.db", "responses") |>
filter(year == 2025) |>
left_join(tbl_sqlite("survey.db", "sites"), by = "site_id") |>
collect()For repeated queries, convert to vectra’s native .vtr
format for faster reads:
write_vtr(big_df, "data.vtr", batch_size = 100000)
tbl("data.vtr") |>
filter(x > 0, region == "EU") |>
group_by(region) |>
summarise(total = sum(value), n = n()) |>
collect()Append new data without rewriting the file, or do a key-based diff between two snapshots:
# Append new rows as a new row group — existing data untouched
append_vtr(new_rows_df, "data.vtr")
# Logical diff: what was added or deleted between two snapshots?
d <- diff_vtr("snapshot_old.vtr", "snapshot_new.vtr", key_col = "id")
collect(d$added) # rows present in new but not old
d$deleted # key values present in old but not newFuzzy string matching runs inside the C engine, no round-trip to R:
tbl("taxa.vtr") |>
filter(levenshtein(species, "Quercus robur") <= 2) |>
mutate(similarity = jaro_winkler(species, "Quercus robur")) |>
arrange(desc(similarity)) |>
collect()Register a star schema to avoid flat-table column creep. Define the links once, then pull only what you need:
s <- vtr_schema(
fact = tbl("observations.vtr"),
species = link("sp_id", tbl("species.vtr")),
site = link("site_id", tbl("sites.vtr"))
)
# Pull columns from any dimension — joins are built automatically
lookup(s, count, species$name, site$habitat) |> collect()
#> species: all 500 keys matched
#> site: 3/500 unmatched keys (X1, X2, X3)Use explain() to inspect the optimized plan:
tbl("data.vtr") |>
filter(x > 0) |>
select(id, x) |>
explain()
#> vectra execution plan
#>
#> ProjectNode [streaming]
#> FilterNode [streaming]
#> ScanNode [streaming, 2/5 cols (pruned), predicate pushdown, v3 stats]
#>
#> Output columns (2):
#> id <int64>
#> x <double>Querying large datasets in R usually means Arrow (requires compiled binaries matching your platform), DuckDB (links a 30 MB bundled library), or Spark (requires a JVM and cluster configuration).
vectra is a self-contained C11 engine compiled as a standard R extension. No external libraries, no JVM, no runtime configuration. It provides:
row_number(),
rank(), dense_rank(), lag(),
lead(), cumsum(), cummean(),
cummin(), cummax()nchar(),
substr(), grepl() evaluated in the engine
without round-tripping to R.vtr, CSV,
SQLite, GeoTIFF — all produce the same lazy query nodesint16/int32/uint8/uint16/float32
with embedded GDAL metadata for 5-10x smaller files| Category | Verbs |
|---|---|
| Transform | filter(),
select(), mutate(), transmute(),
rename(), relocate() |
| Aggregate | group_by(),
summarise() (n, sum,
mean, min, max, sd,
var, first, last,
any, all, median,
n_distinct), count(), tally(),
distinct() |
| Join | left_join(),
inner_join(), right_join(),
full_join(), semi_join(),
anti_join(), cross_join(),
lookup() |
| Order | arrange(),
slice_head(), slice_tail(),
slice_min(), slice_max(),
slice() |
| Window | row_number(),
rank(), dense_rank(), lag(),
lead(), cumsum(), cummean(),
cummin(), cummax(), ntile(),
percent_rank(), cume_dist() |
| Date/Time | year(), month(),
day(), hour(), minute(),
second(), as.Date() (in
filter()/mutate()) |
| String | nchar(),
substr(), grepl(), tolower(),
toupper(), trimws(), paste0(),
gsub(), sub(), startsWith(),
endsWith() (in
filter()/mutate()) |
| String similarity | levenshtein(),
levenshtein_norm(), dl_dist(),
dl_dist_norm(), jaro_winkler() — fuzzy
matching in filter()/mutate(), with optional
max_dist early termination |
| Expression | abs(), sqrt(),
log(), exp(), floor(),
ceiling(), round(), log2(),
log10(), sign(), trunc(),
if_else(), between(), %in%,
as.numeric(), pmin(), pmax(),
resolve(), propagate() (in
filter()/mutate()) |
| Combine | bind_rows(),
bind_cols(), across() |
| Schema | vtr_schema(),
link(), lookup() — star schema definition and
dimension lookup with match reporting |
| I/O | tbl(),
tbl_csv(), tbl_sqlite(),
tbl_tiff(), write_vtr(),
write_csv(), write_sqlite(),
write_tiff(), tiff_extract_points(),
tiff_metadata(), append_vtr(),
delete_vtr(), diff_vtr() |
| Inspect | explain(),
glimpse(), print(), pull() |
Full tidyselect support in select(),
rename(), relocate(), and
across(): starts_with(),
ends_with(), contains(),
matches(), where(), everything(),
all_of(), any_of().
# CRAN
install.packages("vectra")
# Development version
pak::pak("gcol33/vectra")explain() output“Software is like sex: it’s better when it’s free.” – Linus Torvalds
If this package saved you some time, buying me a coffee is a nice way to say thanks.
MIT (see the LICENSE.md file)
@software{vectra,
author = {Colling, Gilles},
title = {vectra: Columnar Query Engine for Larger-Than-RAM Data},
year = {2026},
url = {https://github.com/gcol33/vectra}
}