A lightweight toolkit for text retrieval and NLP with a consistent API: Fetch, Read, Process, and Search. Functions cover the full pipeline from web data to text processing and indexing. Multiple search strategies – regex, BM25, cosine similarity, dictionary matching. Verb_noun naming; pipe-friendly; no heavy dependencies; outputs are plain data frames.
From CRAN:
install.packages("textpress")Development version:
remotes::install_github("jaytimm/textpress")fetch_*)These functions talk to the outside world to find locations of information. They return URLs or metadata, not full text.
fetch_urls() — Web (general). Search
engines for a list of relevant links.fetch_wiki_urls() — Wikipedia. Find
specific page titles/URLs.fetch_wiki_refs() — Wikipedia. Extract
the external “References” URLs from a page.read_*)Once you have locations, bring the data into R.
read_urls() — Input: character vector
of URLs. Output: data frame of cleaned text/markdown.nlp_*)Prepare raw text for analysis or indexing. Designed to be used with
the pipe |>.
nlp_split_paragraphs() — Break large
documents into structural blocks.nlp_split_sentences() — Refine blocks
into individual sentences.nlp_tokenize_text() — Normalize text
into a clean token stream.nlp_index_tokens() — Build a weighted
BM25 index for ranked search.nlp_roll_chunks() — Roll units
(e.g. sentences) into fixed-size chunks with optional context
(RAG-style).search_*)Four ways to query your data. Subject-first: first argument is the data (corpus, index, or embeddings); the second is the query/needle. Pipe-friendly.
| Function | Primary input (needle) | Use case |
|---|---|---|
| search_regex(corpus, query, …) | Character (pattern) | Specific strings/patterns, KWIC. |
| search_dict(corpus, terms, …) | Character (vector of terms) | Exact phrases/MWEs; no partial-match risk. |
| search_index(index, query, …) | Character (keywords) | BM25 ranked retrieval. |
| search_vector(embeddings, query, …) | Numeric (vector/matrix) | Semantic neighbors. |
search_dict is the exact n-gram matcher: pass a vector of terms (e.g. ); get a table of where they appeared. Optimized for high-speed extraction of thousands of specific terms (MWEs) across large corpora. Add categories later with a left_join on or .
Quick start (all four stages):
library(textpress)
links <- fetch_urls("R high performance computing")
corpus <- read_urls(links$url)
corpus$doc_id <- seq_len(nrow(corpus))
toks <- nlp_tokenize_text(corpus, by = "doc_id", include_spans = FALSE)
index <- nlp_index_tokens(toks)
search_regex(corpus, "parallel|future", by = "doc_id")
search_dict(corpus, terms = c("OpenMP", "Socket"), by = "doc_id")
search_index(index, "distributed computing")
# search_vector(embeddings, query) # use util_fetch_embeddings() for embeddingsWhile textpress is a general-purpose text toolkit, its design fits LLM-based workflows (e.g. RAG) and autonomous agents.
Lightweight RAG (retrieval-augmented
generation)
You can build a local-first RAG pipeline without a heavy vector DB:
search_index() (BM25) to pull relevant chunks by keyword;
often more accurate for technical data than semantic search alone.nlp_split_paragraphs() and related functions so you send
only relevant snippets to an LLM, cutting token cost and improving
answers.search_dict() to extract known entities or IDs before
calling an LLM, so the model does not hallucinate core facts.Tool-use for autonomous agents
If you are building an agent (e.g. via or another R framework),
textpress functions work well as tools: flat naming and
predictable data-frame outputs make them easy for a model to call.
fetch_urls() — agent “Search” tool.read_urls() — agent “Browse” tool.search_regex() — agent “Find in page” tool.search_dict() — agent “Entity extraction” tool
(deterministic; reduces hallucination).MIT © Jason Timm, MA, PhD
If you use this package in your research, please cite:
citation("textpress")Report bugs or request features at https://github.com/jaytimm/textpress/issues
Contributions welcome! Please open an issue or submit a pull request.