| Title: | Scientific Code Lint for R Analyses |
| Version: | 0.1.1 |
| Description: | Static analysis for R scientific data analysis code. Flags patterns that often correspond to hidden scientific commitments – silent error swallowing, smuggled defaults, label leakage in selection-stage code, magic-eps floors in 'BIC' formulas, and shadow-overwrite of sourced helpers. Designed for agentic coding workflows; high recall over precision; structured 'ANALYSIS_OK' waivers as the audit trail. |
| License: | MIT + file LICENSE |
| URL: | https://github.com/arjunrajlaboratory/scilintr |
| BugReports: | https://github.com/arjunrajlaboratory/scilintr/issues |
| Encoding: | UTF-8 |
| Depends: | R (≥ 4.1) |
| Imports: | lintr (≥ 3.0.0), xml2, xmlparsedata, yaml |
| Suggests: | testthat (≥ 3.0.0), roxygen2 |
| Config/testthat/edition: | 3 |
| Config/roxygen2/version: | 8.0.0 |
| NeedsCompilation: | no |
| Packaged: | 2026-06-06 12:33:15 UTC; arjunraj |
| Author: | Arjun Raj [aut, cre] |
| Maintainer: | Arjun Raj <arjunraj@seas.upenn.edu> |
| Repository: | CRAN |
| Date/Publication: | 2026-06-12 11:20:02 UTC |
Construct a Finding record.
Description
The unified finding schema is shared with the Python linter so downstream tooling (CI reporters, agent prompts) is language-agnostic.
Usage
Finding(
file,
line,
rule,
message,
severity = "warning",
suggested_fix = NA_character_,
waiver_status = "none"
)
Shared trivial-literal allowlist.
Description
These values are too generic to count as scientifically meaningful: loop sentinels, length/presence checks, polarity flags, common Inf fences, NA fillers, and booleans. Multiple rules consult this set to avoid over-flagging idiomatic R.
Usage
TRIVIAL_LITERALS
Details
Used by R001 (positional access – drop = FALSE etc.) and R002 (magic
threshold – length(x) > 0L etc.). Rule R024 (smuggled default)
uses an equivalent inline allowlist; consolidate if/when extending.
Flag GetAssayData(obj) calls missing explicit assay= and layer=.
Description
Seurat::GetAssayData() silently falls back to the object's
"default assay" and "default layer", which depend on global state set
elsewhere in the pipeline. Pulling expression values without naming
the assay/layer is a common source of "I got counts when I wanted
data" bugs. Require at least one of assay= or layer= to be named
at the call site.
Usage
ambiguous_layer_linter()
Filter findings by removing those covered by a nearby ANALYSIS_OK waiver.
Description
Filter findings by removing those covered by a nearby ANALYSIS_OK waiver.
Usage
apply_waivers(findings, file_text)
R040 – "no-circularity" / "blind" name antipattern
Description
A function whose name contains blind, no_circularity, unsupervised,
label_free, independent, or honest is asserting that it operates
without consulting labels / ground truth. If the body nevertheless
references known label columns (e.g. legacy_branch2_clone, truth_c1,
cell_type) or label files (e.g. cell_labels.tsv), the lexical
promise of the name is contradicted by the body – a classic source of
silent circularity in unsupervised analyses.
Usage
blind_name_antipattern_linter()
Details
Detection (v1, hardcoded patterns):
function defs found via
<- function(...),<<- function(...), or= function(...)(theequal_assignform).a function name matching
(blind|no_circularity|unsupervised| label_free|independent|honest)is required to trigger inspection.body label refs:
$<label>where the symbol matches a known label column, or any STR_CONST mentioningcell_labels/labels.<ext>.
Reports at the line of the offending label reference (not the
function definition), so a waiver placed near the body works as
expected. A # ANALYSIS_OK[blind-name]: comment within the waiver
window suppresses the finding via the shared waiver layer.
Flag try(..., silent = TRUE).
Description
try(expr, silent = TRUE) swallows errors without logging or
rethrowing – the caller continues with a try-error object that
often gets coerced or ignored downstream. v1 detects exactly this
pattern; broader fallback heuristics are deferred.
Usage
broad_exception_linter()
Details
This is distinct from R030, which targets tryCatch(..., error = function(e) <literal>) handlers.
Build a project-wide index for cross-file analysis.
Description
Walks every .R file once and collects:
Usage
build_project_index(files, config = NULL)
Details
top-level function definitions (name, file, line)
-
source()edges (file, sourced_file) stage tags per file (from
# STAGE: <name>headers)
All file paths are normalized to absolute paths so cross-file joins match cleanly regardless of how the caller passed them in.
Flag w1*x1 + w2*x2 + w3*x3 + ... composite scores with literal weights.
Description
Hand-tuned weights on composite scores are a provenance landmine:
each literal weight is a degree of freedom, and three or more
compounded weights make post-hoc tuning trivially deniable. We flag
any +-joined expression with three-or-more terms where at least
one term has an explicit numeric coefficient (<NUM_CONST> * <expr>)
or is itself a bare <NUM_CONST>.
Usage
composite_weights_linter()
Details
Allowed via ANALYSIS_OK[composite-weights] waiver that cites the
weight-provenance record and a sensitivity check.
Detection:
Find each top-level
+<expr>(i.e., an<expr>whose direct children includeOP-PLUSbut whose parent does NOT, so we only anchor on the outermost+in a chain).Recurse through the
+tree to enumerate plus-joined leaf terms.If
>= 3leaves and at least one leaf is<NUM_CONST> * <expr>or a literal<NUM_CONST>, flag the outer expression.
R039 – Recursive calls with constant hyperparameters across depth
Description
Flag a recursive function whose self-call passes one of the function's
own formal arguments through unchanged. When a hyperparameter that was
tuned at the root is inherited verbatim at every recursion depth, the
parent-tuned gate becomes a smuggled default at child nodes – the
resulting null at depth then reads as "no further structure" when it
really reflects parent-tuned parameters being applied to a different
sub-population (see analysis_lint_strategy.md section 39).
Usage
constant_gates_recursion_linter()
Details
V1 is intentionally conservative and file-local: it finds top-level
function definitions X <- function(...) { ... }, looks for direct
self-calls X(...) in the body, and flags when any argument
expression is a bare SYMBOL whose name matches one of X's formal
arguments. A bare formal pass-through (recurse(child, gates)) fires;
a transformed value (recurse(child, child_gates)) does not.
Allowed with # ANALYSIS_OK[constant-gates-at-depth] waiver.
Registry of cross-file (project-level) rules.
Description
Each entry is R<NN> = function(idx) <findings>. The rule receives
the project index built by build_project_index() and returns a
list of scilintr_finding records.
Usage
cross_file_rules()
Details
Add new cross-file rules here as they are implemented.
Detect the # STAGE: <name> tag in a file's first 10 lines.
Description
Detect the # STAGE: <name> tag in a file's first 10 lines.
Usage
detect_file_stage(path)
Arguments
path |
Path to an R file. |
Value
The stage name as a string, or NA_character_ if no tag.
R027 – Asymmetric env_* validators within a single file
Description
Flag when a single source file defines two or more env_*-named
helper functions that mix two failure styles: some return a default
(silent fall-through) while others halt loudly via stop().
Inconsistency here breeds silent misconfiguration: callers cannot
predict whether a missing env var crashes or quietly drifts.
Usage
env_validator_asymmetry_linter()
Details
Pragmatic v1: regex over the file text plus a top-level XML check.
Emit a single lint at the line of the FIRST env_* function definition
in the file (the source_expression covering that def is the only one
that emits, to keep results stable across lintr's per-expression model).
Top-level function definitions in a parsed file.
Description
Handles both name <- function(...) and name = function(...).
Skips -> (right-assign) and <<- (super-assign) for v0; those
are rare in research code and produce a different XML shape.
Usage
extract_fn_defs(xml, file_abs)
source() / sys.source() edges out of a file.
Description
Only handles the literal-string form (source("file.R")); dynamic
source(get(...)) etc. is ignored. Paths are resolved relative to
the consumer file's directory and normalized to absolute.
Usage
extract_sources(xml, file_path, file_abs)
Apply waivers to a list of cross-file Findings.
Description
Reads each finding's file from disk (cached across findings to the
same file) and removes findings covered by a nearby ANALYSIS_OK[...].
Usage
filter_waivers_cross_file(findings)
Flag inline design = ~ ... formula literals.
Description
v1 heuristic: find any named-argument design = <expr> where the
RHS expression is a tilde formula literal (<OP-TILDE> in the XML
parse tree). Variables resolving to formulas (e.g.,
design = design_formula) are not flagged – only inline tildes.
Waiver suppression (ANALYSIS_OK[contrast-definition]) is handled
by the dispatcher; this rule always fires on inline tildes.
Usage
hardcoded_design_formula_linter()
Flag c("S17", "S23", ...) style hardcoded sample-ID exclusion lists.
Description
Heuristic (v1): any c(...) call that contains two or more STR_CONST
children whose unquoted value matches ^[A-Z]+\\d+$ (e.g., "S17",
"A191", "P001") is flagged at the line of the c() call. This catches
the canonical exclude <- c("S17", "S23") pattern from the strategy
doc. The rule fires regardless of waivers; waiver suppression is
applied by the dispatcher reading ANALYSIS_OK[sample-exclusion]
ledger comments – so good_ledgered.R still emits a Lint here.
Usage
hardcoded_sample_id_linter()
Flag implicit file selection patterns.
Description
Two patterns flagged:
String literals containing suspicious filename tokens (latest, old, old, backup, previous, tmp_, tmp, temp, temp, copy, _copy, final_final, archive). Case-insensitive.
mtime-based file picking: an expression containing both
file.infoandmtimesymbols (e.g.files[which.max(file.info(files)$mtime)]).
Usage
implicit_file_selection_linter()
Details
Waivers (# ANALYSIS_OK[...]) are handled by the harness layer;
this linter just emits Lints.
Flag references to label / ground-truth column names in files that
are tagged # STAGE: selection (or any non-evaluation blind stage).
Description
Selection-stage code (PCA, HVG, clustering, embedding, etc.) must be
blind to outcome/label columns; touching metadata$treatment while
choosing genes leaks the answer key into the unsupervised pipeline.
Usage
label_in_blind_stage_linter()
Details
Stage detection: scan the first 10 lines of the file for a comment
# STAGE: <name>. If <name> is evaluation (or no tag is present)
the linter does not fire. Anything else (including selection) is
treated as a blind stage.
Detection patterns (label names hardcoded for v1):
-
metadata$<label>–OP-DOLLARfollowed bySYMBOLmatching a label -
df[["<label>"]]– string literal inside[[ ]]matching a label -
pull(<label>)– bareSYMBOLargument topull()
[ ]: R:%20
Flag any reference to label / ground-truth columns from inside a file
tagged # STAGE: selection.
Description
Selection-stage code (module scoring, clustering, embedding, calling
filters, etc.) must be blind to outcome labels. The project enforces a
two-file split: selected_calls.csv (labels-free) is written by the
selection-stage script, and selected_calls_evaluated.csv is produced
downstream by an evaluation-stage script that joins labels. Even a
read-only label reference in a selection-stage file makes the leak
"one careless edit away."
Usage
label_ref_in_selection_linter()
Details
Stage detection: scan the first 10 lines of the file for a comment
# STAGE: <name>. The linter fires only when <name> is
exactly selection. Other stages (including untagged files) are
handled by other rules (e.g. R012 covers all blind stages).
Detection patterns (label names hardcoded for v1):
-
df$<label>–OP-DOLLARfollowed bySYMBOLmatching a label -
df[["<label>"]]– string literal inside[[ ]]matching a label -
pull(<label>)– bareSYMBOLargument topull()/select()
[ ]: R:%20
Flag a data.frame(...) literal that mixes label columns with
computed-score columns and is subsequently written to disk via
write.csv / write_csv from within a # STAGE: selection file.
Description
Rationale: a CSV that pairs labels with discovery scores propagates
the leakage risk downstream – any later script that re-reads the
file is "one careless edit away" from training on labels. The
project-prescribed remediation is the two-file split (selection
writes selected_calls.csv, labels-free; evaluation joins labels in
a separate stage). See analysis_lint_strategy.md #34.
Usage
label_score_coresidence_linter()
Details
v1 detection (single file):
Stage gate – only fire when the first 10 lines contain
# STAGE: selection.Find
data.frame(...)calls whose argument list contains both a SYMBOL_SUB matching a label name and a SYMBOL_SUB matching a score-name regex.Require the file to also contain a
write.csv/write_csv/readr::write_csvcall (proxy for "written to disk"). Flag at the line of thedata.framecall.
Flag read.csv / read_csv / read.table / read.delim / fread calls
in a # STAGE: selection file whose path argument's basename matches a
label-tainted pattern (e.g. gt_*, *_oracle_*, *_recall_*, etc.).
Description
Even when a selection-stage script only reads one "harmless" column from such a file, the rows / panel that ended up in the CSV were chosen using labels – so the downstream selection is laundered through the file name. See analysis_lint_strategy.md #35.
Usage
label_tainted_input_linter()
Details
Stage gate: the linter scans the first 10 lines for a # STAGE: <name>
marker and fires only when <name> is exactly selection. Untagged or
otherwise-tagged files are left to other rules.
Hardcoded label-tainted regex (v1, applied to basename(path)):
(^|_)(gt|oracle|truth|evaluated|recall|label)(_|\d|\.)
This anchors on _/start and accepts a trailing digit (so gt17_*.csv
fires), underscore (so *_oracle_*.csv fires), or . (so gt.csv would
fire too). Project-config extension is left for a future iteration.
Waivers (e.g. ANALYSIS_OK[oracle-file-read]) are honoured by the waiver
layer; this linter just emits the raw finding.
Flag order(...) / arrange(...) calls whose secondary sort key
references a label-named column (e.g. is_gt_label, truth_c1,
is_target). A secondary key drives ranking for ties – if the
tie-breaker is the answer key, the ranking is leaky.
Description
Detection:
Call is
order(...)orarrange(...).Walk all arguments after the first; if any of their descendant SYMBOLs match the label-name regex below, flag.
Match is case-insensitive and matches names containing
is_gt,gt_label,truth,is_target, or trailing_label.
Usage
label_tiebreak_linter()
Lint a single R file.
Description
Runs every registered per-file linter against path, converts the
resulting lintr::Lint objects to scilintr_finding records, and
applies the ANALYSIS_OK[...] waiver filter.
Usage
lint_file(path, config = NULL)
Arguments
path |
Path to a |
config |
Optional configuration list (loaded from |
Value
A list of scilintr_finding records.
Examples
# Lint a tiny self-contained file in the session temp directory.
tmp <- tempfile(fileext = ".R")
writeLines("x <- 1", tmp)
findings <- lint_file(tmp)
length(findings)
unlink(tmp)
Lint an entire project directory.
Description
Walks every .R file, runs the per-file linters, then builds the
project index and runs the cross-file rules against it.
Usage
lint_project(root = ".", config = NULL)
Arguments
root |
Project root directory. |
config |
Optional configuration list. |
Value
A list of scilintr_finding records aggregated across files.
Examples
# Build a throwaway one-file project inside the session temp directory.
proj <- file.path(tempdir(), "scilintr-demo")
dir.create(proj, showWarnings = FALSE)
writeLines("y <- 2", file.path(proj, "analysis.R"))
findings <- lint_project(proj)
length(findings)
unlink(proj, recursive = TRUE)
Convert a single lintr::Lint to a scilintr_finding.
Description
Pulls the rule ID from lint$linter (set by the registry key when
the linter is dispatched). Strips a trailing _linter suffix if
lintr added one.
Usage
lint_to_finding(lint)
Load scilintr configuration from a project root.
Description
Looks for .scilintr.yml, analysis_labels.yml, and
analysis_identifiers.yml at the project root.
Usage
load_config(root = ".")
Flag log(pmax(x, <tiny-literal>)) and friends.
Description
A floor like 1e-12 or .Machine$double.eps (~2.2e-16) inside a
pmax() immediately before log() is a numerical-stability
landmine when the floor sits below the data's natural
discretisation grid. The floor then dominates the score and
confuses ranking comparisons. Use a domain-motivated floor (e.g.
half the smallest non-zero increment), not a generic safety
constant.
Usage
magic_eps_floor_linter()
Details
Detection:
Outer call is
log/log1p/log10/log2.Inner call is
pmax/pmax.int.Second
pmaxargument is either aNUM_CONSTwith numeric value< 1e-6, or the expression.Machine$double.eps.-
pmax(x, 1 / (2 * N))and similar compound expressions are not flagged – the second argument is not a single literal.
Flag magic numeric thresholds in comparison expressions.
Description
Detects patterns like padj < 0.05, counts > 10, zscores > 3
where a bare numeric literal sits on either side of a comparison
operator (<, >, <=, >=, ==, !=). Named constants
(e.g. padj < FDR_THRESHOLD) are not flagged because no
NUM_CONST appears.
Usage
magic_threshold_linter()
Details
Trivial literals (0, 1, -1, NA, TRUE, FALSE, Inf,
etc.) are filtered out – loop sentinels (length(x) > 0L) and
presence checks (nrow(df) > 0) are not scientific thresholds.
V1.1 still over-flags relative to the strict spec; legitimate
bare-number cases are handled by the orchestrator's
ANALYSIS_OK[threshold] waiver.
Main CLI entry point.
Description
Invoked from inst/bin/scilintr or
Rscript -e 'scilintr::main()'.
Usage
main(args = commandArgs(trailingOnly = TRUE))
Arguments
args |
Character vector of command-line arguments. |
Value
Invisibly returns 0L when no findings are reported and 1L
otherwise. Called for its side effect of printing findings.
Examples
# Run the CLI entry point over a throwaway project in tempdir().
proj <- file.path(tempdir(), "scilintr-cli-demo")
dir.create(proj, showWarnings = FALSE)
writeLines("z <- 3", file.path(proj, "analysis.R"))
main(proj)
unlink(proj, recursive = TRUE)
Detect a structured ANALYSIS_OK waiver near a given line.
Description
Scans up to window lines around line_no for a
# ANALYSIS_OK[category]: comment. Returns the category name if
found, NA otherwise. Roxygen comments (#' @...) are ignored so
an unrelated tag cannot spoof a waiver.
Usage
nearby_waiver(file_lines, line_no, window = 8L)
Parse a single file and pull defs, sources, and stage tag.
Description
Parse a single file and pull defs, sources, and stage tag.
Usage
parse_file_info(path, path_abs)
Flag digest::digest(<single_SYMBOL>) inside a function with 2+
formals. A cache key that fingerprints a single variable when the
enclosing function takes multiple inputs is a partial-fingerprint
hazard: un-fingerprinted inputs can change and the cache silently
returns stale results.
Description
Heuristic: catches digest::digest(idx_e) but not
digest::digest(list(N = N, Y = Y, idx_e = idx_e)) because the
latter has a function call (list(...)) as its argument, not a
bare SYMBOL.
Usage
partial_cache_fingerprint_linter()
Flag hardcoded patient/sample identifiers (and A191-specific SNP literals)
in files declared # STAGE: library.
Description
Stage detection: the rule scans the first 10 lines of the file for a
# STAGE: <name> directive. If <name> is not library, the rule
returns no findings. This keeps the rule scoped to sample-agnostic
library helpers; analysis scripts can legitimately mention "A191",
191L, etc.
Usage
patient_id_in_lib_linter()
Details
v1 forbidden literal set (hardcoded):
NUM_CONST:
191,191L,193,193LSTR_CONST:
"A191","A193"STR_CONST matching SNP-name pattern
^X\d+\.\d+[A-Z]+\.[A-Z]+$(e.g.,"X17.76565019G.A"– R-mangled SNP-id, which is necessarily dataset-specific and should not live in library code).
Waiver suppression (ANALYSIS_OK[sample-specific-default] and similar)
is applied by the orchestrator, not here – this rule fires on every
offending literal regardless of nearby waiver comments.
Registry of per-file lintr::Linter() factories.
Description
Each entry is R<NN> = factory(). The registry key becomes the
rule ID on every emitted Lint (via lintr's name propagation).
Usage
per_file_linters()
Details
Add new per-file rules here as they are implemented. Cross-file
rules live in R/cross_file_rules.R.
Flag pmax(<expr>, 0) and ylim(0, ...).
Description
Both patterns silently clip negative values, hiding informative
signed ranges (e.g. ARI dropping below 0). For v1 we look for the
literal 0 argument; later versions may also catch pmin(x, 1),
coord_cartesian(ylim = c(0, NA)), etc.
Usage
plot_clip_linter()
Flag filtering assignments whose LHS name starts with plot_.
Description
Pattern:
plot_df <- de_results[de_results$padj < 0.05, ] plot_df <- df %>% filter(...)
Usage
plot_data_filter_linter()
Details
Visual filtering (subsetting for a single chart) silently mutates the
analysis population if the same plot_df is later treated as the DE
result. We flag the assignment; a # ANALYSIS_OK[plot-filter]: waiver
on a neighbouring line suppresses elsewhere via the shared waiver layer.
Flag raw positional dataframe access by integer literal.
Description
Detects patterns like metadata[, 4], df[[3]], and df[, 2:5]
where a bare integer literal sits in the column index slot of a
single-bracket access (after the first comma) or as the sole
index of a double-bracket access. Named-constant indices
(e.g. metadata[, TREATMENT_COL_INDEX]) use SYMBOL, not
NUM_CONST, and are not flagged.
Usage
positional_access_linter()
Details
Trivial literals (TRUE, FALSE, NA, 0, 1, Inf, etc.) are
filtered out – they parse as NUM_CONST in xmlparsedata but are
never positional indices. The drop = FALSE argument of
df[, j, drop = FALSE] is the canonical false-positive this guards
against.
V1.1 is still trigger-happy on real integer indices. Legitimate
uses are handled by the orchestrator's
ANALYSIS_OK[positional-access] waiver layer.
Flag positional re-indexing of a dataframe by row/column count of another.
Description
Detects patterns like metadata <- metadata[1:ncol(counts), ] or
metadata[seq_len(ncol(counts)), ] where rows of one structure are
positionally trimmed/aligned to the column or row count of another.
Such alignment is fragile: it silently succeeds even when the two
structures are in different orders. ID-based alignment
(e.g. counts[, metadata$sample_id]) is preferred.
Usage
positional_alignment_linter()
Details
V1 finds any single-bracket access whose row-index expression contains
a call to ncol, nrow, or seq_len. Legitimate uses are handled
by the orchestrator's ANALYSIS_OK[id-alignment] waiver layer.
Flag read.csv() then string-keyed column lookup with mangled chars.
Description
R's read.csv() / read.table() rewrite column names: -, >,
: become ., and names starting with a digit get an X prefix.
A subsequent df[["17-38733306C>T"]] or df$`17-38733306C>T`
then silently returns NULL with no error or warning. Passing
check.names = FALSE (or using readr::read_csv) avoids the
rewrite.
Usage
readcsv_mangling_linter()
Details
Two-phase, single-pass detection:
File-level: if the script contains no
read.csv/read.tablecall, or every such call passescheck.names = FALSE, bail.Per-expression: flag string-literal column lookups (
df[["..."]]) and backtick-symbol lookups (df$`...`) whose name carries-,>,:, or a leading digit.
Both df[["..."]] (STR_CONST) and df$`...` (OP-DOLLAR /
SYMBOL) forms are handled.
Flag if (!file.exists(path)) return(NULL/NA).
Description
This is a silent fallback wearing a different costume – the explicit
fallback rule (R007) catches try(...) swallowing, but it doesn't
catch the cleaner-looking if (!file.exists(path)) return(NULL).
Same end state: the missing input is treated as an acceptable empty
signal and propagates downstream.
Usage
return_null_on_missing_linter()
Details
Detection: an if(!file.exists(...)) whose body is – or contains –
a return(NULL), return(NA), return(NA_real_), etc. Returns of
other values (return(0), return(x)) are not flagged.
Ported from the Python rule return-none-on-missing-input.
Flag a top-level function definition that shadows a name imported
via source().
Description
Joins the project's defs table (one row per top-level function
definition) with its sources table (one row per source() edge)
and looks for names defined in both the consumer file and the
sourced file. The finding lands on the consumer's redefinition line.
Usage
rule_R020_shadow_overwrite(idx)
Flag function names defined in more than one file.
Description
Groups idx$defs by function name; if a name has definitions in
more than one file, emits a Finding at each definition site so the
user sees every drift location. Waivers (ANALYSIS_OK[...]) at any
individual site are applied by the cross-file waiver filter and can
suppress that site independently (e.g. for intentional v1/v2 kept
alongside each other).
Usage
rule_R025_def_drift(idx)
Flag top-level function definitions that have no callers in any other project file.
Description
For each function in idx$defs, scans the text of every other .R
file in idx$files for a token matching \bfn_name\s*\(. If no
such call site is found, the definition is reported as dead code at
its def line. Callers within the defining file are tolerated (they
usually indicate internal helpers); intentional API stubs should
carry an ANALYSIS_OK[unused-fn] waiver near the def. This is a
v1 textual check, so matches inside comments or strings are not
excluded.
Usage
rule_R026_dead_code(idx)
Run every registered cross-file rule against the project index.
Description
Run every registered cross-file rule against the project index.
Usage
run_cross_file_rules(idx)
Flag set.seed(...) calls inside a function literal.
Description
v1 heuristic per analysis_lint_strategy.md R022: a set.seed()
that is not at script top level – i.e. nested inside any enclosing
function(...) <body> – pollutes the global RNG state when the
function is invoked from a loop or parallel worker. The reproducibility
contract belongs to the dispatcher (top-level seed or L'Ecuyer streams),
not the per-task callee.
Usage
seed_in_loop_linter()
Details
Detection: any <SYMBOL_FUNCTION_CALL>set.seed</SYMBOL_FUNCTION_CALL>
that has an ancestor <expr> containing a <FUNCTION> child.
Flag <name> <- <expr> != "" / <name> <- <expr> == "".
Description
The mask is the partitioner downstream code uses; treating "" as
a missingness sentinel hides the upstream contract and silently
drops or keeps rows depending on which side of the comparison the
empty string falls on.
Usage
sentinel_mask_linter()
Details
Scope, on purpose:
Only the empty-string sentinel is matched.
!= 0/!is.na(x)are out of scope (too common and usually legitimate).Only top-level assignments to a plain symbol. The inline form
df[df$col != "", ]is R005 (unledger-filter) territory.Composition with
&,|,!, and parentheses is unwrapped so a compound mask like(df$a != "") & (df$b != "")still fires.
Ported from the Python rule sentinel-mask-assignment.
Flag a tryCatch(..., error = ...) handler that silently degrades.
Description
The silent-fallback family in three costumes, all sharing one hidden commitment – "on failure this quietly proceeds on a meaningless value instead of stopping":
Usage
silent_trycatch_linter()
Details
-
return: the handler's return value is a bare literal (NUM_CONST, NULL_CONST, STR_CONST –
NA,NA_real_,NULL,0,"", ...), whether returned directly (function(e) NA) or as the last statement of a multi-statement block (function(e) { warning(...); NA }). Caller-side code then continues with a numerically-valid stand-in. -
rebind: the handler superassigns (
<<-) an outer name to a degraded default (cohort <<- NULL), so downstream stages run on a placeholder.<<-escapes the handler scope, so it is flagged wherever it sits in the body. -
stub: the handler superassigns an outer name to a no-op stub function (
score_fn <<- function(...) NULL), silently disabling behavior on the failure path.
Doing real work and returning a genuine recovered value (a cached
object, an alternate computation, stop(e) to rethrow) is left
alone – only bare placeholders and no-op stubs are flagged. Local
(<-) rebinds are not flagged: in R they die with the handler frame
and have no external effect.
R024 – Smuggled function-signature defaults
Description
Flag function(arg = <literal>) where the literal is "interesting":
a NUM_CONST whose value is not in {0, 1, -1, NA*, Inf, -Inf, TRUE, FALSE} (with optional L suffix), or a STR_CONST whose unwrapped
value is not the empty string.
Usage
smuggled_default_linter()
Details
Boring defaults (0, 1, NA, NULL, TRUE, "") are sentinels,
not scientific choices. Calls (e.g. c("a","b")) and absent defaults
are not literals and are skipped by construction.
Stage tag row for the index – wraps detect_file_stage().
Description
Stage tag row for the index – wraps detect_file_stage().
Usage
stage_row(path, path_abs)
Flag if (file.exists(<X>)) return(read*(...)) single-statement bodies.
Description
A cache short-circuit that returns a previously serialized result without comparing an input fingerprint is a stale-cache hazard: if upstream inputs change, the cached value is silently returned and the analysis quietly drifts from its inputs.
Usage
stale_cache_linter()
Details
Detection (v1): an if whose condition calls file.exists(...) and
whose body is a single statement calling return(readRDS(...))
(or read_rds, read.csv, read_csv). Multi-statement {...}
bodies are spared on the assumption that the extra statements
implement a fingerprint check – the orchestrator's waiver layer
handles any remaining false positives.
R038 – Symmetric "best of either side" reporting
Description
Flag pmax(...), max(...), or which.max(c(...)) whose argument
subtree contains two or more SYMBOLs whose names look like
side/polarity labels (target_*, rest_*, left_*, right_*,
*_aligned, *_complement, *_c1, *_side).
Usage
symmetric_best_linter()
Details
Picking "the better of two label-named sides" after labels are joined
is a hidden test-multiplication / label-aware fishing pattern. Fix is
to pre-declare which side is the target side via a label-independent
rule and freeze the orientation before label joins
(see analysis_lint_strategy.md section 38).
V1 is file-local and intentionally conservative: it fires only when the heuristic finds at least two side-label-shaped SYMBOLs inside the max-like call's subtree. Pre-declared if/else polarity (no pmax/max/which.max) does not match.
Flag calls to random-data generators (rnorm, runif, rpois,
rbinom, rexp, rgamma, rbeta, rmultinom, sample,
sample_n, sample_frac).
Description
v1 heuristic: any <SYMBOL_FUNCTION_CALL> whose text matches one of
the known random-data generators is flagged at the line of the call.
We deliberately don't try to distinguish "data-like" assignments from
diagnostic randomness – false positives are cheap and the waiver
layer (ANALYSIS_OK[...]) handles legitimate uses.
Usage
synthetic_data_linter()
Flag uppercase-constant assignments that are defined within a few lines
of a label-tainted read or label-column reference, inside a file tagged
# STAGE: selection.
Description
Selection-stage code must not pick its thresholds/bands by maximizing
a label-aware metric (e.g. ground-truth recall). When a constant like
BAND <- sweep[which.max(sweep$gt_recall), ...] sits next to a
read.csv("..._gt_...") call or a label column reference, the
threshold is effectively label-tuned. The fix is to either (a) move
the constant to analysis_constants.yml with a documented value, or
(b) move the threshold-selection code into a # STAGE: evaluation
script and pass the resulting number in by hand.
Usage
threshold_near_label_linter()
Details
Stage detection: scan the first 10 lines for # STAGE: <name>. Fires
only when <name> is exactly selection. Other stages (including
untagged files) are not flagged here.
Detection: regex on the raw file text.
Constant lines:
^[A-Z][A-Z_0-9]+\\s*<-Label-tainted reads: paths/names containing
gt,oracle,truth,evaluated,recallas a token.Label column refs: a small hardcoded vocabulary (kept in sync with R033 for consistency).
A constant is flagged when at least one label-tainted line falls within +/- 6 lines of it.
Flag iterative-fit calls that lack a nearby convergence check.
Description
Iterative optimisers (lme4's glmer/lmer/nlmer, base nls and
optim, and optimx::optimx) can silently return non-converged
fits. Downstream inference on a non-converged model is unreliable.
Analyses should inspect fit@optinfo$conv$lme4$messages, the
convergence slot of optim() output, or otherwise programmatically
confirm convergence before using the fit.
Usage
unchecked_convergence_linter()
Details
v1 heuristic: if any flagged fit call appears in the file AND the file
body contains none of the tokens converged, conv$lme4, or
convergence, every fit call in the file is flagged. The waiver layer
silences this elsewhere when a justified ANALYSIS_OK[model-fit]
comment is present.
Flag join/merge calls without a follow-up cardinality assertion.
Description
Detects calls to left_join, right_join, inner_join, full_join,
anti_join, semi_join, or merge. If the file contains a row-count
or duplicate-key check (stopifnot(...), anyDuplicated, or a
validate = ... argument), the rule treats the joins as covered and
emits nothing. Otherwise, every join call is flagged.
Usage
unchecked_join_linter()
Details
V1 is intentionally file-level and trigger-happy. Upstream assertions
in another file are handled by the orchestrator's ANALYSIS_OK[join]
waiver layer, not here.
Flag optparse::make_option("--kebab-name", ...) calls whose
parsed dest is never read.
Description
Walks every make_option("--kebab", ...) (with or without the
optparse:: qualifier), computes the destination name optparse
would assign (--kebab-name -> kebab_name), then checks whether
any <args>$<dest> or <args>[["<dest>"]] access reads that
name – where <args> is the variable bound to a parse_args(...)
result. If no parse_args() anchor exists in the file we can't tell
what's consumed, so the file is skipped.
Usage
unconsumed_cli_flag_linter()
Details
Ported from the Python rule unconsumed-cli-flag.
Flag scientifically consequential transforms without explanation.
Description
Batch correction, residualization, and similar transforms silently
reshape downstream analysis. Require an ANALYSIS_OK[...] waiver
(handled elsewhere) that documents what was – and was not –
included as a covariate.
Usage
unexplained_transform_linter()
Details
Flagged calls (matched by function symbol): ComBat, combat, residualize, regress_out, remove_batch_effect, removeBatchEffect, regressOut
Flag self-assignments that silently narrow a dataframe.
Description
Patterns like df <- df[<cond>, ], df <- subset(df, ...), and
df <- df %>% filter(...) drop rows without recording how many were
dropped or why. The matched rule (R005 in the strategy doc) is that
filtering is allowed when ledgered – a sibling ANALYSIS_OK[<tag>]
waiver comment plus an observable drop record – but unannotated
self-narrowing should be surfaced.
Usage
unledger_filter_linter()
Details
Heuristic: LEFT_ASSIGN where the LHS SYMBOL name reappears as a SYMBOL
on the RHS, and the RHS contains either a [ (OP-LEFT-BRACKET) or a
call to filter / subset. Waiver-comment suppression is the
responsibility of the cross-file aggregator.
Flag unledger missingness coercion/imputation calls.
Description
v1 detects bare na.omit(...) calls. Stripping NAs silently
discards rows without an audit trail; analyses should either
scope the drop (df[!is.na(df$col), ]) or carry an
ANALYSIS_OK[missingness] ledger comment documenting the
exclusion. Future versions should extend to na.exclude,
tidyr::replace_na, drop_na, and top-level as.numeric
coercions; ledger-comment recognition is also deferred.
Usage
unledger_missingness_linter()
Flag stochastic function calls (kmeans, umap, Rtsne, tsne) that lack a
nearby set.seed(...) call.
Description
v1 heuristic: if the file contains any set.seed(...) call, suppress all
findings for that file. Otherwise emit one Lint per known stochastic call.
The waiver comment ANALYSIS_OK[random-seed-only] is handled by the
orchestrator's apply_waivers, not here.
Usage
unseeded_stochastic_linter()
Details
Known stochastic call names: kmeans, umap (covers uwot::umap –
SYMBOL_FUNCTION_CALL matches the bare function name even under pkg::),
Rtsne, tsne.
Flag yaml::read_yaml(...) / yaml::yaml.load_file(...) /
jsonlite::fromJSON(...) results that are later accessed via
$ or [[ ]] without an intervening validator call.
[ ]: R:%20
Description
Track variables whose value comes from one of the loader calls
above. If the variable is then read by <var>$<key> or
<var>[["<key>"]], flag the loader call. If the variable is
passed to a validate_* or _schema function before being
read, no finding fires.
Usage
unvalidated_config_linter()
Details
Ported from the Python rule unvalidated-config.
Flag suppressWarnings(...) and suppressMessages(...) calls.
Description
Blanket suppression hides diagnostics that often signal real
problems (e.g. glm.fit non-convergence, NA coercion, dropped
factor levels). Narrow, justified suppressions should be paired
with an ANALYSIS_OK[warning-suppression] waiver comment; the
orchestrator's waiver layer handles silencing those.
Usage
warning_suppression_linter()
Details
v1 scope: only the suppress* functions. options(warn = -1)
is deferred.