Help for package scilintr

Title:

Scientific Code Lint for R Analyses

Version:

0.1.1

Description:

Static analysis for R scientific data analysis code. Flags patterns that often correspond to hidden scientific commitments – silent error swallowing, smuggled defaults, label leakage in selection-stage code, magic-eps floors in 'BIC' formulas, and shadow-overwrite of sourced helpers. Designed for agentic coding workflows; high recall over precision; structured 'ANALYSIS_OK' waivers as the audit trail.

License:

MIT + file LICENSE

URL:

https://github.com/arjunrajlaboratory/scilintr

BugReports:

https://github.com/arjunrajlaboratory/scilintr/issues

Encoding:

UTF-8

Depends:

R (≥ 4.1)

Imports:

lintr (≥ 3.0.0), xml2, xmlparsedata, yaml

Suggests:

testthat (≥ 3.0.0), roxygen2

Config/testthat/edition:

Config/roxygen2/version:

8.0.0

NeedsCompilation:

Packaged:

2026-06-06 12:33:15 UTC; arjunraj

Author:

Arjun Raj [aut, cre]

Maintainer:

Arjun Raj <arjunraj@seas.upenn.edu>

Repository:

CRAN

Date/Publication:

2026-06-12 11:20:02 UTC

Construct a Finding record.

Description

The unified finding schema is shared with the Python linter so downstream tooling (CI reporters, agent prompts) is language-agnostic.

Usage

Finding(
  file,
  line,
  rule,
  message,
  severity = "warning",
  suggested_fix = NA_character_,
  waiver_status = "none"
)

Shared trivial-literal allowlist.

Description

These values are too generic to count as scientifically meaningful: loop sentinels, length/presence checks, polarity flags, common Inf fences, NA fillers, and booleans. Multiple rules consult this set to avoid over-flagging idiomatic R.

Usage

TRIVIAL_LITERALS

Details

Used by R001 (positional access – drop = FALSE etc.) and R002 (magic threshold – length(x) > 0L etc.). Rule R024 (smuggled default) uses an equivalent inline allowlist; consolidate if/when extending.

Flag `GetAssayData(obj)` calls missing explicit `⁠assay=⁠` and `⁠layer=⁠`.

Description

Seurat::GetAssayData() silently falls back to the object's "default assay" and "default layer", which depend on global state set elsewhere in the pipeline. Pulling expression values without naming the assay/layer is a common source of "I got counts when I wanted data" bugs. Require at least one of ⁠assay=⁠ or ⁠layer=⁠ to be named at the call site.

Usage

ambiguous_layer_linter()

Filter findings by removing those covered by a nearby ANALYSIS_OK waiver.

Description

Filter findings by removing those covered by a nearby ANALYSIS_OK waiver.

Usage

apply_waivers(findings, file_text)

R040 – "no-circularity" / "blind" name antipattern

Description

A function whose name contains blind, no_circularity, unsupervised, label_free, independent, or honest is asserting that it operates without consulting labels / ground truth. If the body nevertheless references known label columns (e.g. legacy_branch2_clone, truth_c1, cell_type) or label files (e.g. cell_labels.tsv), the lexical promise of the name is contradicted by the body – a classic source of silent circularity in unsupervised analyses.

Usage

blind_name_antipattern_linter()

Details

Detection (v1, hardcoded patterns):

function defs found via ⁠<- function(...)⁠, ⁠<<- function(...)⁠, or ⁠= function(...)⁠ (the equal_assign form).
a function name matching (blind|no_circularity|unsupervised| label_free|independent|honest) is required to trigger inspection.
body label refs: ⁠$<label>⁠ where the symbol matches a known label column, or any STR_CONST mentioning cell_labels / ⁠labels.<ext>⁠.

Reports at the line of the offending label reference (not the function definition), so a waiver placed near the body works as expected. A ⁠# ANALYSIS_OK[blind-name]:⁠ comment within the waiver window suppresses the finding via the shared waiver layer.

Flag `try(..., silent = TRUE)`.

Description

try(expr, silent = TRUE) swallows errors without logging or rethrowing – the caller continues with a try-error object that often gets coerced or ignored downstream. v1 detects exactly this pattern; broader fallback heuristics are deferred.

Usage

broad_exception_linter()

Details

This is distinct from R030, which targets ⁠tryCatch(..., error = function(e) <literal>)⁠ handlers.

Build a project-wide index for cross-file analysis.

Description

Walks every .R file once and collects:

Usage

build_project_index(files, config = NULL)

Details

top-level function definitions (name, file, line)
source() edges (file, sourced_file)
stage tags per file (from ⁠# STAGE: <name>⁠ headers)

All file paths are normalized to absolute paths so cross-file joins match cleanly regardless of how the caller passed them in.

Flag `w1x1 + w2x2 + w3*x3 + ...` composite scores with literal weights.

Description

Hand-tuned weights on composite scores are a provenance landmine: each literal weight is a degree of freedom, and three or more compounded weights make post-hoc tuning trivially deniable. We flag any +-joined expression with three-or-more terms where at least one term has an explicit numeric coefficient (⁠<NUM_CONST> * <expr>⁠) or is itself a bare ⁠<NUM_CONST>⁠.

Usage

composite_weights_linter()

Details

Allowed via ANALYSIS_OK[composite-weights] waiver that cites the weight-provenance record and a sensitivity check.

Detection:

Find each top-level + ⁠<expr>⁠ (i.e., an ⁠<expr>⁠ whose direct children include OP-PLUS but whose parent does NOT, so we only anchor on the outermost + in a chain).
Recurse through the + tree to enumerate plus-joined leaf terms.
If ⁠>= 3⁠ leaves and at least one leaf is ⁠<NUM_CONST> * <expr>⁠ or a literal ⁠<NUM_CONST>⁠, flag the outer expression.

R039 – Recursive calls with constant hyperparameters across depth

Description

Flag a recursive function whose self-call passes one of the function's own formal arguments through unchanged. When a hyperparameter that was tuned at the root is inherited verbatim at every recursion depth, the parent-tuned gate becomes a smuggled default at child nodes – the resulting null at depth then reads as "no further structure" when it really reflects parent-tuned parameters being applied to a different sub-population (see analysis_lint_strategy.md section 39).

Usage

constant_gates_recursion_linter()

Details

V1 is intentionally conservative and file-local: it finds top-level function definitions X <- function(...) { ... }, looks for direct self-calls X(...) in the body, and flags when any argument expression is a bare SYMBOL whose name matches one of X's formal arguments. A bare formal pass-through (recurse(child, gates)) fires; a transformed value (recurse(child, child_gates)) does not.

Allowed with ⁠# ANALYSIS_OK[constant-gates-at-depth]⁠ waiver.

Registry of cross-file (project-level) rules.

Description

Each entry is ⁠R<NN> = function(idx) <findings>⁠. The rule receives the project index built by build_project_index() and returns a list of scilintr_finding records.

Usage

cross_file_rules()

Details

Add new cross-file rules here as they are implemented.

Detect the `⁠# STAGE: <name>⁠` tag in a file's first 10 lines.

Description

Detect the ⁠# STAGE: <name>⁠ tag in a file's first 10 lines.

Usage

detect_file_stage(path)

Arguments

path

Path to an R file.

Value

The stage name as a string, or NA_character_ if no tag.

R027 – Asymmetric env_* validators within a single file

Description

Flag when a single source file defines two or more ⁠env_*⁠-named helper functions that mix two failure styles: some return a default (silent fall-through) while others halt loudly via stop(). Inconsistency here breeds silent misconfiguration: callers cannot predict whether a missing env var crashes or quietly drifts.

Usage

env_validator_asymmetry_linter()

Details

Pragmatic v1: regex over the file text plus a top-level XML check. Emit a single lint at the line of the FIRST ⁠env_*⁠ function definition in the file (the source_expression covering that def is the only one that emits, to keep results stable across lintr's per-expression model).

Top-level function definitions in a parsed file.

Description

Handles both ⁠name <- function(...)⁠ and ⁠name = function(...)⁠. Skips ⁠->⁠ (right-assign) and ⁠<<-⁠ (super-assign) for v0; those are rare in research code and produce a different XML shape.

Usage

extract_fn_defs(xml, file_abs)

`source()` / `sys.source()` edges out of a file.

Description

Only handles the literal-string form (source("file.R")); dynamic source(get(...)) etc. is ignored. Paths are resolved relative to the consumer file's directory and normalized to absolute.

Usage

extract_sources(xml, file_path, file_abs)

Apply waivers to a list of cross-file Findings.

Description

Reads each finding's file from disk (cached across findings to the same file) and removes findings covered by a nearby ANALYSIS_OK[...].

Usage

filter_waivers_cross_file(findings)

Flag inline `design = ~ ...` formula literals.

Description

v1 heuristic: find any named-argument ⁠design = <expr>⁠ where the RHS expression is a tilde formula literal (⁠<OP-TILDE>⁠ in the XML parse tree). Variables resolving to formulas (e.g., design = design_formula) are not flagged – only inline tildes. Waiver suppression (ANALYSIS_OK[contrast-definition]) is handled by the dispatcher; this rule always fires on inline tildes.

Usage

hardcoded_design_formula_linter()

Flag `c("S17", "S23", ...)` style hardcoded sample-ID exclusion lists.

Description

Heuristic (v1): any c(...) call that contains two or more STR_CONST children whose unquoted value matches ⁠^[A-Z]+\\d+$⁠ (e.g., "S17", "A191", "P001") is flagged at the line of the c() call. This catches the canonical exclude <- c("S17", "S23") pattern from the strategy doc. The rule fires regardless of waivers; waiver suppression is applied by the dispatcher reading ANALYSIS_OK[sample-exclusion] ledger comments – so good_ledgered.R still emits a Lint here.

Usage

hardcoded_sample_id_linter()

Flag implicit file selection patterns.

Description

Two patterns flagged:

String literals containing suspicious filename tokens (latest, old, old, backup, previous, tmp_, tmp, temp, temp, copy, _copy, final_final, archive). Case-insensitive.
mtime-based file picking: an expression containing both file.info and mtime symbols (e.g. files[which.max(file.info(files)$mtime)]).

Usage

implicit_file_selection_linter()

Details

Waivers (⁠# ANALYSIS_OK[...]⁠) are handled by the harness layer; this linter just emits Lints.

Flag references to label / ground-truth column names in files that are tagged `⁠# STAGE: selection⁠` (or any non-evaluation blind stage).

Description

Selection-stage code (PCA, HVG, clustering, embedding, etc.) must be blind to outcome/label columns; touching metadata$treatment while choosing genes leaks the answer key into the unsupervised pipeline.

Usage

label_in_blind_stage_linter()

Details

Stage detection: scan the first 10 lines of the file for a comment ⁠# STAGE: <name>⁠. If ⁠<name>⁠ is evaluation (or no tag is present) the linter does not fire. Anything else (including selection) is treated as a blind stage.

Detection patterns (label names hardcoded for v1):

⁠metadata$<label>⁠ – OP-DOLLAR followed by SYMBOL matching a label
df[["<label>"]] – string literal inside ⁠[[ ]]⁠ matching a label
⁠pull(<label>)⁠ – bare SYMBOL argument to pull()

[ ]: R:%20

Flag any reference to label / ground-truth columns from inside a file tagged `⁠# STAGE: selection⁠`.

Description

Selection-stage code (module scoring, clustering, embedding, calling filters, etc.) must be blind to outcome labels. The project enforces a two-file split: selected_calls.csv (labels-free) is written by the selection-stage script, and selected_calls_evaluated.csv is produced downstream by an evaluation-stage script that joins labels. Even a read-only label reference in a selection-stage file makes the leak "one careless edit away."

Usage

label_ref_in_selection_linter()

Details

Stage detection: scan the first 10 lines of the file for a comment ⁠# STAGE: <name>⁠. The linter fires only when ⁠<name>⁠ is exactly selection. Other stages (including untagged files) are handled by other rules (e.g. R012 covers all blind stages).

Detection patterns (label names hardcoded for v1):

⁠df$<label>⁠ – OP-DOLLAR followed by SYMBOL matching a label
df[["<label>"]] – string literal inside ⁠[[ ]]⁠ matching a label
⁠pull(<label>)⁠ – bare SYMBOL argument to pull() / select()

[ ]: R:%20

Flag a `data.frame(...)` literal that mixes label columns with computed-score columns and is subsequently written to disk via `write.csv` / `write_csv` from within a `⁠# STAGE: selection⁠` file.

Description

Rationale: a CSV that pairs labels with discovery scores propagates the leakage risk downstream – any later script that re-reads the file is "one careless edit away" from training on labels. The project-prescribed remediation is the two-file split (selection writes selected_calls.csv, labels-free; evaluation joins labels in a separate stage). See analysis_lint_strategy.md #34.

Usage

label_score_coresidence_linter()

Details

v1 detection (single file):

Stage gate – only fire when the first 10 lines contain ⁠# STAGE: selection⁠.
Find data.frame(...) calls whose argument list contains both a SYMBOL_SUB matching a label name and a SYMBOL_SUB matching a score-name regex.
Require the file to also contain a write.csv / write_csv / readr::write_csv call (proxy for "written to disk"). Flag at the line of the data.frame call.

Flag `read.csv` / `read_csv` / `read.table` / `read.delim` / `fread` calls in a `⁠# STAGE: selection⁠` file whose path argument's basename matches a label-tainted pattern (e.g. `⁠gt_⁠`, `⁠_oracle_⁠`, `⁠_recall_*⁠`, etc.).

Description

Even when a selection-stage script only reads one "harmless" column from such a file, the rows / panel that ended up in the CSV were chosen using labels – so the downstream selection is laundered through the file name. See analysis_lint_strategy.md #35.

Usage

label_tainted_input_linter()

Details

Stage gate: the linter scans the first 10 lines for a ⁠# STAGE: <name>⁠ marker and fires only when ⁠<name>⁠ is exactly selection. Untagged or otherwise-tagged files are left to other rules.

Waivers (e.g. ANALYSIS_OK[oracle-file-read]) are honoured by the waiver layer; this linter just emits the raw finding.

Flag `order(...)` / `arrange(...)` calls whose secondary sort key references a label-named column (e.g. `is_gt_label`, `truth_c1`, `is_target`). A secondary key drives ranking for ties – if the tie-breaker is the answer key, the ranking is leaky.

Description

Detection:

Call is order(...) or arrange(...).
Walk all arguments after the first; if any of their descendant SYMBOLs match the label-name regex below, flag.
Match is case-insensitive and matches names containing is_gt, gt_label, truth, is_target, or trailing ⁠_label⁠.

Usage

label_tiebreak_linter()

Lint a single R file.

Description

Runs every registered per-file linter against path, converts the resulting lintr::Lint objects to scilintr_finding records, and applies the ANALYSIS_OK[...] waiver filter.

Usage

lint_file(path, config = NULL)

Arguments

path

Path to a .R file.

config

Optional configuration list (loaded from .scilintr.yml).

Value

A list of scilintr_finding records.

Examples

# Lint a tiny self-contained file in the session temp directory.
tmp <- tempfile(fileext = ".R")
writeLines("x <- 1", tmp)
findings <- lint_file(tmp)
length(findings)
unlink(tmp)

Lint an entire project directory.

Description

Walks every .R file, runs the per-file linters, then builds the project index and runs the cross-file rules against it.

Usage

lint_project(root = ".", config = NULL)

Arguments

root

Project root directory.

config

Optional configuration list.

Value

A list of scilintr_finding records aggregated across files.

Examples

# Build a throwaway one-file project inside the session temp directory.
proj <- file.path(tempdir(), "scilintr-demo")
dir.create(proj, showWarnings = FALSE)
writeLines("y <- 2", file.path(proj, "analysis.R"))
findings <- lint_project(proj)
length(findings)
unlink(proj, recursive = TRUE)

Convert a single `lintr::Lint` to a `scilintr_finding`.

Description

Pulls the rule ID from lint$linter (set by the registry key when the linter is dispatched). Strips a trailing ⁠_linter⁠ suffix if lintr added one.

Usage

lint_to_finding(lint)

Load scilintr configuration from a project root.

Description

Looks for .scilintr.yml, analysis_labels.yml, and analysis_identifiers.yml at the project root.

Usage

load_config(root = ".")

Flag `⁠log(pmax(x, <tiny-literal>))⁠` and friends.

Description

A floor like 1e-12 or .Machine$double.eps (~2.2e-16) inside a pmax() immediately before log() is a numerical-stability landmine when the floor sits below the data's natural discretisation grid. The floor then dominates the score and confuses ranking comparisons. Use a domain-motivated floor (e.g. half the smallest non-zero increment), not a generic safety constant.

Usage

magic_eps_floor_linter()

Details

Detection:

Outer call is log / log1p / log10 / log2.
Inner call is pmax / pmax.int.
Second pmax argument is either a NUM_CONST with numeric value ⁠< 1e-6⁠, or the expression .Machine$double.eps.
pmax(x, 1 / (2 * N)) and similar compound expressions are not flagged – the second argument is not a single literal.

Flag magic numeric thresholds in comparison expressions.

Description

Detects patterns like padj < 0.05, counts > 10, zscores > 3 where a bare numeric literal sits on either side of a comparison operator (<, >, <=, >=, ==, !=). Named constants (e.g. padj < FDR_THRESHOLD) are not flagged because no NUM_CONST appears.

Usage

magic_threshold_linter()

Details

Trivial literals (0, 1, -1, NA, TRUE, FALSE, Inf, etc.) are filtered out – loop sentinels (length(x) > 0L) and presence checks (nrow(df) > 0) are not scientific thresholds.

V1.1 still over-flags relative to the strict spec; legitimate bare-number cases are handled by the orchestrator's ANALYSIS_OK[threshold] waiver.

Main CLI entry point.

Description

Invoked from inst/bin/scilintr or ⁠Rscript -e 'scilintr::main()'⁠.

Usage

main(args = commandArgs(trailingOnly = TRUE))

Arguments

args

Character vector of command-line arguments.

Value

Invisibly returns 0L when no findings are reported and 1L otherwise. Called for its side effect of printing findings.

Examples


# Run the CLI entry point over a throwaway project in tempdir().
proj <- file.path(tempdir(), "scilintr-cli-demo")
dir.create(proj, showWarnings = FALSE)
writeLines("z <- 3", file.path(proj, "analysis.R"))
main(proj)
unlink(proj, recursive = TRUE)

Detect a structured ANALYSIS_OK waiver near a given line.

Description

Scans up to window lines around line_no for a ⁠# ANALYSIS_OK[category]:⁠ comment. Returns the category name if found, NA otherwise. Roxygen comments (⁠#' @...⁠) are ignored so an unrelated tag cannot spoof a waiver.

Usage

nearby_waiver(file_lines, line_no, window = 8L)

Parse a single file and pull defs, sources, and stage tag.

Description

Parse a single file and pull defs, sources, and stage tag.

Usage

parse_file_info(path, path_abs)

Flag `⁠digest::digest(<single_SYMBOL>)⁠` inside a function with 2+ formals. A cache key that fingerprints a single variable when the enclosing function takes multiple inputs is a partial-fingerprint hazard: un-fingerprinted inputs can change and the cache silently returns stale results.

Description

Heuristic: catches digest::digest(idx_e) but not digest::digest(list(N = N, Y = Y, idx_e = idx_e)) because the latter has a function call (list(...)) as its argument, not a bare SYMBOL.

Usage

partial_cache_fingerprint_linter()

Flag hardcoded patient/sample identifiers (and A191-specific SNP literals) in files declared `⁠# STAGE: library⁠`.

Description

Stage detection: the rule scans the first 10 lines of the file for a ⁠# STAGE: <name>⁠ directive. If ⁠<name>⁠ is not library, the rule returns no findings. This keeps the rule scoped to sample-agnostic library helpers; analysis scripts can legitimately mention "A191", 191L, etc.

Usage

patient_id_in_lib_linter()

Details

v1 forbidden literal set (hardcoded):

NUM_CONST: 191, 191L, 193, 193L
STR_CONST: "A191", "A193"
STR_CONST matching SNP-name pattern ⁠^X\d+\.\d+[A-Z]+\.[A-Z]+$⁠ (e.g., "X17.76565019G.A" – R-mangled SNP-id, which is necessarily dataset-specific and should not live in library code).

Waiver suppression (ANALYSIS_OK[sample-specific-default] and similar) is applied by the orchestrator, not here – this rule fires on every offending literal regardless of nearby waiver comments.

Registry of per-file `lintr::Linter()` factories.

Description

Each entry is ⁠R<NN> = factory()⁠. The registry key becomes the rule ID on every emitted Lint (via lintr's name propagation).

Usage

per_file_linters()

Details

Add new per-file rules here as they are implemented. Cross-file rules live in R/cross_file_rules.R.

Flag `⁠pmax(<expr>, 0)⁠` and `ylim(0, ...)`.

Description

Both patterns silently clip negative values, hiding informative signed ranges (e.g. ARI dropping below 0). For v1 we look for the literal 0 argument; later versions may also catch pmin(x, 1), coord_cartesian(ylim = c(0, NA)), etc.

Usage

plot_clip_linter()

Flag filtering assignments whose LHS name starts with `plot_`.

Description

Pattern:

  plot_df <- de_results[de_results$padj < 0.05, ]
  plot_df <- df %>% filter(...)

Usage

plot_data_filter_linter()

Details

Visual filtering (subsetting for a single chart) silently mutates the analysis population if the same plot_df is later treated as the DE result. We flag the assignment; a ⁠# ANALYSIS_OK[plot-filter]:⁠ waiver on a neighbouring line suppresses elsewhere via the shared waiver layer.

Flag raw positional dataframe access by integer literal.

Description

Detects patterns like metadata[, 4], df[[3]], and df[, 2:5] where a bare integer literal sits in the column index slot of a single-bracket access (after the first comma) or as the sole index of a double-bracket access. Named-constant indices (e.g. metadata[, TREATMENT_COL_INDEX]) use SYMBOL, not NUM_CONST, and are not flagged.

Usage

positional_access_linter()

Details

Trivial literals (TRUE, FALSE, NA, 0, 1, Inf, etc.) are filtered out – they parse as NUM_CONST in xmlparsedata but are never positional indices. The drop = FALSE argument of df[, j, drop = FALSE] is the canonical false-positive this guards against.

V1.1 is still trigger-happy on real integer indices. Legitimate uses are handled by the orchestrator's ANALYSIS_OK[positional-access] waiver layer.

Flag positional re-indexing of a dataframe by row/column count of another.

Description

Detects patterns like metadata <- metadata[1:ncol(counts), ] or metadata[seq_len(ncol(counts)), ] where rows of one structure are positionally trimmed/aligned to the column or row count of another. Such alignment is fragile: it silently succeeds even when the two structures are in different orders. ID-based alignment (e.g. counts[, metadata$sample_id]) is preferred.

Usage

positional_alignment_linter()

Details

V1 finds any single-bracket access whose row-index expression contains a call to ncol, nrow, or seq_len. Legitimate uses are handled by the orchestrator's ANALYSIS_OK[id-alignment] waiver layer.

Flag `read.csv()` then string-keyed column lookup with mangled chars.

Description

R's read.csv() / read.table() rewrite column names: -, >, : become ., and names starting with a digit get an X prefix. A subsequent df[["17-38733306C>T"]] or df$`17-38733306C>T` then silently returns NULL with no error or warning. Passing check.names = FALSE (or using readr::read_csv) avoids the rewrite.

Usage

readcsv_mangling_linter()

Details

Two-phase, single-pass detection:

File-level: if the script contains no read.csv / read.table call, or every such call passes check.names = FALSE, bail.
Per-expression: flag string-literal column lookups (df[["..."]]) and backtick-symbol lookups (df$`...`) whose name carries -, >, :, or a leading digit.

Both df[["..."]] (STR_CONST) and df$`...` (OP-DOLLAR / SYMBOL) forms are handled.

Flag `if (!file.exists(path)) return(NULL/NA)`.

Description

This is a silent fallback wearing a different costume – the explicit fallback rule (R007) catches try(...) swallowing, but it doesn't catch the cleaner-looking if (!file.exists(path)) return(NULL). Same end state: the missing input is treated as an acceptable empty signal and propagates downstream.

Usage

return_null_on_missing_linter()

Details

Detection: an ⁠if(!file.exists(...))⁠ whose body is – or contains – a return(NULL), return(NA), return(NA_real_), etc. Returns of other values (return(0), return(x)) are not flagged.

Ported from the Python rule return-none-on-missing-input.

Flag a top-level function definition that shadows a name imported via `source()`.

Description

Joins the project's defs table (one row per top-level function definition) with its sources table (one row per source() edge) and looks for names defined in both the consumer file and the sourced file. The finding lands on the consumer's redefinition line.

Usage

rule_R020_shadow_overwrite(idx)

Flag function names defined in more than one file.

Description

Groups idx$defs by function name; if a name has definitions in more than one file, emits a Finding at each definition site so the user sees every drift location. Waivers (ANALYSIS_OK[...]) at any individual site are applied by the cross-file waiver filter and can suppress that site independently (e.g. for intentional v1/v2 kept alongside each other).

Usage

rule_R025_def_drift(idx)

Flag top-level function definitions that have no callers in any other project file.

Description

For each function in idx$defs, scans the text of every other .R file in idx$files for a token matching ⁠\bfn_name\s*\(⁠. If no such call site is found, the definition is reported as dead code at its def line. Callers within the defining file are tolerated (they usually indicate internal helpers); intentional API stubs should carry an ANALYSIS_OK[unused-fn] waiver near the def. This is a v1 textual check, so matches inside comments or strings are not excluded.

Usage

rule_R026_dead_code(idx)

Run every registered cross-file rule against the project index.

Description

Run every registered cross-file rule against the project index.

Usage

run_cross_file_rules(idx)

Flag `set.seed(...)` calls inside a function literal.

Description

v1 heuristic per analysis_lint_strategy.md R022: a set.seed() that is not at script top level – i.e. nested inside any enclosing ⁠function(...) <body>⁠ – pollutes the global RNG state when the function is invoked from a loop or parallel worker. The reproducibility contract belongs to the dispatcher (top-level seed or L'Ecuyer streams), not the per-task callee.

Usage

seed_in_loop_linter()

Details

Detection: any ⁠<SYMBOL_FUNCTION_CALL>set.seed</SYMBOL_FUNCTION_CALL>⁠ that has an ancestor ⁠<expr>⁠ containing a ⁠<FUNCTION>⁠ child.

Flag `⁠<name> <- <expr> != ""⁠` / `⁠<name> <- <expr> == ""⁠`.

Description

The mask is the partitioner downstream code uses; treating "" as a missingness sentinel hides the upstream contract and silently drops or keeps rows depending on which side of the comparison the empty string falls on.

Usage

sentinel_mask_linter()

Details

Scope, on purpose:

Only the empty-string sentinel is matched. ⁠!= 0⁠ / !is.na(x) are out of scope (too common and usually legitimate).
Only top-level assignments to a plain symbol. The inline form df[df$col != "", ] is R005 (unledger-filter) territory.
Composition with &, |, !, and parentheses is unwrapped so a compound mask like (df$a != "") & (df$b != "") still fires.

Ported from the Python rule sentinel-mask-assignment.

Flag a `tryCatch(..., error = ...)` handler that silently degrades.

Description

The silent-fallback family in three costumes, all sharing one hidden commitment – "on failure this quietly proceeds on a meaningless value instead of stopping":

Usage

silent_trycatch_linter()

Details

return: the handler's return value is a bare literal (NUM_CONST, NULL_CONST, STR_CONST – NA, NA_real_, NULL, 0, "", ...), whether returned directly (function(e) NA) or as the last statement of a multi-statement block (function(e) { warning(...); NA }). Caller-side code then continues with a numerically-valid stand-in.
rebind: the handler superassigns (⁠<<-⁠) an outer name to a degraded default (cohort <<- NULL), so downstream stages run on a placeholder. ⁠<<-⁠ escapes the handler scope, so it is flagged wherever it sits in the body.
stub: the handler superassigns an outer name to a no-op stub function (score_fn <<- function(...) NULL), silently disabling behavior on the failure path.

Doing real work and returning a genuine recovered value (a cached object, an alternate computation, stop(e) to rethrow) is left alone – only bare placeholders and no-op stubs are flagged. Local (⁠<-⁠) rebinds are not flagged: in R they die with the handler frame and have no external effect.

R024 – Smuggled function-signature defaults

Description

Flag ⁠function(arg = <literal>)⁠ where the literal is "interesting": a NUM_CONST whose value is not in ⁠{0, 1, -1, NA*, Inf, -Inf, TRUE, FALSE}⁠ (with optional L suffix), or a STR_CONST whose unwrapped value is not the empty string.

Usage

smuggled_default_linter()

Details

Boring defaults (0, 1, NA, NULL, TRUE, "") are sentinels, not scientific choices. Calls (e.g. c("a","b")) and absent defaults are not literals and are skipped by construction.

Stage tag row for the index – wraps `detect_file_stage()`.

Description

Stage tag row for the index – wraps detect_file_stage().

Usage

stage_row(path, path_abs)

Flag `⁠if (file.exists(<X>)) return(read*(...))⁠` single-statement bodies.

Description

A cache short-circuit that returns a previously serialized result without comparing an input fingerprint is a stale-cache hazard: if upstream inputs change, the cached value is silently returned and the analysis quietly drifts from its inputs.

Usage

stale_cache_linter()

Details

Detection (v1): an if whose condition calls file.exists(...) and whose body is a single statement calling return(readRDS(...)) (or read_rds, read.csv, read_csv). Multi-statement {...} bodies are spared on the assumption that the extra statements implement a fingerprint check – the orchestrator's waiver layer handles any remaining false positives.

R038 – Symmetric "best of either side" reporting

Description

Flag pmax(...), max(...), or which.max(c(...)) whose argument subtree contains two or more SYMBOLs whose names look like side/polarity labels (⁠target_*⁠, ⁠rest_*⁠, ⁠left_*⁠, ⁠right_*⁠, ⁠*_aligned⁠, ⁠*_complement⁠, ⁠*_c1⁠, ⁠*_side⁠).

Usage

symmetric_best_linter()

Details

Picking "the better of two label-named sides" after labels are joined is a hidden test-multiplication / label-aware fishing pattern. Fix is to pre-declare which side is the target side via a label-independent rule and freeze the orientation before label joins (see analysis_lint_strategy.md section 38).

V1 is file-local and intentionally conservative: it fires only when the heuristic finds at least two side-label-shaped SYMBOLs inside the max-like call's subtree. Pre-declared if/else polarity (no pmax/max/which.max) does not match.

Flag calls to random-data generators (`rnorm`, `runif`, `rpois`, `rbinom`, `rexp`, `rgamma`, `rbeta`, `rmultinom`, `sample`, `sample_n`, `sample_frac`).

Description

v1 heuristic: any ⁠<SYMBOL_FUNCTION_CALL>⁠ whose text matches one of the known random-data generators is flagged at the line of the call. We deliberately don't try to distinguish "data-like" assignments from diagnostic randomness – false positives are cheap and the waiver layer (ANALYSIS_OK[...]) handles legitimate uses.

Usage

synthetic_data_linter()

Flag uppercase-constant assignments that are defined within a few lines of a label-tainted read or label-column reference, inside a file tagged `⁠# STAGE: selection⁠`.

Description

Selection-stage code must not pick its thresholds/bands by maximizing a label-aware metric (e.g. ground-truth recall). When a constant like BAND <- sweep[which.max(sweep$gt_recall), ...] sits next to a read.csv("..._gt_...") call or a label column reference, the threshold is effectively label-tuned. The fix is to either (a) move the constant to analysis_constants.yml with a documented value, or (b) move the threshold-selection code into a ⁠# STAGE: evaluation⁠ script and pass the resulting number in by hand.

Usage

threshold_near_label_linter()

Details

Stage detection: scan the first 10 lines for ⁠# STAGE: <name>⁠. Fires only when ⁠<name>⁠ is exactly selection. Other stages (including untagged files) are not flagged here.

Detection: regex on the raw file text.

Constant lines: ⁠^[A-Z][A-Z_0-9]+\\s*<-⁠
Label-tainted reads: paths/names containing gt, oracle, truth, evaluated, recall as a token.
Label column refs: a small hardcoded vocabulary (kept in sync with R033 for consistency).
A constant is flagged when at least one label-tainted line falls within +/- 6 lines of it.

Flag iterative-fit calls that lack a nearby convergence check.

Description

Iterative optimisers (lme4's glmer/lmer/nlmer, base nls and optim, and optimx::optimx) can silently return non-converged fits. Downstream inference on a non-converged model is unreliable. Analyses should inspect fit@optinfo$conv$lme4$messages, the convergence slot of optim() output, or otherwise programmatically confirm convergence before using the fit.

Usage

unchecked_convergence_linter()

Details

v1 heuristic: if any flagged fit call appears in the file AND the file body contains none of the tokens converged, conv$lme4, or convergence, every fit call in the file is flagged. The waiver layer silences this elsewhere when a justified ANALYSIS_OK[model-fit] comment is present.

Flag join/merge calls without a follow-up cardinality assertion.

Description

Detects calls to left_join, right_join, inner_join, full_join, anti_join, semi_join, or merge. If the file contains a row-count or duplicate-key check (stopifnot(...), anyDuplicated, or a validate = ... argument), the rule treats the joins as covered and emits nothing. Otherwise, every join call is flagged.

Usage

unchecked_join_linter()

Details

V1 is intentionally file-level and trigger-happy. Upstream assertions in another file are handled by the orchestrator's ANALYSIS_OK[join] waiver layer, not here.

Flag `optparse::make_option("--kebab-name", ...)` calls whose parsed dest is never read.

Description

Walks every make_option("--kebab", ...) (with or without the ⁠optparse::⁠ qualifier), computes the destination name optparse would assign (--kebab-name -> kebab_name), then checks whether any ⁠<args>$<dest>⁠ or ⁠<args>[["<dest>"]]⁠ access reads that name – where ⁠<args>⁠ is the variable bound to a parse_args(...) result. If no parse_args() anchor exists in the file we can't tell what's consumed, so the file is skipped.

Usage

unconsumed_cli_flag_linter()

Details

Ported from the Python rule unconsumed-cli-flag.

Flag scientifically consequential transforms without explanation.

Description

Batch correction, residualization, and similar transforms silently reshape downstream analysis. Require an ANALYSIS_OK[...] waiver (handled elsewhere) that documents what was – and was not – included as a covariate.

Usage

unexplained_transform_linter()

Details

Flagged calls (matched by function symbol): ComBat, combat, residualize, regress_out, remove_batch_effect, removeBatchEffect, regressOut

Flag self-assignments that silently narrow a dataframe.

Description

Patterns like ⁠df <- df[<cond>, ]⁠, df <- subset(df, ...), and df <- df %>% filter(...) drop rows without recording how many were dropped or why. The matched rule (R005 in the strategy doc) is that filtering is allowed when ledgered – a sibling ⁠ANALYSIS_OK[<tag>]⁠ waiver comment plus an observable drop record – but unannotated self-narrowing should be surfaced.

Usage

unledger_filter_linter()

Details

Heuristic: LEFT_ASSIGN where the LHS SYMBOL name reappears as a SYMBOL on the RHS, and the RHS contains either a [ (OP-LEFT-BRACKET) or a call to filter / subset. Waiver-comment suppression is the responsibility of the cross-file aggregator.

Flag unledger missingness coercion/imputation calls.

Description

v1 detects bare na.omit(...) calls. Stripping NAs silently discards rows without an audit trail; analyses should either scope the drop (df[!is.na(df$col), ]) or carry an ANALYSIS_OK[missingness] ledger comment documenting the exclusion. Future versions should extend to na.exclude, tidyr::replace_na, drop_na, and top-level as.numeric coercions; ledger-comment recognition is also deferred.

Usage

unledger_missingness_linter()

Flag stochastic function calls (kmeans, umap, Rtsne, tsne) that lack a nearby `set.seed(...)` call.

Description

v1 heuristic: if the file contains any set.seed(...) call, suppress all findings for that file. Otherwise emit one Lint per known stochastic call. The waiver comment ANALYSIS_OK[random-seed-only] is handled by the orchestrator's apply_waivers, not here.

Usage

unseeded_stochastic_linter()

Details

Known stochastic call names: kmeans, umap (covers uwot::umap – SYMBOL_FUNCTION_CALL matches the bare function name even under ⁠pkg::⁠), Rtsne, tsne.

Flag `yaml::read_yaml(...)` / `yaml::yaml.load_file(...)` / `jsonlite::fromJSON(...)` results that are later accessed via `$` or `⁠[[ ]]⁠` without an intervening validator call. [ ]: R:%20

Description

Track variables whose value comes from one of the loader calls above. If the variable is then read by ⁠<var>$<key>⁠ or ⁠<var>[["<key>"]]⁠, flag the loader call. If the variable is passed to a ⁠validate_*⁠ or ⁠_schema⁠ function before being read, no finding fires.

Usage

unvalidated_config_linter()

Details

Ported from the Python rule unvalidated-config.

Flag `suppressWarnings(...)` and `suppressMessages(...)` calls.

Description

Blanket suppression hides diagnostics that often signal real problems (e.g. glm.fit non-convergence, NA coercion, dropped factor levels). Narrow, justified suppressions should be paired with an ANALYSIS_OK[warning-suppression] waiver comment; the orchestrator's waiver layer handles silencing those.

Usage

warning_suppression_linter()

Details

v1 scope: only the ⁠suppress*⁠ functions. options(warn = -1) is deferred.

Package {scilintr}

Construct a Finding record.

Description

Usage

Shared trivial-literal allowlist.

Description

Usage

Details

Flag GetAssayData(obj) calls missing explicit ⁠assay=⁠ and ⁠layer=⁠.

Description

Usage

Filter findings by removing those covered by a nearby ANALYSIS_OK waiver.

Description

Usage

R040 – "no-circularity" / "blind" name antipattern

Description

Usage

Details

Flag try(..., silent = TRUE).

Description

Usage

Details

Build a project-wide index for cross-file analysis.

Description

Usage

Details

Flag w1*x1 + w2*x2 + w3*x3 + ... composite scores with literal weights.

Description

Usage

Details

R039 – Recursive calls with constant hyperparameters across depth

Description

Usage

Details

Registry of cross-file (project-level) rules.

Description

Usage

Details

Detect the ⁠# STAGE: <name>⁠ tag in a file's first 10 lines.

Description

Usage

Arguments

Value

R027 – Asymmetric env_* validators within a single file

Description

Usage

Details

Top-level function definitions in a parsed file.

Description

Usage

source() / sys.source() edges out of a file.

Description

Usage

Apply waivers to a list of cross-file Findings.

Description

Usage

Flag inline design = ~ ... formula literals.

Description

Usage

Flag c("S17", "S23", ...) style hardcoded sample-ID exclusion lists.

Description

Usage

Flag implicit file selection patterns.

Description

Usage

Details

Flag references to label / ground-truth column names in files that are tagged ⁠# STAGE: selection⁠ (or any non-evaluation blind stage).

Description

Usage

Details

Flag any reference to label / ground-truth columns from inside a file tagged ⁠# STAGE: selection⁠.

Description

Usage

Details

Flag a data.frame(...) literal that mixes label columns with computed-score columns and is subsequently written to disk via write.csv / write_csv from within a ⁠# STAGE: selection⁠ file.

Description

Usage

Details

Flag read.csv / read_csv / read.table / read.delim / fread calls in a ⁠# STAGE: selection⁠ file whose path argument's basename matches a label-tainted pattern (e.g. ⁠gt_*⁠, ⁠*_oracle_*⁠, ⁠*_recall_*⁠, etc.).

Description

Flag `GetAssayData(obj)` calls missing explicit `⁠assay=⁠` and `⁠layer=⁠`.

Flag `try(..., silent = TRUE)`.

Flag `w1x1 + w2x2 + w3*x3 + ...` composite scores with literal weights.

Detect the `⁠# STAGE: <name>⁠` tag in a file's first 10 lines.

`source()` / `sys.source()` edges out of a file.

Flag inline `design = ~ ...` formula literals.

Flag `c("S17", "S23", ...)` style hardcoded sample-ID exclusion lists.

Flag references to label / ground-truth column names in files that are tagged `⁠# STAGE: selection⁠` (or any non-evaluation blind stage).

Flag any reference to label / ground-truth columns from inside a file tagged `⁠# STAGE: selection⁠`.

Flag a `data.frame(...)` literal that mixes label columns with computed-score columns and is subsequently written to disk via `write.csv` / `write_csv` from within a `⁠# STAGE: selection⁠` file.

Flag `read.csv` / `read_csv` / `read.table` / `read.delim` / `fread` calls in a `⁠# STAGE: selection⁠` file whose path argument's basename matches a label-tainted pattern (e.g. `⁠gt_⁠`, `⁠_oracle_⁠`, `⁠_recall_*⁠`, etc.).

Flag `order(...)` / `arrange(...)` calls whose secondary sort key references a label-named column (e.g. `is_gt_label`, `truth_c1`, `is_target`). A secondary key drives ranking for ties – if the tie-breaker is the answer key, the ranking is leaky.

Convert a single `lintr::Lint` to a `scilintr_finding`.

Flag `⁠log(pmax(x, <tiny-literal>))⁠` and friends.

Flag `⁠digest::digest(<single_SYMBOL>)⁠` inside a function with 2+ formals. A cache key that fingerprints a single variable when the enclosing function takes multiple inputs is a partial-fingerprint hazard: un-fingerprinted inputs can change and the cache silently returns stale results.

Flag hardcoded patient/sample identifiers (and A191-specific SNP literals) in files declared `⁠# STAGE: library⁠`.

Registry of per-file `lintr::Linter()` factories.

Flag `⁠pmax(<expr>, 0)⁠` and `ylim(0, ...)`.

Flag filtering assignments whose LHS name starts with `plot_`.