---
title: "Getting started with soilKey"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting started with soilKey}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment  = "#>"
)
```

`soilKey` provides automated soil profile classification under WRB 2022 (4th edition), SiBCS 5ª ed. (2018), and USDA Soil Taxonomy (13th edition, 2022). The taxonomic key itself is implemented as deterministic R code driven by versioned YAML rules; vision-language extraction, spatial priors, and OSSL-based attribute prediction sit alongside it as modular layers, never inside it.

# 0. The 30-second on-ramp

If you just want to see soilKey work end-to-end on a real profile -- without writing any R code -- there are two paths.

## A. Zero-code GUI

```{r run-demo, eval = FALSE}
library(soilKey)
run_demo()  # opens a one-screen Shiny app in your browser
```

Pick one of 31 canonical profiles from the dropdown (or upload your own horizons CSV), click **Classify**, and read the WRB / SiBCS / USDA names plus the deterministic key trace and the evidence grade.

## B. One R call, one fixture

```{r quick-start}
library(soilKey)

pedon <- make_ferralsol_canonical()      # canonical Latossolo Vermelho
classify_wrb2022(pedon, on_missing = "silent")$name
classify_sibcs(pedon)$name
classify_usda(pedon, on_missing = "silent")$name
```

That is the whole package: `PedonRecord` in, classification out. The remaining sections walk through how to build your own pedon and how the side modules (VLM, spatial, spectral) fit together.

# 1. Building a PedonRecord from scratch

`PedonRecord` is the central data carrier. It bundles site metadata, the horizons table (with a fixed canonical schema -- see `horizon_column_spec()` for the full list of columns), and optional spectra, images, documents, and a per-attribute provenance log.

```{r build-pedon}
my_pedon <- PedonRecord$new(
  site = list(
    id              = "example-001",
    lat             = -22.5,
    lon             = -43.7,
    country         = "BR",
    parent_material = "gneiss"
  ),
  horizons = data.frame(
    top_cm      = c(0,  15, 65, 130),
    bottom_cm   = c(15, 65, 130, 200),
    designation = c("A", "Bw1", "Bw2", "C"),
    clay_pct    = c(50, 60,  65,  60),
    silt_pct    = c(15, 10,  8,   8),
    sand_pct    = c(35, 30,  27,  32),
    cec_cmol    = c(8,  5,   4.5, 4),
    bs_pct      = c(20, 12,  10,  11),
    ph_h2o      = c(4.8, 4.9, 5.0, 5.1),
    oc_pct      = c(2.0, 0.4, 0.2, 0.1)
  )
)

my_pedon$validate()
```

The validator catches inverted depths, texture sums far from 100, implausible pH, sum of bases above CEC, Munsell out-of-range, and a handful of other soil-physical sanity checks.

# 2. Canonical fixtures

`soilKey` ships sixteen canonical fixtures designed so that exactly one of the eleven v0.2 diagnostics passes on each. Each profile also classifies cleanly through the wired WRB key.

```{r fixture-list}
fixtures <- list(
  Ferralsol  = make_ferralsol_canonical(),
  Luvisol    = make_luvisol_canonical(),
  Acrisol    = make_acrisol_canonical(),
  Lixisol    = make_lixisol_canonical(),
  Alisol     = make_alisol_canonical(),
  Chernozem  = make_chernozem_canonical(),
  Kastanozem = make_kastanozem_canonical(),
  Phaeozem   = make_phaeozem_canonical(),
  Calcisol   = make_calcisol_canonical(),
  Gypsisol   = make_gypsisol_canonical(),
  Solonchak  = make_solonchak_canonical(),
  Cambisol   = make_cambisol_canonical(),
  Plinthosol = make_plinthosol_canonical(),
  Podzol     = make_podzol_canonical(),
  Gleysol    = make_gleysol_canonical(),
  Vertisol   = make_vertisol_canonical()
)

ferralsol <- fixtures$Ferralsol
ferralsol
```

# 3. Calling the diagnostics directly

Every diagnostic returns a `DiagnosticResult` carrying the per-sub-test evidence, missing-attribute report, layer indices that satisfied, and the WRB literature reference.

```{r ferralic}
ferralic(ferralsol)
```

# 4. Diagnostic matrix across the canonical fixtures

```{r matrix, results='asis'}
diagnostics <- c("argic", "ferralic", "mollic", "calcic", "gypsic", "salic",
                  "cambic", "plinthic", "spodic",
                  "gleyic_properties", "vertic_properties")

mat <- vapply(fixtures, function(p) {
  vapply(diagnostics, function(d) {
    fn <- get(d, envir = asNamespace("soilKey"))
    isTRUE(fn(p)$passed)
  }, logical(1))
}, logical(length(diagnostics)))

knitr::kable(t(mat))
```

Every fixture activates exactly one diagnostic (or, for the argic-derived RSGs Acrisol / Lixisol / Alisol / Luvisol, just the shared `argic`).

# 5. RSG-derived diagnostics: argic and mollic families

The argic horizon is shared by four RSGs -- Acrisols, Lixisols, Alisols, Luvisols -- which differ by clay activity (CEC per kg clay) and chemistry (BS or Al saturation). soilKey provides one diagnostic per RSG that runs `argic()` internally, then applies the activity and chemistry tests on the argic layer:

```{r argic-derived}
acrisol(make_acrisol_canonical())$passed
lixisol(make_lixisol_canonical())$passed
alisol (make_alisol_canonical())$passed
luvisol(make_luvisol_canonical())$passed
```

Same pattern for the mollic-derived family (Chernozems / Kastanozems / Phaeozems):

```{r mollic-derived}
chernozem (make_chernozem_canonical())$passed
kastanozem(make_kastanozem_canonical())$passed
phaeozem  (make_phaeozem_canonical())$passed
```

# 6. End-to-end WRB classification

`classify_wrb2022()` consumes a `PedonRecord` and runs it through the YAML key (`inst/rules/wrb2022/key.yaml`). v0.2 wires 16 of 32 RSGs end-to-end; the other 16 are stubbed with `not_implemented_v01:` markers and return NA in the trace.

```{r classify-fr}
classify_wrb2022(ferralsol)
```

```{r classify-all}
classifications <- vapply(fixtures, function(p) {
  classify_wrb2022(p, on_missing = "silent")$rsg_or_order
}, character(1))
data.frame(fixture = names(classifications), assigned_rsg = classifications)
```

Each canonical fixture maps to its intended RSG. The trace shows which RSGs were tested, in canonical key order, before the assigned one.

# 7. Provenance and evidence grade

`PedonRecord$add_measurement()` records a value's provenance in a structured log. The final `ClassificationResult$evidence_grade` summarises that log on an A--D scale: A means every recorded value was laboratory-measured, D means the result rests on attributes extracted by VLM or assumed by the user.

```{r provenance}
ferralsol_v <- make_ferralsol_canonical()

# Mark the Bw1 clay value as predicted from spectroscopy
ferralsol_v$add_measurement(
  horizon_idx = 4,
  attribute   = "clay_pct",
  value       = 60,
  source      = "predicted_spectra",
  confidence  = 0.85,
  overwrite   = TRUE
)

classify_wrb2022(ferralsol_v)$evidence_grade
```

```{r provenance-vlm}
ferralsol_w <- make_ferralsol_canonical()
ferralsol_w$add_measurement(1, "clay_pct", 50, "extracted_vlm",
                              confidence = 0.7, overwrite = TRUE)
classify_wrb2022(ferralsol_w)$evidence_grade
```

# 8. Interoperability with `aqp`

`PedonRecord$to_aqp()` returns an `aqp::SoilProfileCollection`, allowing soilKey to plug into aqp's plotting and aggregation tooling without owning that infrastructure:

```{r to-aqp, eval = requireNamespace("aqp", quietly = TRUE)}
spc <- ferralsol$to_aqp()
class(spc)
aqp::profile_id(spc)
```

# 9. Module 4 -- OSSL spectroscopy bridge (gap-filling)

When some horizon attributes are missing, but the profile carries
Vis-NIR or MIR spectra, soilKey can fill the gaps via the Open Soil
Spectral Library. The pipeline preprocesses the spectra (SNV / SG1 /
trim), dispatches to a memory-based or PLSR backend, and writes each
prediction into the `PedonRecord` with provenance
`predicted_spectra` -- which the authority hierarchy treats as below
laboratory-measured but above VLM-extracted values. The PI95
prediction interval is mapped to a `[0, 1]` confidence score via
`pi_to_confidence()`.

```{r ossl, eval = FALSE}
# Synthetic example -- a profile with measured spectra but missing CEC.
pr_spec <- make_synthetic_pedon_with_spectra(n_horizons = 4)
pr_spec$horizons$cec_cmol <- NA_real_   # erase CEC

# Predict via memory-based learning against the OSSL global library.
pr_filled <- fill_from_spectra(
  pedon   = pr_spec,
  backend = "mbl",         # or "plsr_local" / "pretrained"
  attrs   = c("cec_cmol")  # which attributes to gap-fill
)

# Each predicted cell is logged with provenance source = "predicted_spectra".
pr_filled$provenance
classify_wrb2022(pr_filled)$evidence_grade   # B (predicted_spectra present)
```

# 10. Module 3 -- SoilGrids / Embrapa spatial prior (sanity check)

Once the deterministic key has reached a verdict, soilKey can
**cross-check** that verdict against a spatial prior derived from
ISRIC SoilGrids (global) or the Embrapa raster (Brazil). The prior
*never* overrides the key -- it only attaches a `prior_check` entry
to the result and emits a warning if the deterministic outcome lies
in a low-probability region of the prior.

```{r prior, eval = FALSE}
prior <- spatial_prior(lon = -43.7, lat = -22.5, source = "auto")
prior   # data.table of (rsg_code, probability)

res <- classify_wrb2022(
  pedon           = ferralsol,
  prior           = prior,
  prior_threshold = 0.01   # warn if assigned RSG has prior < 1%
)
res$prior_check
```

# 11. Module 2 -- Multimodal extraction via `ellmer`

A field PDF or photo can be turned into a `PedonRecord` via the
`extract_*` functions, each driven by an `ellmer` chat object
(Anthropic, OpenAI, Google, or Ollama). The output is a
schema-validated JSON (draft-07, in `inst/schemas/`) with
`{value, confidence, source_quote}` per attribute, then merged into
the `PedonRecord` with provenance `extracted_vlm`.

The package ships a `MockVLMProvider` (R6) so the validation +
retry loop can be exercised in tests without an API key:

```{r vlm-mock, eval = FALSE}
mock <- MockVLMProvider$new(
  responses = list(
    list(horizons = list(
      list(top_cm = 0,  bottom_cm = 15, designation = "A",
            clay_pct = list(value = 30, confidence = 0.9,
                            source_quote = "30% clay (table 1)")),
      list(top_cm = 15, bottom_cm = 65, designation = "Bw",
            clay_pct = list(value = 55, confidence = 0.85,
                            source_quote = "Bw horizon, 55% clay"))
    ))
  )
)
pr_extracted <- extract_horizons_from_pdf(
  pdf_path = "fieldsheet.pdf",
  provider = mock          # in production: vlm_provider("anthropic")
)
classify_wrb2022(pr_extracted)$evidence_grade   # C or D depending on cell coverage
```

For real use:

```{r vlm-real, eval = FALSE}
chat <- vlm_provider("anthropic", model = "claude-sonnet-4-5")
pr   <- extract_horizons_from_pdf("RADAMBRASIL_perfil_007.pdf",
                                    provider = chat)
res  <- classify_wrb2022(pr)
res
```

# 12. SiBCS 5ª edição (Embrapa, 2018)

soilKey ships the parallel SiBCS key alongside WRB 2022. The 13
ordens are wired in canonical Cap 4 order; calling
`classify_sibcs()` on any `PedonRecord` runs the same engine that
backs `classify_wrb2022()`.

```{r sibcs-demo}
# A canonical Latossolo (Brazilian Ferralsol equivalent)
pr_lat <- make_latossolo_canonical()
classify_sibcs(pr_lat, on_missing = "silent")$rsg_or_order

# A canonical Argissolo (B textural, low BS)
pr_arg <- make_argissolo_canonical()
classify_sibcs(pr_arg, on_missing = "silent")$rsg_or_order

# A canonical Nitossolo (clay >=35% throughout, B/A <=1.5, cerosidade)
pr_nit <- make_nitossolo_canonical()
classify_sibcs(pr_nit, on_missing = "silent")$rsg_or_order

# Cross-system: the SAME profile classified by both keys
classify_wrb2022(pr_lat, on_missing = "silent")$rsg_or_order
classify_sibcs(pr_lat, on_missing = "silent")$rsg_or_order
```

The diagnostic helpers also have Portuguese names that match the SiBCS
literature. For example:

```{r sibcs-atributos}
# Atividade da fração argila (Ta vs Tb) per Cap 1, p 30
atividade_argila_alta(make_luvissolo_canonical())$passed   # TRUE  -> Ta
atividade_argila_alta(make_nitossolo_canonical())$passed   # FALSE -> Tb

# Caráter alítico (Cap 1, p 32): Al >= 4 cmol_c/kg + sat Al >= 50% + V < 50%
carater_alitico(make_argissolo_canonical())$passed
```


# 13. v0.7 scope and the v0.3.3+ roadmap

| Version  | Scope                                                                                                                |
|----------|----------------------------------------------------------------------------------------------------------------------|
| v0.1     | Core classes; argic, ferralic, mollic; Ferralsols path                                                               |
| v0.2     | +calcic, gypsic, salic, cambic, plinthic, spodic, gleyic, vertic; +AC/LX/AL/LV/CH/KS/PH RSG diagnostics; 16/32 wired |
| v0.3     | +histic, leptic, arenic, umbric, duric, technic, andic, fluvic, natric, nitic, planic, stagnic, retic, cryic, anthric; full WRB key (32/32 RSGs wired); 31 canonical fixtures |
| v0.3.1   | Tier-1 corrections vs WRB 2022 Ch 3.1: argic 6/1.4/20 + band 50, ferralic drops ECEC, duric 10/10, vertic >=25 cm, salic alkaline + product gate |
| v0.3.2   | RSG order in `key.yaml` aligned to canonical WRB 2022 Ch 4 (PL/ST before NT/FR; FL before AR)                        |
| **v0.4** | **Module 4 -- OSSL spectroscopy bridge (MBL, PLSR-local, pretrained)**                                               |
| **v0.5** | **Module 3 -- SoilGrids spatial prior + Embrapa raster (sanity-check, never overrides)**                              |
| **v0.6** | **Module 2 -- Multimodal extraction (PDF / photo / fieldsheet) via `ellmer`, schema-validated**                      |
| **v0.3.3** | **Complete WRB Ch 3.1 / 3.2 / 3.3 coverage** -- +18 horizons, +12 properties, +16 materials. Schema +24 columns.  |
| **v0.3.4** | **Tier-2 RSG gate strengthening** -- vertisol, andosol, gleysol, planosol, ferralsol, chernozem_strict, kastanozem_strict wired into key.yaml; spodic refined to disambiguate from andic. |
| **v0.3.5** | **Closes WRB Ch 3.1 -- 32/32 horizons** (+tsitelic, panpaic, limonic, protovertic).                                 |
| **v0.7**   | **Module 6 -- SiBCS 5ª ed. (Embrapa, 2018) implemented in full**: 17 atributos diagnósticos + 24 horizontes diagnósticos + 13 ordens RSG-level following the canonical Cap 4 key (O→R→V→E→S→G→L→M→C→F→T→N→P). 13 fixtures canônicas, all classify correctly; 30 new tests; +830 expectations total in the suite. |
| v0.8     | Module 5 -- USDA Soil Taxonomy parallel key (12 orders)                                                              |
| v0.9     | All ~202 WRB qualifiers + 10 specifiers; vignettes 05-09; WoSIS benchmark                                            |
| v1.0     | CRAN submission and methodological paper                                                                             |

See `ARCHITECTURE.md` (in the package root) for the full design rationale.