---
title: "4. Advanced analysis: smoothing, tuning, comparison, and mining"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{4. Advanced analysis: smoothing, tuning, comparison, and mining}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>",
                      fig.width = 8, fig.height = 5, out.width = "100%")
```

This vignette covers the parts of `transitiontrees` beyond the basic
fit-prune-predict loop: choosing a smoother and a pruning rule, picking
hyperparameters by cross-validation, quantifying pathway reliability,
comparing cohorts with a permutation test, introspecting the fitted tree,
and mining it for contexts and sequences of interest.

## Setup

We work throughout from one fit on the bundled `trajectories` data and its
pruned form -- the same starting point as *Getting started*.

```{r setup}
library(transitiontrees)
data(trajectories)
set.seed(1)

tree   <- context_tree(trajectories, max_depth = 4L, min_count = 5L)
pruned <- prune_tree(tree, criterion = "G2", alpha = 0.05)
```

## 1. Smoothing schemes

Smoothing decides what probability an *unseen* next state receives. Five
schemes are implemented (`floor`, `laplace`, `kneser_ney`, `witten_bell`,
`jelinek_mercer`). `compare_smoothing()` refits under each and reports
in-sample perplexity in one call.

```{r smoothing-grid}
compare_smoothing(trajectories, max_depth = 4L, min_count = 5L)
```

Two things to read. First, `n_nodes` is identical across schemes -- smoothing
changes *probabilities*, never *which contexts exist*; topology is set by
`min_count`, not the smoother. Second, do **not** pick a smoother on
in-sample perplexity (it rewards memorisation); the cross-validation in
section 3 is the verdict that counts.

Handed a *fitted* tree, `compare_smoothing()` re-smooths it under every
scheme (via `smooth_tree()`, without re-counting) instead of refitting -- a
smoothing sweep on the already-pruned model in one call:

```{r resmooth}
compare_smoothing(pruned)
```

## 2. Pruning criteria

`prune_tree()` supports four criteria. `compare_pruning()` applies each --
holding `alpha`/`threshold` fixed -- and reports how hard each one trims.

```{r pruning-grid}
compare_pruning(tree)
```

`G2` (the likelihood-ratio test) and `AIC` ask "is the extra depth justified
given its sample size?"; `BIC` punishes parameters harder (its penalty scales
with `log n`); `KL` at a lenient absolute `threshold` keeps almost
everything. Use `G2` (or `AIC`) unless you have a specific reason, and report
the reduction -- "most grown contexts were unjustified" is itself a finding.

## 3. Cross-validated tuning

`tune_tree()` runs k-fold CV at the **sequence level** over a grid of
`(max_depth, min_count, smoothing, prune)` and returns a ranked data.frame
with the winner on `attr(., "best")`.

```{r tune}
tg <- tune_tree(trajectories, max_depth = 1L:4L, folds = 5L, seed = 42L)
head(tg, 6)
attr(tg, "best")
```

(`min_count` and `prune` are swept by their defaults; add `smoothing =` or a
wider `min_count =` to grow the grid.)

```{r tune-plot, fig.height = 5}
plot(tg)
```

The shape of the curve is as informative as the winning point: if perplexity
keeps falling with `max_depth` the process has long memory; if it flattens
early (as engagement data tends to) the useful memory is short and deeper
trees just overfit. Refit at the chosen configuration on the full data for
downstream use.

## 4. Bootstrap pathway reliability

`bootstrap_pathways()` resamples whole sequences and reports, per pathway,
`stability_rate` (the count reproduces) and `informative_rate` (the
G-squared against the parent reproducibly clears the chi-square bar). Keeping
the raw resamples lets you also see the full distribution of any statistic.

```{r boot}
boot <- bootstrap_pathways(pruned, iter = 100L, stat = "count",
                          seed = 1L, keep_resamples = TRUE)
boot
```

`summary()` returns the tidy per-pathway table, sorted so the trustworthy
(stable *and* informative) pathways come first. Each tracked statistic
(`count`, `next_probability`, `divergence`, `G2`) carries a symmetric
`mean / sd / ci_lo / ci_hi` quartet, so you can report a bootstrap CI for any
pathway statistic rather than a bare point estimate:

```{r boot-cis}
head(summary(boot))
```

`plot_pathway_resamples()` draws the full resample distribution per pathway.
A tight unimodal peak means the estimate is well-determined; a bimodal or
heavy-tailed panel is the tell that the pathway is *carrier-driven* -- a few
sequences account for it, and dropping them in a resample collapses it.

```{r boot-resamples, fig.height = 4.5}
plot_pathway_resamples(boot, stat = "divergence", top = 6L)
```

## 5. Comparing two cohorts

Name an **external** group column and `context_tree(group = )` fits one tree
per group in a single call, returning a `transitiontrees_group` that
`prune_tree()` and `compare_trees()` consume directly -- no manual splitting
or label-building. We compare high- and low-achieving students on the bundled
`group_regulation_long` log.

```{r group-fit}
data(group_regulation_long)
grp <- prune_tree(context_tree(group_regulation_long,
                              actor = "Actor", time = "Time", action = "Action",
                              group = "Achiever", max_depth = 2L, min_count = 10L))
cmp <- compare_trees(grp, iter = 199L, seed = 1L)
cmp
```

The printed comparison reports the observed distance (`pdist`, a
count-weighted symmetric-KL between the cohorts' pathway distributions) and
the `p_value` from permuting the sequence-to-cohort labels. A significant
result says the cohorts generate genuinely different pathway dynamics, not a
relabelling artefact.

```{r compare-plot, fig.height = 4.5}
plot(cmp)
```

For the full per-axis decomposition (behavioural vs usage) and a tidy
pairwise `distance_matrix`, `compare_groups()` consumes the same
`group =`-fitted tree -- see the *Complete analysis case* vignette.

## 6. Tree introspection

Three accessors treat the tree as a queryable object.

```{r query}
query_pathway(pruned, c("Active", "Active"))               # full distribution
query_pathway(pruned, "Disengaged", next_state = "Disengaged")  # one cell
pathway_exists(pruned, "Active -> Disengaged")             # membership (no backoff)
```

By default an unseen context backs off to its longest matching suffix; pass
`exact = TRUE` to demand the literal node (returns `NA` if it is not one) --
the tool for auditing *which* contexts the tree actually holds.

```{r query-exact}
query_pathway(pruned, c("Active", "Average", "Active"), exact = TRUE)
```

`subtree()` extracts the slice rooted at a context -- the same pathway API
then runs on the slice:

```{r subtree}
sub <- subtree(pruned, "Active")   # its banner reads "subtree of: Active"
sub
head(tree_pathways(sub), 4)
```

## 7. Mining contexts and sequences

`mine_contexts()` scans the tree for contexts where a chosen state is
unusually likely (or unlikely):

```{r mine-contexts}
mine_contexts(pruned, state = "Disengaged", min_prob = 0.5)
```

`mine_sequences()` ranks supplied sequences by how well the model predicts
them -- the `surprising` ones are atypical trajectories worth a closer look:

```{r mine-sequences}
mine_sequences(pruned, newdata = trajectories, which = "surprising", n = 5L)
```

## 8. Imputing gaps

`impute_sequences()` fills *internal* missing states from the fitted tree --
`modal` takes the most likely state at each gap, `prob` samples from the
predicted distribution:

```{r impute}
gappy <- list(c("Active", "Active", NA, "Disengaged"),
              c("Average", NA, "Average"))
impute_sequences(pruned, gappy, method = "modal")
```

## 9. Generating sequences

Every fitted tree is also a generative model. `generate_sequences()` samples
by walking the conditional distributions; `simulate()` is the R-standard
generic wrapping it with `nsim` and a `seed`.

```{r generate}
generate_sequences(pruned, n = 4L, length = 10L)
simulate(pruned, nsim = 4L, seed = 42L, length = 10L)
```

Generated sequences should look plausibly like the real ones -- a sanity
check that the model captured the gross dynamics -- and give you a null
behavioural corpus for stress-testing a downstream pipeline.