Getting started with clusteredMSM

Overview

clusteredMSM provides nonparametric analysis of clustered multistate process data, based on Bakoyannis (2021) and Bakoyannis & Bandyopadhyay (2022). It exposes one user-facing function, patp(), that:

The package depends only on survival.

Input data format

clusteredMSM expects data in interval format: one row per mutually-exclusive time interval per subject, with columns for interval start time (Tstart), end time (Tstop), starting state (Sstart), and ending state (Sstop). The column names are flexible — they are passed by name through the formula.

Censoring is encoded as Sstart == Sstop on the final row of a subject’s record. Within each subject, intervals must be temporally contiguous (Tstop[k] == Tstart[k+1]) and state contiguous (Sstop[k] == Sstart[k+1]).

A progressive illness-death example

library(clusteredMSM)

# 1=Healthy, 2=Ill, 3=Dead. Allowed: 1->2, 1->3, 2->3.
tmat <- trans_mat(
  list(c(2, 3), 3, integer(0)),
  names = c("Healthy", "Ill", "Dead")
)
tmat
#>          to
#> from      Healthy Ill Dead
#>   Healthy      NA   1    2
#>   Ill          NA  NA    3
#>   Dead         NA  NA   NA

A small example dataset in interval format:

mydata <- data.frame(
  pid     = c(1, 1, 2, 3, 4, 4),
  site    = c(1, 1, 1, 2, 2, 2),
  treatment = c("A", "A", "A", "B", "B", "B"),
  t0      = c(0.0, 1.5, 0.0, 0.0, 0.0, 1.0),
  t1      = c(1.5, 3.0, 2.0, 1.0, 1.0, 2.5),
  s0      = c(1,   2,   1,   1,   1,   2),
  s1      = c(2,   3,   1,   3,   2,   3)
)
mydata
#>   pid site treatment  t0  t1 s0 s1
#> 1   1    1         A 0.0 1.5  1  2
#> 2   1    1         A 1.5 3.0  2  3
#> 3   2    1         A 0.0 2.0  1  1
#> 4   3    2         B 0.0 1.0  1  3
#> 5   4    2         B 0.0 1.0  1  2
#> 6   4    2         B 1.0 2.5  2  3

Subject 1: H -> I -> D. Subject 2: censored healthy at t = 2.0 (s0 == s1). Subject 3: died at t = 1.0. Subject 4: H -> I -> D.

One-sample estimation

fit <- patp(
  msm(t0, t1, s0, s1) ~ 1,
  data    = mydata,
  tmat    = tmat,
  id      = "pid",
  cluster = "site",
  h = 1, j = 2, s = 0,
  B = 1000, cband = TRUE,
  seed = 1
)
fit

fit$curves contains the time-indexed point estimate of \[\begin{eqnarray*} P(X(t) = j \,|\, X(s) = h) &=& P(X(t) = 2 \,|\, X(0) = 1) \\ &=& P(X(t) = 2), \end{eqnarray*}\] (second equality due to the fact that all start at state = 1 in the classical illness-death model), standard error, pointwise confidence interval, and (with cband = TRUE) simultaneous band limits.

Two-sample comparison (estimation + test in one call)

If the formula’s right-hand side names a binary grouping variable, patp() automatically estimates both curves and conducts a Kolmogorov-Smirnov-type test:

tt <- patp(
  msm(t0, t1, s0, s1) ~ treatment,
  data    = mydata,
  tmat    = tmat,
  id      = "pid",
  cluster = "site",
  h = 1, j = 2, s = 0,
  B = 1000,
  seed = 1
)
tt

The $curves slot now has one row block per group. The $test slot contains the observed K-S statistic and the bootstrap p-value.

Recovery models (non-monotone processes)

For processes with cyclic transitions (e.g., illness-death with recovery), use the long event format directly. The same patp() call works:

tmat_rec <- trans_mat(
  list(c(2, 3), c(1, 3), integer(0)),
  names = c("Healthy", "Ill", "Dead")
)

# Subject who went Healthy -> Ill -> Healthy -> censored:
recovery_data <- data.frame(
  pid = c(1, 1, 1),
  t0  = c(0.0, 1.0, 2.0),
  t1  = c(1.0, 2.0, 3.5),
  s0  = c(1,   2,   1),
  s1  = c(2,   1,   1)    # last row: censored healthy
)

patp(msm(t0, t1, s0, s1) ~ 1,
     data = recovery_data, tmat = tmat_rec,
     id = "pid",
     h = 1, j = 2, s = 0,
     B = 500, seed = 1)

Landmark variant

When s > 0 and the Markov assumption is questionable, use LMAJ = TRUE:

patp(msm(t0, t1, s0, s1) ~ 1,
     data = mydata, tmat = tmat,
     id = "pid", cluster = "site",
     h = 1, j = 2, s = 1.0,
     LMAJ = TRUE, B = 1000, seed = 1)

Inverse-cluster-size weighting

When cluster size is potentially informative (e.g., sicker centers contribute more patients), use the weighted estimator:

patp(msm(t0, t1, s0, s1) ~ 1,
     data = mydata, tmat = tmat,
     id = "pid", cluster = "site",
     h = 1, j = 2, s = 0,
     weighted = TRUE, B = 1000, seed = 1)

weighted = TRUE requires the cluster argument.

What B = 0 does

Setting B = 0 returns the point estimate without any bootstrap inference. This is permitted only for one-sample formulas; two-sample formulas always require B > 0 (since the entire point of a two-sample analysis is the test). Use B = 0 for fast exploratory work, then switch to B = 1000 for inference.

References

Bakoyannis G (2021). Nonparametric analysis of nonhomogeneous multistate processes with clustered observations. Biometrics 77(2):533-546. doi:10.1111/biom.13327

Bakoyannis G, Bandyopadhyay D (2022). Nonparametric tests for multistate processes with clustered data. Annals of the Institute of Statistical Mathematics 74(5):837-867. doi:10.1007/s10463-021-00819-x

Putter H, Spitoni C (2018). Non-parametric estimation of transition probabilities in non-Markov multi-state models: the landmark Aalen-Johansen estimator. Statistical Methods in Medical Research 27(7):2081-2092. doi:10.1177/0962280216674497