Introduction to FastSurvival

Overview

FastSurvival provides fast alternatives to three standard functions in the survival package (survfit(), survdiff(), coxph()) together with a clinical trial data simulator (simdata_fast()). All functions are designed for repeated evaluation inside simulation loops, and the core computations are implemented in C++ via Rcpp.

This vignette has three parts:

  1. Numerical agreement: confirm that survfit_fast(), survdiff_fast(), and coxph_fast() return the same numerical results as the corresponding functions in the survival package.
  2. Speed comparison: measure execution times against the survival package using microbenchmark.
  3. Simulation example: generate data with simdata_fast() (nsim = 100) and summarize the results of applying the three analysis functions to each simulated trial.
library(FastSurvival)
library(survival)

1. Numerical agreement with the survival package

We use the ovarian dataset bundled with the survival package for the agreement checks.

ord <- order(ovarian$futime)
t_s <- ovarian$futime[ord]
e_s <- ovarian$fustat[ord]
g_s <- ovarian$rx[ord]

1.1 survfit_fast() versus survfit() + summary()

We evaluate the Kaplan-Meier estimate at the landmark time t = 500.

# FastSurvival
fit_fast <- survfit_fast(t_s, e_s, t_eval = 500, conf.type = "log")
fit_fast
#> Kaplan-Meier estimate at a single time point
#> 
#>         survival std.err lower 95% upper 95%
#> t = 500   0.5961  0.0999    0.4291    0.8279
#> 
#>  Confidence interval type: log

# survival
fit_ref <- survfit(Surv(futime, fustat) ~ 1, data = ovarian, conf.type = "log")
summary(fit_ref, times = 500)
#> Call: survfit(formula = Surv(futime, fustat) ~ 1, data = ovarian, conf.type = "log")
#> 
#>  time n.risk n.event survival std.err lower 95% CI upper 95% CI
#>   500     12      10    0.596  0.0999        0.429        0.828

The survival estimate, standard error, and confidence interval at t = 500 agree between the two implementations.

1.2 survdiff_fast() versus survdiff()

# FastSurvival (two-sided chi-square statistic)
sd_fast <- survdiff_fast(ovarian$futime, ovarian$fustat, ovarian$rx,
                         control = 1, side = 2)
sd_fast
#> Log-rank test (two-group)
#> 
#>   N = 26
#> 
#>           Observed Expected (O-E)^2/E (O-E)^2/V
#> control          7   5.2335    0.5962    1.0627
#> treatment        5   6.7665    0.4612    1.0627
#> 
#>  Chi-square = 1.063 on 1 df,  p-value = 0.3026

# survival
sd_ref <- survdiff(Surv(futime, fustat) ~ rx, data = ovarian)
sd_ref
#> Call:
#> survdiff(formula = Surv(futime, fustat) ~ rx, data = ovarian)
#> 
#>       N Observed Expected (O-E)^2/E (O-E)^2/V
#> rx=1 13        7     5.23     0.596      1.06
#> rx=2 13        5     6.77     0.461      1.06
#> 
#>  Chisq= 1.1  on 1 degrees of freedom, p= 0.3

cat("survdiff_fast chi-square:", as.numeric(sd_fast), "\n")
#> survdiff_fast chi-square: 1.06274
cat("survdiff      chi-square:", sd_ref$chisq,       "\n")
#> survdiff      chi-square: 1.06274

The observed and expected counts, the chi-square statistic, and the p-value agree.

1.3 coxph_fast() versus coxph()

coxph() treats rx as numeric with rx = 1 as the reference level, so we set control = 1 for a consistent comparison. coxph_fast() is based on the Pike-Halley Estimator (Homma, 2025), a closed-form approximation to the Cox partial likelihood maximizer; the residual difference is of order O_p(n^{-3/2}).

# FastSurvival
cx_fast <- coxph_fast(ovarian$futime, ovarian$fustat, ovarian$rx, control = 1)
cx_fast
#> Pike-Halley Estimator (closed-form hazard ratio)
#> 
#> Coefficients:
#>          coef exp(coef) se(coef)      z Pr(>|z|)
#> group -0.5964    0.5508   0.5868 -1.016     0.31
#> 
#> Hazard ratio and 95% Wald confidence interval:
#>       exp(coef) exp(-coef) lower .95 upper .95
#> group    0.5508     1.8155    0.1744    1.7399

# survival
cx_ref <- summary(coxph(Surv(futime, fustat) ~ rx, data = ovarian))
cx_ref$coefficients
#>        coef exp(coef)  se(coef)         z  Pr(>|z|)
#> rx -0.59638 0.5508019 0.5869895 -1.015998 0.3096304
cx_ref$conf.int
#>    exp(coef) exp(-coef) lower .95 upper .95
#> rx 0.5508019   1.815535 0.1743207  1.740371

cat("coxph_fast HR :", cx_fast["exp(coef)"], "\n")
#> coxph_fast HR : 0.5508019
cat("coxph      HR :", cx_ref$coefficients[, "exp(coef)"], "\n")
#> coxph      HR : 0.5508019

The hazard ratio, standard error, and Wald confidence interval agree to the expected order of accuracy.

2. Speed comparison

The following benchmark compares FastSurvival against the survival package using microbenchmark on the ovarian dataset (n = 26). Because the FastSurvival functions are designed for use inside simulation loops where the data are generated in sorted order, we set presorted = TRUE and sort the input once outside the benchmark.

library(microbenchmark)

ord <- order(ovarian$futime)
t_s <- ovarian$futime[ord]
e_s <- ovarian$fustat[ord]
g_s <- ovarian$rx[ord]

bm <- microbenchmark(
  survfit_fast  = survfit_fast(t_s, e_s, t_eval = 500),
  survfit       = summary(survfit(Surv(futime, fustat) ~ 1, data = ovarian),
                          times = 500),
  survdiff_fast = survdiff_fast(t_s, e_s, g_s, control = 1, side = 2,
                                presorted = TRUE),
  survdiff      = survdiff(Surv(futime, fustat) ~ rx, data = ovarian),
  coxph_fast    = coxph_fast(t_s, e_s, g_s, control = 1, presorted = TRUE),
  coxph         = coxph(Surv(futime, fustat) ~ rx, data = ovarian),
  times = 100
)
print(bm)
#> Unit: microseconds
#>           expr    min      lq     mean  median      uq    max neval
#>   survfit_fast   13.3   24.00   34.196   29.80   39.10   96.9   100
#>        survfit  910.6 1172.15 1454.820 1315.40 1613.55 4482.6   100
#>  survdiff_fast    7.8   18.05   24.337   23.50   28.60   51.7   100
#>       survdiff  714.7  922.60 1108.567  992.70 1187.30 5118.9   100
#>     coxph_fast   13.9   24.45   35.933   35.05   42.75  109.4   100
#>          coxph 1053.9 1400.95 1743.869 1522.45 2067.00 4252.7   100

The exact timings depend on the hardware and the random fluctuation inherent in microbenchmarks, but on the ovarian dataset all three FastSurvival functions are roughly 40 times faster than their counterparts in the survival package. The speed gain typically grows with the sample size; on a phase-3 trial sized dataset (n = 600, event rate 80%) the speed gains observed in development reach roughly 50x for survfit_fast(), 40x for survdiff_fast(), and 30x for coxph_fast().

3. Simulation example

We now generate nsim = 100 trials with simdata_fast() and apply the three analysis functions to each simulated dataset.

3.1 Generate simulated data

We consider a two-group parallel trial with 100 patients per group, 12-month accrual, exponential survival with median 18 months in the control group and 24 months in the treatment group, and a small constant dropout hazard.

set.seed(42)
df <- simdata_fast(
  nsim     = 100,
  n        = c(100, 100),
  a.time   = c(0, 12),
  a.rate   = 200 / 12,
  e.median = list(18, 24),
  d.hazard = list(0.01, 0.01),
  seed     = 42
)
head(df)
#>   sim group accrual_time surv_time dropout_time       tte event calendar_time
#> 1   1     1    10.900445  8.599582    48.405809  8.599582     1     19.500027
#> 2   1     1     3.967349 14.797314    23.985600 14.797314     1     18.764663
#> 3   1     1    11.411225 21.757974    38.095037 21.757974     1     33.169199
#> 4   1     1     2.453655  6.568218   197.635604  6.568218     1      9.021873
#> 5   1     1     4.113958 15.996459    13.580202 13.580202     0     17.694159
#> 6   1     1     1.103466  1.092207     7.924918  1.092207     1      2.195673

3.2 Apply the three analysis functions to each simulated trial

For each of the 100 simulated trials, we run:

nsim   <- 100L
t_land <- 18

km_mat <- matrix(NA_real_, nrow = nsim, ncol = 4L,
                 dimnames = list(NULL,
                                 c("surv", "std.err", "lower", "upper")))
ld_vec <- numeric(nsim)   # log-rank chi-square
hr_mat <- matrix(NA_real_, nrow = nsim, ncol = 5L,
                 dimnames = list(NULL,
                                 c("coef", "exp(coef)", "se(coef)",
                                   "lower .95", "upper .95")))

for (s in seq_len(nsim)) {
  d <- df[df$sim == s, ]

  # Treatment-group KM at t = 18 (sort once)
  d_trt <- d[d$group == 2L, ]
  ord_t <- order(d_trt$tte)
  km_mat[s, ] <- unclass(
    survfit_fast(d_trt$tte[ord_t], d_trt$event[ord_t], t_eval = t_land)
  )

  # Pooled sort for log-rank and Cox
  ord_p <- order(d$tte)
  t_p   <- d$tte[ord_p]
  e_p   <- d$event[ord_p]
  g_p   <- d$group[ord_p]

  ld_vec[s] <- as.numeric(
    survdiff_fast(t_p, e_p, g_p, control = 1L, side = 2L, presorted = TRUE)
  )
  hr_mat[s, ] <- unclass(
    coxph_fast(t_p, e_p, g_p, control = 1L, presorted = TRUE)
  )
}

3.3 Summarize the results across 100 simulations

# Landmark survival in the treatment group at t = 18
km_summary <- c(
  mean_surv    = mean(km_mat[, "surv"]),
  sd_surv      = sd(km_mat[, "surv"]),
  mean_std_err = mean(km_mat[, "std.err"])
)
km_summary
#>    mean_surv      sd_surv mean_std_err 
#>   0.60114248   0.05336939   0.05108699

# Log-rank test: empirical power at alpha = 0.05 (two-sided)
p_logrank   <- pchisq(ld_vec, df = 1L, lower.tail = FALSE)
power_empir <- mean(p_logrank < 0.05)
cat(sprintf("Empirical power (log-rank, two-sided, alpha = 0.05): %.2f\n",
            power_empir))
#> Empirical power (log-rank, two-sided, alpha = 0.05): 0.47

# Hazard ratio summary
hr_summary <- c(
  mean_HR     = mean(hr_mat[, "exp(coef)"]),
  sd_HR       = sd(hr_mat[, "exp(coef)"]),
  mean_log_HR = mean(hr_mat[, "coef"]),
  sd_log_HR   = sd(hr_mat[, "coef"]),
  cover_95    = mean(hr_mat[, "lower .95"] <= (18 / 24) &
                     hr_mat[, "upper .95"] >= (18 / 24))
)
hr_summary
#>     mean_HR       sd_HR mean_log_HR   sd_log_HR    cover_95 
#>   0.7487242   0.1274661  -0.3031164   0.1655298   0.9300000

Under exponential survival with medians 18 and 24 months in the control and treatment groups, the true hazard ratio is (log(2) / 24) / (log(2) / 18) = 18 / 24 = 0.75. Across the 100 simulated trials, the average treatment-group landmark survival at t = 18 months was around 0.60, consistent with exp(-log(2) * 18 / 24) after accounting for censoring from the administrative cutoff and dropout. The average hazard ratio estimate from coxph_fast() was close to the true value 0.75, and the empirical coverage of the 95% Wald confidence interval was near the nominal 95% level. The empirical power of the two-sided log-rank test at alpha = 0.05 was around 0.5, reflecting the modest sample size (200 patients total) relative to the modest treatment effect (HR = 0.75) under the chosen accrual and dropout settings.

References

Homma, G. (2025). One step from Pike to Cox: a closed-form hazard ratio estimator. Manuscript under review.

Collett, D. (2014). Modelling Survival Data in Medical Research (3rd ed.). Chapman and Hall/CRC.