EpiNow2 benchmarks for select models: speed versus nowcast/forecast performance

Code
library(EpiNow2)
library(scoringutils)
library(data.table)
library(rstan)
library(ggplot2)
library(dplyr)
library(purrr)
library(lubridate)
library(scales)
library(posterior)
library(parallel)
library(patchwork)
set.seed(9876)

In using {EpiNow2}, users will often need to balance between achieving fast model runs and good forecast and nowcast performance. {EpiNow2} provides a range of customisations of the default model to suit these decision points.

The aim of this vignette is to show the trade-offs between select model customisations in terms of model speed/run times and nowcasting and real-time forecasting. We will explore four (4) {EpiNow2} model options, including the default model. The models, chosen to cover typical use cases, are customisations of the default prior on how \(R_t\) is generated over time.

We will evaluate how well the models perform when fitted with the MCMC sampling algorithm in stan because MCMC is the state-of-the-art algorithm for fitting these kinds of models.

Data

To compare the models, we will simulate an epidemic with waves capturing the growth, peak, and decline phase. We will then extract subsets of the data capturing the three phases for use as scenarios. All the models will be fit to the three phases and evaluated.

Throughout this vignette, several argument values, including the observation model options and the \(R_t\) model prior will be reused, so we will define them here. Note that we use 7 cores out of 8 cores for parallelisation.

Code
# Observation model options
obs <- obs_opts(
  scale = Normal(0.1, 0.025),
  return_likelihood = TRUE
)
# Rt prior
rt_prior_default <- Normal(2, 0.1)
# Number of cores
options(mc.cores = detectCores() - 1)
Data simulation steps in detail

Let’s start by creating the “true” \(R_t\) and infections data/trajectories.

We will use {EpiNow2}’s forecast_infections() function. This function allows us to generate a posterior that can be re-used to generate infections by changing the \(R_t\) trajectory.

forecast_infections() requires a fitted “estimates”” object from epinow() with the output argument set to “fit”, the trajectory of the reproduction number, R, and the number of samples to simulate.

To obtain the estimates object, we will run the epinow() function using real-world observed data and delay distributions to recover realistic parameter values. For the data, we will use the first \(60\) observations of the example_confirmed data set. We will use the example_generation_time for the generation time, and the sum of the incubation period (example_incubation_period) and reporting delay (example_reporting_delay) as the delay. These delays come with the package.

For the \(R_t\) prior, we will use a 14-day random walk, with a mean of \(2\) and standard deviation of \(0.1\). Lastly, as we only want to generate estimates, we will turn off forecasting by setting horizon = 0.

We’ll now generate the estimates object from the observed data (example_confirmed).

Code
cases <- example_confirmed[1:60]
estimates <- epinow(
  data = cases,
  generation_time = generation_time_opts(example_generation_time),
  delays = delay_opts(example_incubation_period + example_reporting_delay),
  rt = rt_opts(prior = rt_prior_default, rw = 14),
  gp = NULL,
  obs = obs,
  forecast = forecast_opts(horizon = 0), # no forecasting
  output = "fit"
)

That’s it for the estimates object. Next, we’ll create the R data using an arbitrary trajectory that has some Gaussian noise added to it. We’ll use it to simulate the true infections data by sampling from \(1\) posterior sample.

Code
# Arbitrary reproduction number trajectory
R <- c(
    seq(1, 1.5, length.out = 15),  # Rising to peak 1
    seq(1.5, 1, length.out = 15),  # Falling back to 1
    seq(1, 0.5, length.out = 15),  # Dropping to valley
    seq(0.5, 1, length.out = 15),  # Rising back to 1
    seq(1, 1.4, length.out = 10),  # Smaller peak
    seq(1.4, 1, length.out = 10),  # Back to 1
    seq(1, 0.8, length.out = 10),  # Small dip
    seq(0.8, 1, length.out = 10)   # Returning to 1
)
# Add Gaussian noise
R_noisy <- R * rnorm(length(R), 1, 0.05)

# Forecast infections and the trajectory of Rt
forecast <- forecast_infections(
  estimates$estimates,
  R = R_noisy,
  samples = 1
)

Now, let’s extract and the true \(R_t\) and infections data.

Code
# Extract and prepare the simulated true infections
infections_true <- forecast$summarised[variable == "infections", .(date, confirm = ceiling(mean))]

# Prepare the simulated true Rt
R_true <- data.frame(date = infections_true$date, R = R_noisy)

Below is the simulated data with dotted lines showing the chosen growth, peak, and decline phase in infections. We use the second wave because we want to have enough data to fit/train the models. The chosen dates also represent the scenarios that the models will be fit to and evaluated.

Code
snapshot_dates <- c(
    "growth" = as.Date("2020-05-02"),
    "peak" = as.Date("2020-05-09"),
    "decline" = as.Date("2020-05-21")
)
# Rt plot
R_traj <- ggplot(data = R_true) +
  geom_line(aes(x = date, y = R)) +
    labs(x = "Date", y = "Rt")

# Infections plot
infections_traj <- ggplot(data = infections_true) +
  geom_line(aes(x = date, y = confirm)) +
  geom_vline(xintercept = snapshot_dates, linetype = "dashed") +
  annotate("text", x = snapshot_dates["growth"], y = 7500, label = "Growth", color = "blue",
           angle = 90, vjust = -0.5) +
  annotate("text", x = snapshot_dates["peak"], y = 7500, label = "Peak", color = "blue",
           angle = 90, vjust = -0.5) +
  annotate("text", x = snapshot_dates["decline"], y = 7500, label = "Decline", color = "blue",
           angle = 90, vjust = -0.5) +
  scale_y_continuous(labels = scales::label_comma()) +
    labs(x = "Date", y = "Infections")

# Compose the plots
(R_traj/infections_traj) +
    plot_layout(axes = "collect") &
    scale_x_date(date_labels = "%b %d", date_breaks = "1 weeks") &
    theme_minimal()

Let’s proceed to define the models, fit them to the true data, and evaluate their performance.

Models

Descriptions

Below we describe each model.

Model descriptions
model description
default Default model (non-stationary prior on Rt)
non_mechanistic No mechanistic prior on Rt
rw7 7-day random walk prior on Rt
non_residual Stationary prior on Rt

Configurations

We will now define the {EpiNow2} configurations for each model, which are modifications of the default model.

Code
model_configs <- list(
  # The default model
  default = list(
    rt = rt_opts(
      prior = rt_prior_default
    )
  ),
  # The non-mechanistic model
  non_mechanistic = list(
    rt = NULL
  ),
  # The 7-day Random Walk Rt model
  rw7 = list(
    rt = rt_opts(
      prior = rt_prior_default,
      rw = 7
    ),
    gp = NULL
  ),
  # The non_residual model
  non_residual = list(
    rt = rt_opts(
      prior = rt_prior_default,
      gp_on = "R0"
    )
  )
)

Inputs

All the models will share the configuration for the generation time, incubation period, reporting delay, and the forecast horizon, so we will define them once and pass them to the models.

Code
# Combine the example COVID-19 incubation period and reporting delay (from EpiNow2) into one delay
delay <- example_incubation_period + example_reporting_delay

# 7-day forecast window
horizon <- 7

# Combine the shared model inputs into a list for use across all the models
model_inputs <- list(
  generation_time = generation_time_opts(example_generation_time),
  delays = delay_opts(delay),
  obs = obs,
  forecast = forecast_opts(horizon = horizon),
  verbose = FALSE
)

Running the models

Now, we’re ready to run the models. We will use snapshots of the true infections data representing the last 10 weeks and including the growth, peak, and decline phase of the second wave.

Code
data_length <- 70
# create the data snapshots for fitting the models using the snapshot dates.
data_snaps <- lapply(
  snapshot_dates,
  function(snap_date) {
    tail(infections_true[date <= snap_date], data_length)
  }
)

# Create a version of epinow() that works like base::try() and works even if some models fail.
safe_epinow <- purrr::safely(epinow)
# Run the models over the different dates
results <- lapply(
  data_snaps, function(data) {
    lapply(
      model_configs,
      function(model) {
        do.call(
          safe_epinow,
          c(
            data = list(data),
            model_inputs,
            model
          )
        )
      }
    )
  }
)

Evaluating model performance

We will now evaluate the models.

Extraction functions

We’ll begin by setting up the following post-processing functions:

Code
# Function to extract the "timing", "Rt", "infections", and "reports" variables from an
# epinow() run. It expects a model run, x, which contains a "results" or "error" component.
# If the model run successfully, "error" should be NULL.
extract_results <- function(x, variable) {
  stopifnot(
    "variable must be one of c(\"timing\", \"R\", \"infections\", \"reports\")" =
      variable %in% c("timing", "R", "infections", "reports")
  )
  # Return NA if there's an error
  if (!is.null(x$error)) {
    return(NA)
  }

  if (variable == "timing") {
    return(round(as.duration(x$result$timing), 1))
  } else {
    obj <- x$result$estimates$fit
  }

  # Extracting "Rt", "infections", and "reports" is different based on the object's class and
  # other settings
  if (inherits(obj, "stanfit")) {
    # Depending on rt_opts(use_rt = TRUE/FALSE), R shows up as R or gen_R
    if (variable == "R") {
      # The non-mechanistic model returns "gen_R" where as the others sample "R".
      if ("R[1]" %in% names(obj)) {
        return(rstan::extract(obj, "R")$R)
      } else {
        return(rstan::extract(obj, "gen_R")$gen_R)
      }
    } else {
      return(rstan::extract(obj, variable)[[variable]])
    }
  } else {
    obj_mat <- as_draws_matrix(obj)
    # Extracting R depends on the value of rt_opts(use_rt = )
    if (variable == "R") {
      if ("R[1]" %in% variables(obj_mat)) {
          return(subset_draws(obj_mat, "R"))
      } else {
        return(subset_draws(obj_mat, "gen_R"))
      }
    } else {
        return(subset_draws(obj_mat, variable))
      }
    }
}

# Apply `extract_results()` to a nested list of model runs per snapshot date.
get_model_results <- function(results_by_snapshot, variable) {
  # Get model results list
  purrr::map_depth(results_by_snapshot, 2, extract_results, variable)
}

# Function to convert all columns to factor except the specified cols in `except`
make_cols_factors <- function(data, except){
  data[
    ,
    (setdiff(names(data), except)) :=
      lapply(.SD, as.factor),
    .SDcols = setdiff(names(data), except)
  ]
  data[]
}

# Add factor levels to the `epidemic_phase` column to allow for easy ordering.
add_epidemic_phase_levels <- function(data){
  data[, epidemic_phase := factor(epidemic_phase, levels = c("growth", "peak", "decline"))]
  data[]
}

# Calculate the CRPS using the [scoringutils](https://epiforecasts.io/scoringutils/) R package. It ensures that the estimates and truth data are the same length before calculating the crps. It also returns NA if the passed estimates object is not a matrix because the extraction function above returns a matrix.
calc_crps <- function(estimates, truth) {
    # if the object is not a matrix, then it's an NA (failed run)
    if (!inherits(estimates, c("matrix"))) return(rep(NA_real_, length(truth)))
    # Assumes that the estimates object is structured with the samples as rows
    shortest_obs_length <- min(ncol(estimates), length(truth))
    reduced_truth <- head(truth, shortest_obs_length)
    estimates_transposed <- t(estimates) # transpose to have samples as columns
    reduced_estimates <- head(estimates_transposed, shortest_obs_length)
    crps_sample(reduced_truth, reduced_estimates)
}

# Calculate CRPS estimates for the nested list of model runs per snapshot date and flatten into a simple list.
process_crps <- function(results, variable, truth) {
    # Extract values
    results_by_snapshot <- get_model_results(results, variable = variable)

    # Get the dates reference from the true infections time series
    dates_ref <- infections_true$date
    # For each snapshot (growth, peak, decline)
    crps_by_snapshot <- purrr::imap(
        results_by_snapshot,
        function(results_by_model, snapshot_ref_label) {
            # Get the correct slice of truth data for this snapshot date. Note that we now
            # include the test data, i.e., the forecast horizon
            snapshot_date <- snapshot_dates[snapshot_ref_label]
            truth_slice <- tail(
                truth[1:which(dates_ref == snapshot_date + horizon)],
                data_length
            )

            # For each model in this snapshot, calculate CRPS comparing model estimates to truth slice
            purrr::map(results_by_model, function(res) {
                calc_crps(estimates = res, truth = truth_slice)
            })
        })

    # Add dates column based on snapshot length
    crps_with_dates <- purrr::imap(
        crps_by_snapshot,
        function(results_by_model, snapshot_ref_label) {
            date_end <- snapshot_dates[snapshot_ref_label] + horizon

            purrr::map(results_by_model, function(crps_values) {
                data.table(crps = crps_values)[,
                    date := seq.Date(
                        from = date_end - .N + 1,
                        to = date_end,
                        by = "day"
                    )]
            })
        })
    # Flatten the results into one dt
    crps_flat <- lapply(
        crps_with_dates,
        function(snapshot_results) {
            rbindlist(snapshot_results, idcol = "model")
        }) |>
        rbindlist(idcol = "snapshot_date")

    # Replace the snapshot dates with their description
    snapshot_date_labels <- names(snapshot_dates)
    # Replace the snapshot dates with their description
    crps_flat[, epidemic_phase := snapshot_date_labels[
        match(snapshot_date, snapshot_date_labels)
    ]]

    return(crps_flat[])
}

# Shared plot settings
plot_caption_custom <- "Where a model is not shown, it means it failed to run"
plot_theme_custom <- theme_minimal() +
    theme(plot.title = element_text(size = 18),
          strip.text = element_text(size = 13),
          axis.title = element_text(size = 13),
          axis.text = element_text(size = 11),
          panel.border = element_rect(color = "black", fill = NA, linewidth = 0.5)
    )

Run times (computational resources)

Let’s see how long each model took to run using MCMC.

Code
# Extract the run times and reshape to dt
runtimes_by_snapshot <- get_model_results(results, "timing")

# Flatten the results
runtimes_dt <- lapply(runtimes_by_snapshot, function(x) as.data.table(x)) |>
  rbindlist(idcol = "snapshot_date", ignore.attr = TRUE)

# snapshot dates dictionary
snapshot_date_labels <- names(snapshot_dates)

# Replace snapshot_date based on the dictionary
runtimes_dt[, epidemic_phase := snapshot_date_labels[match(snapshot_date, snapshot_date_labels)]]

# Add model descriptions
runtimes_dt_long <- melt(
  runtimes_dt,
  id.vars = "epidemic_phase",    # Column to keep as an identifier
  measure.vars = model_descriptions$model,  # Dynamically select model columns by pattern
  variable.name = "model",      # Name for the 'model' column
  value.name = "timing"         # Name for the 'timing' column
)

runtimes_dt_detailed <- merge(
  runtimes_dt_long,
  model_descriptions,
  by = "model"
)

# Make all columns except timing a factor
runtimes_dt_detailed <- make_cols_factors(runtimes_dt_detailed, except = "timing")

# Add epidemic_phase factor levels to c("growth", "peak", "decline"))
runtimes_dt_detailed <- add_epidemic_phase_levels(runtimes_dt_detailed)

# Plot the timing
timing_plot <- ggplot(data = runtimes_dt_detailed) +
  geom_col(aes(x = epidemic_phase,
                 y = timing,
                 fill = model
                 ),
           position = position_dodge2()
  ) +
  labs(x = "Epidemic phase",
       y = "Runtime (secs)",
       fill = "Model",
       title = "Model runtimes"
  ) +
  scale_color_brewer(palette = "Dark2") +
  scale_y_continuous(breaks = seq(0, max(runtimes_dt_detailed$timing) + 20, 25)) +
  plot_theme_custom
timing_plot

We can see that the default model is the slowest in all data scenarios. On the other hand, the non-mechanistic model is the fastest, followed by the 7-day random walk model, and the non-residual model. Let’s see how the model run times compare with forecasting and nowcasting performance.

Evaluating model performance

We will use the continuous ranked probability score (CRPS). CRPS is a proper scoring rule that measures the accuracy of probabilistic forecasts. When comparing models, the smaller the CRPS, the better.

We will evaluate model runtimes versus overall performance out-of-sample, i.e., total CRPS for \(R_t\) and infections in the forecasting window. Additionally, for \(R_t\), we’ll evaluate the nowcast value, i.e., the estimate of \(R_t\) before the forecast horizon, and for infections, we will compare the 7-day forecast as a measure of real-time performance.

If you are interested in the time-varying performance of the models, see the appendix section at the end of this vignette.

Code
# Process CRPS for Rt
rt_crps <- process_crps(results, "R", R_true$R)
rt_crps_full <- merge.data.table(
    rt_crps,
    model_descriptions,
    by = "model"
  )

# Re-categorise fit_type column and convert to factor
rt_crps_dt <- make_cols_factors(rt_crps_full, except = c("date", "crps"))
rt_crps_dt_final <- add_epidemic_phase_levels(rt_crps_dt)

# Process CRPS for infections
infections_crps <- process_crps(results, "infections", infections_true$confirm)
infections_crps_full <- merge.data.table(
  infections_crps,
  model_descriptions,
  by = "model"
)

infections_crps_dt <- make_cols_factors(infections_crps_full, except = c("date", "crps"))
infections_crps_dt_final <- add_epidemic_phase_levels(infections_crps_dt)

Overall model performance

Let’s compare the overall/aggregated out-of-sample (forecast horizon) performance of the models in terms of the total CRPS for \(R_t\) and infections compared with model run times.

Summary and plotting functions
Code
# Calculate total CRPS stratified by the "by" vector
calculate_total_crps <- function(data, by) {
    evaluation_data <- data[, .SD[(.N - horizon + 1):.N], by = by]
    evaluation_data[, .(total_crps = sum(crps, na.rm = TRUE)), by = by]
}
# Plot total CRPS. It returns a ggplot object that can take further layers.
plot_performance_vs_timing <- function(performance_dt, performance_col, title) {
  plot <- ggplot(data = performance_dt) +
    geom_point(
      aes(
        x = timing,
        y = .data[[performance_col]],
        color = model
      ),
      size = 5,
      stroke = 2.2,
      shape = 1
    ) +
    facet_wrap(~ epidemic_phase) +
    guides(
      color = guide_legend(title = "Model")
    ) +
    labs(title = title) +
    plot_theme_custom +
    scale_color_brewer(palette = "Dark2")
  return(plot)
}

In the figure below, we show the model runtimes compared to the total performance in forecasting \(R_t\), grouped by the three epidemic phases. Ideal models would be in the bottom left corner, i.e., fast and with low CRPS.

Below, we show the model run times versus total performance in forecasting infections, grouped by the three epidemic phases. Ideal models would be in the bottom left corner, i.e., fast and with low CRPS.

Nowcast \(R_t\) estimates

Let’s now compare the performance of the models in terms of nowcast estimates of \(R_t\), i.e., the estimate of \(R_t\) in horizon = -1 by epidemic phase. Ideal models would be in the bottom left corner, i.e., fast and with low CRPS.

Real-time infection forecast

Let’s also see the real-time performance of the models in estimating infections by epidemic phase compared with model run times. Ideal models would be in the bottom left corner, i.e., fast and with low CRPS.

Summary of results

Below is a summary of overall/total out-of-sample performance of each model.

Summary of overall performance out-of-sample by model
Model Summary
default Slowest run time with mixed performance for forecasting Rt and infections
non_mechanistic Fastest and best for forecasting Rt, but weakest for real-time infections forecasting
non_residual Slow runtime with mixed performance for forecasting Rt and infections
rw7 Moderate run time with good infections forecasting performance

The next table shows a summary of model performance in nowcasting Rt.

Summary of model performance (nowcasting Rt)
Model Summary
default Slowest across all phases with poor accuracy
non_mechanistic Fastest model; better performance relative to default model
non_residual Slow run time but with consistently best CRPS across all phases
rw7 Moderate run time but with good performance in growth and decline phase

Lastly, we show a summary of model performance in real-time forecasting of infections.

Summary of model performance (real-time infections forecasting)
Model Summary
default Slowest across all phases with mixed performance
non_mechanistic Fastest run time but with worst performance overall
non_residual Slower run time with good performance across all phases
rw7 moderate runtime with mixed performance across all phases

As can be seen in the summaries above, each model has its strengths and weaknessess and a balance needs to be struck.

We will now discuss the considerations and recommendations for choosing an appropriate model based on the results of these benchmarks and experience with using the models in practice.

Considerations and recommendations for choosing an appropriate model

Changing default stan controls

Users can consider changing the default stan options set in stan_opts(). Exercise caution here and observe the number of divergences, effective sample size (ESS), and Rhat to ensure that the model is converging well. The following are some options for changing stan controls:

Non-mechanistic versus mechanistic models

Estimation in {EpiNow2} using the semi-mechanistic approaches (putting a prior on \(R_t\)) is often much slower than the non-mechanistic approach (seeting `rt = NULL``). The mechanistic model is slower because it models aspects of the processes and mechanisms that drive \(R_t\) estimates using the renewal equation. The non-mechanistic model, on the other hand, runs much faster but does not use the renewal equation to generate infections. Because of this none of the options defining the behaviour of the reproduction number are available in this case, limiting its flexibility.

Faster fitting non-MCMC algorithms but experimental and unstable

The default sampling method, set through stan_opts(), performs MCMC sampling using {rstan}. The MCMC sampling method is accurate but is often slow. {EpiNow2} also provides the option to run three (3) other algorithms that approximate MCMC: Automatic Differentiation Variational Inference, Pathfinder method, and Laplace sampling (set using stan_opts(method = "laplace", backend = "cmdstanr)). These methods are much faster because they are approximate (See, for example, a detailed explanation for automatic variational inference in Stan). They are, however, currently experimental and unstable, and more research is needed to understand under what conditions they excel and fail. We, therefore, only recommend users to use the MCMC sampler.

In {EpiNow2}, you can use variational inference with the {rstan} or {cmdstanr} backend but you must install {cmdstanr} to access its functionalities. You can set stan_opts(method = "vb"), which will use the {rstan} backend or stan_opts(method = "vb", backend = "cmdstanr"). Additionally, {EpiNow2} supports using the Laplace algorithm (which you can set using stan_opts(method = "laplace", backend = "cmdstanr")), and Pathfinder algorithm (which you can set using stan_opts(method = "pathfinder", backend = "cmdstanr")) through the {cmdstanr} R package.

The non-mcmc methods can be used in various ways. First, you can initialise the MCMC sampling algorithm with the fit object returned by methods such as pathfinder. More details can be found in the original pathfinder paper. This approach speeds up the initialisation phase of the MCMC algorithm. Second, the non-mcmc methods are also great for prototyping. For example, if you are testing out a pipeline setup, it might be more practical to switch to a method like variational bayes and only use MCMC when the pipeline is up and running.

Faster competitive models at cost of smoothness/granularity of estimates

The random walk model is much faster than the default model and is competitive in all tasks and data scenarios. However, choosing it comes at a cost of reduced smoothness/granularity of the estimates, compared to the other methods.

Caveats of this exercise

We generated the data using an arbitrary R trajectory. The models were also only fit to one time point. Ideally, they would be fit to multiple time windows. This experiment therefore represents only one of many data and time point scenarios that the models can be benchmarked against.

The run times measured here use a crude method that compares the start and end times of each simulation. It only measures the time taken for one model run and may not be accurate. For more accurate run time measurements, we recommend using a more sophisticated approach like those provided by packages like {bench} and {microbenchmark}.

Lastly, we used 7 cores for between-chain parallelisation, and so using more or fewer cores might change the run time results.

Results appendix

Model performance over time

Let’s see how the \(R_t\) and infections CRPS changed over time.