Efficient Storage of Imputed Data

2026-02-16

1 Introduction

When performing multiple imputation with {rbmi} using many imputations (e.g., 100-1000), the full imputed dataset can become very large. However, most of this data is redundant: observed values are identical across all imputations.

The {rbmiUtils} package provides two functions to address this:

This approach can reduce storage requirements by 90% or more, depending on the proportion of missing data.

2 The Storage Problem

Consider a typical clinical trial dataset:

Full storage: 2,500 rows × 1,000 imputations = 2.5 million rows

Reduced storage: 125 missing values × 1,000 imputations = 125,000 rows (5% of full size)

3 Setup

library(dplyr)
library(rbmi)
library(rbmiUtils)

4 Example with Package Data

The {rbmiUtils} package includes example datasets we can use:

data("ADMI", package = "rbmiUtils")  # Full imputed dataset
data("ADEFF", package = "rbmiUtils") # Original data with missing values

# Check dimensions
cat("Full imputed dataset (ADMI):", nrow(ADMI), "rows\n")
#> Full imputed dataset (ADMI): 100000 rows
cat("Number of imputations:", length(unique(ADMI$IMPID)), "\n")
#> Number of imputations: 100

5 Reducing Imputed Data

First, prepare the original data to match the imputed data structure:

original <- ADEFF |>
 mutate(
   TRT = TRT01P,
   USUBJID = as.character(USUBJID)
 )

# Count missing values
n_missing <- sum(is.na(original$CHG))
cat("Missing values in original data:", n_missing, "\n")
#> Missing values in original data: 44

Define the variables specification:

vars <- set_vars(
 subjid = "USUBJID",
 visit = "AVISIT",
 group = "TRT",
 outcome = "CHG"
)

Now reduce the imputed data:

reduced <- reduce_imputed_data(ADMI, original, vars)

cat("Full imputed rows:", nrow(ADMI), "\n")
#> Full imputed rows: 100000
cat("Reduced rows:", nrow(reduced), "\n")
#> Reduced rows: 4400
cat("Compression ratio:", round(100 * nrow(reduced) / nrow(ADMI), 1), "%\n")
#> Compression ratio: 4.4 %

6 What’s in the Reduced Data?

The reduced dataset contains only the rows that were originally missing:

# First few rows
head(reduced)
#> # A tibble: 6 × 12
#>   IMPID STRATA REGION  REGIONC TRT    BASE   CHG AVISIT USUBJID CRIT1FLN CRIT1FL
#>   <dbl> <chr>  <chr>     <dbl> <chr> <dbl> <dbl> <chr>  <chr>      <dbl> <chr>  
#> 1     1 A      North …       1 Plac…    12 -1.96 Week … ID011          0 N      
#> 2     1 A      Europe        3 Drug…     3 -3.71 Week … ID014          0 N      
#> 3     1 B      Europe        3 Drug…     9 -1.96 Week … ID018          0 N      
#> 4     1 A      Asia          4 Drug…    10 -5.55 Week … ID033          0 N      
#> 5     1 A      Asia          4 Drug…     0 -1.28 Week … ID061          0 N      
#> 6     1 A      South …       2 Plac…     5 -2.60 Week … ID071          0 N      
#> # ℹ 1 more variable: CRIT <chr>

# Structure matches original imputed data
cat("\nColumns in reduced data:\n")
#> 
#> Columns in reduced data:
cat(paste(names(reduced), collapse = ", "))
#> IMPID, STRATA, REGION, REGIONC, TRT, BASE, CHG, AVISIT, USUBJID, CRIT1FLN, CRIT1FL, CRIT

Each row represents an imputed value for a specific subject-visit-imputation combination.

7 Expanding Back to Full Data

When you need to run analyses, expand the reduced data back to full form:

expanded <- expand_imputed_data(reduced, original, vars)

cat("Expanded rows:", nrow(expanded), "\n")
#> Expanded rows: 100000
cat("Original ADMI rows:", nrow(ADMI), "\n")
#> Original ADMI rows: 100000

8 Verifying Data Integrity

Let’s verify that the round-trip preserves data integrity:

# Sort both datasets for comparison
admi_sorted <- ADMI |>
 arrange(IMPID, USUBJID, AVISIT)

expanded_sorted <- expanded |>
 arrange(IMPID, USUBJID, AVISIT)

# Compare CHG values
all_equal <- all.equal(
 admi_sorted$CHG,
 expanded_sorted$CHG,
 tolerance = 1e-10
)

cat("Data integrity check:", all_equal, "\n")
#> Data integrity check: TRUE

9 Practical Workflow

Here’s how to integrate efficient storage into your workflow:

9.1 Save Reduced Data

# After imputation
impute_obj <- impute(draws_obj, references = c("Placebo" = "Placebo", "Drug A" = "Placebo"))
full_imputed <- get_imputed_data(impute_obj)

# Reduce for storage
reduced <- reduce_imputed_data(full_imputed, original_data, vars)

# Save both (reduced is much smaller)
saveRDS(reduced, "imputed_reduced.rds")
saveRDS(original_data, "original_data.rds")

9.2 Load and Analyse

# Load saved data
reduced <- readRDS("imputed_reduced.rds")
original_data <- readRDS("original_data.rds")

# Expand when needed for analysis
full_imputed <- expand_imputed_data(reduced, original_data, vars)

# Run analysis
ana_obj <- analyse_mi_data(
 data = full_imputed,
 vars = vars,
 method = method,
 fun = ancova
)

10 Storage Comparison

Here’s a comparison of storage requirements for different scenarios:

Subjects Visits Missing % Imputations Full Rows Reduced Rows Savings
500 5 5% 100 250,000 12,500 95%
500 5 5% 1,000 2,500,000 125,000 95%
1,000 8 10% 500 4,000,000 400,000 90%
200 4 20% 1,000 800,000 160,000 80%

The savings scale with:

11 When to Use This Approach

Use reduced storage when:

Keep full data when:

12 Edge Cases

12.1 No Missing Data

If the original data has no missing values, reduce_imputed_data() returns an empty data.frame:

# If original has no missing values
reduced <- reduce_imputed_data(full_imputed, complete_data, vars)
nrow(reduced)
#> [1] 0

# expand_imputed_data handles this correctly
expanded <- expand_imputed_data(reduced, complete_data, vars)
# Returns original data with IMPID = "1"

12.2 Single Imputation

The functions work with any number of imputations, including just one.

13 Summary

The reduce_imputed_data() and expand_imputed_data() functions provide an efficient way to store imputed datasets:

  1. Reduce after imputation to store only what’s necessary
  2. Expand before analysis to reconstruct full datasets
  3. Verify data integrity is preserved through round-trip

This approach is particularly valuable when working with large numbers of imputations or when storage and memory are constrained.

For the complete analysis workflow using imputed data, see vignette('pipeline').