Bayesian Functional Principal Component Analysis

2026-04-20

Introduction

This vignette provides a detailed guide to the fpca_bayes() function in the refundBayes package, which fits a Bayesian Functional Principal Component Analysis (FPCA) model using Stan. Unlike the regression-style functions in the package (sofr_bayes, fosr_bayes, fcox_bayes, fofr_bayes), fpca_bayes() does not relate a functional response to predictors — it instead decomposes a sample of curves into a smooth population mean function plus a small number of subject-specific functional principal components, with full Bayesian uncertainty quantification on every quantity except the eigenfunctions themselves.

The methodology builds directly on the tutorial by Jiang, Crainiceanu, and Cui (2025), Tutorial on Bayesian Functional Regression Using Stan, published in Statistics in Medicine, and reuses the same penalized-spline + Stan machinery that powers fosr_bayes() for modelling the mean function.

Install the refundBayes Package

The refundBayes package can be installed from CRAN:

install.packages("refundBayes")

For the latest version of the refundBayes package, users can install from GitHub:

library(remotes)
remotes::install_github("https://github.com/ZirenJiang/refundBayes")

Statistical Model

The FPCA Model

For subject \(i = 1, \ldots, n\), let \(Y_i(t)\) be a functional observation recorded at \(M\) common grid points \(t_1, \ldots, t_M\) on a domain \(\mathcal{T}\). The classical FPCA decomposition (Karhunen–Loève expansion plus measurement error) writes

\[Y_i(t_m) \;=\; \mu(t_m) \;+\; \sum_{j=1}^{J} \xi_{ij}\, \phi_j(t_m) \;+\; \epsilon_i(t_m), \qquad \epsilon_i(t_m) \stackrel{\text{iid}}{\sim} \mathcal{N}(0, \sigma_\epsilon^2),\]

where:

The functional domain \(\mathcal{T}\) is not restricted to \([0, 1]\); it is whatever interval the columns of the response matrix represent. The model assumes the same dense grid is observed for every subject.

Fixed Eigenfunctions from Frequentist FPCA

The defining feature of fpca_bayes() is that the eigenfunctions \(\phi_j(t)\) are not sampled. They are estimated once, before Stan ever runs, by calling

init.fpca <- refund::fpca.sc(Y_mat, npc = npc)

inside the function. The resulting \(M \times J\) matrix init.fpca$efunctions is transposed to \(J \times M\) and shipped to Stan as a data matrix called Phi_mat. The number of components \(J\) is chosen automatically by fpca.sc() based on percent-variance-explained when npc = NULL (the default), or pinned by the user via the npc argument.

There are three reasons for this design choice:

  1. Identifiability. Treating the eigenfunctions as parameters introduces label-switching, sign-flipping, and orthogonality constraints that are notoriously hard to handle in MCMC. Fixing them resolves all three issues at once.
  2. Linearity in Stan. With \(\phi_j(t)\) fixed, the model becomes linear-Gaussian conditional on the variance components, which Stan can sample very efficiently using NUTS / HMC.
  3. Consistency with the rest of the package. fosr_bayes() and fofr_bayes() use the same fixed-eigenfunction trick to model the residual functional process. fpca_bayes() simply elevates that idea from “nuisance covariance” to “model of primary interest”.

The trade-off is that uncertainty in the eigenfunction estimation itself is not propagated into the posterior of the scores or the mean. In practice, when \(n\) is moderately large and the eigenfunctions are well-separated (eigenvalues of decreasing magnitude with a clear gap), this is a small price for a fast and stable sampler.

Mean Function via Penalized Splines

The mean function \(\mu(t)\) is represented using a penalized spline expansion identical to the one used for the response-domain coefficient functions in fosr_bayes():

\[\mu(t) \;=\; \sum_{k=1}^{K} \alpha_k\, \psi_k(t) \;=\; \boldsymbol{\alpha}^\top \boldsymbol{\Psi}(t),\]

where \(\psi_1(t), \ldots, \psi_K(t)\) are spline basis functions (B-splines by default, \(K = 10\)). The basis matrix \(\boldsymbol{\Psi}\) and its penalty matrix \(\mathbf{S}\) are constructed by mgcv::smooth.construct() exactly as in fosr_bayes(). The penalty matrix is then rescaled,

\[\mathbf{S} \;\leftarrow\; \mathbf{S} \,/\, \frac{\|\mathbf{S}\|_2}{\|\boldsymbol{\Psi}\|_\infty^2},\]

so that the smoothing parameter \(\sigma_\mu^2\) lives on a numerically sensible scale. This is the same scaling step used throughout the package.

Smoothness of \(\mu(t)\) is induced through the multivariate-normal prior implied by the quadratic penalty:

\[p(\boldsymbol{\alpha} \mid \sigma_\mu^2) \;\propto\; \exp\!\left\{-\frac{\boldsymbol{\alpha}^\top \mathbf{S}\, \boldsymbol{\alpha}}{2\sigma_\mu^2}\right\}.\]

Matrix Form Sent to Stan

Let \(\mathbf{Y} \in \mathbb{R}^{n \times M}\) be the response matrix, \(\boldsymbol{\Psi} \in \mathbb{R}^{K \times M}\) the spline basis (rows = basis functions, columns = grid points), \(\boldsymbol{\Phi} \in \mathbb{R}^{J \times M}\) the fixed FPC eigenfunctions, \(\boldsymbol{\xi} \in \mathbb{R}^{n \times J}\) the score matrix, and \(\mathbf{1}_n \in \mathbb{R}^n\) a column of ones. The Stan model evaluates

\[\mathbf{Y} \;=\; \mathbf{1}_n (\boldsymbol{\alpha}^\top \boldsymbol{\Psi}) \;+\; \boldsymbol{\xi}\, \boldsymbol{\Phi} \;+\; \boldsymbol{\mathcal{E}}, \qquad \boldsymbol{\mathcal{E}}_{im} \stackrel{\text{iid}}{\sim} \mathcal{N}(0, \sigma_\epsilon^2),\]

implemented as

row_vector[M_num] mu_t = alpha' * Psi_mat;
matrix[N_num, M_num] mu = rep_matrix(mu_t, N_num) + xi * Phi_mat;

Full Bayesian Model

The complete Bayesian FPCA model is:

\[\begin{cases} \mathbf{Y}_i = \boldsymbol{\Psi}^\top \boldsymbol{\alpha} + \boldsymbol{\Phi}^\top \boldsymbol{\xi}_i + \boldsymbol{\epsilon}_i,\quad i = 1, \ldots, n \\[4pt] \boldsymbol{\Phi} \text{ fixed} \text{ (plug-in from } \texttt{refund::fpca.sc}\text{)} \\[4pt] p(\boldsymbol{\alpha} \mid \sigma_\mu^2) \propto \exp\!\big(- \boldsymbol{\alpha}^\top \mathbf{S}\, \boldsymbol{\alpha} \,/\, 2\sigma_\mu^2 \big) \\[4pt] \xi_{ij} \mid \lambda_j \sim \mathcal{N}(0, \lambda_j^2),\quad j = 1, \ldots, J \\[4pt] \sigma_\mu^2,\ \lambda_j^2,\ \sigma_\epsilon^2 \sim \mathrm{InvGamma}(0.001, 0.001) \end{cases}\]

All inverse-Gamma hyper-priors are non-informative, exactly matching the convention used in sofr_bayes, fosr_bayes, fcox_bayes, and fofr_bayes.

The fpca_bayes() Function

Usage

fpca_bayes(
  formula,
  data,
  npc         = NULL,
  runStan     = TRUE,
  niter       = 3000,
  nwarmup     = 1000,
  nchain      = 3,
  ncores      = 1,
  spline_type = "bs",
  spline_df   = 10
)

Arguments

Argument Description
formula A formula of the form Y_mat ~ 1. Only the LHS is used — the RHS is parsed but ignored, since FPCA has no predictors.
data A data frame containing the functional response variable. The response must be stored as an \(n \times M\) numeric matrix (one row per subject, one column per grid point), typically wrapped with I() when assigning to the data frame.
npc Number of functional principal components. If NULL (default), refund::fpca.sc() selects \(J\) automatically by percent-variance-explained. To pin a specific count, pass an integer.
runStan Logical. Whether to run the Stan program. If FALSE, the function only generates the Stan code and data without sampling — useful for inspecting the generated Stan code. Default is TRUE.
niter Total number of Bayesian posterior sampling iterations (including warmup). Default is 3000.
nwarmup Number of warmup (burn-in) iterations. Default is 1000.
nchain Number of Markov chains for posterior sampling. Default is 3.
ncores Number of CPU cores for parallel execution of the chains. Default is 1.
spline_type Type of spline basis used for the mean function. Default is "bs" (B-splines). Any mgcv basis name is accepted.
spline_df Number of degrees of freedom (basis functions) for the mean-function spline. Default is 10.

Return Value

The function returns a list of class "refundBayes" containing the following elements:

Element Description
stanfit The Stan fit object (class stanfit). Use for convergence diagnostics via the rstan package.
stancode A character string containing the generated Stan model code.
standata A list containing the data passed to Stan (including Phi_mat, Psi_mat, S_mat, Y_mat).
mu A \(Q \times M\) matrix of posterior samples of the population mean function, evaluated on the input grid. Each row is one posterior draw; each column is one grid point.
efunctions An \(M \times J\) matrix of the fixed eigenfunctions returned by the initial refund::fpca.sc() call. These are inputs to the Bayesian model, not posterior samples.
scores A \(Q \times n \times J\) array of posterior samples of FPC scores \(\xi_{ij}\).
evalues A \(Q \times J\) matrix of posterior samples of the FPC eigenvalue standard deviations \(\lambda_j\).
sigma A length-\(Q\) vector of posterior samples of the residual standard deviation \(\sigma_\epsilon\).
family Always "fpca". Used for class-method dispatch.

When runStan = FALSE, every element except efunctions, stancode, standata, and family is filled with NA.

Formula Syntax

Because FPCA has no predictors, the formula syntax for fpca_bayes() is intentionally minimal:

Y_mat ~ 1

where Y_mat is the name of the matrix-valued response column in data. This differs sharply from sofr_bayes and fcox_bayes, where tmat, lmat, and wmat encode the Riemann-sum integral against the functional predictor: there is no integral and no functional predictor here, so none of those auxiliary matrices are required. The only requirement is that data[[Y_mat]] is a numeric matrix of dimension \(n \times M\) with the same grid across all subjects.

How the Data Are Auto-Processed

When fpca_bayes() is called, the function performs the following steps internally before invoking Stan:

  1. Parse the formula with brms::brmsformula() to extract the response variable name.
  2. Validate that data[[y_var]] exists and is a matrix.
  3. Run an initial frequentist FPCA via refund::fpca.sc(Y_mat, npc = npc) to obtain the eigenfunction basis \(\boldsymbol{\Phi}\) and the number of components \(J\).
  4. Build the mean-function spline basis \(\boldsymbol{\Psi}\) and penalty matrix \(\mathbf{S}\) using mgcv::smooth.construct() on a dummy grid 1:M. The penalty matrix is rescaled so that the smoothing variance is on a numerically sensible scale.
  5. Assemble the Stan data list containing \(N\), \(M\), \(J\), \(K\), \(\mathbf{Y}\), \(\boldsymbol{\Phi}\), \(\boldsymbol{\Psi}\), and \(\mathbf{S}\).
  6. Generate the Stan code for each block (data, transformed data, parameters, transformed parameters, model) and concatenate them.
  7. Run Stan (or skip if runStan = FALSE) and extract posterior samples.
  8. Reconstruct the mean function at the input grid by computing \(\boldsymbol{\alpha}^\top \boldsymbol{\Psi}\) for every posterior draw of \(\boldsymbol{\alpha}\), so the user receives mu already evaluated at the observed time points.

The user therefore needs to supply only Y_mat ~ 1 and a data frame with a matrix-valued response. No basis construction, no FPCA pre-fit, and no centering are required from the caller.

Example: Bayesian FPCA on Simulated Data

We demonstrate fpca_bayes() using a small simulated example with a known true mean function, two true eigenfunctions, and known noise variance, so that posterior recovery can be checked against the truth.

Simulate Data

set.seed(12345)
library(refundBayes)
library(refund)
library(ggplot2)

## Dimensions
n     <- 200
M     <- 50
tgrid <- seq(0, 1, length.out = M)

## True mean function and eigenfunctions
mu_true <- sin(pi * tgrid)
phi1    <- sqrt(2) * sin(2 * pi * tgrid)
phi2    <- sqrt(2) * cos(2 * pi * tgrid)
phi_true <- rbind(phi1, phi2)            # J x M
eigen_true     <- c(2, 0.5)
sigma_eps_true <- 0.3

## Simulate scores and observations
xi_true <- cbind(rnorm(n, sd = sqrt(eigen_true[1])),
                 rnorm(n, sd = sqrt(eigen_true[2])))
Y_mat <- matrix(rep(mu_true, n), nrow = n, byrow = TRUE) +
         xi_true %*% phi_true +
         matrix(rnorm(n * M, sd = sigma_eps_true), nrow = n)

dat <- data.frame(inx = 1:n)
dat$Y_mat <- Y_mat

The simulated data set dat contains:

Fit the Bayesian FPCA Model

fit_fpca <- refundBayes::fpca_bayes(
  formula     = Y_mat ~ 1,
  data        = dat,
  spline_type = "bs",
  spline_df   = 10,
  niter       = 1500,
  nwarmup     = 500,
  nchain      = 1,
  ncores      = 1
)

In this call:

Visualization

The plot() method for refundBayes objects has a dedicated family = "fpca" branch that returns a named list of four ggplot objects:

library(ggplot2)
p <- plot(fit_fpca)        # default prob = 0.95
p$mu                       # posterior mean function with 95% credible band
p$efunctions               # fixed eigenfunctions used as the FPC basis
p$evalues                  # posterior eigenvalue SD with 95% intervals
p$sigma                    # posterior of sigma_eps (histogram)

The prob argument controls the coverage of the credible bands, e.g. plot(fit_fpca, prob = 0.80) produces 80% bands.

Extracting Posterior Summaries

Posterior summaries can be computed directly from the returned arrays:

## Posterior mean of the mean function on the input grid
mu_hat <- apply(fit_fpca$mu, 2, mean)

## Pointwise 95% credible interval for the mean function
mu_lo  <- apply(fit_fpca$mu, 2, function(x) quantile(x, 0.025))
mu_hi  <- apply(fit_fpca$mu, 2, function(x) quantile(x, 0.975))

## Posterior mean of the FPC eigenvalue SDs
lambda_hat <- apply(fit_fpca$evalues, 2, mean)

## Posterior mean of the FPC scores (n x J matrix of point estimates)
scores_hat <- apply(fit_fpca$scores, c(2, 3), mean)

## Posterior mean of the residual SD
sigma_eps_hat <- mean(fit_fpca$sigma)

Reconstructing Fitted Curves \(\hat{Y}_i(t)\)

Because the eigenfunctions are fixed, the Karhunen–Loève reconstruction is computed simply by combining the posterior mean function with the posterior score estimates:

Phi <- fit_fpca$efunctions       # M x J  (fixed)
mu_hat     <- apply(fit_fpca$mu,     2,      mean)   # length M
scores_hat <- apply(fit_fpca$scores, c(2,3), mean)   # n x J

Y_hat <- matrix(rep(mu_hat, n), nrow = n, byrow = TRUE) +
         scores_hat %*% t(Phi)

Y_hat has dimensions \(n \times M\) and is comparable to the observed Y_mat minus measurement error.

Comparison with Frequentist FPCA

The Bayesian results can be compared with the frequentist FPCA fit obtained from refund::fpca.sc() (or refund::fpca.face(), which is faster on dense grids):

library(refund)

fit_freq <- refund::fpca.sc(dat$Y_mat)
## or: fit_freq <- refund::fpca.face(dat$Y_mat)

## Frequentist mean function vs Bayesian posterior mean
plot(tgrid, fit_freq$mu, type = "l", lwd = 2, col = "darkorange",
     ylab = expression(hat(mu)(t)), xlab = "t")
lines(tgrid, mu_hat, col = "blue", lwd = 2)
legend("topright", legend = c("Frequentist", "Bayesian"),
       col = c("darkorange", "blue"), lwd = 2, bty = "n")

Because fpca_bayes() uses the same eigenfunctions as refund::fpca.sc() by construction, the only differences between the two fits come from how each method estimates the mean function and the FPC scores. The Bayesian posterior provides full uncertainty quantification on \(\mu(t)\), \(\lambda_j\), \(\xi_{ij}\), and \(\sigma_\epsilon\) that the frequentist point estimator does not. The companion files Simulation/FPCA_Simulation_V1.R and Simulation/FPCA_QuickCheck.Rmd provide a more thorough numerical comparison; a summary of that comparison is reported below.

Simulation Study: Bayesian vs Frequentist FPCA

To benchmark fpca_bayes() against the standard frequentist FPCA fit, we ran a small simulation study in which both methods are given the same eigenfunction basis so that they can be compared on an apples-to-apples footing. The full simulation script is shipped as Simulation/FPCA_Simulation_V1.R.

Simulation Setup

Functional observations were generated from the FPCA model

\[Y_i(t_m) \;=\; \mu(t_m) \;+\; \sum_{j=1}^{4} \xi_{ij}\,\phi_j(t_m) \;+\; \epsilon_i(t_m), \qquad \epsilon_i(t_m) \stackrel{\text{iid}}{\sim} N(0, 0.5^2),\]

on a uniform grid of \(M = 50\) points on \([0, 1]\), with \(J = 4\) true Fourier eigenfunctions and true eigenvalues \((2.5, 1.5, 0.8, 0.4)\). The simulation grid varies three factors:

Factor Levels
Sample size \(n\) \(100,\; 200,\; 500\)
Mean-function shape type 1: \(\mu(t) = \tau\,\sin(\pi t)\); type 2: \(\mu(t) = \tau\,\{-0.5 + (t - 0.5)^2\}\)
Mean-signal strength \(\tau\) \(1,\; 5,\; 10\)

Each cell of the \(3 \times 2 \times 3 = 18\) design was replicated 500 times, giving 9000 fits per method.

Comparator Methods

  • Frequentist baselinerefund::fpca.face(Y_mat), with mean function and FPC scores estimated by penalized least squares.
  • Bayesian modelrefundBayes::fpca_bayes() (or the equivalent standalone Stan program in Simulation/StanFPCA_Gaussian.stan), 1 chain, 2000 iterations (1000 warm-up).

For an apples-to-apples comparison, the same eigenfunction matrix returned by refund::fpca.face() is passed to the Bayesian model as the fixed Phi_mat, so the two methods differ only in how they estimate \(\mu(t)\) and the FPC scores \(\xi_{ij}\).

Performance Metrics

Two relative error measures are reported:

  1. Recovery of the population mean function – the relative integrated squared error \[\text{RISE}\bigl(\hat\mu\bigr) \;=\; \frac{\sum_{m=1}^{M}\bigl\{\hat\mu(t_m) - \mu(t_m)\bigr\}^2}{\sum_{m=1}^{M}\mu(t_m)^2}.\]

  2. Recovery of the noise-free signal – the relative MSE of the reconstructed curves \[\text{relMSE}\bigl(\hat Y\bigr) \;=\; \frac{\frac{1}{nM}\sum_{i,m}\bigl\{\hat Y_i(t_m) - \widetilde Y_i(t_m)\bigr\}^2}{\frac{1}{nM}\sum_{i,m}\widetilde Y_i(t_m)^2},\] where \(\widetilde Y_i(t) = \mu(t) + \sum_j \xi_{ij}\phi_j(t)\) is the noise-free signal and \(\hat Y_i(t)\) is the Karhunen–Loève reconstruction implied by each method.

For the Bayesian fit, both \(\hat\mu\) and the FPC scores are taken to be the posterior median across the MCMC draws.

Results

The figures below show boxplots of each metric across the 500 replicates per cell, on a \(\log_{10}\) scale, with rows indexed by mean-function shape and columns indexed by signal strength \(\tau\).

Recovery of \(\mu(t)\) (RISE). Bayesian and frequentist FPCA recover the mean function with essentially indistinguishable accuracy; both error distributions decrease at the expected \(\sqrt{n}\) rate, and the gap between methods is well within the Monte Carlo noise.

Recovery of mu(t)
Recovery of mu(t)

Reconstruction of the noise-free signal (relMSE). The same conclusion holds for the curve reconstruction: the two methods coincide up to Monte Carlo error, with the relative MSE driven by the signal-to-noise ratio (governed by \(\tau\)) and shrinking with \(n\). Critically, the Bayesian fit attains this accuracy while also providing posterior credible intervals on \(\mu(t)\), \(\lambda_j\), \(\xi_{ij}\), and \(\sigma_\epsilon\), which the frequentist plug-in does not.

Signal reconstruction relMSE
Signal reconstruction relMSE

In summary, the simulation confirms that switching from refund::fpca.face() to fpca_bayes() does not sacrifice point-estimation accuracy for the mean function or the curve reconstruction — it simply augments the analysis with a coherent posterior distribution over every parameter.

Inspecting the Generated Stan Code

Setting runStan = FALSE allows you to inspect or modify the Stan code before running the model:

fpca_code_only <- refundBayes::fpca_bayes(
  formula = Y_mat ~ 1,
  data    = dat,
  runStan = FALSE
)
cat(fpca_code_only$stancode)

A standalone copy of the same model is also shipped in Simulation/StanFPCA_Gaussian.stan, suitable for stan(file = ..., data = ...) calls.

References