This vignette provides a detailed guide to the
fpca_bayes() function in the refundBayes
package, which fits a Bayesian Functional Principal Component Analysis
(FPCA) model using Stan. Unlike the regression-style functions in the
package (sofr_bayes, fosr_bayes,
fcox_bayes, fofr_bayes),
fpca_bayes() does not relate a functional response to
predictors — it instead decomposes a sample of curves into a smooth
population mean function plus a small number of subject-specific
functional principal components, with full Bayesian uncertainty
quantification on every quantity except the eigenfunctions
themselves.
The methodology builds directly on the tutorial by Jiang,
Crainiceanu, and Cui (2025), Tutorial on Bayesian Functional
Regression Using Stan, published in Statistics in
Medicine, and reuses the same penalized-spline + Stan machinery
that powers fosr_bayes() for modelling the mean
function.
refundBayes PackageThe refundBayes package can be installed from CRAN:
For the latest version of the refundBayes package, users
can install from GitHub:
For subject \(i = 1, \ldots, n\), let \(Y_i(t)\) be a functional observation recorded at \(M\) common grid points \(t_1, \ldots, t_M\) on a domain \(\mathcal{T}\). The classical FPCA decomposition (Karhunen–Loève expansion plus measurement error) writes
\[Y_i(t_m) \;=\; \mu(t_m) \;+\; \sum_{j=1}^{J} \xi_{ij}\, \phi_j(t_m) \;+\; \epsilon_i(t_m), \qquad \epsilon_i(t_m) \stackrel{\text{iid}}{\sim} \mathcal{N}(0, \sigma_\epsilon^2),\]
where:
The functional domain \(\mathcal{T}\) is not restricted to \([0, 1]\); it is whatever interval the columns of the response matrix represent. The model assumes the same dense grid is observed for every subject.
The defining feature of fpca_bayes() is that the
eigenfunctions \(\phi_j(t)\)
are not sampled. They are estimated once, before Stan ever
runs, by calling
inside the function. The resulting \(M
\times J\) matrix init.fpca$efunctions is transposed
to \(J \times M\) and shipped to Stan
as a data matrix called Phi_mat. The
number of components \(J\) is chosen
automatically by fpca.sc() based on
percent-variance-explained when npc = NULL (the default),
or pinned by the user via the npc argument.
There are three reasons for this design choice:
fosr_bayes() and fofr_bayes() use the same
fixed-eigenfunction trick to model the residual functional process.
fpca_bayes() simply elevates that idea from “nuisance
covariance” to “model of primary interest”.The trade-off is that uncertainty in the eigenfunction estimation itself is not propagated into the posterior of the scores or the mean. In practice, when \(n\) is moderately large and the eigenfunctions are well-separated (eigenvalues of decreasing magnitude with a clear gap), this is a small price for a fast and stable sampler.
The mean function \(\mu(t)\) is
represented using a penalized spline expansion identical to the one used
for the response-domain coefficient functions in
fosr_bayes():
\[\mu(t) \;=\; \sum_{k=1}^{K} \alpha_k\, \psi_k(t) \;=\; \boldsymbol{\alpha}^\top \boldsymbol{\Psi}(t),\]
where \(\psi_1(t), \ldots,
\psi_K(t)\) are spline basis functions (B-splines by default,
\(K = 10\)). The basis matrix \(\boldsymbol{\Psi}\) and its penalty matrix
\(\mathbf{S}\) are constructed by
mgcv::smooth.construct() exactly as in
fosr_bayes(). The penalty matrix is then rescaled,
\[\mathbf{S} \;\leftarrow\; \mathbf{S} \,/\, \frac{\|\mathbf{S}\|_2}{\|\boldsymbol{\Psi}\|_\infty^2},\]
so that the smoothing parameter \(\sigma_\mu^2\) lives on a numerically sensible scale. This is the same scaling step used throughout the package.
Smoothness of \(\mu(t)\) is induced through the multivariate-normal prior implied by the quadratic penalty:
\[p(\boldsymbol{\alpha} \mid \sigma_\mu^2) \;\propto\; \exp\!\left\{-\frac{\boldsymbol{\alpha}^\top \mathbf{S}\, \boldsymbol{\alpha}}{2\sigma_\mu^2}\right\}.\]
Let \(\mathbf{Y} \in \mathbb{R}^{n \times M}\) be the response matrix, \(\boldsymbol{\Psi} \in \mathbb{R}^{K \times M}\) the spline basis (rows = basis functions, columns = grid points), \(\boldsymbol{\Phi} \in \mathbb{R}^{J \times M}\) the fixed FPC eigenfunctions, \(\boldsymbol{\xi} \in \mathbb{R}^{n \times J}\) the score matrix, and \(\mathbf{1}_n \in \mathbb{R}^n\) a column of ones. The Stan model evaluates
\[\mathbf{Y} \;=\; \mathbf{1}_n (\boldsymbol{\alpha}^\top \boldsymbol{\Psi}) \;+\; \boldsymbol{\xi}\, \boldsymbol{\Phi} \;+\; \boldsymbol{\mathcal{E}}, \qquad \boldsymbol{\mathcal{E}}_{im} \stackrel{\text{iid}}{\sim} \mathcal{N}(0, \sigma_\epsilon^2),\]
implemented as
The complete Bayesian FPCA model is:
\[\begin{cases} \mathbf{Y}_i = \boldsymbol{\Psi}^\top \boldsymbol{\alpha} + \boldsymbol{\Phi}^\top \boldsymbol{\xi}_i + \boldsymbol{\epsilon}_i,\quad i = 1, \ldots, n \\[4pt] \boldsymbol{\Phi} \text{ fixed} \text{ (plug-in from } \texttt{refund::fpca.sc}\text{)} \\[4pt] p(\boldsymbol{\alpha} \mid \sigma_\mu^2) \propto \exp\!\big(- \boldsymbol{\alpha}^\top \mathbf{S}\, \boldsymbol{\alpha} \,/\, 2\sigma_\mu^2 \big) \\[4pt] \xi_{ij} \mid \lambda_j \sim \mathcal{N}(0, \lambda_j^2),\quad j = 1, \ldots, J \\[4pt] \sigma_\mu^2,\ \lambda_j^2,\ \sigma_\epsilon^2 \sim \mathrm{InvGamma}(0.001, 0.001) \end{cases}\]
All inverse-Gamma hyper-priors are non-informative, exactly matching
the convention used in sofr_bayes, fosr_bayes,
fcox_bayes, and fofr_bayes.
fpca_bayes() Function| Argument | Description |
|---|---|
formula |
A formula of the form Y_mat ~ 1. Only the
LHS is used — the RHS is parsed but ignored, since FPCA has no
predictors. |
data |
A data frame containing the functional response
variable. The response must be stored as an \(n \times M\) numeric matrix (one row per
subject, one column per grid point), typically wrapped with
I() when assigning to the data frame. |
npc |
Number of functional principal components. If
NULL (default), refund::fpca.sc() selects
\(J\) automatically by
percent-variance-explained. To pin a specific count, pass an
integer. |
runStan |
Logical. Whether to run the Stan program. If
FALSE, the function only generates the Stan code and data
without sampling — useful for inspecting the generated Stan code.
Default is TRUE. |
niter |
Total number of Bayesian posterior sampling iterations
(including warmup). Default is 3000. |
nwarmup |
Number of warmup (burn-in) iterations. Default is
1000. |
nchain |
Number of Markov chains for posterior sampling. Default
is 3. |
ncores |
Number of CPU cores for parallel execution of the
chains. Default is 1. |
spline_type |
Type of spline basis used for the mean function.
Default is "bs" (B-splines). Any mgcv basis
name is accepted. |
spline_df |
Number of degrees of freedom (basis functions) for the
mean-function spline. Default is 10. |
The function returns a list of class "refundBayes"
containing the following elements:
| Element | Description |
|---|---|
stanfit |
The Stan fit object (class stanfit). Use
for convergence diagnostics via the rstan package. |
stancode |
A character string containing the generated Stan model code. |
standata |
A list containing the data passed to Stan (including
Phi_mat, Psi_mat, S_mat,
Y_mat). |
mu |
A \(Q \times M\) matrix of posterior samples of the population mean function, evaluated on the input grid. Each row is one posterior draw; each column is one grid point. |
efunctions |
An \(M \times J\)
matrix of the fixed eigenfunctions returned by the
initial refund::fpca.sc() call. These are inputs to the
Bayesian model, not posterior samples. |
scores |
A \(Q \times n \times J\) array of posterior samples of FPC scores \(\xi_{ij}\). |
evalues |
A \(Q \times J\) matrix of posterior samples of the FPC eigenvalue standard deviations \(\lambda_j\). |
sigma |
A length-\(Q\) vector of posterior samples of the residual standard deviation \(\sigma_\epsilon\). |
family |
Always "fpca". Used for class-method
dispatch. |
When runStan = FALSE, every element except
efunctions, stancode, standata,
and family is filled with NA.
Because FPCA has no predictors, the formula syntax for
fpca_bayes() is intentionally minimal:
where Y_mat is the name of the matrix-valued
response column in data. This differs sharply from
sofr_bayes and fcox_bayes, where
tmat, lmat, and wmat encode the
Riemann-sum integral against the functional predictor: there is no
integral and no functional predictor here, so none of those auxiliary
matrices are required. The only requirement is that
data[[Y_mat]] is a numeric matrix of dimension \(n \times M\) with the same grid across all
subjects.
When fpca_bayes() is called, the function performs the
following steps internally before invoking Stan:
brms::brmsformula() to extract the response variable
name.data[[y_var]] exists and
is a matrix.refund::fpca.sc(Y_mat, npc = npc) to obtain the
eigenfunction basis \(\boldsymbol{\Phi}\) and the number of
components \(J\).mgcv::smooth.construct() on a dummy grid 1:M.
The penalty matrix is rescaled so that the smoothing variance is on a
numerically sensible scale.data, transformed data,
parameters, transformed parameters,
model) and concatenate them.runStan = FALSE)
and extract posterior samples.mu already evaluated at the observed time points.The user therefore needs to supply only Y_mat ~ 1 and a
data frame with a matrix-valued response. No basis construction, no FPCA
pre-fit, and no centering are required from the caller.
We demonstrate fpca_bayes() using a small simulated
example with a known true mean function, two true eigenfunctions, and
known noise variance, so that posterior recovery can be checked against
the truth.
set.seed(12345)
library(refundBayes)
library(refund)
library(ggplot2)
## Dimensions
n <- 200
M <- 50
tgrid <- seq(0, 1, length.out = M)
## True mean function and eigenfunctions
mu_true <- sin(pi * tgrid)
phi1 <- sqrt(2) * sin(2 * pi * tgrid)
phi2 <- sqrt(2) * cos(2 * pi * tgrid)
phi_true <- rbind(phi1, phi2) # J x M
eigen_true <- c(2, 0.5)
sigma_eps_true <- 0.3
## Simulate scores and observations
xi_true <- cbind(rnorm(n, sd = sqrt(eigen_true[1])),
rnorm(n, sd = sqrt(eigen_true[2])))
Y_mat <- matrix(rep(mu_true, n), nrow = n, byrow = TRUE) +
xi_true %*% phi_true +
matrix(rnorm(n * M, sd = sigma_eps_true), nrow = n)
dat <- data.frame(inx = 1:n)
dat$Y_mat <- Y_matThe simulated data set dat contains:
Y_mat: an \(n \times
M\) matrix-valued column of functional observations,inx: a dummy index column (only present so the data
frame has the right number of rows; it is not used by the model).fit_fpca <- refundBayes::fpca_bayes(
formula = Y_mat ~ 1,
data = dat,
spline_type = "bs",
spline_df = 10,
niter = 1500,
nwarmup = 500,
nchain = 1,
ncores = 1
)In this call:
Y_mat ~ 1 tells the function to treat the
column Y_mat as the functional response and to model only
its mean and FPC structure.refund::fpca.sc() (typically two for this simulation,
matching the truth).niter = 3000, nwarmup = 1000 is
recommended.The plot() method for refundBayes objects
has a dedicated family = "fpca" branch that returns a named
list of four ggplot objects:
library(ggplot2)
p <- plot(fit_fpca) # default prob = 0.95
p$mu # posterior mean function with 95% credible band
p$efunctions # fixed eigenfunctions used as the FPC basis
p$evalues # posterior eigenvalue SD with 95% intervals
p$sigma # posterior of sigma_eps (histogram)The prob argument controls the coverage of the credible
bands, e.g. plot(fit_fpca, prob = 0.80) produces 80%
bands.
Posterior summaries can be computed directly from the returned arrays:
## Posterior mean of the mean function on the input grid
mu_hat <- apply(fit_fpca$mu, 2, mean)
## Pointwise 95% credible interval for the mean function
mu_lo <- apply(fit_fpca$mu, 2, function(x) quantile(x, 0.025))
mu_hi <- apply(fit_fpca$mu, 2, function(x) quantile(x, 0.975))
## Posterior mean of the FPC eigenvalue SDs
lambda_hat <- apply(fit_fpca$evalues, 2, mean)
## Posterior mean of the FPC scores (n x J matrix of point estimates)
scores_hat <- apply(fit_fpca$scores, c(2, 3), mean)
## Posterior mean of the residual SD
sigma_eps_hat <- mean(fit_fpca$sigma)Because the eigenfunctions are fixed, the Karhunen–Loève reconstruction is computed simply by combining the posterior mean function with the posterior score estimates:
Phi <- fit_fpca$efunctions # M x J (fixed)
mu_hat <- apply(fit_fpca$mu, 2, mean) # length M
scores_hat <- apply(fit_fpca$scores, c(2,3), mean) # n x J
Y_hat <- matrix(rep(mu_hat, n), nrow = n, byrow = TRUE) +
scores_hat %*% t(Phi)Y_hat has dimensions \(n
\times M\) and is comparable to the observed Y_mat
minus measurement error.
The Bayesian results can be compared with the frequentist FPCA fit
obtained from refund::fpca.sc() (or
refund::fpca.face(), which is faster on dense grids):
library(refund)
fit_freq <- refund::fpca.sc(dat$Y_mat)
## or: fit_freq <- refund::fpca.face(dat$Y_mat)
## Frequentist mean function vs Bayesian posterior mean
plot(tgrid, fit_freq$mu, type = "l", lwd = 2, col = "darkorange",
ylab = expression(hat(mu)(t)), xlab = "t")
lines(tgrid, mu_hat, col = "blue", lwd = 2)
legend("topright", legend = c("Frequentist", "Bayesian"),
col = c("darkorange", "blue"), lwd = 2, bty = "n")Because fpca_bayes() uses the same
eigenfunctions as refund::fpca.sc() by construction, the
only differences between the two fits come from how each method
estimates the mean function and the FPC scores. The Bayesian posterior
provides full uncertainty quantification on \(\mu(t)\), \(\lambda_j\), \(\xi_{ij}\), and \(\sigma_\epsilon\) that the frequentist
point estimator does not. The companion files
Simulation/FPCA_Simulation_V1.R and
Simulation/FPCA_QuickCheck.Rmd provide a more thorough
numerical comparison; a summary of that comparison is reported
below.
To benchmark fpca_bayes() against the standard
frequentist FPCA fit, we ran a small simulation study in which both
methods are given the same eigenfunction basis so that
they can be compared on an apples-to-apples footing. The full simulation
script is shipped as Simulation/FPCA_Simulation_V1.R.
Functional observations were generated from the FPCA model
\[Y_i(t_m) \;=\; \mu(t_m) \;+\; \sum_{j=1}^{4} \xi_{ij}\,\phi_j(t_m) \;+\; \epsilon_i(t_m), \qquad \epsilon_i(t_m) \stackrel{\text{iid}}{\sim} N(0, 0.5^2),\]
on a uniform grid of \(M = 50\) points on \([0, 1]\), with \(J = 4\) true Fourier eigenfunctions and true eigenvalues \((2.5, 1.5, 0.8, 0.4)\). The simulation grid varies three factors:
| Factor | Levels |
|---|---|
| Sample size \(n\) | \(100,\; 200,\; 500\) |
| Mean-function shape | type 1: \(\mu(t) = \tau\,\sin(\pi t)\); type 2: \(\mu(t) = \tau\,\{-0.5 + (t - 0.5)^2\}\) |
| Mean-signal strength \(\tau\) | \(1,\; 5,\; 10\) |
Each cell of the \(3 \times 2 \times 3 = 18\) design was replicated 500 times, giving 9000 fits per method.
refund::fpca.face(Y_mat), with mean function and FPC scores
estimated by penalized least squares.refundBayes::fpca_bayes() (or the equivalent standalone
Stan program in Simulation/StanFPCA_Gaussian.stan), 1
chain, 2000 iterations (1000 warm-up).For an apples-to-apples comparison, the same eigenfunction matrix
returned by refund::fpca.face() is passed to the Bayesian
model as the fixed Phi_mat, so the two methods differ
only in how they estimate \(\mu(t)\) and the FPC scores \(\xi_{ij}\).
Two relative error measures are reported:
Recovery of the population mean function – the relative integrated squared error \[\text{RISE}\bigl(\hat\mu\bigr) \;=\; \frac{\sum_{m=1}^{M}\bigl\{\hat\mu(t_m) - \mu(t_m)\bigr\}^2}{\sum_{m=1}^{M}\mu(t_m)^2}.\]
Recovery of the noise-free signal – the relative MSE of the reconstructed curves \[\text{relMSE}\bigl(\hat Y\bigr) \;=\; \frac{\frac{1}{nM}\sum_{i,m}\bigl\{\hat Y_i(t_m) - \widetilde Y_i(t_m)\bigr\}^2}{\frac{1}{nM}\sum_{i,m}\widetilde Y_i(t_m)^2},\] where \(\widetilde Y_i(t) = \mu(t) + \sum_j \xi_{ij}\phi_j(t)\) is the noise-free signal and \(\hat Y_i(t)\) is the Karhunen–Loève reconstruction implied by each method.
For the Bayesian fit, both \(\hat\mu\) and the FPC scores are taken to be the posterior median across the MCMC draws.
The figures below show boxplots of each metric across the 500 replicates per cell, on a \(\log_{10}\) scale, with rows indexed by mean-function shape and columns indexed by signal strength \(\tau\).
Recovery of \(\mu(t)\) (RISE). Bayesian and frequentist FPCA recover the mean function with essentially indistinguishable accuracy; both error distributions decrease at the expected \(\sqrt{n}\) rate, and the gap between methods is well within the Monte Carlo noise.
Reconstruction of the noise-free signal (relMSE). The same conclusion holds for the curve reconstruction: the two methods coincide up to Monte Carlo error, with the relative MSE driven by the signal-to-noise ratio (governed by \(\tau\)) and shrinking with \(n\). Critically, the Bayesian fit attains this accuracy while also providing posterior credible intervals on \(\mu(t)\), \(\lambda_j\), \(\xi_{ij}\), and \(\sigma_\epsilon\), which the frequentist plug-in does not.
In summary, the simulation confirms that switching from
refund::fpca.face() to fpca_bayes() does
not sacrifice point-estimation accuracy for the mean
function or the curve reconstruction — it simply augments the analysis
with a coherent posterior distribution over every parameter.
Setting runStan = FALSE allows you to inspect or modify
the Stan code before running the model:
fpca_code_only <- refundBayes::fpca_bayes(
formula = Y_mat ~ 1,
data = dat,
runStan = FALSE
)
cat(fpca_code_only$stancode)A standalone copy of the same model is also shipped in
Simulation/StanFPCA_Gaussian.stan, suitable for
stan(file = ..., data = ...) calls.