cvLM 2.0.0
Major Changes & Breaking
Updates
- New Default Behavior: Data centering
(
center = TRUE) is now the default. This ensures that the
intercept is not penalized in ridge regression, aligning the package
with standard statistical methodologies.
- API Cleanup: Removed the
verbose
argument from grid.search. The new C++ backend evaluates
the lambda grid analytically, rendering progress bars unnecessary.
- Refined Object Inheritance: For
lm and
glm methods, subset and na.action
are now handled by the model object prior to cross-validation, ensuring
consistency with the original model fit.
The engine has been transitioned from RcppEigen to RcppArmadillo,
allowing the package to leverage high-performance LAPACK and BLAS
libraries for large-scale matrix operations.
- SVD-Powered Grid Search:
grid.search
has been entirely rewritten in C++. It now utilizes a single
Singular Value Decomposition (SVD) to evaluate the
entire \(\lambda\) grid analytically.
- Efficiency: Reduces computational complexity from
\(O(np^2)\) per grid point to \(O(\min(n, p))\) after the initial
decomposition.
- Parallel Computation: Refined and further
integrated
RcppParallel to distribute workloads.
- For K-fold CV, threads are distributed across folds.
- For GCV/LOOCV grid searches, threads are distributed across the
\(\lambda\) grid.
- Optimized LOOCV/GCV: Implemented closed-form
solutions for Leave-One-Out and Generalized Cross-Validation using the
hat-matrix diagonal, avoiding \(n\)
model refits.
Numerical Robustness
- OLS Evolution: Transitioned from Column-Pivoted QR
decomposition to Complete Orthogonal Decomposition
(COD). This enables the computation of the unique minimum \(L_2\) norm solution for column
rank-deficient or underdetermined (\(p >
n\)) systems.
- Ridge Evolution: Transitioned from Cholesky-based
methods to Singular Value Decomposition (SVD). This
avoids the numerical risks associated with forming the cross-product
matrix \(X^TX\) and ensures stability
in ill-conditioned settings.
- Precision Control: Added a
tol
(tolerance) parameter to define the threshold for numerical rank
estimation during COD and SVD operations.
Internal Improvements
- Template Metaprogramming: Re-engineered core logic
to utilize generic, templated C++ code, shifting significant
computational evaluation to compile-time and reducing runtime
overhead.
- C++17 Migration: Upgraded the package build
standard to C++17, enabling more expressive syntax and modern compiler
optimizations.
- Memory Optimization: Refactored multi-threaded
workers to utilize pre-allocated buffers. By eliminating heap
allocations within “hot loops” (specifically during data training and
out-of-sample evaluation), the engine achieves significantly higher
throughput and lower latency.
- Armadillo Expression Tuning: Optimized the use of
Armadillo expression templates to maximize lazy evaluation. This
minimizes the creation of temporary objects and allows the compiler to
generate more efficient SIMD-augmented computation loops.
- Comprehensive Testing Suite:
- R Integration: Implemented extensive
testthat suites to validate cvLM and
grid.search against manual matrix algebra and established
packages like boot.
- Numerical Validation: Tests specifically target
edge cases including ill-conditioned, rank-deficient, and
high-dimensional (\(p > n\))
datasets.
- Zero-Copy Interoperability: Utilizes Armadillo’s
advanced memory mapping to interface directly with R-allocated memory,
ensuring zero-copy data passing between R and C++.