spboost: Nonlinear Spatial Autoregressive
Models with Boosting and CFE
Ghislain Geniaux
Flexible nonlinear spatial autoregressive models: a gradient boosting
approach with closed-form estimation
Geniaux, 2026.
Scientific motivation
Spatial autoregressive models are a central tool in spatial
econometrics because observations located close to each other often
interact. The spatial lag model introduced in the classical spatial
econometric literature (Ord, 1975; Anselin, 1988) formalizes this idea
through a row-standardized spatial weights matrix \(W\), which defines spatial lags such as
\(Wy\). In empirical work, the
parameter attached to \(Wy\) is usually
interpreted as a spillover or diffusion effect: outcomes observed at one
location partly depend on outcomes observed at neighbouring
locations.
The difficulty is that spatial dependence and nonlinear
misspecification are hard to distinguish empirically. McMillen (2003)
shows that residual spatial autocorrelation may arise because the
regression function is misspecified, not necessarily because there is a
genuine spatial spillover. This point is especially important when
covariates are themselves spatially structured. A flexible regression
component can correctly recover nonlinear covariate effects, but if it
is too flexible it can also absorb the same low-frequency spatial signal
that the researcher wants to attribute to a structural spatial parameter
— a phenomenon analysed as the geoadditive identification trap
in Geniaux (2026, Section 3.4). Recent work on identification in spatial
spillover models emphasizes this tension and the need to distinguish
functional-form flexibility from structural spatial dependence (Debarsy,
2025).
spboost is motivated by this identification problem,
building on the methodological developments in Geniaux (2026),
“Flexible nonlinear spatial autoregressive models: a gradient
boosting approach with closed-form estimation”. It combines
nonlinear regression tools with spatial autoregressive models while
keeping the spatial parameter explicitly identified and amenable to
standard diagnostics. The nonlinear component \(f(X)\) can be fitted with component-wise
boosting (mboost), penalized splines (mgcv),
MARS (earth), or tree/linear boosting
(xgboost). Component-wise boosting performs regularized
variable selection while fitting additive effects (Bühlmann and van de
Geer, 2011), thereby connecting the variable-selection literature in
statistical learning to spatial econometric estimation.
The second ingredient is computational. Likelihood-based spatial
models require repeated evaluation of log-determinants such as \(\log |I-\rho W|\), which becomes expensive
for large samples. Smirnov (2020, doi:10.1111/gean.12268) proposed a Closed-Form Estimator
(CFE) for linear spatial dependence models that avoids this repeated
determinant calculation. The CFE is computationally very fast because it
relies on three auxiliary regressions without spatial dependence, while
achieving accuracy close to maximum-likelihood estimation in the
settings studied by Smirnov (2020, doi:10.1111/gean.12268) for the estimation of the
spatial parameter. The methodological question behind
spboost is whether this determinant-free idea remains valid
when \(f(X)\) is estimated
nonparametrically. Geniaux (2026) gives a positive answer for additive
estimators that admit a fixed smoothing-matrix or projection
representation: a formal asymptotic result establishes the consistency
of the Closed-Form Estimator and the validity of its asymptotic variance
formula under standard regularity conditions on the additive components,
with no additional regularity requirement induced by the spatial
autoregressive structure. For boosting-based estimators, the formal
proof for the greedy operator remains open, but extensive finite-sample
evidence from the Monte Carlo experiments of Geniaux (2026) supports the
same conclusion.
The package focuses on three Gaussian spatial data-generating
structures:
\[
y = \rho Wy + f(X) + \varepsilon \qquad \text{(SAR)}
\]
\[
y = f(X) + (I - \lambda W)^{-1}\varepsilon \qquad \text{(SEM)}
\]
\[
y = \rho Wy + f(X) + (I - \lambda W_2)^{-1}\varepsilon \qquad
\text{(SARAR)}.
\]
Here \(f(X)\) is an additive or
machine-learning regression component. In the boosting backend, it is
written as
\[
f(X) = \beta_0 + \sum_j h_j(X_j),
\]
where each \(h_j\) is selected and
regularized through component-wise gradient boosting. The SARAR
specification supports two distinct row-standardized weight matrices —
\(W\) for the spatial lag and \(W_2\) for the error process — the common
case \(W_2 = W\) corresponding to the
SARAR model studied in Geniaux (2026). The main interface,
spbgam(), allows the user to specify the spatial model
(DGP) and the estimation method (method).
Two families of spatial-parameter estimators are implemented. Methods
ending in ML use a profile maximum-likelihood criterion and
evaluate the spatial log-determinant. Methods ending in CFE
use the determinant-free Closed-Form Estimator of Smirnov (2020, doi:10.1111/gean.12268),
extended to nonlinear additive regression in Geniaux (2026). In the
nonlinear setting, CFE proceeds by estimating auxiliary residuals from
regressions involving \(y\), \(Wy\), and, for SAR, \(W^2y\), and then solving a quadratic
estimating equation for the spatial parameter.
The theoretical status of the estimators differs across estimation
backends. The formal asymptotic result applies to nonlinear additive
estimators that admit a fixed smoothing-matrix representation, including
penalized spline GAM estimators and MARS estimators when the retained
basis dimension is fixed. Component-wise boosting estimators are
supported by consistent finite-sample evidence from the simulations of
Geniaux (2026) and by their asymptotic connection to penalized spline
regularization, but the formal proof for the greedy boosting operator
remains open. Tree-based xgboost variants are provided as
experimental predictive benchmarks whose validity in the CFE framework
requires empirical validation on a case-by-case basis.
Main interface
The executable function signature is:
args(spbgam)
#> function (formula, data, W, W2 = NULL, DGP = "SAR", method = "gamboost_ML",
#> control = list(), debug = NULL, debug_fit_each_iter = NULL,
#> debug_print = NULL)
#> NULL
The key arguments are:
formula |
Regression formula. Its syntax must match the backend:
mboost base learners for BSPA,
mgcv smooths for GAM, ordinary formula terms
for MARS and XGBOOST. |
data |
Data frame containing the response and covariates. |
W |
Row-standardized spatial weights matrix for SAR or SEM
dependence. |
W2 |
Optional second row-standardized matrix for the error process in
SARAR models. |
DGP |
Spatial model: "SAR", "SEM", or
"SARAR". |
method |
Estimator/backend combination, such as "BSPA_SAR_CFE"
or "MARS_SEM_ML". |
control |
List of backend and tuning controls. |
Model and estimator catalogue
The method names follow a compact grammar:
<backend>_<model>_<spatial estimator>
The backend determines how \(f(X)\)
is estimated; the model determines the spatial structure; the suffix
determines how the spatial parameter is estimated.
BSPA |
Component-wise boosting with spline base learners from
mboost |
Y ~ bbs(X1) + bbs(X2) + bols(X3) |
mboost object augmented with class
spboost |
BLA |
Component-wise boosting with linear base learners |
Y ~ bols(X1) + bols(X2) |
mboost object augmented with class
spboost |
GAM |
Penalized spline GAM from mgcv |
Y ~ s(X1) + s(X2) |
gam/bam object augmented with class
spboost |
MARS |
Multivariate Adaptive Regression Splines through
earth::earth |
Y ~ X1 + X2 + X3 |
earth object augmented with class
spboost |
XGBOOST |
Linear or tree-based gradient boosting through
xgboost |
Y ~ X1 + X2 + X3 |
xgb.Booster object augmented with class
spboost |
LM |
Linear benchmark |
Y ~ X1 + X2 + X3 |
Linear spatial model object augmented with spatial fields |
Available combinations through spbgam() are:
SAR |
\(y=\rho Wy+f(X)+\varepsilon\) |
BSPA_SAR_ML, BSPA_SAR_CFE,
BLA_SAR_CFE, MARS_SAR_ML,
MARS_SAR_CFE, GAM_SAR_ML,
GAM_SAR_CFE, GAM_SAR_FIVA,
GAM_SAR_2SLSA, XGBOOST_SAR_ML,
XGBOOST_SAR_CFE, LM_SAR_ML,
BLA_SAR_ML |
SEM |
\(y=f(X)+(I-\lambda
W)^{-1}\varepsilon\) |
BSPA_SEM_ML, BSPA_SEM_CFE,
BSPA_SEM_CFE_iter, BSPA_SEM_CFE_BRUT,
MARS_SEM_ML, MARS_SEM_CFE,
GAM_SEM_CFE, BLA_SEM_ML |
SARAR |
\(y=\rho Wy+f(X)+(I-\lambda
W_2)^{-1}\varepsilon\) |
BSPA_SARAR_ML, BSPA_SARAR_CFE,
BLA_SARAR_ML |
In user-facing discussions, spboost_earth_cfe
corresponds to the MARS_*_CFE branch, because the
implementation uses earth::earth;
spboost_xgboost_cfe corresponds to the
XGBOOST_SAR_CFE branch.
Theoretical status
The table below is intentionally conservative. It separates formal
asymptotic results, finite-sample empirical validation, and experimental
methods.
| Formal asymptotic guarantees |
GAM_SAR_CFE, GAM_SEM_CFE;
MARS_SAR_CFE and MARS_SEM_CFE when
nprune is fixed |
The CFE residual projection argument applies to additive estimators
with a fixed smoothing-matrix or projection representation. Consistency
and asymptotic variance validity follow under the regularity and
identification conditions stated in the methodological paper. |
| Finite-sample validated by simulation evidence |
BSPA_SAR_CFE, BSPA_SEM_CFE,
BSPA_SEM_CFE_iter, BSPA_SEM_CFE_BRUT;
MARS_*_CFE when nprune is selected by
cross-validation |
The formal proof does not directly cover the discrete
model-selection path, but Monte Carlo evidence shows behaviour close to
the corresponding GAM/MARS reference estimators in the studied
designs. |
| Likelihood reference estimators |
*_ML methods such as BSPA_SAR_ML,
MARS_SAR_ML, BSPA_SEM_ML,
BSPA_SARAR_ML |
These are natural robustness checks for CFE results. They are
usually more computationally demanding because they rely on profile
likelihood and log-determinant evaluation. |
| Experimental spatial estimators |
GAM_SAR_FIVA, GAM_SAR_2SLSA,
XGBOOST_SAR_CFE, XGBOOST_SAR_ML,
BSPA_SARAR_CFE |
These are useful for benchmarking, prediction, or exploratory
analysis, but their theoretical status is more limited in the current
framework. Tree-based xgboost methods do not admit the
smoothing-matrix representation used by the CFE proof. |
For SEM models with high spatial error dependence near the boundary,
CFE fallbacks may be triggered. In that case, it is good practice to
compare against the corresponding ML estimator and to
inspect the rho_method field stored in the fitted
object.
Returned objects
All successful fits are given class spboost, in addition
to the class of the underlying backend. Common fields include:
rho |
Estimated spatial autoregressive parameter for SAR; in SEM routines
this field often stores the estimated error parameter \(\lambda\). |
lambda |
Error-process parameter for SARAR methods, when available. |
fitted |
Full fitted value on the response scale. |
fittedYS |
Fitted nonlinear component after the spatial transformation used
internally. |
residuals |
Full-model residuals. |
rmse |
Root mean squared error of full-model residuals. |
var_rho |
CFE asymptotic variance estimate when available and numerically
stable. |
rho_method |
Diagnostic label for the CFE branch, such as cfe_exact,
cfe_ratio_R, or another fallback. |
spbgam_meta |
Metadata added by spbgam(), including the method, DGP,
stopping rule, and CV information. |
The package provides summary.spboost(),
predict.spboost(), and fitted_decomp_spboost()
methods for model inspection, prediction, and decomposition of fitted
values by variable.
Example 1: SAR model with fixed hyperparameters
The first example simulates a nonlinear SAR process and fits two
BSPA estimators with known hyperparameters. The CFE version
is fast and determinant-free; the ML version is a useful reference
because it estimates \(\rho\) by
profile likelihood.
library(spboost)
library(mboost)
set.seed(1)
sim <- dgp(
n = 1000,
rho = 0.6,
betas = c(0, 0.5, 1, -1),
sigma2 = 1,
model = "SAR",
nonlin = TRUE,
myseed = 1
)
fit_sar_cfe <- spbgam(
formula = Y ~ bbs(X1) + bbs(X2) + bbs(X3),
data = sim$data,
W = sim$W,
DGP = "SAR",
method = "BSPA_SAR_CFE",
control = list(
control_gamboost = boost_control(mstop = 150, nu = 0.1)
)
)
fit_sar_cfe$rho
#> [1] 0.6073672
fit_sar_cfe$rmse
#> [1] 1.032868
fit_sar_cfe$rho_method
#> [1] "cfe_exact"
summary(fit_sar_cfe)
#> Call (spbgam):
#> spbgam(formula = Y ~ bbs(X1) + bbs(X2) + bbs(X3), data = sim$data,
#> W = sim$W, DGP = "SAR", method = "BSPA_SAR_CFE", control = list(control_gamboost = boost_control(mstop = 150,
#> nu = 0.1)))
#> Estimator: BSPA_SAR_CFE
#> mstop: 150 (fixed)
#> Spatial parameter (rho/lambda): 0.607367
#> Spatial estimator: cfe_exact
#> Var(rho/lambda): 0.001110
#> SE(rho/lambda): 0.033318
#> Wald z-statistic: 18.229407
#> p-value (H0: rho/lambda = 0): < 2.2e-16
#> Elapsed (s): 0.240
#>
#> Full model diagnostics (original Y scale)
#> n = 1000
#> RMSE: 1.032868
#> MAE : 0.821090
#> R2 : 0.567521
#> Deviance explained (full): 0.567521
#>
#> Underlying model summary (spatially filtered equation)
#>
#> Model-based Boosting
#>
#> Call:
#> gamboost(formula = "Y ~ bbs(X1) + bbs(X2) + bbs(X3)", data = data, control = control)
#>
#>
#> Squared Error (Regression)
#>
#> Loss function: (y - f)^2
#>
#>
#> Number of boosting iterations: mstop = 150
#> Step size: 0.1
#> Offset: 0.4862967
#> Number of baselearners: 3
#>
#> Selection frequencies:
#> bbs(X2) bbs(X3) bbs(X1)
#> 0.4600000 0.3333333 0.2066667
The likelihood counterpart uses the same regression formula and
stopping iteration:
fit_sar_ml <- spbgam(
formula = Y ~ bbs(X1) + bbs(X2) + bbs(X3),
data = sim$data,
W = sim$W,
DGP = "SAR",
method = "BSPA_SAR_ML",
control = list(
control_gamboost = boost_control(mstop = 150, nu = 0.1)
)
)
c(CFE = fit_sar_cfe$rho, ML = fit_sar_ml$rho)
#> CFE ML
#> 0.6073672 0.6022003
summary(fit_sar_ml)
#> Call (spbgam):
#> spbgam(formula = Y ~ bbs(X1) + bbs(X2) + bbs(X3), data = sim$data,
#> W = sim$W, DGP = "SAR", method = "BSPA_SAR_ML", control = list(control_gamboost = boost_control(mstop = 150,
#> nu = 0.1)))
#> Estimator: BSPA_SAR_ML
#> mstop: 150 (fixed)
#> Spatial parameter (rho/lambda): 0.602200
#>
#> Full model diagnostics (original Y scale)
#> n = 1000
#> RMSE: 1.033908
#> MAE : 0.821900
#> R2 : 0.566650
#> Deviance explained (full): 0.566650
#>
#> Underlying model summary (spatially filtered equation)
#>
#> Model-based Boosting
#>
#> Call:
#> gamboost(formula = formula2, data = data, control = control_final)
#>
#>
#> Squared Error (Regression)
#>
#> Loss function: (y - f)^2
#>
#>
#> Number of boosting iterations: mstop = 150
#> Step size: 0.1
#> Offset: 0.4927991
#> Number of baselearners: 3
#>
#> Selection frequencies:
#> bbs(X2) bbs(X3) bbs(X1)
#> 0.4600000 0.3333333 0.2066667
The fitted nonlinear components can be inspected visually. Because
additive components are identifiable up to constants, the true and
estimated component-specific curves are centered before plotting.
decomp_cfe_ex1 <- fitted_decomp_spboost(fit_sar_cfe)
decomp_ml_ex1 <- fitted_decomp_spboost(fit_sar_ml)
center_curve <- function(x) as.numeric(x - mean(x, na.rm = TRUE))
plot_component_ex1 <- function(true_col, fit_col, title) {
ord <- order(sim$data[[fit_col]])
x_component <- sim$data[[fit_col]][ord]
true_component <- center_curve(sim$data[[true_col]])[ord]
cfe_component <- center_curve(decomp_cfe_ex1[[fit_col]])[ord]
ml_component <- center_curve(decomp_ml_ex1[[fit_col]])[ord]
plot(
x_component,
ml_component,
type = "l",
lwd = 1,
col = "#D95F02",
xlab = fit_col,
ylab = "Centered component",
main = title,
ylim = range(c(true_component, cfe_component, ml_component), na.rm = TRUE)
)
lines(x_component, cfe_component, col = "#1B9E77", lwd = 1)
lines(x_component, true_component, col = "black", lwd = 3)
}
old_par_ex1 <- par(mfrow = c(3, 1), mar = c(4, 4, 2, 1))
plot_component_ex1("f1", "X1", "Component f1(X1)")
legend(
"topleft",
legend = c("True component", "BSPA_SAR_CFE", "BSPA_SAR_ML"),
col = c("black", "#1B9E77", "#D95F02"),
lwd = c(3, 1, 1),
bty = "n"
)
plot_component_ex1("f2", "X2", "Component f2(X2)")
plot_component_ex1("f3", "X3", "Component f3(X3)")

Example 2: Internal cross-validation, random and spatial
The same SAR model can be estimated with internal cross-validation
for the boosting stopping iteration. With
cv_mode = "random", folds are random. With
cv_mode = "spatial_block", each fold holds out one spatial
cluster built with blockCV::cv_cluster(). With
cv_mode = "spatial_hex", folds are defined as spatial
hexagonal blocks built from the coordinate columns x and
y. However, this strategy is generally not recommended for
autoregressive models because it may strongly disrupt the underlying
neighborhood graph structure (see Geniaux, 2026). The spatial options
require blockCV and sf; if they are not
installed, the package falls back to a safe random split. Here
mstop = 500 is not the final fixed stopping iteration: it
is the maximum value searched by the internal CV procedure. Use
cv_plot = TRUE to display the spatial folds as points
coloured by block; the plotted coordinates are also stored in
fit$mstop_selection$cv_plot_data.
fit_sar_cv_random <- spbgam(
formula = Y ~ bbs(X1) + bbs(X2) + bbs(X3),
data = sim$data,
W = sim$W,
DGP = "SAR",
method = "BSPA_SAR_CFE",
control = list(
# Maximum mstop considered by the internal CV search.
control_gamboost = boost_control(mstop = 500, nu = 0.1),
mstop_init = 100,
mstop_criterion = "CV",
nfold = 5,
ncore = 1,
cv_mode = "random",
cv_seed_spatial = 1001
)
)
#> Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
#> args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
#> Linear extrapolation used.
#> Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
#> args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
#> Linear extrapolation used.
#> Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
#> args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
#> Linear extrapolation used.
#> Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
#> args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
#> Linear extrapolation used.
#> Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
#> args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
#> Linear extrapolation used.
#> Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
#> args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
#> Linear extrapolation used.
fit_sar_cv_spatial <- spbgam(
formula = Y ~ bbs(X1) + bbs(X2) + bbs(X3),
data = sim$data,
W = sim$W,
DGP = "SAR",
method = "BSPA_SAR_CFE",
control = list(
# Maximum mstop considered by the internal CV search.
control_gamboost = boost_control(mstop = 500, nu = 0.1),
mstop_init = 100,
mstop_criterion = "CV",
nfold = 5,
ncore = 1,
cv_mode = "spatial_block",
cv_coord_vars_spatial = c("x", "y"),
cv_hex_size_spatial = 0.15,
cv_seed_spatial = 1001,
cv_plot = TRUE
)
)
#> Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
#> args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
#> Linear extrapolation used.
#> Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
#> args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
#> Linear extrapolation used.
#> Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
#> args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
#> Linear extrapolation used.
#> Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
#> args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
#> Linear extrapolation used.
#> Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
#> args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
#> Linear extrapolation used.
#> Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
#> args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
#> Linear extrapolation used.
fit_sar_cv_random$spbgam_meta$mstop_final
#> [1] 152
fit_sar_cv_spatial$spbgam_meta$mstop_final
#> [1] 192
summary(fit_sar_cv_random)
#> Call (spbgam):
#> spbgam(formula = Y ~ bbs(X1) + bbs(X2) + bbs(X3), data = sim$data,
#> W = sim$W, DGP = "SAR", method = "BSPA_SAR_CFE", control = list(control_gamboost = boost_control(mstop = 500,
#> nu = 0.1), mstop_init = 100, mstop_criterion = "CV",
#> nfold = 5, ncore = 1, cv_mode = "random", cv_seed_spatial = 1001))
#> Estimator: BSPA_SAR_CFE
#> mstop: 152 (CV)
#> CV mode: random
#> CV RMSE on spatially filtered Y (at best mstop): 1.048652
#> CV RMSE on Y (OOF BPN, at best mstop): 1.046238
#> OOF predictions used: 1000 / 1000
#> Spatial parameter (rho/lambda): 0.607354
#> Spatial estimator: cfe_exact
#> Var(rho/lambda): 0.001110
#> SE(rho/lambda): 0.033317
#> Wald z-statistic: 18.229361
#> p-value (H0: rho/lambda = 0): < 2.2e-16
#> Elapsed (s): 0.208
#>
#> Full model diagnostics (original Y scale)
#> n = 1000
#> RMSE: 1.032814
#> MAE : 0.821066
#> R2 : 0.567567
#> Deviance explained (full): 0.567567
#>
#> Underlying model summary (spatially filtered equation)
#>
#> Model-based Boosting
#>
#> Call:
#> gamboost(formula = "Y ~ bbs(X1) + bbs(X2) + bbs(X3)", data = data, control = control)
#>
#>
#> Squared Error (Regression)
#>
#> Loss function: (y - f)^2
#>
#>
#> Number of boosting iterations: mstop = 152
#> Step size: 0.1
#> Offset: 0.4863137
#> Number of baselearners: 3
#>
#> Selection frequencies:
#> bbs(X2) bbs(X3) bbs(X1)
#> 0.4605263 0.3355263 0.2039474
summary(fit_sar_cv_spatial)
#> Call (spbgam):
#> spbgam(formula = Y ~ bbs(X1) + bbs(X2) + bbs(X3), data = sim$data,
#> W = sim$W, DGP = "SAR", method = "BSPA_SAR_CFE", control = list(control_gamboost = boost_control(mstop = 500,
#> nu = 0.1), mstop_init = 100, mstop_criterion = "CV",
#> nfold = 5, ncore = 1, cv_mode = "spatial_block", cv_coord_vars_spatial = c("x",
#> "y"), cv_hex_size_spatial = 0.15, cv_seed_spatial = 1001,
#> cv_plot = TRUE))
#> Estimator: BSPA_SAR_CFE
#> mstop: 192 (CV)
#> CV mode: spatial_block
#> CV RMSE on spatially filtered Y (at best mstop): 1.051404
#> CV RMSE on Y (OOF BPN, at best mstop): 1.247931
#> OOF predictions used: 1000 / 1000
#> Spatial parameter (rho/lambda): 0.607215
#> Spatial estimator: cfe_exact
#> Var(rho/lambda): 0.001109
#> SE(rho/lambda): 0.033307
#> Wald z-statistic: 18.231059
#> p-value (H0: rho/lambda = 0): < 2.2e-16
#> Elapsed (s): 0.227
#>
#> Full model diagnostics (original Y scale)
#> n = 1000
#> RMSE: 1.031971
#> MAE : 0.820704
#> R2 : 0.568272
#> Deviance explained (full): 0.568272
#>
#> Underlying model summary (spatially filtered equation)
#>
#> Model-based Boosting
#>
#> Call:
#> gamboost(formula = "Y ~ bbs(X1) + bbs(X2) + bbs(X3)", data = data, control = control)
#>
#>
#> Squared Error (Regression)
#>
#> Loss function: (y - f)^2
#>
#>
#> Number of boosting iterations: mstop = 192
#> Step size: 0.1
#> Offset: 0.4864878
#> Number of baselearners: 3
#>
#> Selection frequencies:
#> bbs(X2) bbs(X3) bbs(X1)
#> 0.4947917 0.3072917 0.1979167
The same fold assignment can be plotted explicitly from the fitted
object. This is useful when knitting the vignette because the spatial CV
diagnostic is then a standard top-level figure.

Example 3: Comparison table for SAR estimators
The next block compares six estimators on the same simulated SAR
sample: linear lagsarlm, non-spatial gamboost,
BSPA_SAR_CFE, BSPA_SAR_ML,
GAM_SAR_CFE, and GAM_SAR_ML. The table reports
the response RMSE, the error on \(\rho\), and the RMSE of the estimated
structural component \(f(X)\). For the
non-spatial gamboost benchmark, \(\rho\) is not estimated.
The GAM_SAR_* versions are especially useful once the
specification is known or has been selected in a preliminary BSPA
workflow. They avoid the variable-selection boosting loop and are
therefore much faster on large samples; in particular,
GAM_SAR_CFE can switch to mgcv::bam()
internally.
library(spatialreg)
#> Loading required package: spData
#> Loading required package: sf
#> Warning: package 'sf' was built under R version 4.3.3
#> Linking to GEOS 3.13.0, GDAL 3.8.5, PROJ 9.5.1; sf_use_s2() is TRUE
library(spdep)
#>
#> Attaching package: 'spdep'
#> The following objects are masked from 'package:spatialreg':
#>
#> get.ClusterOption, get.VerboseOption, get.ZeroPolicyOption,
#> get.coresOption, get.mcOption, set.ClusterOption,
#> set.VerboseOption, set.ZeroPolicyOption, set.coresOption,
#> set.mcOption
listw <- spdep::mat2listw(sim$W, style = "W")
fit_lagsarlm <- spatialreg::lagsarlm(
Y ~ X1 + X2 + X3,
data = sim$data,
listw = listw,
method = "LU",
zero.policy = TRUE
)
fit_gamboost <- mboost::gamboost(
Y ~ bbs(X1) + bbs(X2) + bbs(X3),
data = sim$data,
control = boost_control(mstop = 150, nu = 0.1)
)
fit_gam_sar_cfe <- spbgam(
formula = Y ~ s(X1, k = 5) + s(X2, k = 5) + s(X3, k = 5),
data = sim$data,
W = sim$W,
DGP = "SAR",
method = "GAM_SAR_CFE",
control = list(ncore = 1)
)
fit_gam_sar_ml <- spbgam(
formula = Y ~ s(X1, k = 5) + s(X2, k = 5) + s(X3, k = 5),
data = sim$data,
W = sim$W,
DGP = "SAR",
method = "GAM_SAR_ML",
control = list(ncore = 1)
)
true_f <- rowSums(sim$data[, c("f0", "f1", "f2", "f3")])
wy <- as.numeric(sim$W %*% sim$data$Y)
f_lagsarlm <- as.numeric(
{
x_lag <- model.matrix(Y ~ X1 + X2 + X3, sim$data)
b_lag <- coef(fit_lagsarlm)[colnames(x_lag)]
x_lag %*% b_lag
}
)
f_gamboost <- as.numeric(fitted(fit_gamboost))
f_cfe <- as.numeric(fit_sar_cfe$fittedYS)
f_ml <- as.numeric(fit_sar_ml$fittedYS)
f_gam_cfe <- as.numeric(fit_gam_sar_cfe$fittedYS)
f_gam_ml <- as.numeric(fit_gam_sar_ml$fittedYS)
comparison_sar <- data.frame(
estimator = c(
"SAR(lagsarlm)", "gamboost", "BSPA_SAR_CFE", "BSPA_SAR_ML",
"GAM_SAR_CFE", "GAM_SAR_ML"
),
rmse_y = c(
sqrt(mean((sim$data$Y - suppressMessages(fitted(fit_lagsarlm)))^2)),
sqrt(mean((sim$data$Y - fitted(fit_gamboost))^2)),
fit_sar_cfe$rmse,
fit_sar_ml$rmse,
fit_gam_sar_cfe$rmse,
fit_gam_sar_ml$rmse
),
rmse_rho = c(
abs(fit_lagsarlm$rho - 0.6),
NA_real_,
abs(fit_sar_cfe$rho - 0.6),
abs(fit_sar_ml$rho - 0.6),
abs(fit_gam_sar_cfe$rho - 0.6),
abs(fit_gam_sar_ml$rho - 0.6)
),
rmse_f = c(
sqrt(mean((true_f - f_lagsarlm)^2)),
sqrt(mean((true_f - f_gamboost)^2)),
sqrt(mean((true_f - f_cfe)^2)),
sqrt(mean((true_f - f_ml)^2)),
sqrt(mean((true_f - f_gam_cfe)^2)),
sqrt(mean((true_f - f_gam_ml)^2))
)
)
comparison_sar
#> estimator rmse_y rmse_rho rmse_f
#> 1 SAR(lagsarlm) 1.173212 2.798108e-02 0.5592926
#> 2 gamboost 1.320820 NA 0.7724488
#> 3 BSPA_SAR_CFE 1.032868 7.367201e-03 0.1268243
#> 4 BSPA_SAR_ML 1.033908 2.200315e-03 0.1261273
#> 5 GAM_SAR_CFE 1.033243 7.861567e-03 0.1208463
#> 6 GAM_SAR_ML 1.034861 9.937418e-05 0.1198516
Example 4: SEM model with spatial-error filtering
For SEM models, the spatial parameter governs the error process. The
SEM-CFE methods build auxiliary residual regressions and can optionally
tune the boosting stopping iteration using a SEM-filtered criterion. In
this example, the simulated DGP is SEM and the target parameter is \(\lambda\), stored in the rho
field by the current SEM routines.
sim_sem <- dgp(
n = 1000,
rho = 0.5,
betas = c(0, 0.5, 1, -1),
sigma2 = 1,
model = "SEM",
nonlin = TRUE,
myseed = 2
)
fit_sem_cfe <- spbgam(
formula = Y ~ bbs(X1) + bbs(X2) + bbs(X3),
data = sim_sem$data,
W = sim_sem$W,
DGP = "SEM",
method = "BSPA_SEM_CFE",
control = list(
control_gamboost = boost_control(mstop = 250, nu = 0.1),
mstop_criterion = "CV",
cv_mode = "random",
nfold = 5,
ncore = 1
)
)
#> Warning in spbgam(formula = Y ~ bbs(X1) + bbs(X2) + bbs(X3), data =
#> sim_sem$data, : BSPA_SEM_CFE with SEM CV: enabling internal CV on CFE auxiliary
#> regressions (Y and WY).
fit_sem_cfe$rho
#> [1] 0.4665136
fit_sem_cfe$rmse
#> [1] 1.130323
fit_sem_cfe$rho_method
#> [1] "cfe_exact"
summary(fit_sem_cfe)
#> Call (spbgam):
#> spbgam(formula = Y ~ bbs(X1) + bbs(X2) + bbs(X3), data = sim_sem$data,
#> W = sim_sem$W, DGP = "SEM", method = "BSPA_SEM_CFE", control = list(control_gamboost = boost_control(mstop = 250,
#> nu = 0.1), mstop_criterion = "CV", cv_mode = "random",
#> nfold = 5, ncore = 1))
#> Estimator: BSPA_SEM_CFE
#> mstop: 130 (CV)
#> CV mode: random
#> CV RMSE on spatially filtered Y (at best mstop): 1.022093
#> Spatial parameter (rho/lambda): 0.466514
#> Spatial estimator: cfe_exact
#> CFE exact root selected: 1
#> Var(rho/lambda): 0.001541
#> SE(rho/lambda): 0.039250
#> Wald z-statistic: 11.885648
#> p-value (H0: rho/lambda = 0): < 2.2e-16
#> Elapsed (s): 0.088
#>
#> Full model diagnostics (original Y scale)
#> n = 1000
#> RMSE: 1.130323
#> MAE : 0.910778
#> R2 : 0.268590
#> Deviance explained (full): 0.268590
#>
#> Underlying model summary (spatially filtered equation)
#>
#> Model-based Boosting
#>
#> Call:
#> gamboost(formula = formula, data = data, family = cust_loss_sem(I_rhoW), control = control)
#>
#>
#> SEM with I_rhoW=Diagonal(nrow(W))-rho*W
#>
#> Loss function: as.numeric((I_rhoW %*% (y - f))^2)
#>
#>
#> Number of boosting iterations: mstop = 130
#> Step size: 0.1
#> Offset: 0.3461594
#> Number of baselearners: 3
#>
#> Selection frequencies:
#> bbs(X3) bbs(X2) bbs(X1)
#> 0.4076923 0.4000000 0.1923077
When the estimated spatial error parameter is close to the boundary,
compare with BSPA_SEM_ML:
fit_sem_ml <- spbgam(
formula = Y ~ bbs(X1) + bbs(X2) + bbs(X3),
data = sim_sem$data,
W = sim_sem$W,
DGP = "SEM",
method = "BSPA_SEM_ML",
control = list(
control_gamboost = boost_control(mstop = 250, nu = 0.1)
)
)
c(CFE = fit_sem_cfe$rho, ML = fit_sem_ml$rho)
#> CFE ML
#> 0.4665136 0.4659820
summary(fit_sem_ml)
#> Call (spbgam):
#> spbgam(formula = Y ~ bbs(X1) + bbs(X2) + bbs(X3), data = sim_sem$data,
#> W = sim_sem$W, DGP = "SEM", method = "BSPA_SEM_ML", control = list(control_gamboost = boost_control(mstop = 250,
#> nu = 0.1)))
#> Estimator: BSPA_SEM_ML
#> mstop: 250 (fixed)
#> Spatial parameter (rho/lambda): 0.465982
#>
#> Full model diagnostics (original Y scale)
#> n = 1000
#> RMSE: 1.128568
#> MAE : 0.909587
#> R2 : 0.270859
#> Deviance explained (full): 0.270859
#>
#> Underlying model summary (spatially filtered equation)
#>
#> Model-based Boosting
#>
#> Call:
#> gamboost(formula = formula, data = data, family = cust_loss_sem(I_rhoW), control = control_final)
#>
#>
#> SEM with I_rhoW=Diagonal(nrow(W))-rho*W
#>
#> Loss function: as.numeric((I_rhoW %*% (y - f))^2)
#>
#>
#> Number of boosting iterations: mstop = 250
#> Step size: 0.1
#> Offset: 0.346143
#> Number of baselearners: 3
#>
#> Selection frequencies:
#> bbs(X2) bbs(X3) bbs(X1)
#> 0.492 0.340 0.168
A SEM comparison table can be built in the same spirit as the SAR
table. Here errorsarlm is the linear spatial-error
reference, while gamboost remains a non-spatial nonlinear
benchmark.
fit_errorsarlm <- spatialreg::errorsarlm(
Y ~ X1 + X2 + X3,
data = sim_sem$data,
listw = spdep::mat2listw(sim_sem$W, style = "W"),
method = "LU",
zero.policy = TRUE
)
fit_gamboost_sem <- mboost::gamboost(
Y ~ bbs(X1) + bbs(X2) + bbs(X3),
data = sim_sem$data,
control = boost_control(mstop = 250, nu = 0.1)
)
true_f_sem <- rowSums(sim_sem$data[, c("f0", "f1", "f2", "f3")])
f_errorsarlm <- as.numeric(
{
x_err <- model.matrix(Y ~ X1 + X2 + X3, sim_sem$data)
b_err <- coef(fit_errorsarlm)[colnames(x_err)]
x_err %*% b_err
}
)
f_gamboost_sem <- as.numeric(fitted(fit_gamboost_sem))
f_sem_cfe <- as.numeric(fit_sem_cfe$fittedYS)
f_sem_ml <- as.numeric(fit_sem_ml$fittedYS)
lambda_errorsarlm <- if (!is.null(fit_errorsarlm$lambda)) {
fit_errorsarlm$lambda
} else {
fit_errorsarlm$lambda.se
}
if (length(lambda_errorsarlm) == 0L) lambda_errorsarlm <- NA_real_
scalar_or_na <- function(x) {
if (is.null(x) || length(x) == 0L) return(NA_real_)
as.numeric(x[1])
}
comparison_sem <- data.frame(
estimator = c("SEM(errorsarlm)", "gamboost", "BSPA_SEM_CFE", "BSPA_SEM_ML"),
rmse_y = c(
sqrt(mean((sim_sem$data$Y - suppressMessages(fitted(fit_errorsarlm)))^2)),
sqrt(mean((sim_sem$data$Y - fitted(fit_gamboost_sem))^2)),
scalar_or_na(fit_sem_cfe$rmse),
scalar_or_na(fit_sem_ml$rmse)
),
rmse_lambda = c(
abs(as.numeric(lambda_errorsarlm[1]) - 0.5),
NA_real_,
abs(scalar_or_na(fit_sem_cfe$rho) - 0.5),
abs(scalar_or_na(fit_sem_ml$rho) - 0.5)
),
rmse_f = c(
sqrt(mean((true_f_sem - f_errorsarlm)^2)),
sqrt(mean((true_f_sem - f_gamboost_sem)^2)),
sqrt(mean((true_f_sem - f_sem_cfe)^2)),
sqrt(mean((true_f_sem - f_sem_ml)^2))
)
)
comparison_sem
#> estimator rmse_y rmse_lambda rmse_f
#> 1 SEM(errorsarlm) 1.166148 0.09506476 0.5696819
#> 2 gamboost 1.128129 NA 0.1873721
#> 3 BSPA_SEM_CFE 1.130323 0.03348644 0.1787908
#> 4 BSPA_SEM_ML 1.128568 0.03401802 0.1759179
Example 5: MARS / earth CFE branch
The MARS_*_CFE methods use earth::earth for
the nonlinear component. This branch is useful when piecewise-linear
effects are expected and a compact model is desired.
fit_mars <- spbgam(
formula = Y ~ X1 + X2 + X3,
data = sim$data,
W = sim$W,
DGP = "SAR",
method = "MARS_SAR_CFE",
control = list(
control_earth = list(
degree = 1,
nprune = 12,
trace = 0
)
)
)
fit_mars$rho
#> [1] 0.6039672
fit_mars$rmse
#> [1] 1.036029
fit_mars$rho_method
#> [1] "cfe_exact"
For theory-oriented work, fixing nprune keeps the
estimator within the projection argument described above. For predictive
work, nprune can also be selected by internal
cross-validation:
fit_mars_cv <- spbgam(
formula = Y ~ X1 + X2 + X3,
data = sim$data,
W = sim$W,
DGP = "SAR",
method = "MARS_SAR_CFE",
control = list(
control_earth = list(
degree = 1,
use_cv_nprune = TRUE,
cv_nfold = 5,
cv_ncore = 1,
trace = 0
)
)
)
Example 6: Experimental xgboost CFE branch
The XGBOOST_SAR_CFE estimator can be useful as a
predictive benchmark, but it falls outside the formal smoothing-matrix
framework. Use it as an experimental method and validate it empirically
against CFE/ML additive estimators.
fit_xgb <- spbgam(
formula = Y ~ X1 + X2 + X3,
data = sim$data,
W = sim$W,
DGP = "SAR",
method = "XGBOOST_SAR_CFE",
control = list(
mstop_init = 100,
myparams_xgboost = list(
booster = "gbtree",
eta = 0.05,
max_depth = 2,
subsample = 0.8,
colsample_bytree = 0.8,
nfold = 5,
early_stopping_rounds = 5,
nthread = 1,
verbose = 0
)
)
)
fit_xgb$rho
#> [1] 0.6035466
fit_xgb$rmse
#> [1] 0.9983839
Example 7: Prediction and decomposition
For fitted spboost objects, use the S3 prediction
method. Out-of-sample prediction with spatial correction requires the
full-sample W matrix covering both training and prediction
locations.
pred_in <- predict(fit_sar_cfe)
head(pred_in)
#> [,1]
#> [1,] 1.4229374
#> [2,] 0.3184280
#> [3,] 1.1425926
#> [4,] -0.1092532
#> [5,] 0.1610168
#> [6,] 1.0677343
decomp <- fitted_decomp_spboost(fit_sar_cfe)
head(decomp)
#> Intercept X1 X2 X3 fitted
#> 1 0.4862967 0.39833445 0.2019544 0.33635195 2.4327536
#> 2 0.4862967 0.50759936 -0.7541467 0.07867869 0.6900655
#> 3 0.4862967 0.39272386 0.1642198 0.09935227 2.3494308
#> 4 0.4862967 -0.05307168 -0.6423254 0.09984719 0.9543365
#> 5 0.4862967 -0.27366328 0.2495608 -0.30117731 1.1089248
#> 6 0.4862967 -0.44735972 0.7070805 0.32171686 2.0627377
For a new dataset, refit on a training subset and provide a
full-sample weights matrix covering both training and prediction
locations:
train_id <- 1:800
test_id <- 801:1000
W_train <- sim$W[train_id, train_id, drop = FALSE]
row_sum_train <- Matrix::rowSums(W_train)
W_train <- Matrix::Diagonal(
x = ifelse(row_sum_train > 0, 1 / row_sum_train, 0)
) %*% W_train
fit_sar_train <- spbgam(
formula = Y ~ bbs(X1) + bbs(X2) + bbs(X3),
data = sim$data[train_id, ],
W = W_train,
DGP = "SAR",
method = "BSPA_SAR_CFE",
control = list(
control_gamboost = boost_control(mstop = 100, nu = 0.1)
)
)
#> Warning in spb_warn_w_quality(W, name = "W"): W contains 2 island row(s) (row
#> sum ~ 0).
pred_new <- predict(
fit_sar_train,
newdata = sim$data[test_id, ],
data = sim$data[train_id, ],
W = sim$W,
type = "BPN"
)
#> Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
#> args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
#> Linear extrapolation used.
#> Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
#> args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
#> Linear extrapolation used.
#> Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
#> args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
#> Linear extrapolation used.
head(pred_new)
#> [1] 1.7598530 1.8834854 0.6036237 0.4816779 1.6679928 2.1017742
# diff RMSE train - test
fit_sar_train$rmse
#> [1] 1.078054
rmse_test<-sqrt(mean((pred_new-sim$data[test_id, 'Y'])^2))
rmse_test
#> [1] 1.036808
Example 8: SARAR models
SARAR models include both a spatial lag in the response and a
spatially structured error process. They require two spatial matrices
unless W2 = W is acceptable for the application.
sim_sarar <- dgp(
n = 1000,
rho = 0.5,
lambda = -0.3,
betas = c(0, 0.5, 1, -1),
sigma2 = 1,
model = "SARAR",
nonlin = TRUE,
myseed = 3
)
fit_sarar <- spbgam(
formula = Y ~ bbs(X1) + bbs(X2) + bbs(X3),
data = sim_sarar$data,
W = sim_sarar$W,
W2 = sim_sarar$W2,
DGP = "SARAR",
method = "BSPA_SARAR_CFE",
control = list(
control_gamboost = boost_control(mstop = 200, nu = 0.1),
iter_max = 3,
tol_lambda = 1e-4
)
)
fit_sarar$rho
#> [1] 0.450844
fit_sarar$lambda
#> [1] -0.2373298
fit_sarar$rmse
#> [1] 1.188259
Because SARAR-CFE involves joint determinant-free estimation of two
spatial parameters, it should be treated more cautiously than the
marginal SAR and SEM branches.
Practical recommendations
Start with the simplest estimator that matches the scientific
question. For inference-oriented nonlinear SAR or SEM work with known
smooth structure, GAM_*_CFE or fixed-nprune
MARS_*_CFE are the closest to the formal asymptotic theory.
For automatic variable selection, use BSPA_* methods and
compare CFE results with their ML counterparts. For
prediction-oriented benchmarks, MARS and xgboost can be
valuable, but xgboost CFE should be reported as
experimental.
For very large datasets, a practical workflow is to search for the
model specification with BSPA_*_CFE on representative
subsamples, using boosting as a variable-selection and formula-discovery
device. Once the relevant covariates, nonlinear terms, and spatial
specification are known, refit a GAM_*_CFE model on the
full sample with the selected formula. On large samples, the GAM-CFE
branch can use mgcv::bam(), which is extremely fast for
large additive models. This combines the exploratory strength of
boosting with the stronger asymptotic status and computational stability
of the GAM-CFE branch at full scale.
Always inspect:
fit <- fit_sar_cfe
summary(fit)
#> Call (spbgam):
#> spbgam(formula = Y ~ bbs(X1) + bbs(X2) + bbs(X3), data = sim$data,
#> W = sim$W, DGP = "SAR", method = "BSPA_SAR_CFE", control = list(control_gamboost = boost_control(mstop = 150,
#> nu = 0.1)))
#> Estimator: BSPA_SAR_CFE
#> mstop: 150 (fixed)
#> Spatial parameter (rho/lambda): 0.607367
#> Spatial estimator: cfe_exact
#> Var(rho/lambda): 0.001110
#> SE(rho/lambda): 0.033318
#> Wald z-statistic: 18.229407
#> p-value (H0: rho/lambda = 0): < 2.2e-16
#> Elapsed (s): 0.240
#>
#> Full model diagnostics (original Y scale)
#> n = 1000
#> RMSE: 1.032868
#> MAE : 0.821090
#> R2 : 0.567521
#> Deviance explained (full): 0.567521
#>
#> Underlying model summary (spatially filtered equation)
#>
#> Model-based Boosting
#>
#> Call:
#> gamboost(formula = "Y ~ bbs(X1) + bbs(X2) + bbs(X3)", data = data, control = control)
#>
#>
#> Squared Error (Regression)
#>
#> Loss function: (y - f)^2
#>
#>
#> Number of boosting iterations: mstop = 150
#> Step size: 0.1
#> Offset: 0.4862967
#> Number of baselearners: 3
#>
#> Selection frequencies:
#> bbs(X2) bbs(X3) bbs(X1)
#> 0.4600000 0.3333333 0.2066667
fit$rho
#> [1] 0.6073672
fit$lambda
#> NULL
fit$rho_method
#> [1] "cfe_exact"
fit$rmse
#> [1] 1.032868
fit$spbgam_meta
#> $method
#> [1] "BSPA_SAR_CFE"
#>
#> $DGP
#> [1] "SAR"
#>
#> $mstop_criterion_requested
#> [1] "none"
#>
#> $cv_mode_requested
#> [1] "random"
#>
#> $control_mstop
#> [1] 150
#>
#> $mstop_final
#> [1] 150
#>
#> $mstop_source
#> [1] "fixed"
#>
#> $spectral_overlap_requested
#> [1] FALSE
#>
#> $spectral_k_requested
#> [1] 50
When rho_method is not an exact CFE branch, report the
fallback and compare with the likelihood counterpart. When covariates or
geographic smoothers are strongly spatially structured, keep the model
deliberately parsimonious and check sensitivity to mstop,
formula specification, and the choice of cross-validation regime.
References
Anselin, L. (1988). Spatial Econometrics: Methods and
Models. Kluwer.
Buhlmann, P. and van de Geer, S. (2011). Statistics for
High-Dimensional Data: Methods, Theory and Applications.
Springer.
Chen, T. and Guestrin, C. (2016). XGBoost: A scalable tree boosting
system. Proceedings of the 22nd ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, 785-794.
Debarsy, N. and Le Gallo, J. (2025). Identification of spatial
spillovers: Do’s and don’ts. Journal of Economic Surveys,
39(5), 2152-2173.
Friedman, J. H. (1991). Multivariate adaptive regression splines.
The Annals of Statistics, 19(1), 1-67.
Geniaux, G. (2026). Flexible nonlinear spatial autoregressive
models: a gradient boosting approach with closed-form estimation,
20th WORLD CONFERENCE OF THE SPATIAL ECONOMETRICS ASSOCIATION & 24th
INTERNATIONAL WORKSHOP ON SPATIAL ECONOMETRICS AND STATISTICS, Paris,
June 17-18, 2026
McMillen, D. P. (2003). Spatial autocorrelation or model
misspecification? International Regional Science Review, 26(2),
208-217.
Ord, K. (1975). Estimation methods for models of spatial interaction.
Journal of the American Statistical Association, 70(349),
120-126.
Smirnov, O. A. (2020). A closed-form consistent estimator for linear
models with spatial dependence. Geographical Analysis. doi:10.1111/gean.12268.
Wood, S. N. (2011). Fast stable restricted maximum likelihood and
marginal likelihood estimation of semiparametric generalized linear
models. Journal of the Royal Statistical Society: Series B,
73(1), 3-36.