When test_mean is an R function, it has an (optional)
argument for the covariates. This argument can be an uppercase
X or a lowercase x. Most examples in other
vignettes use the uppercase X, since this is the more
efficient choice, and suffices in most cases. The uppercase
X requires, though, that the expressions used in
test_mean are vectorized.
In the case in which some part of the function is not vectorized, a
lowercase x can be used. Suppose, for example, that
integrate() is required to define test_mean,
and that the covariate value is used as the upper limit of integration.
The example below illustrates this using the integral \[
\int_0^x2t\,dt,
\] whose value is known to be \(x^2\). We can therefore compare the results
of the lowercase example with those of the uppercase example.
set.seed(20240913)
n <- 1e2
func_upper <- function(X, theta) theta[1] + theta[2]*X[,1] + theta[3]*X[,2]^2
func_lower <- function(x, theta) theta[1] + theta[2]*x[1] +
theta[3]*integrate(function(x) 2*x, lower = 0, upper = x[2])$value
theta <- c(2,5,1)
X <- matrix(rexp(2*n, rate = 1), ncol = 2)
Y <- distfreereg:::f2ftheta(f = func_upper, X)(theta) + rnorm(n)
set.seed(20240913)
dfr_upper <- distfreereg(Y = Y, X = X, test_mean = func_upper,
covariance = list(Sigma = 1), theta_init = c(1,1,1))
set.seed(20240913)
dfr_lower <- distfreereg(Y = Y, X = X, test_mean = func_lower,
covariance = list(Sigma = 1), theta_init = c(1,1,1))
all.equal(dfr_upper$observed_stats, dfr_lower$observed_stats)## [1] "Component \"CvM\": Mean relative difference: 2.989899e-08"
## [1] TRUE
The fundamental object involved in this testing procedure is the empirical partial sum process (EPSP); that is, the scaled vector of cumulative sums of transformed residuals. When the model includes only one covariate (that is, when the matrix \(X\) of covariates has only one column), the order in which the residuals are added to form this process is determined by the natural linear order of the covariates. When more than one covariate is present, though, no single order is evidently the “correct” one.
When the null hypothesis is true, the order does not affect the distribution of the statistics of the EPSP. To illustrate this, consider the following example with only one covariate.
set.seed(20241001)
n <- 1e2
func <- function(X, theta) theta[1] + theta[2]*X[,1]
theta <- c(2,5)
X <- matrix(rexp(n))
cdfr_simplex <- compare(true_mean = func,
true_X = X,
true_covariance = list(Sigma = 1),
theta = theta,
test_mean = func,
covariance = list(Sigma = 1),
X = X,
theta_init = c(1,1),
keep = 1)
cdfr_asis <- update(cdfr_simplex, ordering = "asis")The order in which the residuals are added is determined by the
ordering argument of distfreereg(). The
default method of determining the order is the “simplex”
method. In the case of a single covariate, this method orders the
observations in increasing order of covariate values. Another built-in
option is to preserve to given order of the observations; that is, the
order in which the residuals are added is the order of the observations
specified in the input to distfreereg(). The CDF plots of
each method are shown below, and both pairs of curves are mostly
overlapping.
When the null hypothesis is false, however, the order of the
residuals can have a substantial effect on the power of the test. If the
residuals are added in a random order (which they are in the
“asis” objects, because X is randomly
generated and not ordered), the test is generally less able to detect
deviations from the expected null-hypothesis behavior of the EPSP.
Ordering the covariates increases the power of the test to detect such
deviations.
To see this in practice, suppose we now change the previous example so that the true mean function has a quadratic term.
alt_func <- function(X, theta) theta[1] + theta[2]*X[,1] + 0.5*X[,1]^2
cdfr_simplex_alt <- update(cdfr_simplex, true_mean = alt_func)
cdfr_asis_alt <- update(cdfr_asis, true_mean = alt_func)The plots below show that the CDFs for the simplex
method are more separated than those for the asis
method.
The estimated powers of these tests, as expected, highly favor the
default simplex method:
## stat alpha rate mcse
## 1 KS 0.05 1 NA
## stat alpha rate mcse
## 1 KS 0.05 0.236 0.004246222
The previous plots are instructive regarding the importance of ordering, but are not useful when we want to explore model mis-specification. To do that, examining the residuals can be helpful, as suggested by Khmaladze (2021). The plot below uses the example above in which the null hypothesis is true.
Now let us look at the plots when the alternative hypothesis is true.
First, the plot with residuals ordered in the order of
X:
This plot seems similar to the previous one. On the other hand, the plot with the residuals ordered by covariate value show a clear pattern.
The slopes of this plot indicate that the residuals are positive, then negative, and then positive again as the covariate values increase. (This is a common trend when a linear function is used to model data with a quadratic mean function.) This patterns can also be seen by looking at the following plot.
dfr_simplex <- cdfr_simplex_alt$dfrs[[1]]
X <- dfr_simplex$data$X
Y <- dfr_simplex$data$Y[order(X)]
Y_hat <- fitted(dfr_simplex)[order(X)]
X <- sort(X)
ml <- loess(Y ~ X)
plot(X, predict(ml), type = "l", col = "blue", ylab = "Y")
points(X, Y, col = rgb(0,0,1,0.4))
lines(X, Y_hat, col = "red", lty = "dashed")
legend(x = "bottomright", legend = c("smoothed", "fitted"), col = c("blue", "red"),
lty = c("solid", "dashed"))The asis option is included primarily for exploratory
purposes, and for cases in which observations are given in the desired
order. In general, the ordering should arise from a principled mapping
of the observations onto the time domain. One such option is provided by
the simplex method, which scales each covariate to the unit
interval, and then orders the observations in increasing order of the
sums of their scaled values.
Another option is optimal, which orders observations
using optimal transport. This is illustrated below. In this example, the
power is comparable to the simplex method, but the patterns
in the EPSP is not as clear.
Note: this method is computationally expensive when the sample size is large.
## stat alpha rate mcse
## 1 KS 0.05 0.996 0.0006311894
Another option available for ordering observations is the
“natural” ordering, which can also be seen as analogous to
a lexicographic ordering: the observations are ordered by covariate
values from left to right. That is, observations are ordered in
ascending order of the first covariate in the matrix or data frame
containing the covariates; ties are broken by ordering by the second
covariate; remaining ties are broken by the third; and so on.
In the following example, Y is generated using an
expression with a quadratic term, but the function being tested is
linear in its covariates. We hope that the tests reject the null.
set.seed(20241003)
n <- 1e2
theta <- c(2,5,-1)
X <- cbind(sample(1:5, size = n, replace = TRUE),
sample(1:10, size = n, replace = TRUE))
Y <- theta[1] + theta[2]*X[,1] + theta[3]*X[,2] + (2e-1)*X[,2]^2 + rnorm(nrow(X))
func <- function(X, theta) theta[1] + theta[2]*X[,1] + theta[3]*X[,2]
dfr <- distfreereg(Y = Y, X = X, test_mean = func, theta_init = c(1,1,1),
covariance = list(Sigma = 1), ordering = "natural")
dfr## Number of observations: 100
## Length of EPSP: 44
## Monte Carlo simulations: 10000
##
## Estimated parameter values:
## theta1 theta2 theta3
## -3.228e+00 5.358e+00 1.154e+00
##
## Observed statistics:
## Stat Value Pr(>Value) MCSE
## KS 1.380e+00 1.720e-02 1.300e-03
## CvM 3.882e-01 6.030e-02 2.380e-03
## ---
## `MCSE' is the Monte Carlo standard error of the estimated p-value.
These results can be compared to the simplex method.
## Number of observations: 100
## Length of EPSP: 44
## Monte Carlo simulations: 10000
##
## Estimated parameter values:
## theta1 theta2 theta3
## -3.228e+00 5.358e+00 1.154e+00
##
## Observed statistics:
## Stat Value Pr(>Value) MCSE
## KS 1.590e+00 5.200e-03 7.192e-04
## CvM 5.734e-01 2.140e-02 1.447e-03
## ---
## `MCSE' is the Monte Carlo standard error of the estimated p-value.
The results for these methods are comparable in this example.
More flexibility is possible by specifying a list of column names or numbers that will result in the process above being used except only with the listed columns in the listed order.
The previous example’s covariates contain many repeated value
combinations. These covariate observations are indistinguishable, and
their order among themselves is dependent on the order in which they
happen to appear in the data. To avoid this arbitrariness in the
calculation of test statistics, a grouping option is available. When the
group argument is TRUE, the transformed
residuals for repeated covariate values are combined before forming the
EPSP. This avoids dependence on the order in which the observations
enter the data.
Grouping is done by default. To prevent grouping, use
group = FALSE, as in the following example.
Note that the length of epsp in
dfr_not_grouped is the sample size. In dfr, it
is not. It is instead the number of unique combinations of values in the
two columns of X.
## [1] 100
## [1] 44
## [1] 44
The test results are similar, but not identical to, those in
dfr.
## Number of observations: 100
## Length of EPSP: 100
## Monte Carlo simulations: 10000
##
## Estimated parameter values:
## theta1 theta2 theta3
## -3.228e+00 5.358e+00 1.154e+00
##
## Observed statistics:
## Stat Value Pr(>Value) MCSE
## KS 1.669e+00 3.600e-03 5.989e-04
## CvM 5.402e-01 2.960e-02 1.695e-03
## ---
## `MCSE' is the Monte Carlo standard error of the estimated p-value.
override ArgumentThe override argument of distfreereg() is
used by update.distfreereg() to avoid unnecessary
recalculation. This has particular benefits for the performance of
compare(), which uses
update.distfreereg().
This argument can be useful in other cases, too. For example, if the
observations should be ordered in a way that is not among the available
options for ordering, this order can be specified using
res_order.
The example below illustrates a more involved application using
compare() That function uses override to avoid
recalculating the r, res_order, and
mc_stats elements of the distfreereg objects
it creates. Any values specified in the override list
supplied to update() take precedence over all others, and
are therefore used in every repetition of compare(). Let us
use this below to illustrate that using the true value of \(\theta\) instead of \(\hat\theta\) does not produce good
results.
set.seed(20241003)
n <- 2e2
func <- function(X, theta) theta[1] + theta[2]*X[,1]
theta <- c(2,5)
X <- matrix(rexp(n))
cdfr <- compare(true_mean = func, test_mean = func, true_X = X, X = X,
true_covariance = list(Sigma = 1), covariance = list(Sigma = 1),
theta = theta, theta_init = c(1,1))
cdfr_theta <- update(cdfr, override = list(theta_hat = theta))The first plot is what we expect, given that the true mean and test
mean are the same in cdfr:
However, the next plot shows that using \(\theta\) does not yield good results:
An early step in the testing procedure is to sphere the residuals;
that is, to left multiply the residual vector \(\hat\epsilon\) by a matrix to remove
covariances among the residuals. Let \(\Sigma\) denote the covariance matrix of
\(\hat\epsilon\), and let \(Q\) denote any matrix satisfying \(Q^\top Q=\Sigma^{-1}\). Then \(Q\hat\epsilon\) has the identity covariance
matrix. There are infinitely many such matrices \(Q\) that satisfy \(Q^\top Q=\Sigma^{-1}\). (See Kessy et al. (2018)
for a discussion of sphering matrices.) Two considerations are
particularly important here: computational feasibility and performance
within the goodness-of-fit testing procedure. All of the discussion in
this section pertains to the case in which test_mean is a
function, and therefore the covariance structure must be specified by
the user. In particular, it pertains to the case in which the covariance
structure must be specified by a matrix rather than one of the simpler
forms (e.g., a vector).
Matrix square roots calculated by distfreereg are
calculated using eigendecompositions. These are “true” square roots in
the sense that they are symmetric, and therefore satisfy, e.g., \(QQ=\Sigma^{-1}\). Eigendecompositions are
not computationally feasible in all cases (specifically when the
dimensions are large), in which case it is necessary to specify either
SqrtSigma or Q instead of Sigma.
Such a specification must result in a matrix \(Q\) such that \(Q^\top Q=\Sigma^{-1/2}\). One such option
is to use a Cholesky factor. For example, let \(LL^\top\) be a Cholesky decomposition of
\(\Sigma\), and let \(Q=L^{-1}\). Such a value of \(Q\) is acceptable as the Q
element of covariance. Simulations indicate, however, that
this choice of square root leads to lower power than an
eigendecomposition-based square root.
As an example, the following simulation compares the powers of the two default tests (the KS- and CvM-based tests) using the eigendecomposition-based \(Q\) and the Cholesky-based \(Q\). The basics of the simulation are set up below.
theta <- c(1,-1)
true_mean <- function(X, theta) exp(theta[1]*X[,1]) + theta[2]*X[,2]
test_mean <- function(X, theta) theta[1]*X[,1] + theta[2]*X[,2]
n <- 2e2
set.seed(20250903)
Sigma <- rWishart(1, df = n, Sigma = diag(n))[,,1]
SqrtSigma <- distfreereg:::matsqrt(Sigma)
Q_eigen <- solve(SqrtSigma)
Q_chol <- t(solve(chol(Sigma)))
X <- matrix(rnorm(2*n), nrow = n)
Y <- distfreereg:::f2ftheta(true_mean, X)(theta) +
distfreereg:::rmvnorm(n = 1, reps = 1, SqrtSigma = SqrtSigma)To verify that the sample size is adequate for asymptotic results,
asymptotics() is applied to distfreereg
objects that are created using test_mean, one of which uses
Q_eigen and the other of which uses Q_chol.
The plots indicate that the sample size is adequate.
dfr_eigen <- distfreereg(test_mean = test_mean, Y = Y,
X = X, theta_init = c(1,1),
covariance = list(Q = Q_eigen))
dfr_chol <- distfreereg(test_mean = test_mean, Y = Y,
X = X, theta_init = c(1,1),
covariance = list(Q = Q_chol))
cdfr_eigen <- asymptotics(dfr_eigen)
cdfr_chol <- asymptotics(dfr_chol)
plot(cdfr_eigen)Finally, compare objects are created to estimate the
tests’ power. The rejection rates indicate that the Cholesky
decomposition leads to lower power for both tests.
cdfr_eigen_H1 <- update(cdfr_eigen, true_mean = true_mean, theta = theta)
cdfr_chol_H1 <- update(cdfr_chol, true_mean = true_mean, theta = theta)
rejection(cdfr_eigen_H1)## stat alpha rate mcse
## 1 KS 0.05 1 NA
## 2 CvM 0.05 1 NA
## stat alpha rate mcse
## 1 KS 0.05 0.844 0.003628553
## 2 CvM 0.05 0.854 0.003531062