twoPhaseGAS provides tools for the design and analysis of two-phase genetic association studies (GAS) in which a subset of the full cohort is selected for targeted sequencing (phase 2), while the remaining individuals contribute only GWAS-level data (phase 1).
The package offers:
DataGeneration_TPD().twoPhaseDesign() / twoPhase().twoPhaseSPML(), which implements a semiparametric maximum
likelihood (SPML) estimator using the EM algorithm.library(twoPhaseGAS)
data.table::setDTthreads(1)
set.seed(42)
data <- DataGeneration_TPD(
Beta0 = 2, # intercept
Beta1 = 0.5, # genetic effect (G on Y)
Disp = 1, # error variance for default gaussian family
N = 3000, # phase 1 cohort size
LD.r = 0.75, # LD (r) between causal SNP G and GWAS SNP Z
P_g = 0.20, # MAF of G
P_z = 0.30 # MAF of Z
)
#> Beta0=2 Beta1=0.5
head(data)
#> wait_it Y G1 Z G0 S
#> 1 1 1.314338 0 1 0 1
#> 2 1 1.207286 0 2 0 1
#> 3 1 1.592996 0 0 0 1
#> 4 1 1.351329 1 1 1 1
#> 5 1 3.115760 0 0 0 3
#> 6 1 1.120543 0 0 1 1The simulated data frame contains:
| Column | Description |
|---|---|
Y |
Continuous outcome |
G1 |
Causal sequence variant (absent from phase 1 GWAS) |
Z |
GWAS SNP (available for everyone) |
S |
Stratum variable derived from residuals |
Here we use simple random sampling; twoPhaseDesign() can
be used for optimised designs.
set.seed(1)
n2 <- 500 # phase 2 size
R <- rep(0L, nrow(data))
R[sample(nrow(data), n2)] <- 1L # 1 = selected for phase 2
data[R == 0, c("G1")] <- NA # Make all phase 1 data complement G1 values missing
cat("Phase 1 complement:", sum(R == 0), "individuals\n")
#> Phase 1 complement: 2500 individuals
cat("Phase 2: ", sum(R == 1), "individuals\n")
#> Phase 2: 500 individualsWhen the missing-by-design covariate (miscov) is
absent from formula,
twoPhaseSPML() fits the null model and returns score
statistics.
res_Ho <- twoPhaseSPML(
formula = Y ~ Z,
miscov = ~ G1,
auxvar = ~ Z,
data = data
)
print(res_Ho)
#>
#> Call:
#> twoPhaseSPML(formula = Y ~ Z, miscov = ~G1, auxvar = ~Z, family = gaussian)
#>
#> Coefficients:
#> Estimate
#> (Intercept) 2.016
#> Z 0.290
#>
#> Family: gaussian Link: identity
#> Dispersion: 1.074
#> Log-likelihood: -7382 (EM iterations: 45 )
#>
#> Hypothesis test for missing-by-design covariate(s): G1
#> [Null model - Score statistics]
#> Observed score statistic (Sobs): 26.06
#> Expected score statistic (Sexp): 25.51
#> Degrees of freedom: 1
#> p-value (chi-squared): 3.313e-07The print method (modelled on glm)
displays:
Including miscov in formula switches to the
alternative model. The EM algorithm now estimates the effect of
G1 jointly with the regression parameters, and Wald
statistics are reported.
res_Ha <- twoPhaseSPML(
formula = Y ~ Z + G1,
miscov = ~ G1,
auxvar = ~ Z,
data = data
)
print(res_Ha)
#>
#> Call:
#> twoPhaseSPML(formula = Y ~ Z + G1, miscov = ~G1, auxvar = ~Z, family = gaussian)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 2.00924 0.02548 78.846 < 2e-16 ***
#> Z -0.09125 0.06188 -1.474 0.14
#> G1 0.61235 0.08688 7.048 1.81e-12 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Family: gaussian Link: identity
#> Dispersion: 1.021
#> Log-likelihood: -7367 (EM iterations: 53 )
#>
#> Hypothesis test for missing-by-design covariate(s): G1
#> [Alternative model - Wald statistics]
#> Observed Wald statistic (Wobs): 49.68
#> Expected Wald statistic (Wexp): 38.87
#> Degrees of freedom: 1
#> p-value (chi-squared): 1.811e-12The returned object is a list of class
"twoPhaseSPML".
# Regression coefficients
res_Ha$theta
#> (Intercept) Z G1
#> 2.00923599 -0.09124575 0.61234907
# Log-likelihood
res_Ha$ll
#> [1] -7367.262
# Estimated joint distribution of G1 and Z
head(res_Ha$qGZ)
#> G1 Z q
#> 1 0 0 0.477127116
#> 2 1 0 0.004206217
#> 3 0 1 0.151060698
#> 4 1 1 0.269605968
#> 5 0 2 0.012126314
#> 6 1 2 0.049821574
# Wald statistic and degrees of freedom
cat("Wobs =", res_Ha$Wobs, " df =", res_Ha$df, "\n")
#> Wobs = 49.67874 df = 1
cat("p-value =", pchisq(res_Ha$Wobs, df = res_Ha$df, lower.tail = FALSE), "\n")
#> p-value = 1.810981e-12start.values for warm startsWhen analysing many variants, passing converged estimates from a
nearby model as start.values can reduce computation
time.
# Use null-model estimates as starting point for the alternative model
res_warm <- twoPhaseSPML(
formula = Y ~ Z + G1,
miscov = ~ G1,
auxvar = ~ Z,
data = data,
start.values = list(betas = res_Ho$theta,
q = res_Ho$qGZ)
)sessionInfo()
#> R version 4.1.2 (2021-11-01)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS 15.7.5
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] twoPhaseGAS_1.2.5
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.12 enrichwith_0.4.0 bigmemory.sri_0.1.3
#> [4] kofnGA_1.3 rstudioapi_0.13 knitr_1.37
#> [7] magrittr_2.0.3 MASS_7.3-54 lattice_0.20-45
#> [10] R6_2.6.1 rlang_1.1.7 fastmap_1.1.0
#> [13] bigmemory_4.5.36 stringr_1.4.0 tools_4.1.2
#> [16] parallel_4.1.2 grid_4.1.2 dfoptim_2023.1.0
#> [19] data.table_1.18.0 xfun_0.34 cli_3.6.5
#> [22] jquerylib_0.1.4 htmltools_0.5.3 yaml_2.2.1
#> [25] digest_0.6.37 Matrix_1.6-5 nloptr_1.2.2.3
#> [28] sass_0.4.2 evaluate_1.0.5 rmarkdown_2.17
#> [31] stringi_1.7.5 compiler_4.1.2 bslib_0.3.1
#> [34] jsonlite_2.0.0