Type: Package
Title: Knockoffs for Hidden Markov Models and Genetic Data
Version: 0.8.2
Date: 2019-04-30
Description: Generate knockoffs for genetic data and hidden Markov models. For more information, see the website below and the accompanying papers: "Gene hunting with hidden Markov model knockoffs", Sesia et al., Biometrika, 2019, (<doi:10.1093/biomet/asy033>). "Multi-resolution localization of causal variants across the genome", Sesia et al., bioRxiv, 2019, (<doi:10.1101/631390>).
License: GPL-3
Language: en-US
Depends: R (≥ 3.3.0)
Imports: Rcpp (≥ 0.12.13), Rdpack
RdMacros: Rdpack
Suggests: knitr, testthat, parallel, doParallel
LinkingTo: Rcpp, RcppArmadillo, RcppProgress
RoxygenNote: 6.1.1
URL: https://msesia.github.io/snpknock
BugReports: https://github.com/msesia/snpknock/issues
VignetteBuilder: knitr
Encoding: UTF-8
NeedsCompilation: yes
Packaged: 2019-05-16 23:35:28 UTC; msesia
Author: Matteo Sesia [aut, cre]
Maintainer: Matteo Sesia <msesia@stanford.edu>
Repository: CRAN
Date/Publication: 2019-05-17 20:50:14 UTC

Wrapper for GroupGenotypes

Description

Wrapper for GroupGenotypes

Usage

GroupGenotypes_wrapper(X_, r_, alpha_, theta_, groups_, n_, p_, seed_,
  display_progress_)

Wrapper for GroupHaplotypes

Description

Wrapper for GroupHaplotypes

Usage

GroupHaplotypes_wrapper(X_, r_, alpha_, theta_, groups_, n_, p_, seed_,
  display_progress_)

Compute the genotype transition matrices based on the fastPHASE HMM

Description

Compute the genotype transition matrices based on the fastPHASE HMM

Usage

assemble_Q(Q1)

Compute the genotype emission distributions based on the fastPHASE HMM

Description

Compute the genotype emission distributions based on the fastPHASE HMM

Usage

assemble_pEmit(theta)

Compute the haplotype emission distributions based on the fastPHASE HMM

Description

Compute the haplotype emission distributions based on the fastPHASE HMM

Usage

assemble_pEmit_phased(theta)

Compute the genotype initial distributions based on the fastPHASE HMM

Description

Compute the genotype initial distributions based on the fastPHASE HMM

Usage

assemble_pInit(alpha)

Compute the haplotype initial distributions based on the fastPHASE HMM

Description

Compute the haplotype initial distributions based on the fastPHASE HMM

Usage

assemble_pInit_phased(alpha)

Compute the haplotype transition matrices based on the fastPHASE HMM

Description

Compute the haplotype transition matrices based on the fastPHASE HMM

Usage

compute_Q1(r, alpha)

Group knockoffs of discrete Markov chains

Description

This function constructs knockoffs of variables distributed as a discrete Markov chain.

Usage

knockoffDMC(X, pInit, Q, groups = NULL, seed = 123, cluster = NULL,
  display_progress = FALSE)

Arguments

X

an integer matrix of size n-by-p containing the original variables.

pInit

an array of length K, containing the marginal distribution of the states for the first variable.

Q

an array of size (p-1,K,K), containing a list of p-1 transition matrices between the K states of the Markov chain.

groups

an array of length p, describing the group membership of each variable (default: NULL).

seed

an integer random seed (default: 123).

cluster

a computing cluster object created by makeCluster (default: NULL).

display_progress

whether to show progress bar (default: FALSE).

Details

Each element of the matrix X should be an integer value between 0 and K-1. The transition matrices contained in Q are defined such that P[X_{j+1}=k|X_{j}=l]=Q[j,l,k].

Value

An integer matrix of size n-by-p containing the knockoff variables.

References

Sesia M, Sabatti C, Candès EJ (2019). “Gene hunting with hidden Markov model knockoffs.” Biometrika, 106, 1–18. doi: 10.1093/biomet/asy033. Sesia M, Katsevich E, Bates S, Candès E, Sabatti C (2019). “Multi-resolution localization of causal variants across the genome.” bioRxiv. doi: 10.1101/631390.

See Also

Other knockoffs: knockoffGenotypes, knockoffHMM, knockoffHaplotypes

Examples

# Generate data
p = 10; K = 5;
pInit = rep(1/K,K)
Q = array(stats::runif((p-1)*K*K),c(p-1,K,K))
for(j in 1:(p-1)) { Q[j,,] = Q[j,,] / rowSums(Q[j,,]) }
X = sampleDMC(pInit, Q, n=20)
# Generate knockoffs
Xk = knockoffDMC(X, pInit, Q)
# Generate group-knockoffs for groups of size 3
groups = rep(seq(p), each=3, length.out=p)
Xk = knockoffDMC(X, pInit, Q, groups=groups)


Wrapper for DMC group knockoffs

Description

Wrapper for DMC group knockoffs

Usage

knockoffDMC_wrapper(X_, pInit_, Q_, n_, p_, K_, seed_, G_,
  display_progress_)

Group-knockoffs of unphased genotypes

Description

This function efficiently constructs group-knockoffs of 0,1,2 variables distributed according to the Li and Stephens model for unphased genotypes.

Usage

knockoffGenotypes(X, r, alpha, theta, groups = NULL, seed = 123,
  cluster = NULL, display_progress = FALSE)

Arguments

X

a 0,1,2 matrix of size n-by-p containing the original variables.

r

a vector of length p containing the "r" parameters estimated by fastPHASE.

alpha

a matrix of size p-by-K containing the "alpha" parameters estimated by fastPHASE.

theta

a matrix of size p-by-K containing the "theta" parameters estimated by fastPHASE.

groups

a vector of length p containing group memberships for each variable. Indices are assumed to be monotone increasing, starting from 1 (default: NULL).

seed

an integer random seed (default: 123).

cluster

a computing cluster object created by makeCluster (default: NULL).

display_progress

whether to show progress bar (default: FALSE).

Details

Generate group-knockoffs of unphased genotypes according to the Li and Stephens HMM. The required model parameters can be obtained through fastPHASE and loaded with loadHMM. This function is more efficient than knockoffHMM for haplotype data.

Value

A 0,1,2 matrix of size n-by-p containing the knockoff variables.

References

Sesia M, Katsevich E, Bates S, Candès E, Sabatti C (2019). “Multi-resolution localization of causal variants across the genome.” bioRxiv. doi: 10.1101/631390.

See Also

Other knockoffs: knockoffDMC, knockoffHMM, knockoffHaplotypes

Examples

# Problem size
p = 10
n = 100
# Load HMM to generate data
r_file = system.file("extdata", "haplotypes_rhat.txt", package = "SNPknock")
alpha_file = system.file("extdata", "haplotypes_alphahat.txt", package = "SNPknock")
theta_file = system.file("extdata", "haplotypes_thetahat.txt", package = "SNPknock")
char_file = system.file("extdata", "haplotypes_origchars", package = "SNPknock")
hmm.data = loadHMM(r_file, alpha_file, theta_file, char_file, compact=FALSE, phased=FALSE)
hmm.data$Q = hmm.data$Q[1:(p-1),,]
hmm.data$pEmit = hmm.data$pEmit[1:p,,]
# Sample X from this HMM
X = sampleHMM(hmm.data$pInit, hmm.data$Q, hmm.data$pEmit, n=n)
# Load HMM to generate knockoffs
hmm = loadHMM(r_file, alpha_file, theta_file, char_file)
hmm$r = hmm$r[1:p]
hmm$alpha = hmm$alpha[1:p,]
hmm$theta = hmm$theta[1:p,]
# Generate knockoffs
Xk = knockoffGenotypes(X, hmm$r, hmm$alpha, hmm$theta)
# Generate group-knockoffs for groups of size 3
groups = rep(seq(p), each=3, length.out=p)
Xk = knockoffGenotypes(X, hmm$r, hmm$alpha, hmm$theta, groups=groups)

Group knockoffs of hidden Markov models

Description

This function constructs knockoffs of variables distributed as a hidden Markov model.

Usage

knockoffHMM(X, pInit, Q, pEmit, groups = NULL, seed = 123,
  cluster = NULL, display_progress = FALSE)

Arguments

X

an integer matrix of size n-by-p containing the original variables.

pInit

an array of length K, containing the marginal distribution of the states for the first variable.

Q

an array of size (p-1,K,K), containing a list of p-1 transition matrices between the K states of the Markov chain.

pEmit

an array of size (p,M,K), containing the emission probabilities for each of the M possible emission states, from each of the K hidden states and the p variables.

groups

an array of length p, describing the group membership of each variable (default: NULL).

seed

an integer random seed (default: 123).

cluster

a computing cluster object created by makeCluster (default: NULL).

display_progress

whether to show progress bar (default: FALSE).

Details

Each element of the matrix X should be an integer value between 0 and M-1. The transition matrices contained in Q are defined with the same convention as in knockoffDMC. The emission propability matrices contained in pEmit are defined such that P[X_{j}=k|H_{j}=l]=\mathrm{pEmit}[j,k,l], where H_j is the latent variable associated to X_j.

Value

An integer matrix of size n-by-p containing the knockoff variables.

References

Sesia M, Sabatti C, Candès EJ (2019). “Gene hunting with hidden Markov model knockoffs.” Biometrika, 106, 1–18. doi: 10.1093/biomet/asy033. Sesia M, Katsevich E, Bates S, Candès E, Sabatti C (2019). “Multi-resolution localization of causal variants across the genome.” bioRxiv. doi: 10.1101/631390.

See Also

Other knockoffs: knockoffDMC, knockoffGenotypes, knockoffHaplotypes

Examples

# Generate data
p=10; K=5; M=3;
pInit = rep(1/K,K)
Q = array(stats::runif((p-1)*K*K),c(p-1,K,K))
for(j in 1:(p-1)) { Q[j,,] = Q[j,,] / rowSums(Q[j,,]) }
pEmit = array(stats::runif(p*M*K),c(p,M,K))
for(j in 1:p) { pEmit[j,,] = pEmit[j,,] / rowSums(pEmit[j,,]) }
X = sampleHMM(pInit, Q, pEmit, n=20)
# Generate knockoffs
Xk = knockoffHMM(X, pInit, Q, pEmit)
# Generate group-knockoffs for groups of size 3
groups = rep(seq(p), each=3, length.out=p)
Xk = knockoffHMM(X, pInit, Q, pEmit, groups=groups)


Wrapper for HMM group knockoffs

Description

Wrapper for HMM group knockoffs

Usage

knockoffHMM_wrapper(X_, pInit_, Q_, pEmit_, n_, p_, K_, M_, seed_, G_,
  display_progress_)

Group-knockoffs of phased haplotypes

Description

This function efficiently constructs group-knockoffs of binary variables distributed according to the Li and Stephens model for phased haplotypes.

Usage

knockoffHaplotypes(X, r, alpha, theta, groups = NULL, seed = 123,
  cluster = NULL, display_progress = FALSE)

Arguments

X

a binary matrix of size n-by-p containing the original variables.

r

a vector of length p containing the "r" parameters estimated by fastPHASE.

alpha

a matrix of size p-by-K containing the "alpha" parameters estimated by fastPHASE.

theta

a matrix of size p-by-K containing the "theta" parameters estimated by fastPHASE.

groups

a vector of length p containing group memberships for each variable. Indices are assumed to be monotone increasing, starting from 1 (default: NULL).

seed

an integer random seed (default: 123).

cluster

a computing cluster object created by makeCluster (default: NULL).

display_progress

whether to show progress bar (default: FALSE).

Details

Generate group-knockoffs of phased haplotypes according to the Li and Stephens HMM. The required model parameters can be obtained through fastPHASE and loaded with loadHMM. This function is more efficient than knockoffHMM for haplotype data.

Value

A binary matrix of size n-by-p containing the knockoff variables.

References

Sesia M, Katsevich E, Bates S, Candès E, Sabatti C (2019). “Multi-resolution localization of causal variants across the genome.” bioRxiv. doi: 10.1101/631390.

See Also

Other knockoffs: knockoffDMC, knockoffGenotypes, knockoffHMM

Examples

# Problem size
p = 10
n = 100
# Load HMM to generate data
r_file = system.file("extdata", "haplotypes_rhat.txt", package = "SNPknock")
alpha_file = system.file("extdata", "haplotypes_alphahat.txt", package = "SNPknock")
theta_file = system.file("extdata", "haplotypes_thetahat.txt", package = "SNPknock")
char_file = system.file("extdata", "haplotypes_origchars", package = "SNPknock")
hmm.data = loadHMM(r_file, alpha_file, theta_file, char_file, compact=FALSE, phased=TRUE)
hmm.data$Q = hmm.data$Q[1:(p-1),,]
hmm.data$pEmit = hmm.data$pEmit[1:p,,]
# Sample X from this HMM
X = sampleHMM(hmm.data$pInit, hmm.data$Q, hmm.data$pEmit, n=n)
# Load HMM to generate knockoffs
hmm = loadHMM(r_file, alpha_file, theta_file, char_file)
hmm$r = hmm$r[1:p]
hmm$alpha = hmm$alpha[1:p,]
hmm$theta = hmm$theta[1:p,]
# Generate knockoffs
Xk = knockoffHaplotypes(X, hmm$r, hmm$alpha, hmm$theta)
# Generate group-knockoffs for groups of size 3
groups = rep(seq(p), each=3, length.out=p)
Xk = knockoffHaplotypes(X, hmm$r, hmm$alpha, hmm$theta, groups=groups)


Read the files produced by fastPHASE

Description

Read the files produced by fastPHASE

Usage

loadEMParameters(data_path)

Load HMM parameters fitted by fastPHASE

Description

This function loads the parameter estimates obtained by fastPHASE (see runFastPhase) and assembles the Li and Stephens HMM, in the format required by the knockoff generation functions knockoffHaplotypes and knockoffGenotypes.

Usage

loadHMM(r_file, alpha_file, theta_file, char_file, compact = TRUE,
  phased = FALSE)

Arguments

r_file

a string with the path of the "_rhat.txt" file produced by fastPHASE.

alpha_file

a string with the path of the "_alphahat.txt" file produced by fastPHASE.

theta_file

a string with the path of the "_thetahat.txt" file produced by fastPHASE.

char_file

a string with the path of the "_origchars" file produced by fastPHASE.

compact

whether to assemble the explicit transition and emission matrices for the HMM (default: FALSE).

phased

whether to assemble a model for phased haplotypes, if compact==FALSE (default: FALSE).

Details

This function by default returns a structure with three fields:

If the parameter compact is FALSE, this function assembles the HMM model for the genotype data (either unphased or phased), in the format required by the knockoff generation function knockoffHMM.

Value

A structure containing the parameters from the Li and Stephens HMM for phased haplotypes.

References

Scheet P, Stephens M (2006). “A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase.” Am. J. Hum. Genet., 78, 629–644. doi: 10.1086/502802.

See Also

Other fastPHASE: runFastPhase, writeXtoInp

Examples

# Specify the location of the fastPHASE output files containing the parameter estimates.
# Example files can be found in the package installation directory.
r_file = system.file("extdata", "genotypes_rhat.txt", package = "SNPknock")
alpha_file = system.file("extdata", "genotypes_alphahat.txt", package = "SNPknock")
theta_file = system.file("extdata", "genotypes_thetahat.txt", package = "SNPknock")
char_file = system.file("extdata", "genotypes_origchars", package = "SNPknock")

# Read the parameter files and load the HMM
hmm = loadHMM(r_file, alpha_file, theta_file, char_file)

# Read the parameter files and load the HMM
hmm.large = loadHMM(r_file, alpha_file, theta_file, char_file, compact=FALSE)


Random sampling from discrete distribution

Description

Random sampling from discrete distribution

Usage

rand_weighted(W)

Fit an HMM to genetic data using fastPHASE

Description

This function provides a wrapper for the fastPHASE executable in order to fit an HMM to either unphased genotype data or phased haplotype data. The software fastPHASE will fit the HMM to the genotype data and write the corresponding parameter estimates in four separate files. Since fastPHASE is not an R package, this executable must be downloaded separately by the user. Visit http://scheet.org/software.html for more information on how to obtain fastPHASE.

Usage

runFastPhase(fp_path, X_file, out_path = NULL, K = 12, numit = 25,
  phased = FALSE, seed = 1)

Arguments

fp_path

a string with the path to the directory with the fastPHASE executable.

X_file

a string with the path of the genotype input file containing X in fastPHASE format (as created by writeXtoInp).

out_path

a string with the path of the directory in which the parameter estimates will be saved (default: NULL). If this is equal to NULL, a temporary file in the R temporary directory will be used.

K

the number of hidden states for each haplotype sequence (default: 12).

numit

the number of EM iterations (default: 25).

phased

whether the data are already phased (default: FALSE).

seed

the random seed for the EM algorithm (default: 1).

Details

The software fastPHASE saves the parameter estimates in four separate files whose names begin with the string contained in 'out_path' and end with:

The HMM for the genotype data can then be loaded from these files by calling loadHMM.

Value

A string containing the path of the directory in which the parameter estimates were saved. This is useful to find the data when the default option for 'out_path' is used and the output is written in an R temporary directory.

References

Scheet P, Stephens M (2006). “A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase.” Am. J. Hum. Genet., 78, 629–644. doi: 10.1086/502802.

See Also

Other fastPHASE: loadHMM, writeXtoInp

Examples

fp_path  = "~/bin/fastPHASE" # Path to the fastPHASE executable

# Run fastPHASE on unphased genotypes
# Specify the path to the genotype input file in ".inp" format.
# An example file containing unphased genotypes can be found in the package installation folder.
X_file = system.file("extdata", "genotypes.inp", package = "SNPknock")
fp_outPath = runFastPhase(fp_path, X_file)

# Run fastPHASE on phased haplotypes
# An example file containing phased haplotypes can be found in the package installation folder.
H_file = system.file("extdata", "haplotypes.inp", package = "SNPknock")
fp_outPath = runFastPhase(fp_path, H_file, phased=TRUE)


Sample discrete Markov chains

Description

This function draws independent random samples of a discrete Markov chain.

Usage

sampleDMC(pInit, Q, n = 1)

Arguments

pInit

an array of length K, containing the marginal distribution of the states for the first variable.

Q

an array of size (p-1,K,K), containing a list of p-1 transition matrices between the K states of the Markov chain.

n

the number of independent samples to be drawn (default: 1).

Details

Each element of the output matrix is an integer value between 0 and K-1. The transition matrices contained in Q are defined such that P[X_{j+1}=k|X_{j}=l]=Q[j,l,k].

Value

A matrix of size n-by-p containing the n observed Markov chains of length p.

References

Sesia M, Sabatti C, Candès EJ (2019). “Gene hunting with hidden Markov model knockoffs.” Biometrika, 106, 1–18. doi: 10.1093/biomet/asy033.

See Also

Other models: sampleHMM

Examples

p=10; K=5;
pInit = rep(1/K,K)
Q = array(stats::runif((p-1)*K*K),c(p-1,K,K))
for(j in 1:(p-1)) { Q[j,,] = Q[j,,] / rowSums(Q[j,,]) }
X = sampleDMC(pInit, Q, n=20)


Sample hidden Markov models

Description

This function draws independent random samples of an hidden Markov model.

Usage

sampleHMM(pInit, Q, pEmit, n = 1)

Arguments

pInit

an array of length K, containing the marginal distribution of the states for the first variable.

Q

an array of size (p-1,K,K), containing a list of p-1 transition matrices between the K states of the Markov chain.

pEmit

an array of size (p,M,K), containing the emission probabilities for each of the M possible emission states, from each of the K hidden states and the p variables.

n

the number of independent samples to be drawn (default: 1).

Details

Each element of the output matrix is an integer value between 0 and K-1. The transition matrices contained in Q are defined with the same convention as in sampleDMC. The emission propability matrices contained in pEmit are defined such that P[X_{j}=k|H_{j}=l]=\mathrm{pEmit}[j,k,l], where H_j is the latent variable associated to X_j.

Value

A matrix of size n-by-p containing the n observed Markov chains of length p.

References

Sesia M, Sabatti C, Candès EJ (2019). “Gene hunting with hidden Markov model knockoffs.” Biometrika, 106, 1–18. doi: 10.1093/biomet/asy033.

See Also

Other models: sampleDMC

Examples

p=10; K=5; M=3;
pInit = rep(1/K,K)
Q = array(stats::runif((p-1)*K*K),c(p-1,K,K))
for(j in 1:(p-1)) { Q[j,,] = Q[j,,] / rowSums(Q[j,,]) }
pEmit = array(stats::runif(p*M*K),c(p,M,K))
for(j in 1:p) { pEmit[j,,] = pEmit[j,,] / rowSums(pEmit[j,,]) }
X = sampleHMM(pInit, Q, pEmit, n=20)


Convert genotypes X into the fastPHASE input format

Description

This function converts a genetic matrix X into the fastPHASE input format and saves it to a user-specified file. Then, an HMM can be fitted by calling fastPHASE with runFastPhase.

Usage

writeXtoInp(X, phased = FALSE, out_file = NULL)

Arguments

X

either a matrix of size n-by-p containing unphased genotypes for n individuals, or a matrix of size 2n-by-p containing phased haplotypes for n individuals.

phased

whether the data are phased (default: FALSE). If this is equal to TRUE, each pair of consecutive rows will be assumed to correspond to phased haplotypes from the same individual.

out_file

a string containing the path of the output file onto which X will be written (default: NULL). If this is equal to NULL, a temporary file in the R temporary directory will be used.

Value

A string containing the path of the output file onto which X was written. This is useful to find the data when the default option for 'out_file' is used and X is written onto a temporary file in the R temporary directory.

References

Scheet P, Stephens M (2006). “A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase.” Am. J. Hum. Genet., 78, 629–644. doi: 10.1086/502802.

See Also

Other fastPHASE: loadHMM, runFastPhase

Examples

# Convert unphased genotypes
# Load an example data matrix X from the package installation directory.
X_file = system.file("extdata", "genotypes.RData", package = "SNPknock")
load(X_file)
# Write X in a temporary file
Xinp_file = writeXtoInp(X)

# Convert phased haplotypes
# Load an example data matrix H from the package installation directory.
H_file = system.file("extdata", "haplotypes.RData", package = "SNPknock")
load(H_file)
# Write H in a temporary file
Hinp_file = writeXtoInp(H, phased=TRUE)