Encoding: | UTF-8 |
Type: | Package |
Title: | Analysis of Massive SNP Arrays |
Version: | 1.12.18 |
Date: | 2024-11-26 |
Description: | Easy-to-use, efficient, flexible and scalable tools for analyzing massive SNP arrays. Privé et al. (2018) <doi:10.1093/bioinformatics/bty185>. |
License: | GPL-3 |
LazyData: | TRUE |
Language: | en-US |
ByteCompile: | TRUE |
SystemRequirements: | A few functions from package 'bigsnpr' wrap existing software such as 'PLINK' <www.cog-genomics.org/plink2>. Functions are provided to download these software. Note that these external software might not work for some operating systems (e.g. 'PLINK' might not work on Solaris). |
Depends: | R (≥ 3.4), bigstatsr (≥ 1.5.11) |
Imports: | bigassertr (≥ 0.1.6), bigparallelr, bigsparser (≥ 0.6), bigreadr, bigutilsr (≥ 0.3.3), data.table (≥ 1.12.4), doRNG, foreach, ggplot2, magrittr, Matrix (≥ 1.3.0), methods, Rcpp, runonce (≥ 0.2.3), stats, vctrs |
LinkingTo: | bigsparser, bigstatsr, Rcpp, RcppArmadillo (≥ 0.9.600), rmio, roptim (≥ 0.1.6) |
Suggests: | bindata, covr, dbplyr (≥ 1.4), dplyr, gaston, glue, Hmisc, microbenchmark, pcadapt (≥ 4.1), quadprog, RhpcBLASctl, rmutil, RSpectra, RSQLite, R.utils, spelling, testthat, tibble, xgboost |
RoxygenNote: | 7.3.2 |
URL: | https://privefl.github.io/bigsnpr/ |
BugReports: | https://github.com/privefl/bigsnpr/issues |
NeedsCompilation: | yes |
Packaged: | 2024-11-26 08:46:20 UTC; au639593 |
Author: | Florian Privé [aut, cre], Michael Blum [ths], Hugues Aschard [ths], Bjarni Jóhann Vilhjálmsson [ths] |
Maintainer: | Florian Privé <florian.prive.21@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2024-11-26 10:30:02 UTC |
bigsnpr: Analysis of Massive SNP Arrays
Description
Easy-to-use, efficient, flexible and scalable tools for analyzing massive SNP arrays. Privé et al. (2018) doi:10.1093/bioinformatics/bty185.
Arguments
G |
A FBM.code256
(typically |
Gna |
A FBM.code256
(typically |
x |
A bigSNP. |
infos.chr |
Vector of integers specifying each SNP's chromosome. |
infos.pos |
Vector of integers specifying the physical position
on a chromosome (in base pairs) of each SNP. |
nploidy |
Number of trials, parameter of the binomial distribution.
Default is |
ind.row |
An optional vector of the row indices (individuals) that
are used. If not specified, all rows are used. |
ind.col |
An optional vector of the column indices (SNPs) that are used.
If not specified, all columns are used. |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
is.size.in.bp |
Deprecated. |
obj.bed |
Object of type bed, which is the mapping of some bed file.
Use |
Author(s)
Maintainer: Florian Privé florian.prive.21@gmail.com
Other contributors:
Michael Blum [thesis advisor]
Hugues Aschard hugues.aschard@pasteur.fr [thesis advisor]
Bjarni Jóhann Vilhjálmsson bjv@econ.au.dk [thesis advisor]
See Also
Useful links:
CODE_012: code genotype calls (3) and missing values.
Description
CODE_012: code genotype calls (3) and missing values.
CODE_DOSAGE: code genotype calls and missing values (4), and imputed calls (3) and imputed allele dosages rounded to two decimal places (201).
CODE_IMPUTE_PRED: code genotype calls and missing values (4), and imputed calls (3).
Usage
CODE_012
CODE_DOSAGE
CODE_IMPUTE_PRED
Format
An object of class numeric
of length 256.
An object of class numeric
of length 256.
An object of class numeric
of length 256.
Long-range LD regions
Description
34 long-range Linkage Disequilibrium (LD) regions for the human genome based on this wiki table.
Usage
LD.wiki34
Format
A data frame with 34 rows (regions) and 4 variables:
-
Chr
: region's chromosome -
Start
: starting position of the region (in bp) -
Stop
: stopping position of the region (in bp) -
ID
: some ID of the region.
Stacked C+T (SCT)
Description
Polygenic Risk Scores for a grid of clumping and thresholding parameters.
Stacking over many Polygenic Risk Scores, corresponding to a grid of many different parameters for clumping and thresholding.
Usage
snp_grid_clumping(
G,
infos.chr,
infos.pos,
lpS,
ind.row = rows_along(G),
grid.thr.r2 = c(0.01, 0.05, 0.1, 0.2, 0.5, 0.8, 0.95),
grid.base.size = c(50, 100, 200, 500),
infos.imp = rep(1, ncol(G)),
grid.thr.imp = 1,
groups = list(cols_along(G)),
exclude = NULL,
ncores = 1
)
snp_grid_PRS(
G,
all_keep,
betas,
lpS,
n_thr_lpS = 50,
grid.lpS.thr = 0.9999 * seq_log(max(0.1, min(lpS, na.rm = TRUE)), max(lpS, na.rm =
TRUE), n_thr_lpS),
ind.row = rows_along(G),
backingfile = tempfile(),
type = c("float", "double"),
ncores = 1
)
snp_grid_stacking(
multi_PRS,
y.train,
alphas = c(1, 0.01, 1e-04),
ncores = 1,
...
)
Arguments
G |
A FBM.code256
(typically |
infos.chr |
Vector of integers specifying each SNP's chromosome. |
infos.pos |
Vector of integers specifying the physical position
on a chromosome (in base pairs) of each SNP. |
lpS |
Numeric vector of |
ind.row |
An optional vector of the row indices (individuals) that
are used. If not specified, all rows are used. |
grid.thr.r2 |
Grid of thresholds over the squared correlation between
two SNPs for clumping. Default is |
grid.base.size |
Grid for base window sizes. Sizes are then computed as
|
infos.imp |
Vector of imputation scores. Default is all |
grid.thr.imp |
Grid of thresholds over |
groups |
List of vectors of indices to define your own categories. This could be used e.g. to derive C+T scores using two different GWAS summary statistics, or to include other information such as functional annotations. Default just makes one group with all variants. |
exclude |
Vector of SNP indices to exclude anyway. |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
all_keep |
Output of |
betas |
Numeric vector of weights (effect sizes from GWAS) associated
with each variant (column of |
n_thr_lpS |
Length for default |
grid.lpS.thr |
Sequence of thresholds to apply on |
backingfile |
Prefix for backingfiles where to store scores of C+T. As we typically use a large grid, this can result in a large matrix so that we store it on disk. Default uses a temporary file. |
type |
Type of backingfile values. Either |
multi_PRS |
Output of |
y.train |
Vector of phenotypes. If there are two levels (binary 0/1),
it uses |
alphas |
Vector of values for grid-search. See |
... |
Other parameters to be passed to |
Value
snp_grid_PRS()
: An FBM
(matrix on disk) that stores the C+T scores
for all parameters of the grid (and for each chromosome separately).
It also stores as attributes the input parameters all_keep
, betas
,
lpS
and grid.lpS.thr
that are also needed in snp_grid_stacking()
.
Accessor methods for class bed
.
Description
Accessor methods for class bed
.
Usage
## S4 method for signature 'bed,ANY,ANY,ANY'
x[i, j, ..., drop = TRUE]
## S4 method for signature 'bed_light,ANY,ANY,ANY'
x[i, j, ..., drop = TRUE]
Arguments
x |
A bed object. |
i |
A vector of indices (or nothing). You can use positive and negative indices, and also logical indices (that are recycled). |
j |
A vector of indices (or nothing). You can use positive and negative indices, and also logical indices (that are recycled). |
... |
Not used. Just to make nargs work. |
drop |
Whether to drop dimensions (to a vector) when a dimension is 1.
Default is |
Examples
bedfile <- system.file("extdata", "example-missing.bed", package = "bigsnpr")
(obj.bed <- bed(bedfile))
obj.bed[1:5, 1]
obj.bed[1:5, 1:2]
typeof(obj.bed[1, 1])
Class bed
Description
A reference class for storing a pointer to a mapped version of a bed file.
Usage
bed(bedfile)
Arguments
bedfile |
Path to file with extension ".bed" to read. You need the corresponding ".bim" and ".fam" in the same directory. |
Details
A bed
object has many field:
-
$address
: address of the external pointer containing the underlying C++ object, to be used internally as aXPtr<bed>
in C++ code -
$extptr
: use$address
instead -
$bedfile
: path to the bed file -
$bimfile
: path to the corresponding bim file -
$famfile
: path to the corresponding fam file -
$prefix
: path without extension -
$nrow
: number of samples in the bed file -
$ncol
: number of variants in the bed file -
$map
: data frame read from$bimfile
-
$fam
: data frame read from$famfile
-
$.map
: use$map
instead -
$.fam
: use$fam
instead -
$light
: get a lighter version of this object for parallel algorithms to not have to transfer e.g.$.map
.
Examples
bedfile <- system.file("extdata", "example-missing.bed", package = "bigsnpr")
(obj.bed <- bed(bedfile))
Methods for the bed class
Description
Methods for the bed class
Dimension methods for class bed
.
Methods nrow()
and ncol()
are automatically defined with dim()
.
Usage
## S4 method for signature 'bed'
dim(x)
## S4 method for signature 'bed'
length(x)
## S4 method for signature 'bed_light'
dim(x)
## S4 method for signature 'bed_light'
length(x)
Arguments
x |
Object of type |
Value
Dimensions of x
.
Allele frequencies
Description
Allele frequencies of a bed object.
Usage
bed_MAF(
obj.bed,
ind.row = rows_along(obj.bed),
ind.col = cols_along(obj.bed),
ncores = 1
)
Arguments
obj.bed |
Object of type bed, which is the mapping of some bed file.
Use |
ind.row |
An optional vector of the row indices (individuals) that
are used. If not specified, all rows are used. |
ind.col |
An optional vector of the column indices (SNPs) that are used.
If not specified, all columns are used. |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
Value
A data.frame with
-
$ac
: allele counts, -
$mac
: minor allele counts, -
$af
: allele frequencies, -
$maf
: minor allele frequencies, -
$N
: numbers of non-missing values.
Examples
bedfile <- system.file("extdata", "example-missing.bed", package = "bigsnpr")
obj.bed <- bed(bedfile)
bed_MAF(obj.bed, ind.col = 1:5)
LD clumping
Description
For a bigSNP
:
-
snp_pruning()
: LD pruning. Similar to "--indep-pairwise (size+1) 1 thr.r2
" in PLINK. This function is deprecated (see this article). -
snp_clumping()
(andbed_clumping()
): LD clumping. If you do not provide any statistic to rank SNPs, it would use minor allele frequencies (MAFs), making clumping similar to pruning. -
snp_indLRLDR()
: Get SNP indices of long-range LD regions for the human genome.
Usage
bed_clumping(
obj.bed,
ind.row = rows_along(obj.bed),
S = NULL,
thr.r2 = 0.2,
size = 100/thr.r2,
exclude = NULL,
ncores = 1
)
snp_clumping(
G,
infos.chr,
ind.row = rows_along(G),
S = NULL,
thr.r2 = 0.2,
size = 100/thr.r2,
infos.pos = NULL,
is.size.in.bp = NULL,
exclude = NULL,
ncores = 1
)
snp_pruning(
G,
infos.chr,
ind.row = rows_along(G),
size = 49,
is.size.in.bp = FALSE,
infos.pos = NULL,
thr.r2 = 0.2,
exclude = NULL,
nploidy = 2,
ncores = 1
)
snp_indLRLDR(infos.chr, infos.pos, LD.regions = LD.wiki34)
Arguments
obj.bed |
Object of type bed, which is the mapping of some bed file.
Use |
ind.row |
An optional vector of the row indices (individuals) that
are used. If not specified, all rows are used. |
S |
A vector of column statistics which express the importance
of each SNP (the more important is the SNP, the greater should be
the corresponding statistic). |
thr.r2 |
Threshold over the squared correlation between two SNPs.
Default is |
size |
For one SNP, window size around this SNP to compute correlations.
Default is |
exclude |
Vector of SNP indices to exclude anyway. For example,
can be used to exclude long-range LD regions (see Price2008). Another use
can be for thresholding with respect to p-values associated with |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
G |
A FBM.code256
(typically |
infos.chr |
Vector of integers specifying each SNP's chromosome. |
infos.pos |
Vector of integers specifying the physical position
on a chromosome (in base pairs) of each SNP. |
is.size.in.bp |
Deprecated. |
nploidy |
Number of trials, parameter of the binomial distribution.
Default is |
LD.regions |
A |
Value
-
snp_clumping()
(andbed_clumping()
): SNP indices that are kept. -
snp_indLRLDR()
: SNP indices to be used as (part of) the 'exclude
' parameter ofsnp_clumping()
.
References
Price AL, Weale ME, Patterson N, et al. Long-Range LD Can Confound Genome Scans in Admixed Populations. Am J Hum Genet. 2008;83(1):132-135. doi:10.1016/j.ajhg.2008.06.005
Examples
test <- snp_attachExtdata()
G <- test$genotypes
# clumping (prioritizing higher MAF)
ind.keep <- snp_clumping(G, infos.chr = test$map$chromosome,
infos.pos = test$map$physical.pos,
thr.r2 = 0.1)
# keep most of them -> not much LD in this simulated dataset
length(ind.keep) / ncol(G)
Counts
Description
Counts the number of 0s, 1s, 2s and NAs by variants in the bed file.
Usage
bed_counts(
obj.bed,
ind.row = rows_along(obj.bed),
ind.col = cols_along(obj.bed),
byrow = FALSE,
ncores = 1
)
Arguments
obj.bed |
Object of type bed, which is the mapping of some bed file.
Use |
ind.row |
An optional vector of the row indices (individuals) that
are used. If not specified, all rows are used. |
ind.col |
An optional vector of the column indices (SNPs) that are used.
If not specified, all columns are used. |
byrow |
Whether to count by individual rather than by variant?
Default is |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
Value
A matrix of with 4 rows and length(ind.col)
columns.
Examples
bedfile <- system.file("extdata", "example-missing.bed", package = "bigsnpr")
obj.bed <- bed(bedfile)
bed_counts(obj.bed, ind.col = 1:5)
bed_counts(obj.bed, ind.row = 1:5, byrow = TRUE)
Cross-product with a vector
Description
Cross-product between a "bed" object and a vector.
Missing values are replaced by 0 (after centering), as if they
had been imputed using parameter center
.
Usage
bed_cprodVec(
obj.bed,
y.row,
ind.row = rows_along(obj.bed),
ind.col = cols_along(obj.bed),
center = rep(0, length(ind.col)),
scale = rep(1, length(ind.col)),
ncores = 1
)
Arguments
obj.bed |
Object of type bed, which is the mapping of some bed file.
Use |
y.row |
A vector of same size as |
ind.row |
An optional vector of the row indices (individuals) that
are used. If not specified, all rows are used. |
ind.col |
An optional vector of the column indices (SNPs) that are used.
If not specified, all columns are used. |
center |
Vector of same length of |
scale |
Vector of same length of |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
Value
X^T \cdot y
.
Examples
bedfile <- system.file("extdata", "example.bed", package = "bigsnpr")
obj.bed <- bed(bedfile)
y.row <- rep(1, nrow(obj.bed))
str(bed_cprodVec(obj.bed, y.row))
Product with a vector
Description
Product between a "bed" object and a vector.
Missing values are replaced by 0 (after centering), as if they
had been imputed using parameter center
.
Usage
bed_prodVec(
obj.bed,
y.col,
ind.row = rows_along(obj.bed),
ind.col = cols_along(obj.bed),
center = rep(0, length(ind.col)),
scale = rep(1, length(ind.col)),
ncores = 1
)
Arguments
obj.bed |
Object of type bed, which is the mapping of some bed file.
Use |
y.col |
A vector of same size as |
ind.row |
An optional vector of the row indices (individuals) that
are used. If not specified, all rows are used. |
ind.col |
An optional vector of the column indices (SNPs) that are used.
If not specified, all columns are used. |
center |
Vector of same length of |
scale |
Vector of same length of |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
Value
X \cdot y
.
Examples
bedfile <- system.file("extdata", "example.bed", package = "bigsnpr")
obj.bed <- bed(bedfile)
y.col <- rep(1, ncol(obj.bed))
str(bed_prodVec(obj.bed, y.col))
Projecting PCA
Description
Computing and projecting PCA of reference dataset to a target dataset.
Usage
bed_projectPCA(
obj.bed.ref,
obj.bed.new,
k = 10,
ind.row.new = rows_along(obj.bed.new),
ind.row.ref = rows_along(obj.bed.ref),
ind.col.ref = cols_along(obj.bed.ref),
strand_flip = TRUE,
join_by_pos = TRUE,
match.min.prop = 0.5,
build.new = "hg19",
build.ref = "hg19",
liftOver = NULL,
...,
verbose = TRUE,
ncores = 1
)
Arguments
obj.bed.ref |
Object of type bed, which is the mapping of the bed file of
the reference data. Use |
obj.bed.new |
Object of type bed, which is the mapping of the bed file of
the target data. Use |
k |
Number of principal components to compute and project. |
ind.row.new |
Rows to be used in the target data. Default uses them all. |
ind.row.ref |
Rows to be used in the reference data. Default uses them all. |
ind.col.ref |
Columns to be potentially used in the reference data. Default uses all the ones in common with target data. |
strand_flip |
Whether to try to flip strand? (default is |
join_by_pos |
Whether to join by chromosome and position (default), or instead by rsid. |
match.min.prop |
Minimum proportion of variants in the smallest data
to be matched, otherwise stops with an error. Default is |
build.new |
Genome build of the target data. Default is |
build.ref |
Genome build of the reference data. Default is |
liftOver |
Path to liftOver executable. Binaries can be downloaded at https://hgdownload.cse.ucsc.edu/admin/exe/macOSX.x86_64/liftOver for Mac and at https://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/liftOver for Linux. |
... |
Arguments passed on to
|
verbose |
Output some information on the iterations? Default is |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
Value
A list of 3 elements:
-
$obj.svd.ref
: big_SVD object computed from reference data. -
$simple_proj
: simple projection of new data into space of reference PCA. -
$OADP_proj
: Online Augmentation, Decomposition, and Procrustes (OADP) projection of new data into space of reference PCA.
Projecting PCA
Description
Projecting PCA using individuals from one dataset to other individuals from the same dataset.
Usage
bed_projectSelfPCA(
obj.svd,
obj.bed,
ind.row,
ind.col = attr(obj.svd, "subset"),
ncores = 1
)
snp_projectSelfPCA(
obj.svd,
G,
ind.row,
ind.col = attr(obj.svd, "subset"),
ncores = 1
)
Arguments
obj.svd |
List with |
obj.bed |
Object of type bed, which is the mapping of the bed file of the data containing both the individuals that were used to compute the PCA and the other individuals to be projected. |
ind.row |
Rows (individuals) to be projected. |
ind.col |
Columns that were used for computing PCA. If bed_autoSVD was
used, then |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
G |
The FBM.code256 that was used to
compute |
Value
A list of 3 elements:
-
$obj.svd.ref
: big_SVD object computed from reference data. -
$simple_proj
: simple projection of new data into space of reference PCA. -
$OADP_proj
: Online Augmentation, Decomposition, and Procrustes (OADP) projection of new data into space of reference PCA.
Randomized partial SVD
Description
Partial SVD (or PCA) of a genotype matrix stored as a PLINK (.bed) file.#'
Usage
bed_randomSVD(
obj.bed,
fun.scaling = bed_scaleBinom,
ind.row = rows_along(obj.bed),
ind.col = cols_along(obj.bed),
k = 10,
tol = 1e-04,
verbose = FALSE,
ncores = 1
)
Arguments
obj.bed |
Object of type bed, which is the mapping of some bed file.
Use |
fun.scaling |
A function with parameters
Default doesn't use any scaling.
You can also provide your own |
ind.row |
An optional vector of the row indices (individuals) that
are used. If not specified, all rows are used. |
ind.col |
An optional vector of the column indices (SNPs) that are used.
If not specified, all columns are used. |
k |
Number of singular vectors/values to compute. Default is |
tol |
Precision parameter of svds. Default is |
verbose |
Should some progress be printed? Default is |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
Value
A named list (an S3 class "big_SVD") of
-
d
, the singular values, -
u
, the left singular vectors, -
v
, the right singular vectors, -
niter
, the number of the iteration of the algorithm, -
nops
, number of Matrix-Vector multiplications used, -
center
, the centering vector, -
scale
, the scaling vector.
Note that to obtain the Principal Components, you must use predict on the result. See examples.
Examples
bedfile <- system.file("extdata", "example.bed", package = "bigsnpr")
obj.bed <- bed(bedfile)
str(bed_randomSVD(obj.bed))
Binomial(2, p) scaling
Description
Binomial(2, p) scaling where p
is estimated.
Usage
bed_scaleBinom(
obj.bed,
ind.row = rows_along(obj.bed),
ind.col = cols_along(obj.bed),
ncores = 1
)
Arguments
obj.bed |
Object of type bed, which is the mapping of some bed file.
Use |
ind.row |
An optional vector of the row indices (individuals) that
are used. If not specified, all rows are used. |
ind.col |
An optional vector of the column indices (SNPs) that are used.
If not specified, all columns are used. |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
Details
You will probably not use this function as is but as parameter
fun.scaling
of other functions (e.g. bed_autoSVD
and bed_randomSVD
).
Value
A data frame with $center
and $scale
.
References
This scaling is widely used for SNP arrays. Patterson N, Price AL, Reich D (2006). Population Structure and Eigenanalysis. PLoS Genet 2(12): e190. doi:10.1371/journal.pgen.0020190.
Examples
bedfile <- system.file("extdata", "example-missing.bed", package = "bigsnpr")
obj.bed <- bed(bedfile)
str(bed_scaleBinom(obj.bed))
str(bed_randomSVD(obj.bed, bed_scaleBinom))
tcrossprod / GRM
Description
Compute G G^T
from a bed object, with possible filtering and scaling
of G
. For example, this can be used to compute GRMs.
Usage
bed_tcrossprodSelf(
obj.bed,
fun.scaling = bed_scaleBinom,
ind.row = rows_along(obj.bed),
ind.col = cols_along(obj.bed),
block.size = block_size(length(ind.row))
)
Arguments
obj.bed |
Object of type bed, which is the mapping of some bed file.
Use |
fun.scaling |
A function with parameters
Default uses binomial scaling.
You can also provide your own |
ind.row |
An optional vector of the row indices (individuals) that
are used. If not specified, all rows are used. |
ind.col |
An optional vector of the column indices (SNPs) that are used.
If not specified, all columns are used. |
block.size |
Maximum number of columns read at once. Default uses block_size. |
Value
A temporary FBM, with the following two attributes:
a numeric vector
center
of column scaling,a numeric vector
scale
of column scaling.
Matrix parallelization
Large matrix computations are made block-wise and won't be parallelized
in order to not have to reduce the size of these blocks. Instead, you can use
the MKL
or OpenBLAS in order to accelerate these block matrix computations.
You can control the number of cores used by these optimized matrix libraries
with bigparallelr::set_blas_ncores()
.
Examples
bedfile <- system.file("extdata", "example.bed", package = "bigsnpr")
obj.bed <- bed(bedfile)
K <- bed_tcrossprodSelf(obj.bed)
K[1:4, 1:6] / ncol(obj.bed)
Class bigSNP
Description
An S3 class for representing information on massive SNP arrays.
Value
A named list with at least 3 slots:
- genotypes
A FBM.code256 which is a special Filebacked Big Matrix encoded with type
raw
(one byte unsigned integer), representing genotype calls and possibly imputed allele dosages. Rows are individuals and columns are SNPs.- fam
A
data.frame
containing some information on the individuals (read from a ".fam" file).- map
A
data.frame
giving some information on the variants (read from a ".bim" file).
See Also
Liability scale
Description
Coefficient to convert to the liability scale. E.g. h2_liab = coef * h2_obs.
Usage
coef_to_liab(K_pop, K_gwas = 0.5)
Arguments
K_pop |
Prevalence in the population. |
K_gwas |
Prevalence in the GWAS. You should provide this if you used
( |
Value
Scaling coefficient to convert e.g. heritability to the liability scale.
Examples
h2 <- 0.2
h2 * coef_to_liab(0.02)
Download 1000G
Description
Download 1000 genomes project (phase 3) data in PLINK bed/bim/fam format, including 2490 (mostly unrelated) individuals and ~1.7M SNPs in common with either HapMap3 or the UK Biobank.
Usage
download_1000G(dir, overwrite = FALSE, delete_zip = TRUE)
Arguments
dir |
The directory where to put the downloaded files. |
overwrite |
Whether to overwrite files when downloading and unzipping?
Default is |
delete_zip |
Whether to delete zip after decompressing the file in it?
Default is |
Value
The path of the downloaded bed file.
Download Beagle 4.1
Description
Download Beagle 4.1 from https://faculty.washington.edu/browning/beagle/beagle.html
Usage
download_beagle(dir = tempdir())
Arguments
dir |
The directory where to put the Beagle Java Archive. Default is a temporary directory. |
Value
The path of the downloaded Beagle Java Archive.
Download a genetic map
Description
This function uses linear interpolation, whereas snp_asGeneticPos()
uses
nearest neighbors.
Usage
download_genetic_map(
type = c("hg19_OMNI", "hg19_hapmap", "hg38_price"),
dir,
ncores = 1
)
snp_asGeneticPos2(infos.chr, infos.pos, genetic_map)
Arguments
type |
Which genetic map to download. |
dir |
Directory where to download and decompress files. |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
infos.chr |
Vector of integers specifying each SNP's chromosome. |
infos.pos |
Vector of integers specifying the physical position
on a chromosome (in base pairs) of each SNP. |
genetic_map |
A data frame with 3 columns: |
Details
The hg19 genetic maps are downloaded from
https://github.com/joepickrell/1000-genomes-genetic-maps/
while the hg38 one is downloaded from
https://alkesgroup.broadinstitute.org/Eagle/downloads/tables/
.
Value
A data frame with 3 columns: chr
, pos
, and pos_cM
.
The new vector of genetic positions.
Download PLINK
Description
Download PLINK 1.9 from https://www.cog-genomics.org/plink2.
Download PLINK 2.0 from https://www.cog-genomics.org/plink/2.0/.
Usage
download_plink(dir = tempdir(), overwrite = FALSE, verbose = TRUE)
download_plink2(
dir = tempdir(),
AVX2 = TRUE,
ARM = FALSE,
AMD = FALSE,
overwrite = FALSE,
verbose = TRUE
)
Arguments
dir |
The directory where to put the PLINK executable. Default is a temporary directory. |
overwrite |
Whether to overwrite file? Default is |
verbose |
Whether to output details of downloading. Default is |
AVX2 |
Whether to download the AVX2 version? This is only available for
64 bits architectures. Default is |
ARM |
Whether to download an ARM version. Default is |
AMD |
Whether to download an AMD version. Default is |
Value
The path of the downloaded PLINK executable.
Objects exported from other packages
Description
These objects are imported from other packages. Follow the links below to see their documentation.
- bigsparser
Determine reference divergence
Description
Determine reference divergence while accounting for strand flips. This does not remove ambiguous alleles.
Usage
same_ref(ref1, alt1, ref2, alt2)
Arguments
ref1 |
The reference alleles of the first dataset. |
alt1 |
The alternative alleles of the first dataset. |
ref2 |
The reference alleles of the second dataset. |
alt2 |
The alternative alleles of the second dataset. |
Value
A logical vector whether the references alleles are the same. Missing values can result from missing values in the inputs or from ambiguous matching (e.g. matching A/C and A/G).
See Also
Examples
same_ref(ref1 = c("A", "C", "T", "G", NA),
alt1 = c("C", "T", "C", "A", "A"),
ref2 = c("A", "C", "A", "A", "C"),
alt2 = c("C", "G", "G", "G", "A"))
Sequence, evenly spaced on a logarithmic scale
Description
Sequence, evenly spaced on a logarithmic scale
Usage
seq_log(from, to, length.out)
Arguments
from , to |
the starting and (maximal) end values of the
sequence. Of length |
length.out |
desired length of the sequence. A
non-negative number, which for |
Value
A sequence of length length.out
, evenly spaced on a logarithmic scale
between from
and to
.
Examples
seq_log(1, 1000, 4)
seq_log(1, 100, 5)
MAF
Description
Minor Allele Frequency.
Usage
snp_MAF(
G,
ind.row = rows_along(G),
ind.col = cols_along(G),
nploidy = 2,
ncores = 1
)
Arguments
G |
A FBM.code256
(typically |
ind.row |
An optional vector of the row indices (individuals) that
are used. If not specified, all rows are used. |
ind.col |
An optional vector of the column indices (SNPs) that are used.
If not specified, all columns are used. |
nploidy |
Number of trials, parameter of the binomial distribution.
Default is |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
Value
A vector of MAFs, corresponding to ind.col
.
Examples
obj.bigsnp <- snp_attachExtdata()
str(maf <- snp_MAF(obj.bigsnp$genotypes))
MAX3 statistic
Description
Compute the MAX3 statistic, which tests for three genetic models (additive, recessive and dominant).
Usage
snp_MAX3(Gna, y01.train, ind.train = rows_along(Gna), val = c(0, 0.5, 1))
Arguments
Gna |
A FBM.code256
(typically |
y01.train |
Vector of responses, corresponding to |
ind.train |
An optional vector of the row indices that are used, for the training part. If not specified, all rows are used. Don't use negative indices. |
val |
Computing
|
Details
P-values associated with returned scores are in fact the minimum of the p-values of each test separately. Thus, they are biased downward.
Value
An object of classes mhtest
and data.frame
returning one
score by SNP. See methods(class = "mhtest")
.
References
Zheng, G., Yang, Y., Zhu, X., & Elston, R. (2012). Robust Procedures. Analysis Of Genetic Association Studies, 151-206. doi:10.1007/978-1-4614-2245-7_6.
Examples
set.seed(1)
# constructing a fake genotype big.matrix
N <- 50; M <- 1200
fake <- snp_fake(N, M)
G <- fake$genotypes
G[] <- sample(as.raw(0:3), size = length(G), replace = TRUE)
G[1:8, 1:10]
# Specify case/control phenotypes
fake$fam$affection <- rep(1:2, each = N / 2)
# Get MAX3 statistics
y01 <- fake$fam$affection - 1
str(test <- snp_MAX3(fake$genotypes, y01.train = y01))
# p-values are not well calibrated
snp_qq(test)
# genomic control is not of much help
snp_qq(snp_gc(test))
# Armitage trend test (well calibrated because only one test)
test2 <- snp_MAX3(fake$genotypes, y01.train = y01, val = 0.5)
snp_qq(test2)
PRS
Description
Polygenic Risk Scores with possible clumping and thresholding.
Usage
snp_PRS(
G,
betas.keep,
ind.test = rows_along(G),
ind.keep = cols_along(G),
same.keep = rep(TRUE, length(ind.keep)),
lpS.keep = NULL,
thr.list = 0
)
Arguments
G |
A FBM.code256
(typically |
betas.keep |
Numeric vector of weights associated with each SNP
corresponding to |
ind.test |
The individuals on whom to project the scores. Default uses all. |
ind.keep |
Column (SNP) indices to use (if using clumping, the output of snp_clumping). Default doesn't clump. |
same.keep |
A logical vector associated with |
lpS.keep |
Numeric vector of |
thr.list |
Threshold vector on |
Value
A matrix of scores, where rows correspond to ind.test
and
columns correspond to thr.list
.
Examples
test <- snp_attachExtdata()
G <- big_copy(test$genotypes, ind.col = 1:1000)
CHR <- test$map$chromosome[1:1000]
POS <- test$map$physical.position[1:1000]
y01 <- test$fam$affection - 1
# PCA -> covariables
obj.svd <- snp_autoSVD(G, infos.chr = CHR, infos.pos = POS)
# train and test set
ind.train <- sort(sample(nrow(G), 400))
ind.test <- setdiff(rows_along(G), ind.train) # 117
# GWAS
gwas.train <- big_univLogReg(G, y01.train = y01[ind.train],
ind.train = ind.train,
covar.train = obj.svd$u[ind.train, ])
# clumping
ind.keep <- snp_clumping(G, infos.chr = CHR,
ind.row = ind.train,
S = abs(gwas.train$score))
# -log10(p-values) and thresolding
summary(lpS.keep <- -predict(gwas.train)[ind.keep])
thrs <- seq(0, 4, by = 0.5)
nb.pred <- sapply(thrs, function(thr) sum(lpS.keep > thr))
# PRS
prs <- snp_PRS(G, betas.keep = gwas.train$estim[ind.keep],
ind.test = ind.test,
ind.keep = ind.keep,
lpS.keep = lpS.keep,
thr.list = thrs)
# AUC as a function of the number of predictors
aucs <- apply(prs, 2, AUC, target = y01[ind.test])
library(ggplot2)
qplot(nb.pred, aucs) +
geom_line() +
scale_x_log10(breaks = nb.pred) +
labs(x = "Number of predictors", y = "AUC") +
theme_bigstatsr()
Estimation of ancestry proportions
Description
Estimation of ancestry proportions. Make sure to match summary statistics
using snp_match()
(and to reverse frequencies correspondingly).
Usage
snp_ancestry_summary(
freq,
info_freq_ref,
projection,
correction,
min_cor = 0.4,
sum_to_one = TRUE
)
Arguments
freq |
Vector of frequencies from which to estimate ancestry proportions. |
info_freq_ref |
A data frame (or matrix) with the set of frequencies to be used as reference (one population per column). |
projection |
Matrix of "loadings" for each variant/PC to be used to project allele frequencies. |
correction |
Coefficients to correct for shrinkage when projecting. |
min_cor |
Minimum correlation between observed and predicted frequencies. Default is 0.4. When correlation is lower, an error is returned. For individual genotypes, this should be larger than 0.6. For allele frequencies, this should be larger than 0.9. |
sum_to_one |
Whether to force ancestry coefficients to sum to 1?
Default is |
Value
Vector of coefficients representing the ancestry proportions.
Also (as attributes) cor_each
, the correlation between input
frequencies and each reference frequencies, and cor_pred
, the correlation
between input and predicted frequencies.
Examples
## Not run:
# GWAS summary statistics for Epilepsy (supposedly in EUR+EAS+AFR)
gz <- runonce::download_file(
"http://www.epigad.org/gwas_ilae2018_16loci/all_epilepsy_METAL.gz",
dir = "tmp-data")
readLines(gz, n = 3)
library(dplyr)
sumstats <- bigreadr::fread2(
gz, select = c("CHR", "BP", "Allele2", "Allele1", "Freq1"),
col.names = c("chr", "pos", "a0", "a1", "freq")
) %>%
mutate_at(3:4, toupper)
# It is a good idea to filter for similar per-variant N (when available..)
all_freq <- bigreadr::fread2(
runonce::download_file("https://figshare.com/ndownloader/files/31620968",
dir = "tmp-data", fname = "ref_freqs.csv.gz"))
projection <- bigreadr::fread2(
runonce::download_file("https://figshare.com/ndownloader/files/31620953",
dir = "tmp-data", fname = "projection.csv.gz"))
matched <- snp_match(
mutate(sumstats, chr = as.integer(chr), beta = 1),
all_freq[1:5],
return_flip_and_rev = TRUE
) %>%
mutate(freq = ifelse(`_REV_`, 1 - freq, freq))
res <- snp_ancestry_summary(
freq = matched$freq,
info_freq_ref = all_freq[matched$`_NUM_ID_`, -(1:5)],
projection = projection[matched$`_NUM_ID_`, -(1:5)],
correction = c(1, 1, 1, 1.008, 1.021, 1.034, 1.052, 1.074, 1.099,
1.123, 1.15, 1.195, 1.256, 1.321, 1.382, 1.443)
)
# Some ancestry groups are very close to each other, and should be merged
group <- colnames(all_freq)[-(1:5)]
group[group %in% c("Scandinavia", "United Kingdom", "Ireland")] <- "Europe (North West)"
group[group %in% c("Europe (South East)", "Europe (North East)")] <- "Europe (East)"
tapply(res, factor(group, unique(group)), sum)
## End(Not run)
Interpolate to genetic positions
Description
Use genetic maps available at https://github.com/joepickrell/1000-genomes-genetic-maps/ to interpolate physical positions (in bp) to genetic positions (in cM).
Usage
snp_asGeneticPos(
infos.chr,
infos.pos,
dir = tempdir(),
ncores = 1,
rsid = NULL,
type = c("OMNI", "hapmap")
)
Arguments
infos.chr |
Vector of integers specifying each SNP's chromosome. |
infos.pos |
Vector of integers specifying the physical position
on a chromosome (in base pairs) of each SNP. |
dir |
Directory where to download and decompress files.
Default is |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
rsid |
If providing rsIDs, the matching is performed using those (instead of positions) and variants not matched are interpolated using spline interpolation of variants that have been matched. |
type |
Whether to use the genetic maps interpolated from "OMNI" (the default), or from "hapmap". |
Value
The new vector of genetic positions.
Attach a "bigSNP" from backing files
Description
Load a bigSNP from backing files into R.
Usage
snp_attach(rdsfile)
Arguments
rdsfile |
The path of the ".rds" which stores the |
Details
This is often just a call to readRDS. But it also checks if you have moved the two (".bk" and ".rds") backing files to another directory.
Value
The bigSNP
object.
Examples
(bedfile <- system.file("extdata", "example.bed", package = "bigsnpr"))
# Reading the bedfile and storing the data in temporary directory
rds <- snp_readBed(bedfile, backingfile = tempfile())
# Loading the data from backing files
test <- snp_attach(rds)
str(test)
dim(G <- test$genotypes)
G[1:8, 1:8]
Attach a "bigSNP" for examples and tests
Description
Attach a "bigSNP" for examples and tests
Usage
snp_attachExtdata(bedfile = c("example.bed", "example-missing.bed"))
Arguments
bedfile |
Name of one example bed file. Either
|
Value
The example "bigSNP", filebacked in the "/tmp/" directory.
Truncated SVD while limiting LD
Description
Fast truncated SVD with initial pruning and that iteratively removes
long-range LD regions. Some variants are removing due to the initial clumping,
then more and more variants are removed at each iteration. You can access the
indices of the remaining variants with attr(*, "subset")
. If some of the
variants removed are contiguous, the regions are reported in attr(*, "lrldr")
.
Usage
snp_autoSVD(
G,
infos.chr,
infos.pos = NULL,
ind.row = rows_along(G),
ind.col = cols_along(G),
fun.scaling = snp_scaleBinom(),
thr.r2 = 0.2,
size = 100/thr.r2,
k = 10,
roll.size = 50,
int.min.size = 20,
alpha.tukey = 0.05,
min.mac = 10,
min.maf = 0.02,
max.iter = 5,
is.size.in.bp = NULL,
ncores = 1,
verbose = TRUE
)
bed_autoSVD(
obj.bed,
ind.row = rows_along(obj.bed),
ind.col = cols_along(obj.bed),
fun.scaling = bed_scaleBinom,
thr.r2 = 0.2,
size = 100/thr.r2,
k = 10,
roll.size = 50,
int.min.size = 20,
alpha.tukey = 0.05,
min.mac = 10,
min.maf = 0.02,
max.iter = 5,
ncores = 1,
verbose = TRUE
)
Arguments
G |
A FBM.code256
(typically |
infos.chr |
Vector of integers specifying each SNP's chromosome. |
infos.pos |
Vector of integers specifying the physical position
on a chromosome (in base pairs) of each SNP. |
ind.row |
An optional vector of the row indices (individuals) that
are used. If not specified, all rows are used. |
ind.col |
An optional vector of the column indices (SNPs) that are used.
If not specified, all columns are used. |
fun.scaling |
A function with parameters
Default uses binomial scaling.
You can also provide your own |
thr.r2 |
Threshold over the squared correlation between two variants.
Default is |
size |
For one SNP, window size around this SNP to compute correlations.
Default is |
k |
Number of singular vectors/values to compute. Default is |
roll.size |
Radius of rolling windows to smooth log-p-values.
Default is |
int.min.size |
Minimum number of consecutive outlier variants
in order to be reported as long-range LD region. Default is |
alpha.tukey |
Default is |
min.mac |
Minimum minor allele count (MAC) for variants to be included.
Default is |
min.maf |
Minimum minor allele frequency (MAF) for variants to be included.
Default is |
max.iter |
Maximum number of iterations of outlier detection.
Default is |
is.size.in.bp |
Deprecated. |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
verbose |
Output some information on the iterations? Default is |
obj.bed |
Object of type bed, which is the mapping of some bed file.
Use |
Details
If you don't have any information about variants, you can try using
-
infos.chr = rep(1, ncol(G))
, -
size = ncol(G)
(if variants are not sorted), -
roll.size = 0
(if variants are not sorted).
Value
A named list (an S3 class "big_SVD") of
-
d
, the singular values, -
u
, the left singular vectors, -
v
, the right singular vectors, -
niter
, the number of the iteration of the algorithm, -
nops
, number of Matrix-Vector multiplications used, -
center
, the centering vector, -
scale
, the scaling vector.
Note that to obtain the Principal Components, you must use predict on the result. See examples.
Examples
ex <- snp_attachExtdata()
G <- ex$genotypes
obj.svd <- snp_autoSVD(G,
infos.chr = ex$map$chromosome,
infos.pos = ex$map$physical.position)
str(obj.svd)
Imputation
Description
Imputation using Beagle version 4.
Usage
snp_beagleImpute(
beagle.path,
plink.path,
bedfile.in,
bedfile.out = NULL,
memory.max = 3,
ncores = 1,
extra.options = "",
plink.options = "",
verbose = TRUE
)
Arguments
beagle.path |
Path to the executable of Beagle v4+. |
plink.path |
Path to the executable of PLINK 1.9. |
bedfile.in |
Path to the input bedfile. |
bedfile.out |
Path to the output bedfile. Default is created by
appending |
memory.max |
Max memory (in GB) to be used. It is internally rounded
to be an integer. Default is |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
extra.options |
Other options to be passed to Beagle as a string. More options can be found at Beagle's website. |
plink.options |
Other options to be passed to PLINK as a string. More options can be found at https://www.cog-genomics.org/plink2/filter. |
verbose |
Whether to show PLINK log? Default is |
Details
Downloads and more information can be found at the following websites
Value
The path of the new bedfile.
References
Browning, Brian L., and Sharon R. Browning. "Genotype imputation with millions of reference samples." The American Journal of Human Genetics 98.1 (2016): 116-126.
See Also
download_plink download_beagle
Correlation matrix
Description
Get significant (Pearson) correlations between nearby SNPs of the same chromosome (p-values are computed using a two-sided t-test).
Usage
snp_cor(
Gna,
ind.row = rows_along(Gna),
ind.col = cols_along(Gna),
size = 500,
alpha = 1,
thr_r2 = 0,
fill.diag = TRUE,
infos.pos = NULL,
ncores = 1
)
bed_cor(
obj.bed,
ind.row = rows_along(obj.bed),
ind.col = cols_along(obj.bed),
size = 500,
alpha = 1,
thr_r2 = 0,
fill.diag = TRUE,
infos.pos = NULL,
ncores = 1
)
Arguments
Gna |
A FBM.code256
(typically |
ind.row |
An optional vector of the row indices (individuals) that
are used. If not specified, all rows are used. |
ind.col |
An optional vector of the column indices (SNPs) that are used.
If not specified, all columns are used. |
size |
For one SNP, window size around this SNP to compute correlations.
Default is |
alpha |
Type-I error for testing correlations.
Default is |
thr_r2 |
Threshold to apply on squared correlations. Default is |
fill.diag |
Whether to fill the diagonal with 1s (the default) or to keep it as 0s. |
infos.pos |
Vector of integers specifying the physical position
on a chromosome (in base pairs) of each SNP. |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
obj.bed |
Object of type bed, which is the mapping of some bed file.
Use |
Value
The (Pearson) correlation matrix. This is a sparse symmetric matrix.
Examples
test <- snp_attachExtdata()
G <- test$genotypes
corr <- snp_cor(G, ind.col = 1:1000)
corr[1:10, 1:10]
# Sparsity
length(corr@x) / length(corr)
Fake a "bigSNP"
Description
Fake a "bigSNP"
Usage
snp_fake(n, m)
Arguments
n |
Number of individuals. |
m |
Number of SNPs. |
Value
A new temporary bigSNP
object representing n
individuals
and m
SNPs. The genotype Filebacked Big Matrix is initialized
with missing values.
Examples
(test <- snp_fake(5, 12))
# The genotype Filebackeg Big Matrix is initialized with missing values
G <- test$genotypes
G[]
# Modify the genotype `big.matrix`
G[] <- sample(as.raw(0:3), size = length(G), replace = TRUE)
G[]
Fast imputation
Description
Fast imputation algorithm based on local XGBoost models.
Usage
snp_fastImpute(
Gna,
infos.chr,
alpha = 1e-04,
size = 200,
p.train = 0.8,
n.cor = nrow(Gna),
seed = NA,
ncores = 1
)
Arguments
Gna |
A FBM.code256
(typically |
infos.chr |
Vector of integers specifying each SNP's chromosome. |
alpha |
Type-I error for testing correlations. Default is |
size |
Number of neighbor SNPs to be possibly included in the model
imputing this particular SNP. Default is |
p.train |
Proportion of non missing genotypes that are used for training
the imputation model while the rest is used to assess the accuracy of
this imputation model. Default is |
n.cor |
Number of rows that are used to estimate correlations. Default uses them all. |
seed |
An integer, for reproducibility. Default doesn't use seeds. |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
Value
An FBM with
the proportion of missing values by SNP (first row),
the estimated proportion of imputation errors by SNP (second row).
See Also
Examples
## Not run:
fake <- snp_attachExtdata("example-missing.bed")
G <- fake$genotypes
CHR <- fake$map$chromosome
infos <- snp_fastImpute(G, CHR)
infos[, 1:5]
# Still missing values
big_counts(G, ind.col = 1:10)
# You need to change the code of G
# To make this permanent, you need to save (modify) the file on disk
fake$genotypes$code256 <- CODE_IMPUTE_PRED
fake <- snp_save(fake)
big_counts(fake$genotypes, ind.col = 1:10)
# Plot for post-checking
## Here there is no SNP with more than 1% error (estimated)
pvals <- c(0.01, 0.005, 0.002, 0.001); colvals <- 2:5
df <- data.frame(pNA = infos[1, ], pError = infos[2, ])
# base R
plot(subset(df, pNA > 0.001), pch = 20)
idc <- lapply(seq_along(pvals), function(i) {
curve(pvals[i] / x, from = 0, lwd = 2,
col = colvals[i], add = TRUE)
})
legend("topright", legend = pvals, title = "p(NA & Error)",
col = colvals, lty = 1, lwd = 2)
# ggplot2
library(ggplot2)
Reduce(function(p, i) {
p + stat_function(fun = function(x) pvals[i] / x, color = colvals[i])
}, x = seq_along(pvals), init = ggplot(df, aes(pNA, pError))) +
geom_point() +
coord_cartesian(ylim = range(df$pError, na.rm = TRUE)) +
theme_bigstatsr()
## End(Not run)
Fast imputation
Description
Fast imputation via mode, mean, sampling according to allele frequencies, or 0.
Usage
snp_fastImputeSimple(
Gna,
method = c("mode", "mean0", "mean2", "random"),
ncores = 1
)
Arguments
Gna |
A FBM.code256
(typically |
method |
Either |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
Value
A new FBM.code256
object (same file, but different code).
See Also
Examples
bigsnp <- snp_attachExtdata("example-missing.bed")
G <- bigsnp$genotypes
G[, 2] # some missing values
G2 <- snp_fastImputeSimple(G)
G2[, 2] # no missing values anymore
G[, 2] # imputed, but still returning missing values
G$copy(code = CODE_IMPUTE_PRED)[, 2] # need to decode imputed values
G$copy(code = c(0, 1, 2, rep(0, 253)))[, 2] # "imputation" by 0
Fixation index (Fst)
Description
Fixation index (Fst), either per variant, or genome-wide
Usage
snp_fst(list_df_af, min_maf = 0, overall = FALSE)
Arguments
list_df_af |
List of data frames with |
min_maf |
Minimum MAF threshold (for the average of populations) to be
included in the final results. Default is |
overall |
Whether to compute Fst genome-wide ( |
Value
If overall
, then one value, otherwise a value for each variant
with missing values for the variants not passing min_maf
.
This should be equivalent to using '--fst --within
' in PLINK.
References
Weir, B. S., & Cockerham, C. C. (1984). Estimating F-statistics for the analysis of population structure. Evolution, 1358-1370.
Examples
bedfile <- system.file("extdata", "example.bed", package = "bigsnpr")
obj.bed <- bed(bedfile)
pop <- rep(1:3, c(143, 167, 207))
ind_pop <- split(seq_along(pop), pop)
list_df_af <- lapply(ind_pop, function(ind) bed_MAF(obj.bed, ind.row = ind))
snp_fst(list_df_af)
snp_fst(list_df_af[c(1, 2)], overall = TRUE)
snp_fst(list_df_af[c(1, 3)], overall = TRUE)
snp_fst(list_df_af[c(3, 2)], overall = TRUE)
Genomic Control
Description
Genomic Control
Usage
snp_gc(gwas)
Arguments
gwas |
A |
Value
A ggplot2
object. You can plot it using the print
method.
You can modify it as you wish by adding layers. You might want to read
this chapter
to get more familiar with the package ggplot2.
References
Devlin, B., & Roeder, K. (1999). Genomic control for association studies. Biometrics, 55(4), 997-1004.
Examples
set.seed(9)
test <- snp_attachExtdata()
G <- test$genotypes
y <- rnorm(nrow(G))
gwas <- big_univLinReg(G, y)
snp_qq(gwas)
gwas_gc <- snp_gc(gwas) # this modifies `attr(gwas_gc, "transfo")`
snp_qq(gwas_gc)
# The next plot should be prettier with a real dataset
snp_manhattan(gwas_gc,
infos.chr = test$map$chromosome,
infos.pos = test$map$physical.pos) +
ggplot2::geom_hline(yintercept = -log10(5e-8), linetype = 2, color = "red")
p <- snp_qq(gwas_gc) +
ggplot2::aes(text = asPlotlyText(test$map)) +
ggplot2::labs(subtitle = NULL, x = "Expected -log10(p)", y = "Observed -log10(p)")
## Not run: plotly::ggplotly(p, tooltip = "text")
Get sample information
Description
Get information of individuals by matching from an external file.
Usage
snp_getSampleInfos(
x,
df.or.files,
col.family.ID = 1,
col.sample.ID = 2,
col.infos = -c(1, 2),
pair.sep = "-_-",
...
)
Arguments
x |
A bigSNP. |
df.or.files |
Either
|
col.family.ID |
Index of the column containing the family IDs to match with those of the study. Default is the first one. |
col.sample.ID |
Index of the column containing the sample IDs to match with those of the study. Default is the second one. |
col.infos |
Indices of the column containing the information you want. Default is all but the first and the second columns. |
pair.sep |
Separator used for concatenation of family and sample IDs to
make unique IDs for matching between the two datasets. Default is |
... |
Any additional parameter to pass to |
Value
The requested information as a data.frame
.
See Also
Examples
test <- snp_attachExtdata()
table(test$fam$family.ID)
# Get populations clusters from external files
files <- system.file("extdata", paste0("cluster", 1:3), package = "bigsnpr")
bigreadr::fread2(files[1])
bigreadr::fread2(files[1], header = FALSE) # need header option here
infos <- snp_getSampleInfos(test, files, header = FALSE)
table(infos[[1]])
lassosum2
Description
lassosum2
Usage
snp_lassosum2(
corr,
df_beta,
delta = c(0.001, 0.01, 0.1, 1),
nlambda = 30,
lambda.min.ratio = 0.01,
dfmax = 2e+05,
maxiter = 1000,
tol = 1e-05,
ind.corr = cols_along(corr),
ncores = 1
)
Arguments
corr |
Sparse correlation matrix as an SFBM.
If |
df_beta |
A data frame with 3 columns:
|
delta |
Vector of shrinkage parameters to try (L2-regularization).
Default is |
nlambda |
Number of different lambdas to try (L1-regularization).
Default is |
lambda.min.ratio |
Ratio between last and first lambdas to try.
Default is |
dfmax |
Maximum number of non-zero effects in the model.
Default is |
maxiter |
Maximum number of iterations before convergence.
Default is |
tol |
Tolerance parameter for assessing convergence.
Default is |
ind.corr |
Indices to "subset" |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
Value
A matrix of effect sizes, one vector (column) for each row in
attr(<res>, "grid_param")
. Missing values are returned when strong
divergence is detected.
LD scores
Description
LD scores
Usage
snp_ld_scores(
Gna,
ind.row = rows_along(Gna),
ind.col = cols_along(Gna),
size = 500,
infos.pos = NULL,
ncores = 1
)
bed_ld_scores(
obj.bed,
ind.row = rows_along(obj.bed),
ind.col = cols_along(obj.bed),
size = 500,
infos.pos = NULL,
ncores = 1
)
Arguments
Gna |
A FBM.code256
(typically |
ind.row |
An optional vector of the row indices (individuals) that
are used. If not specified, all rows are used. |
ind.col |
An optional vector of the column indices (SNPs) that are used.
If not specified, all columns are used. |
size |
For one SNP, window size around this SNP to compute correlations.
Default is |
infos.pos |
Vector of integers specifying the physical position
on a chromosome (in base pairs) of each SNP. |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
obj.bed |
Object of type bed, which is the mapping of some bed file.
Use |
Value
A vector of LD scores. For each variant, this is the sum of squared correlations with the neighboring variants (including itself).
Examples
test <- snp_attachExtdata()
G <- test$genotypes
(ld <- snp_ld_scores(G, ind.col = 1:1000))
LDpred2
Description
LDpred2. Tutorial at https://privefl.github.io/bigsnpr/articles/LDpred2.html.
Usage
snp_ldpred2_inf(corr, df_beta, h2)
snp_ldpred2_grid(
corr,
df_beta,
grid_param,
burn_in = 50,
num_iter = 100,
ncores = 1,
return_sampling_betas = FALSE,
ind.corr = cols_along(corr)
)
snp_ldpred2_auto(
corr,
df_beta,
h2_init,
vec_p_init = 0.1,
burn_in = 500,
num_iter = 200,
sparse = FALSE,
verbose = FALSE,
report_step = num_iter + 1L,
allow_jump_sign = TRUE,
shrink_corr = 1,
use_MLE = TRUE,
p_bounds = c(1e-05, 1),
alpha_bounds = c(-1.5, 0.5),
ind.corr = cols_along(corr),
ncores = 1
)
Arguments
corr |
Sparse correlation matrix as an SFBM.
If |
df_beta |
A data frame with 3 columns:
|
h2 |
Heritability estimate. |
grid_param |
A data frame with 3 columns as a grid of hyper-parameters:
|
burn_in |
Number of burn-in iterations. |
num_iter |
Number of iterations after burn-in. |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
return_sampling_betas |
Whether to return all sampling betas (after
burn-in)? This is useful for assessing the uncertainty of the PRS at the
individual level (see doi:10.1101/2020.11.30.403188).
Default is |
ind.corr |
Indices to "subset" |
h2_init |
Heritability estimate for initialization. |
vec_p_init |
Vector of initial values for p. Default is |
sparse |
In LDpred2-auto, whether to also report a sparse solution by
running LDpred2-grid with the estimates of p and h2 from LDpred2-auto, and
sparsity enabled. Default is |
verbose |
Whether to print "p // h2" estimates at each iteration. Disabled when parallelism is used. |
report_step |
Step to report sampling betas (after burn-in and before
unscaling). Nothing is reported by default. If using |
allow_jump_sign |
Whether to allow for effects sizes to change sign in
consecutive iterations? Default is |
shrink_corr |
Shrinkage multiplicative coefficient to apply to off-diagonal
elements of the correlation matrix. Default is |
use_MLE |
Whether to use maximum likelihood estimation (MLE) to estimate
alpha and the variance component (since v1.11.4), or assume that alpha is
-1 and estimate the variance of (scaled) effects as h2/(m*p), as it was
done in earlier versions of LDpred2-auto (e.g. in v1.10.8). Default is |
p_bounds |
Boundaries for the estimates of p (the polygenicity).
Default is |
alpha_bounds |
Boundaries for the estimates of |
Details
For reproducibility, set.seed()
can be used to ensure that two runs of
LDpred2 give the exact same results (since v1.10).
Value
snp_ldpred2_inf
: A vector of effects, assuming an infinitesimal model.
snp_ldpred2_grid
: A matrix of effect sizes, one vector (column)
for each row of grid_param
. Missing values are returned when strong
divergence is detected. If using return_sampling_betas
, each column
corresponds to one iteration instead (after burn-in).
snp_ldpred2_auto
: A list (over vec_p_init
) of lists with
-
$beta_est
: vector of effect sizes (on the allele scale); note that missing values are returned when strong divergence is detected -
$beta_est_sparse
(only whensparse = TRUE
): sparse vector of effect sizes -
$postp_est
: vector of posterior probabilities of being causal -
$corr_est
, the "imputed" correlations between variants and phenotypes, which can be used for post-QCing variants by comparing those towith(df_beta, beta / sqrt(n_eff * beta_se^2 + beta^2))
-
$sample_beta
: sparse matrix of sampling betas (see parameterreport_step
), not on the allele scale, for which you need to multiply bywith(df_beta, sqrt(n_eff * beta_se^2 + beta^2))
-
$path_p_est
: full path of p estimates (including burn-in); useful to check convergence of the iterative algorithm -
$path_h2_est
: full path of h2 estimates (including burn-in); useful to check convergence of the iterative algorithm -
$path_alpha_est
: full path of alpha estimates (including burn-in); useful to check convergence of the iterative algorithm -
$h2_est
: estimate of the (SNP) heritability (also see coef_to_liab) -
$p_est
: estimate of p, the proportion of causal variants -
$alpha_est
: estimate of alpha, the parameter controlling the relationship between allele frequencies and expected effect sizes -
$h2_init
and$p_init
: input parameters, for convenience
LD score regression
Description
LD score regression
Usage
snp_ldsc(
ld_score,
ld_size,
chi2,
sample_size,
blocks = 200,
intercept = NULL,
chi2_thr1 = 30,
chi2_thr2 = Inf,
ncores = 1
)
snp_ldsc2(
corr,
df_beta,
blocks = NULL,
intercept = 1,
ncores = 1,
ind.beta = cols_along(corr),
chi2_thr1 = 30,
chi2_thr2 = Inf
)
Arguments
ld_score |
Vector of LD scores. |
ld_size |
Number of variants used to compute |
chi2 |
Vector of chi-squared statistics. |
sample_size |
Sample size of GWAS corresponding to chi-squared statistics. Possibly a vector, or just a single value. |
blocks |
Either a single number specifying the number of blocks,
or a vector of integers specifying the block number of each |
intercept |
You can constrain the intercept to some value (e.g. 1).
Default is |
chi2_thr1 |
Threshold on |
chi2_thr2 |
Threshold on |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
corr |
Sparse correlation matrix. Can also be an SFBM. |
df_beta |
A data frame with 3 columns:
|
ind.beta |
Indices in |
Value
Vector of 4 values (or only the first 2 if blocks = NULL
):
-
[["int"]]
: LDSC regression intercept, -
[["int_se"]]
: SE of this intercept, -
[["h2"]]
: LDSC regression estimate of (SNP) heritability (also see coef_to_liab), -
[["h2_se"]]
: SE of this heritability estimate.
Examples
bigsnp <- snp_attachExtdata()
G <- bigsnp$genotypes
y <- bigsnp$fam$affection - 1
corr <- snp_cor(G, ind.col = 1:1000)
gwas <- big_univLogReg(G, y, ind.col = 1:1000)
df_beta <- data.frame(beta = gwas$estim, beta_se = gwas$std.err,
n_eff = 4 / (1 / sum(y == 0) + 1 / sum(y == 1)))
snp_ldsc2(corr, df_beta)
snp_ldsc2(corr, df_beta, blocks = 20, intercept = NULL)
Independent LD blocks
Description
Split a correlation matrix in blocks as independent as possible. This finds the splitting in blocks that minimizes the sum of squared correlation between these blocks (i.e. everything outside these blocks). In case of equivalent splits, it then minimizes the sum of squared sizes of the blocks.
Usage
snp_ldsplit(
corr,
thr_r2,
min_size,
max_size,
max_K = 500,
max_r2 = 0.3,
max_cost = ncol(corr)/200,
pos_scaled = rep(0, ncol(corr))
)
Arguments
corr |
Sparse correlation matrix. Usually, the output of |
thr_r2 |
Threshold under which squared correlations are ignored.
This is useful to avoid counting noise, which should give clearer patterns
of costs vs. number of blocks. It is therefore possible to have a splitting
cost of 0. If this parameter is used, then |
min_size |
Minimum number of variants in each block. This is used not to have a disproportionate number of small blocks. |
max_size |
Maximum number of variants in each block. This is used not to have blocks that are too large, e.g. to limit computational and memory requirements of applications that would use these blocks. For some long-range LD regions, it may be needed to allow for large blocks. You can now provide a vector of values to try. |
max_K |
Maximum number of blocks to consider. All optimal solutions for K
from 1 to |
max_r2 |
Maximum squared correlation allowed for one pair of variants in
two different blocks. This is used to make sure that strong correlations are
not discarded and also to speed up the algorithm. Default is |
max_cost |
Maximum cost reported. Default is |
pos_scaled |
Vector of positions. The positions should be scaled so that limits of a block must be separated by a distance of 1 at the maximum. E.g. if the positions are in base pairs (bp), and you want a maximum distance of 10 Mbp, you need to provide the vector of positions divided by 10e6. |
Value
Either NULL
when no block splitting satisfies the conditions,
or a tibble with seven columns:
-
$max_size
: Input parameter, useful when providing a vector of values to try. -
$n_block
: Number of blocks. -
$cost
: The sum of squared correlations outside the blocks. -
$cost2
: The sum of squared sizes of the blocks. -
$perc_kept
: Percentage of initial non-zero values kept within the blocks defined. -
$all_last
: Last index of each block. -
$all_size
: Sizes of the blocks. -
$block_num
: Resulting block numbers for each variant. This is not reported anymore, but can be computed withrep(seq_along(all_size), all_size)
.
Examples
## Not run:
corr <- readRDS(url("https://www.dropbox.com/s/65u96jf7y32j2mj/spMat.rds?raw=1"))
# adjust `THR_R2` depending on sample size used to compute corr
# use e.g. 0.05 for small sample sizes, and 0.01 for large sample sizes
THR_R2 <- 0.02
m <- ncol(corr)
(SEQ <- round(seq_log(m / 30, m / 5, length.out = 10)))
# replace `min_size` by e.g. 100 for larger data
(res <- snp_ldsplit(corr, thr_r2 = THR_R2, min_size = 10, max_size = SEQ))
# add the variant block IDs corresponding to each split
res$block_num <- lapply(res$all_size, function(.) rep(seq_along(.), .))
library(ggplot2)
# trade-off cost / number of blocks
qplot(n_block, cost, color = factor(max_size, SEQ), data = res) +
theme_bw(14) +
scale_y_log10() +
theme(legend.position = "top") +
labs(x = "Number of blocks", color = "Maximum block size",
y = "Sum of squared correlations outside blocks")
# trade-off cost / number of non-zero values
qplot(perc_kept, cost, color = factor(max_size, SEQ), data = res) +
theme_bw(14) +
# scale_y_log10() +
theme(legend.position = "top") +
labs(x = "Percentage of non-zero values kept", color = "Maximum block size",
y = "Sum of squared correlations outside blocks")
# trade-off cost / sum of squared sizes
qplot(cost2, cost, color = factor(max_size, SEQ), data = res) +
theme_bw(14) +
scale_y_log10() +
geom_vline(xintercept = 0)+
theme(legend.position = "top") +
labs(x = "Sum of squared blocks", color = "Maximum block size",
y = "Sum of squared correlations outside blocks")
## Pick one solution and visualize blocks
library(dplyr)
all_ind <- res %>%
arrange(cost2 * sqrt(5 + cost)) %>%
print() %>%
slice(1) %>%
pull(all_last)
## Transform sparse representation into (i,j,x) triplets
corrT <- as(corr, "dgTMatrix")
upper <- (corrT@i <= corrT@j & corrT@x^2 >= THR_R2)
df <- data.frame(
i = corrT@i[upper] + 1L,
j = corrT@j[upper] + 1L,
r2 = corrT@x[upper]^2
)
df$y <- (df$j - df$i) / 2
ggplot(df) +
geom_point(aes(i + y, y, alpha = r2)) +
theme_minimal() +
theme(axis.text.y = element_blank(), axis.ticks.y = element_blank(),
strip.background = element_blank(), strip.text.x = element_blank()) +
scale_alpha_continuous(range = 0:1) +
scale_x_continuous(expand = c(0.02, 0.02), minor_breaks = NULL,
breaks = head(all_ind[[1]], -1) + 0.5) +
facet_wrap(~ cut(i + y, 4), scales = "free", ncol = 1) +
labs(x = "Position", y = NULL)
## End(Not run)
Manhattan plot
Description
Creates a manhattan plot.
Usage
snp_manhattan(
gwas,
infos.chr,
infos.pos,
colors = c("black", "grey60"),
dist.sep.chrs = 1e+07,
ind.highlight = integer(0),
col.highlight = "red",
labels = NULL,
npoints = NULL,
coeff = 1
)
Arguments
gwas |
A |
infos.chr |
Vector of integers specifying each SNP's chromosome. |
infos.pos |
Vector of integers specifying the physical position
on a chromosome (in base pairs) of each SNP. |
colors |
Colors used for each chromosome (they are recycled). Default is an alternation of black and gray. |
dist.sep.chrs |
"Physical" distance that separates two chromosomes. Default is 10 Mbp. |
ind.highlight |
Indices of SNPs you want to highlight (of interest). Default doesn't highlight any SNPs. |
col.highlight |
Color used for highlighting SNPs. Default uses red. |
labels |
Labels of the x axis. Default uses the number of the
chromosome there are in |
npoints |
Number of points to keep (ranked by p-value) in order to get
a lighter object (and plot). Default doesn't cut anything.
If used, the resulting object will have an attribute called |
coeff |
Relative size of text. Default is |
Details
If you don't have information of chromosome and position, you should simply
use plot
instead.
Value
A ggplot2
object. You can plot it using the print
method.
You can modify it as you wish by adding layers. You might want to read
this chapter
to get more familiar with the package ggplot2.
Examples
set.seed(9)
test <- snp_attachExtdata()
G <- test$genotypes
y <- rnorm(nrow(G))
gwas <- big_univLinReg(G, y)
snp_qq(gwas)
gwas_gc <- snp_gc(gwas) # this modifies `attr(gwas_gc, "transfo")`
snp_qq(gwas_gc)
# The next plot should be prettier with a real dataset
snp_manhattan(gwas_gc,
infos.chr = test$map$chromosome,
infos.pos = test$map$physical.pos) +
ggplot2::geom_hline(yintercept = -log10(5e-8), linetype = 2, color = "red")
p <- snp_qq(gwas_gc) +
ggplot2::aes(text = asPlotlyText(test$map)) +
ggplot2::labs(subtitle = NULL, x = "Expected -log10(p)", y = "Observed -log10(p)")
## Not run: plotly::ggplotly(p, tooltip = "text")
Match alleles
Description
Match alleles between summary statistics and SNP information. Match by ("chr", "a0", "a1") and ("pos" or "rsid"), accounting for possible strand flips and reverse reference alleles (opposite effects).
Usage
snp_match(
sumstats,
info_snp,
strand_flip = TRUE,
join_by_pos = TRUE,
remove_dups = TRUE,
match.min.prop = 0.2,
return_flip_and_rev = FALSE
)
Arguments
sumstats |
A data frame with columns "chr", "pos", "a0", "a1" and "beta". |
info_snp |
A data frame with columns "chr", "pos", "a0" and "a1". |
strand_flip |
Whether to try to flip strand? (default is |
join_by_pos |
Whether to join by chromosome and position (default), or instead by rsid. |
remove_dups |
Whether to remove duplicates (same physical position)?
Default is |
match.min.prop |
Minimum proportion of variants in the smallest data
to be matched, otherwise stops with an error. Default is |
return_flip_and_rev |
Whether to return internal boolean variables
|
Value
A single data frame with matched variants. Values in column $beta
are multiplied by -1 for variants with alleles reversed (i.e. swapped).
New variable "_NUM_ID_.ss"
returns the corresponding row indices of the
input sumstats
(first argument of this function), and "_NUM_ID_"
corresponding to the input info_snp
(second argument).
See Also
Examples
sumstats <- data.frame(
chr = 1,
pos = c(86303, 86331, 162463, 752566, 755890, 758144),
a0 = c("T", "G", "C", "A", "T", "G"),
a1 = c("G", "A", "T", "G", "A", "A"),
beta = c(-1.868, 0.250, -0.671, 2.112, 0.239, 1.272),
p = c(0.860, 0.346, 0.900, 0.456, 0.776, 0.383)
)
info_snp <- data.frame(
id = c("rs2949417", "rs115209712", "rs143399298", "rs3094315", "rs3115858"),
chr = 1,
pos = c(86303, 86331, 162463, 752566, 755890),
a0 = c("T", "A", "G", "A", "T"),
a1 = c("G", "G", "A", "G", "A")
)
snp_match(sumstats, info_snp)
snp_match(sumstats, info_snp, strand_flip = FALSE)
Modify genome build
Description
Modify the physical position information of a data frame when converting genome build using executable liftOver.
Usage
snp_modifyBuild(
info_snp,
liftOver,
from = "hg18",
to = "hg19",
check_reverse = TRUE,
local_chain = NULL,
base_url = "https://hgdownload.soe.ucsc.edu/goldenPath/"
)
Arguments
info_snp |
A data frame with columns "chr" and "pos". |
liftOver |
Path to liftOver executable. Binaries can be downloaded at https://hgdownload.cse.ucsc.edu/admin/exe/macOSX.x86_64/liftOver for Mac and at https://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/liftOver for Linux. |
from |
Genome build to convert from. Default is |
to |
Genome build to convert to. Default is |
check_reverse |
Whether to discard positions for which we cannot go back
to initial values by doing 'from -> to -> from'. Default is |
local_chain |
Local chain file (e.g. |
base_url |
From where to download the chain files. Default is
|
Value
Input data frame info_snp
with column "pos" in the new build.
References
Hinrichs, Angela S., et al. "The UCSC genome browser database: update 2006." Nucleic acids research 34.suppl_1 (2006): D590-D598.
Outlier detection
Description
Method to detect genetic markers involved in biological adaptation. This provides a statistical tool for outlier detection based on Principal Component Analysis. This corresponds to the statistic based on mahalanobis distance, as implemented in package pcadapt.
Usage
snp_pcadapt(
G,
U.row,
ind.row = rows_along(G),
ind.col = cols_along(G),
ncores = 1
)
bed_pcadapt(
obj.bed,
U.row,
ind.row = rows_along(obj.bed),
ind.col = cols_along(obj.bed),
ncores = 1
)
Arguments
G |
A FBM.code256
(typically |
U.row |
Left singular vectors (not scores, |
ind.row |
An optional vector of the row indices (individuals) that
are used. If not specified, all rows are used. |
ind.col |
An optional vector of the column indices (SNPs) that are used.
If not specified, all columns are used. |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
obj.bed |
Object of type bed, which is the mapping of some bed file.
Use |
Value
An object of classes mhtest
and data.frame
returning one
score by SNP. See methods(class = "mhtest")
.
References
Luu, K., Bazin, E., & Blum, M. G. (2017). pcadapt: an R package to perform genome scans for selection based on principal component analysis. Molecular ecology resources, 17(1), 67-77.
See Also
snp_manhattan, snp_qq and snp_gc.
Examples
test <- snp_attachExtdata()
G <- test$genotypes
obj.svd <- big_SVD(G, fun.scaling = snp_scaleBinom(), k = 10)
plot(obj.svd) # there seems to be 3 "significant" components
pcadapt <- snp_pcadapt(G, obj.svd$u[, 1:3])
snp_qq(pcadapt)
Identity-by-descent
Description
Quality Control based on Identity-by-descent (IBD) computed by PLINK 1.9 using its method-of-moments.
Usage
snp_plinkIBDQC(
plink.path,
bedfile.in,
bedfile.out = NULL,
pi.hat = 0.08,
ncores = 1,
pruning.args = c(100, 0.2),
do.blind.QC = TRUE,
extra.options = "",
verbose = TRUE
)
Arguments
plink.path |
Path to the executable of PLINK 1.9. |
bedfile.in |
Path to the input bedfile. |
bedfile.out |
Path to the output bedfile. Default is created by
appending |
pi.hat |
PI_HAT value threshold for individuals (first by pairs)
to be excluded. Default is |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
pruning.args |
A vector of 2 pruning parameters, respectively
the window size (in variant count) and the pairwise $r^2$ threshold
(the step size is fixed to 1). Default is |
do.blind.QC |
Whether to do QC with |
extra.options |
Other options to be passed to PLINK as a string (for the IBD part). More options can be found at https://www.cog-genomics.org/plink/1.9/ibd. |
verbose |
Whether to show PLINK log? Default is |
Value
The path of the new bedfile. If no sample is filter, no new bed/bim/fam files are created and then the path of the input bedfile is returned.
References
Chang, Christopher C, Carson C Chow, Laurent CAM Tellier, Shashaank Vattikuti, Shaun M Purcell, and James J Lee. 2015. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience 4 (1): 7. doi:10.1186/s13742-015-0047-8.
See Also
download_plink snp_plinkQC snp_plinkKINGQC
Examples
## Not run:
bedfile <- system.file("extdata", "example.bed", package = "bigsnpr")
plink <- download_plink()
bedfile <- snp_plinkIBDQC(plink, bedfile,
bedfile.out = tempfile(fileext = ".bed"),
ncores = 2)
df_rel <- snp_plinkIBDQC(plink, bedfile, do.blind.QC = FALSE, ncores = 2)
str(df_rel)
library(ggplot2)
qplot(Z0, Z1, data = df_rel, col = RT)
qplot(y = PI_HAT, data = df_rel) +
geom_hline(yintercept = 0.2, color = "blue", linetype = 2)
snp_plinkRmSamples(plink, bedfile,
bedfile.out = tempfile(fileext = ".bed"),
df.or.files = subset(df_rel, PI_HAT > 0.2))
## End(Not run)
Relationship-based pruning
Description
Quality Control based on KING-robust kinship estimator. More information can be found at https://www.cog-genomics.org/plink/2.0/distance#king_cutoff.
Usage
snp_plinkKINGQC(
plink2.path,
bedfile.in,
bedfile.out = NULL,
thr.king = 2^-3.5,
make.bed = TRUE,
ncores = 1,
extra.options = "",
verbose = TRUE
)
Arguments
plink2.path |
Path to the executable of PLINK 2. |
bedfile.in |
Path to the input bedfile. |
bedfile.out |
Path to the output bedfile. Default is created by
appending |
thr.king |
Note that KING kinship coefficients are scaled such that duplicate samples have kinship 0.5, not 1. First-degree relations (parent-child, full siblings) correspond to ~0.25, second-degree relations correspond to ~0.125, etc. It is conventional to use a cutoff of ~0.354 (2^-1.5, the geometric mean of 0.5 and 0.25) to screen for monozygotic twins and duplicate samples, ~0.177 (2^-2.5) to remove first-degree relations as well, and ~0.0884 (2^-3.5, default) to remove second-degree relations as well, etc. |
make.bed |
Whether to create new bed/bim/fam files (default). Otherwise, returns a table with coefficients of related pairs. |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
extra.options |
Other options to be passed to PLINK2 as a string. |
verbose |
Whether to show PLINK log? Default is |
Value
See parameter make-bed
.
References
Manichaikul, Ani, Josyf C. Mychaleckyj, Stephen S. Rich, Kathy Daly, Michele Sale, and Wei-Min Chen. "Robust relationship inference in genome-wide association studies." Bioinformatics 26, no. 22 (2010): 2867-2873.
See Also
Examples
## Not run:
bedfile <- system.file("extdata", "example.bed", package = "bigsnpr")
plink2 <- download_plink2(AVX2 = FALSE)
bedfile2 <- snp_plinkKINGQC(plink2, bedfile,
bedfile.out = tempfile(fileext = ".bed"),
ncores = 2)
df_rel <- snp_plinkKINGQC(plink2, bedfile, make.bed = FALSE, ncores = 2)
str(df_rel)
## End(Not run)
Quality Control
Description
Quality Control (QC) and possible conversion to bed/bim/fam files using PLINK 1.9.
Usage
snp_plinkQC(
plink.path,
prefix.in,
file.type = "--bfile",
prefix.out = paste0(prefix.in, "_QC"),
maf = 0.01,
geno = 0.1,
mind = 0.1,
hwe = 1e-50,
autosome.only = FALSE,
extra.options = "",
verbose = TRUE
)
Arguments
plink.path |
Path to the executable of PLINK 1.9. |
prefix.in |
Prefix (path without extension) of the dataset to be QCed. |
file.type |
Type of the dataset to be QCed. Default is |
prefix.out |
Prefix (path without extension) of the bed/bim/fam dataset
to be created. Default is created by appending |
maf |
Minimum Minor Allele Frequency (MAF) for a SNP to be kept.
Default is |
geno |
Maximum proportion of missing values for a SNP to be kept.
Default is |
mind |
Maximum proportion of missing values for a sample to be kept.
Default is |
hwe |
Filters out all variants which have Hardy-Weinberg equilibrium
exact test p-value below the provided threshold. Default is |
autosome.only |
Whether to exclude all unplaced and non-autosomal
variants? Default is |
extra.options |
Other options to be passed to PLINK as a string. More
options can be found at https://www.cog-genomics.org/plink2/filter.
If using PLINK 2.0, you could e.g. use |
verbose |
Whether to show PLINK log? Default is |
Value
The path of the newly created bedfile.
References
Chang, Christopher C, Carson C Chow, Laurent CAM Tellier, Shashaank Vattikuti, Shaun M Purcell, and James J Lee. 2015. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience 4 (1): 7. doi:10.1186/s13742-015-0047-8.
See Also
Examples
## Not run:
bedfile <- system.file("extdata", "example.bed", package = "bigsnpr")
prefix <- sub_bed(bedfile)
plink <- download_plink()
test <- snp_plinkQC(plink.path = plink,
prefix.in = prefix,
prefix.out = tempfile(),
file.type = "--bfile", # the default (for ".bed")
maf = 0.05,
geno = 0.05,
mind = 0.05,
hwe = 1e-10,
autosome.only = TRUE)
test
## End(Not run)
Remove samples
Description
Create new bed/bim/fam files by removing samples with PLINK.
Usage
snp_plinkRmSamples(
plink.path,
bedfile.in,
bedfile.out,
df.or.files,
col.family.ID = 1,
col.sample.ID = 2,
...,
verbose = TRUE
)
Arguments
plink.path |
Path to the executable of PLINK 1.9. |
bedfile.in |
Path to the input bedfile. |
bedfile.out |
Path to the output bedfile. |
df.or.files |
Either
|
col.family.ID |
Index of the column containing the family IDs to match with those of the study. Default is the first one. |
col.sample.ID |
Index of the column containing the sample IDs to match with those of the study. Default is the second one. |
... |
Any additional parameter to pass to |
verbose |
Whether to show PLINK log? Default is |
Value
The path of the new bedfile.
See Also
BGEN matrix product
Description
Compute a matrix product between BGEN files and a matrix. This removes the
need to read an intermediate FBM object with snp_readBGEN()
to compute the
product. Moreover, when using dosages, they are not rounded to two decimal
places anymore.
Usage
snp_prodBGEN(
bgenfiles,
beta,
list_snp_id,
ind_row = NULL,
bgi_dir = dirname(bgenfiles),
read_as = c("dosage", "random"),
block_size = 1000,
ncores = 1
)
Arguments
bgenfiles |
Character vector of paths to files with extension ".bgen". The corresponding ".bgen.bgi" index files must exist. |
beta |
A matrix (or a vector), with rows corresponding to |
list_snp_id |
List of character vectors of SNP IDs to read, with one
vector per BGEN file. Each SNP ID should be in the form
|
ind_row |
An optional vector of the row indices (individuals) that
are used. If not specified, all rows are used. Don't use negative indices.
You can access the sample IDs corresponding to the genotypes from the .sample
file, and use e.g. |
bgi_dir |
Directory of index files. Default is the same as |
read_as |
How to read BGEN probabilities? Currently implemented:
|
block_size |
Maximum size of temporary blocks (in number of variants).
Default is |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
Value
The product bgen_data[ind_row, 'list_snp_id'] %*% beta
.
See Also
Q-Q plot
Description
Creates a quantile-quantile plot from p-values from a GWAS study.
Usage
snp_qq(gwas, lambdaGC = TRUE, coeff = 1)
Arguments
gwas |
A |
lambdaGC |
Add the Genomic Control coefficient as subtitle to the plot? |
coeff |
Relative size of text. Default is |
Value
A ggplot2
object. You can plot it using the print
method.
You can modify it as you wish by adding layers. You might want to read
this chapter
to get more familiar with the package ggplot2.
Examples
set.seed(9)
test <- snp_attachExtdata()
G <- test$genotypes
y <- rnorm(nrow(G))
gwas <- big_univLinReg(G, y)
snp_qq(gwas)
gwas_gc <- snp_gc(gwas) # this modifies `attr(gwas_gc, "transfo")`
snp_qq(gwas_gc)
# The next plot should be prettier with a real dataset
snp_manhattan(gwas_gc,
infos.chr = test$map$chromosome,
infos.pos = test$map$physical.pos) +
ggplot2::geom_hline(yintercept = -log10(5e-8), linetype = 2, color = "red")
p <- snp_qq(gwas_gc) +
ggplot2::aes(text = asPlotlyText(test$map)) +
ggplot2::labs(subtitle = NULL, x = "Expected -log10(p)", y = "Observed -log10(p)")
## Not run: plotly::ggplotly(p, tooltip = "text")
Read BGEN files into a "bigSNP"
Description
Function to read the UK Biobank BGEN files into a bigSNP.
Usage
snp_readBGEN(
bgenfiles,
backingfile,
list_snp_id,
ind_row = NULL,
bgi_dir = dirname(bgenfiles),
read_as = c("dosage", "random"),
ncores = 1
)
Arguments
bgenfiles |
Character vector of paths to files with extension ".bgen". The corresponding ".bgen.bgi" index files must exist. |
backingfile |
The path (without extension) for the backing files (".bk" and ".rds") that are created by this function for storing the bigSNP object. |
list_snp_id |
List of character vectors of SNP IDs to read, with one
vector per BGEN file. Each SNP ID should be in the form
|
ind_row |
An optional vector of the row indices (individuals) that
are used. If not specified, all rows are used. Don't use negative indices.
You can access the sample IDs corresponding to the genotypes from the .sample
file, and use e.g. |
bgi_dir |
Directory of index files. Default is the same as |
read_as |
How to read BGEN probabilities? Currently implemented:
|
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
Details
For more information on this format, please visit BGEN webpage.
This function is designed to read UK Biobank imputation files. This assumes
that variants have been compressed with zlib, that there are only 2 possible
alleles, and that each probability is stored on 8 bits. For example, if you
use qctool to generate your own BGEN files, please make sure you are using
options '-ofiletype bgen_v1.2 -bgen-bits 8 -assume-chromosome
'.
If the format is not the expected one, this will result in an error or even
a crash of your R session. Another common source of error is due to corrupted
files; e.g. if using UK Biobank files, compare the result of tools::md5sum()
with the ones at https://biobank.ndph.ox.ac.uk/ukb/refer.cgi?id=998.
You can look at some example code from my papers on how to use this function:
-
https://github.com/privefl/paper-infer/blob/main/code/prepare-geno-simu.R
-
https://github.com/privefl/paper-misspec/blob/main/code/prepare-genotypes.R
Value
The path to the RDS file <backingfile>.rds
that stores the bigSNP
object created by this function.
Note that this function creates another file (.bk) which stores the values
of the FBM ($genotypes
). The rows corresponds to the order of ind_row
;
the columns to the order of list_snp_id
. The $map
component of the
bigSNP
object stores some information on the variants (including allele
frequencies and INFO scores computed from the imputation probabilities).
However, it does not have a $fam
component; you should use the individual
IDs in the .sample file (filtered with ind_row
) to add external
information on the individuals.
You shouldn't read from BGEN files more than once. Instead, use
snp_attach to load the "bigSNP" object in any R session from backing files.
Read variant info from one BGI file
Description
Read variant info from one BGI file
Usage
snp_readBGI(bgifile, snp_id = NULL)
Arguments
bgifile |
Path to one file with extension ".bgi". |
snp_id |
Character vector of SNP IDs. These should be in the form
|
Value
A data frame containing variant information.
Read PLINK files into a "bigSNP"
Description
Functions to read bed/bim/fam files into a bigSNP.
Usage
snp_readBed(bedfile, backingfile = sub_bed(bedfile))
snp_readBed2(
bedfile,
backingfile = sub_bed(bedfile),
ind.row = rows_along(obj.bed),
ind.col = cols_along(obj.bed),
ncores = 1
)
Arguments
bedfile |
Path to file with extension ".bed" to read. You need the corresponding ".bim" and ".fam" in the same directory. |
backingfile |
The path (without extension) for the backing files for the cache of the bigSNP object. Default takes the bedfile without the ".bed" extension. |
ind.row |
An optional vector of the row indices (individuals) that
are used. If not specified, all rows are used. |
ind.col |
An optional vector of the column indices (SNPs) that are used.
If not specified, all columns are used. |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
Details
For more information on these formats, please visit
PLINK webpage.
For other formats, please use PLINK to convert them in bedfiles,
which require minimal space to store and are faster to read. For example,
to convert from a VCF file, use the --vcf
option. See snp_plinkQC.
Value
The path to the RDS file that stores the bigSNP
object.
Note that this function creates one other file which stores the values of
the Filebacked Big Matrix.
You shouldn't read from PLINK files more than once. Instead, use
snp_attach to load the "bigSNP" object in any R session from backing files.
Examples
(bedfile <- system.file("extdata", "example.bed", package = "bigsnpr"))
# Reading the bedfile and storing the data in temporary directory
rds <- snp_readBed(bedfile, backingfile = tempfile())
# Loading the data from backing files
test <- snp_attach(rds)
str(test)
dim(G <- test$genotypes)
G[1:8, 1:8]
Save modifications
Description
Save a bigSNP
after having made some modifications to it.
As bigSNP
is an S3 class, you can add any slot you want
to an object of this class, then use snp_save
to
save these modifications in the corresponding ".rds" backing file.
Usage
snp_save(x, version = NULL)
Arguments
x |
A bigSNP. |
version |
the workspace format version to use. |
Value
The (saved) bigSNP
.
Examples
set.seed(1)
# Reading example
test <- snp_attachExtdata()
# I can add whatever I want to an S3 class
test$map$`p-values` <- runif(nrow(test$map))
str(test$map)
# Reading again
rds <- test$genotypes$rds
test2 <- snp_attach(rds)
str(test2$map) # new slot wasn't saved
# Save it
snp_save(test)
# Reading again
test3 <- snp_attach(rds)
str(test3$map) # it is saved now
# The complicated code of this function
snp_save
Binomial(n, p) scaling
Description
Binomial(n, p) scaling where n
is fixed and p
is estimated.
Usage
snp_scaleAlpha(alpha = -1)
snp_scaleBinom(nploidy = 2)
Arguments
alpha |
Assumes that the average contribution (e.g. heritability)
of a SNP of frequency |
nploidy |
Number of trials, parameter of the binomial distribution.
Default is |
Details
You will probably not use this function as is but as the
fun.scaling
parameter of other functions of package bigstatsr
.
Value
A new function that returns a data.frame of two vectors
"center" and "scale" which are of the length of ind.col
.
References
This scaling is widely used for SNP arrays. Patterson N, Price AL, Reich D (2006). Population Structure and Eigenanalysis. PLoS Genet 2(12): e190. doi:10.1371/journal.pgen.0020190.
Examples
set.seed(1)
a <- matrix(0, 93, 170)
p <- 0.2
a[] <- rbinom(length(a), 2, p)
X <- add_code256(big_copy(a, type = "raw"), code = c(0, 1, 2, rep(NA, 253)))
X.svd <- big_SVD(X, fun.scaling = snp_scaleBinom())
str(X.svd)
plot(X.svd$center)
abline(h = 2 * p, col = "red")
plot(X.svd$scale)
abline(h = sqrt(2 * p * (1 - p)), col = "red")
Simulate phenotypes
Description
Simulate phenotypes using a linear model. When a prevalence is given, the liability threshold is used to convert liabilities to a binary outcome. The genetic and environmental liabilities are scaled such that the variance of the genetic liability is exactly equal to the requested heritability, and the variance of the total liability is equal to 1.
Usage
snp_simuPheno(
G,
h2,
M,
K = NULL,
alpha = -1,
ind.row = rows_along(G),
ind.possible = cols_along(G),
prob = NULL,
effects.dist = c("gaussian", "laplace"),
ncores = 1
)
Arguments
G |
A FBM.code256
(typically |
h2 |
Heritability. |
M |
Number of causal variants. |
K |
Prevalence. Default is |
alpha |
Assumes that the average contribution (e.g. heritability)
of a SNP of frequency |
ind.row |
An optional vector of the row indices (individuals) that
are used. If not specified, all rows are used. |
ind.possible |
Indices of possible causal variants. |
prob |
Vector of probability weights for sampling causal indices.
It can have 0s (discarded) and is automatically scaled to sum to 1.
Default is |
effects.dist |
Distribution of effects.
Either |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
Value
A list with 3 elements:
-
$pheno
: vector of phenotypes, -
$set
: indices of causal variants, -
$effects
: effect sizes (of scaled genotypes) corresponding toset
. -
$allelic_effects
: effect sizes, but on the allele scale (0|1|2).
Split-parApply-Combine
Description
A Split-Apply-Combine strategy to parallelize the evaluation of a function on each SNP, independently.
Usage
snp_split(infos.chr, FUN, combine, ncores = 1, ...)
Arguments
infos.chr |
Vector of integers specifying each SNP's chromosome. |
FUN |
The function to be applied. It must take a
FBM.code256 as first argument and |
combine |
function that is used by foreach::foreach to process the tasks results as they generated. This can be specified as either a function or a non-empty character string naming the function. Specifying 'c' is useful for concatenating the results into a vector, for example. The values 'cbind' and 'rbind' can combine vectors into a matrix. The values '+' and '*' can be used to process numeric data. By default, the results are returned in a list. |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
... |
Extra arguments to be passed to |
Details
This function splits indices for each chromosome, then apply a given function to each part (chromosome) and finally combine the results.
Value
The result of foreach::foreach.
Examples
# parallelize over chromosomes made easy
# examples of functions from this package
snp_pruning
snp_clumping
snp_fastImpute
Subset a bigSNP
Description
Subset (copy) of a bigSNP
, also stored on disk.
Usage
snp_subset(
x,
ind.row = rows_along(x$genotypes),
ind.col = cols_along(x$genotypes),
backingfile = NULL
)
## S3 method for class 'bigSNP'
subset(
x,
ind.row = rows_along(x$fam),
ind.col = rows_along(x$map),
backingfile = NULL,
...
)
Arguments
x |
A bigSNP. |
ind.row |
Indices of the rows (individuals) to keep. Negative indices can be used to exclude row indices. Default: keep them all. |
ind.col |
Indices of the columns (SNPs) to keep. Negative indices can be used to exclude column indices. Default: keep them all. |
backingfile |
Prefix of the two new files created (".bk" and ".rds"). By default, it is automatically determined by appending "_sub" and a number to the prefix of the input bigSNP backing files. |
... |
Not used. |
Value
The path to the RDS file that stores the bigSNP
object.
See Also
Examples
str(test <- snp_attachExtdata())
# keep only first 50 samples and SNPs
rdsfile <- snp_subset(test, ind.row = 1:50, ind.col = 1:50)
str(snp_attach(rdsfile))
# remove only first 50 samples and SNPs
rdsfile2 <- snp_subset(test, ind.row = -(1:50), ind.col = -(1:50))
str(snp_attach(rdsfile2))
Thresholding and correction
Description
P-value thresholding and correction of summary statistics for winner's curse.
Usage
snp_thr_correct(beta, beta_se, lpS, thr_lpS)
Arguments
beta |
Vector of effect sizes. |
beta_se |
Vector of standard errors for |
lpS |
Vector of -log10(p-value) associated with |
thr_lpS |
Threshold on |
Value
beta
after p-value thresholding and shrinkage.
References
Zhong, H., & Prentice, R. L. (2008). Bias-reduced estimators and confidence intervals for odds ratios in genome-wide association studies. Biostatistics, 9(4), 621-634.
Examples
beta <- rnorm(1000)
beta_se <- runif(1000, min = 0.3, max = 0.5)
new_beta <- snp_thr_correct(beta, beta_se = beta_se, thr_lpS = 1)
plot(beta / beta_se, new_beta / beta_se, pch = 20); abline(0, 1, col = "red")
plot(beta, new_beta, pch = 20); abline(0, 1, col = "red")
# Can provide -log10(p-values) instead of standard errors
lpval <- -log10(pchisq((beta / beta_se)^2, df = 1, lower.tail = FALSE))
new_beta2 <- snp_thr_correct(beta, lpS = lpval, thr_lpS = 1)
all.equal(new_beta2, new_beta)
Write PLINK files from a "bigSNP"
Description
Function to write bed/bim/fam files from a bigSNP.
This will use the slot code
rounded to write 0s, 1s, 2s or NAs.
Usage
snp_writeBed(x, bedfile, ind.row = rows_along(G), ind.col = cols_along(G))
Arguments
x |
A bigSNP. |
bedfile |
Path to file with extension ".bed" to create. |
ind.row |
An optional vector of the row indices (individuals) that
are used. If not specified, all rows are used. |
ind.col |
An optional vector of the column indices (SNPs) that are used.
If not specified, all columns are used. |
Value
The input bedfile
path.
Examples
N <- 17
M <- 911
fake <- snp_fake(N, M)
G <- fake$genotypes
G[] <- sample(as.raw(0:3), size = length(G), replace = TRUE)
# Write the object as a bed/bim/fam object
tmp <- tempfile(fileext = ".bed")
bed <- snp_writeBed(fake, tmp)
# Read this new file for the first time
rds <- snp_readBed(bed, backingfile = tempfile())
# Attach object in R session
fake2 <- snp_attach(rds)
# Same content
all.equal(fake$genotypes[], fake2$genotypes[])
all.equal(fake$fam, fake2$fam)
all.equal(fake$map, fake2$map)
# Two different backingfiles
fake$genotypes$backingfile
fake2$genotypes$backingfile
Replace extension '.bed'
Description
Replace extension '.bed'
Usage
sub_bed(path, replacement = "", stop_if_not_ext = TRUE)
Arguments
path |
String with extension '.bed'. |
replacement |
Replacement of '.bed'. Default replaces by nothing. Can be useful to replace e.g. by '.bim' or '.fam'. |
stop_if_not_ext |
If |
Value
String with extension '.bed' replaced by replacement
.
Examples
path <- "toto.bed"
sub_bed(path)
sub_bed(path, ".bim")
sub_bed(path, ".fam")
sub_bed(path, "_QC", stop_if_not_ext = FALSE)