The BinaryDosage package provides three functions for reading individual SNPs from Format 5 binary dosage files. They return the same data but differ in how they manage memory and file connections, which matters when reading many SNPs in a loop.
| Function | Allocates output? | Opens file per call? |
|---|---|---|
getbd5snp |
Yes — returns a new list | Yes |
getbd5snp_buf |
No — writes into caller vectors | Yes |
getbd5snp_con |
No — writes into caller vectors | No — reuses open connection |
All three functions require the "genetic-info" list
returned by getbdinfo.
The examples below use the small bgzipped VCF file included with the package. First, convert it to a Format 5 file pair and load the metadata.
bdose_file <- file.path(tempdir(), "set1a.bdose")
if (requireNamespace("vcfppR", quietly = TRUE)) {
vcftobd(vcffile = system.file("extdata", "set1a.vcf.gz", package = "BinaryDosage"),
bdose_file = bdose_file)
} else {
updatebd(bdfiles = system.file("extdata", "vcf1a.bdose", package = "BinaryDosage"),
bdose_file = bdose_file)
}
bd5 <- getbdinfo(bdose_file)
n_snps <- nrow(bd5$snps)
n_samp <- nrow(bd5$samples)
cat("SNPs:", n_snps, " Samples:", n_samp, "\n")
#> SNPs: 10 Samples: 60getbd5snp is the simplest interface. It opens the
.bdose file, seeks to the requested SNP’s compressed block,
decompresses it, decodes the values, and returns them as a new list. The
file is opened and closed on every call.
bd5info — object returned by
getbdinfosnp — 1-based integer index or character SNP IDA list with four numeric vectors of length n_samples:
dosage — DS values in \[0,
2\]; NA = missingp0 — Pr(g=0) in \[0, 1\]; NA = missingp1 — Pr(g=1) in \[0, 1\]; NA = missingp2 — Pr(g=2) in \[0, 1\]; NA = missingsnp1 <- getbd5snp(bd5info = bd5, snp = 1L)
knitr::kable(
data.frame(
SampleID = head(bd5$samples$sid, 10),
Dosage = round(head(snp1$dosage, 10), 4),
P_00 = round(head(snp1$p0, 10), 4),
P_01 = round(head(snp1$p1, 10), 4),
P_11 = round(head(snp1$p2, 10), 4)
),
caption = paste("SNP", bd5$snps$snpid[1], "— first 10 subjects")
)| SampleID | Dosage | P_00 | P_01 | P_11 |
|---|---|---|---|---|
| I1 | 2.000 | 0.000 | 0.000 | 1.000 |
| I2 | 1.002 | 0.000 | 0.998 | 0.002 |
| I3 | 1.000 | 0.000 | 1.000 | 0.000 |
| I4 | 0.122 | 0.878 | 0.122 | 0.000 |
| I5 | 1.000 | 0.000 | 1.000 | 0.000 |
| I6 | 0.000 | 1.000 | 0.000 | 0.000 |
| I7 | 1.004 | 0.000 | 0.996 | 0.004 |
| I8 | 2.000 | 0.000 | 0.000 | 1.000 |
| I9 | 0.000 | 1.000 | 0.000 | 0.000 |
| I10 | 0.918 | 0.082 | 0.918 | 0.000 |
snp3 <- getbd5snp(bd5info = bd5, snp = "1:12000:T:C")
knitr::kable(
data.frame(
SampleID = head(bd5$samples$sid, 10),
Dosage = round(head(snp3$dosage, 10), 4),
P_00 = round(head(snp3$p0, 10), 4),
P_01 = round(head(snp3$p1, 10), 4),
P_11 = round(head(snp3$p2, 10), 4)
),
caption = "SNP 1:12000:T:C — first 10 subjects"
)| SampleID | Dosage | P_00 | P_01 | P_11 |
|---|---|---|---|---|
| I1 | 1.000 | 0.000 | 1.000 | 0.000 |
| I2 | 0.000 | 1.000 | 0.000 | 0.000 |
| I3 | 0.000 | 1.000 | 0.000 | 0.000 |
| I4 | 0.000 | 1.000 | 0.000 | 0.000 |
| I5 | 0.000 | 1.000 | 0.000 | 0.000 |
| I6 | 1.016 | 0.012 | 0.959 | 0.029 |
| I7 | 0.000 | 1.000 | 0.000 | 0.000 |
| I8 | 0.000 | 1.000 | 0.000 | 0.000 |
| I9 | 0.000 | 1.000 | 0.000 | 0.000 |
| I10 | 0.000 | 1.000 | 0.000 | 0.000 |
getbd5snp is convenient but allocates a new list and
opens the file on every call. For a small number of SNPs this is fine.
The following calculates the alternate allele frequency for every
SNP.
aaf <- numeric(n_snps)
for (i in seq_len(n_snps)) {
aaf[i] <- mean(getbd5snp(bd5, i)$dosage, na.rm = TRUE) / 2
}
knitr::kable(
data.frame(snpid = bd5$snps$snpid, aaf = round(aaf, 4)),
caption = "Alternate allele frequencies via getbd5snp"
)| snpid | aaf |
|---|---|
| 1:10000:C:A | 0.3468 |
| 1:11000:T:C | 0.0159 |
| 1:12000:T:C | 0.2651 |
| 1:13000:T:C | 0.2966 |
| 1:14000:G:C | 0.2040 |
| 1:15000:A:C | 0.5236 |
| 1:16000:G:A | 0.4246 |
| 1:17000:C:A | 0.4434 |
| 1:18000:C:G | 0.2655 |
| 1:19000:T:G | 0.2446 |
getbd5snp_buf eliminates the per-call list allocation by
writing results directly into four numeric vectors that the caller
pre-allocates once before the loop. This avoids repeated garbage
collection pressure when reading many SNPs. The file is still opened and
closed on every call.
bd5info — object returned by
getbdinfosnp — 1-based integer index or character SNP IDdosage, p0, p1,
p2 — pre-allocated numeric(n_samples)
vectorsNULL invisibly. The four output vectors are updated in
place.
The output vectors must not have more than one R binding at the call site. If another variable also points to the same vector, R’s copy-on-modify rule will copy the vector before writing, so the caller’s variable will not be updated. Always use plain, freshly allocated vectors:
dosage <- numeric(n_samp)
p0 <- numeric(n_samp)
p1 <- numeric(n_samp)
p2 <- numeric(n_samp)
aaf_buf <- numeric(n_snps)
for (i in seq_len(n_snps)) {
getbd5snp_buf(bd5info = bd5, snp = i, dosage = dosage,
p0 = p0, p1 = p1, p2 = p2)
aaf_buf[i] <- mean(dosage, na.rm = TRUE) / 2
}
knitr::kable(
data.frame(snpid = bd5$snps$snpid, aaf = round(aaf_buf, 4)),
caption = "Alternate allele frequencies via getbd5snp_buf"
)| snpid | aaf |
|---|---|
| 1:10000:C:A | 0.3468 |
| 1:11000:T:C | 0.0159 |
| 1:12000:T:C | 0.2651 |
| 1:13000:T:C | 0.2966 |
| 1:14000:G:C | 0.2040 |
| 1:15000:A:C | 0.5236 |
| 1:16000:G:A | 0.4246 |
| 1:17000:C:A | 0.4434 |
| 1:18000:C:G | 0.2655 |
| 1:19000:T:G | 0.2446 |
getbd5snp_con is the highest-performance option. It
combines the pre-allocated vector approach of getbd5snp_buf
with a persistent open file connection, so the .bdose file
is opened once before the loop and kept open for all reads. The C-level
read uses zlib directly rather than R’s memDecompress.
Use this when reading a large number of SNPs sequentially and minimizing elapsed time matters.
openbd5con(bd5info) to open the file and get a
"bd5con" object.getbd5snp_con inside the loop, passing the
"bd5con" object.closebd5con when finished to release the file
handle promptly. (The connection is also closed automatically when the
"bd5con" object is garbage-collected or when R exits, so
the explicit close is optional but recommended.)Opens a persistent connection to the .bdose file.
Parameters
bd5info — object returned by
getbdinfoReturn value
An object of class "bd5con" to be passed to
getbd5snp_con and closebd5con.
Explicitly closes the connection.
Parameters
bd5con — object returned by
openbd5conReturn value
NULL invisibly.
Parameters
bd5info — object returned by
getbdinfosnp — 1-based integer index or character SNP IDdosage, p0, p1,
p2 — pre-allocated numeric(n_samples)
vectorsbd5con — object returned by
openbd5conReturn value
NULL invisibly. The four output vectors are updated in
place.
The same copy-on-modify constraint as getbd5snp_buf
applies.
dosage <- numeric(n_samp)
p0 <- numeric(n_samp)
p1 <- numeric(n_samp)
p2 <- numeric(n_samp)
con <- openbd5con(bd5)
aaf_con <- numeric(n_snps)
for (i in seq_len(n_snps)) {
getbd5snp_con(bd5info = bd5, snp = i, dosage = dosage,
p0 = p0, p1 = p1, p2 = p2, bd5con = con)
aaf_con[i] <- mean(dosage, na.rm = TRUE) / 2
}
closebd5con(con)
knitr::kable(
data.frame(snpid = bd5$snps$snpid, aaf = round(aaf_con, 4)),
caption = "Alternate allele frequencies via getbd5snp_con"
)| snpid | aaf |
|---|---|
| 1:10000:C:A | 0.3468 |
| 1:11000:T:C | 0.0159 |
| 1:12000:T:C | 0.2651 |
| 1:13000:T:C | 0.2966 |
| 1:14000:G:C | 0.2040 |
| 1:15000:A:C | 0.5236 |
| 1:16000:G:A | 0.4246 |
| 1:17000:C:A | 0.4434 |
| 1:18000:C:G | 0.2655 |
| 1:19000:T:G | 0.2446 |
getbd5snp — use when reading a small
number of SNPs, or when simplicity matters more than speed.getbd5snp_buf — use in loops where
allocation overhead is measurable but the file seek cost is
acceptable.getbd5snp_con — use when reading all
or most SNPs sequentially and elapsed time is important. The persistent
connection avoids both the allocation and the file open/close on every
iteration.All three functions produce identical results, as confirmed below.