Reading SNPs from Format 5 Files

Introduction

The BinaryDosage package provides three functions for reading individual SNPs from Format 5 binary dosage files. They return the same data but differ in how they manage memory and file connections, which matters when reading many SNPs in a loop.

Function Allocates output? Opens file per call?
getbd5snp Yes — returns a new list Yes
getbd5snp_buf No — writes into caller vectors Yes
getbd5snp_con No — writes into caller vectors No — reuses open connection

All three functions require the "genetic-info" list returned by getbdinfo.

Setup

The examples below use the small bgzipped VCF file included with the package. First, convert it to a Format 5 file pair and load the metadata.

bdose_file <- file.path(tempdir(), "set1a.bdose")

if (requireNamespace("vcfppR", quietly = TRUE)) {
  vcftobd(vcffile    = system.file("extdata", "set1a.vcf.gz", package = "BinaryDosage"),
          bdose_file = bdose_file)
} else {
  updatebd(bdfiles    = system.file("extdata", "vcf1a.bdose", package = "BinaryDosage"),
           bdose_file = bdose_file)
}
bd5 <- getbdinfo(bdose_file)

n_snps <- nrow(bd5$snps)
n_samp <- nrow(bd5$samples)
cat("SNPs:", n_snps, " Samples:", n_samp, "\n")
#> SNPs: 10  Samples: 60

getbd5snp

getbd5snp is the simplest interface. It opens the .bdose file, seeks to the requested SNP’s compressed block, decompresses it, decodes the values, and returns them as a new list. The file is opened and closed on every call.

Parameters

Return value

A list with four numeric vectors of length n_samples:

Reading a SNP by index

snp1 <- getbd5snp(bd5info = bd5, snp = 1L)

knitr::kable(
  data.frame(
    SampleID = head(bd5$samples$sid, 10),
    Dosage   = round(head(snp1$dosage, 10), 4),
    P_00     = round(head(snp1$p0,     10), 4),
    P_01     = round(head(snp1$p1,     10), 4),
    P_11     = round(head(snp1$p2,     10), 4)
  ),
  caption = paste("SNP", bd5$snps$snpid[1], "— first 10 subjects")
)
SNP 1:10000:C:A — first 10 subjects
SampleID Dosage P_00 P_01 P_11
I1 2.000 0.000 0.000 1.000
I2 1.002 0.000 0.998 0.002
I3 1.000 0.000 1.000 0.000
I4 0.122 0.878 0.122 0.000
I5 1.000 0.000 1.000 0.000
I6 0.000 1.000 0.000 0.000
I7 1.004 0.000 0.996 0.004
I8 2.000 0.000 0.000 1.000
I9 0.000 1.000 0.000 0.000
I10 0.918 0.082 0.918 0.000

Reading a SNP by ID

snp3 <- getbd5snp(bd5info = bd5, snp = "1:12000:T:C")

knitr::kable(
  data.frame(
    SampleID = head(bd5$samples$sid, 10),
    Dosage   = round(head(snp3$dosage, 10), 4),
    P_00     = round(head(snp3$p0,     10), 4),
    P_01     = round(head(snp3$p1,     10), 4),
    P_11     = round(head(snp3$p2,     10), 4)
  ),
  caption = "SNP 1:12000:T:C — first 10 subjects"
)
SNP 1:12000:T:C — first 10 subjects
SampleID Dosage P_00 P_01 P_11
I1 1.000 0.000 1.000 0.000
I2 0.000 1.000 0.000 0.000
I3 0.000 1.000 0.000 0.000
I4 0.000 1.000 0.000 0.000
I5 0.000 1.000 0.000 0.000
I6 1.016 0.012 0.959 0.029
I7 0.000 1.000 0.000 0.000
I8 0.000 1.000 0.000 0.000
I9 0.000 1.000 0.000 0.000
I10 0.000 1.000 0.000 0.000

Using getbd5snp in a loop

getbd5snp is convenient but allocates a new list and opens the file on every call. For a small number of SNPs this is fine. The following calculates the alternate allele frequency for every SNP.

aaf <- numeric(n_snps)
for (i in seq_len(n_snps)) {
  aaf[i] <- mean(getbd5snp(bd5, i)$dosage, na.rm = TRUE) / 2
}

knitr::kable(
  data.frame(snpid = bd5$snps$snpid, aaf = round(aaf, 4)),
  caption = "Alternate allele frequencies via getbd5snp"
)
Alternate allele frequencies via getbd5snp
snpid aaf
1:10000:C:A 0.3468
1:11000:T:C 0.0159
1:12000:T:C 0.2651
1:13000:T:C 0.2966
1:14000:G:C 0.2040
1:15000:A:C 0.5236
1:16000:G:A 0.4246
1:17000:C:A 0.4434
1:18000:C:G 0.2655
1:19000:T:G 0.2446

getbd5snp_buf

getbd5snp_buf eliminates the per-call list allocation by writing results directly into four numeric vectors that the caller pre-allocates once before the loop. This avoids repeated garbage collection pressure when reading many SNPs. The file is still opened and closed on every call.

Parameters

Return value

NULL invisibly. The four output vectors are updated in place.

Important: R copy-on-modify semantics

The output vectors must not have more than one R binding at the call site. If another variable also points to the same vector, R’s copy-on-modify rule will copy the vector before writing, so the caller’s variable will not be updated. Always use plain, freshly allocated vectors:

# Correct
dosage <- numeric(n_samp)
getbd5snp_buf(bd5, 1L, dosage, p0, p1, p2)

# Wrong — dosage2 creates a second binding; the update may not be visible
dosage2 <- dosage
getbd5snp_buf(bd5, 1L, dosage, p0, p1, p2)

Example

dosage <- numeric(n_samp)
p0     <- numeric(n_samp)
p1     <- numeric(n_samp)
p2     <- numeric(n_samp)

aaf_buf <- numeric(n_snps)
for (i in seq_len(n_snps)) {
  getbd5snp_buf(bd5info = bd5, snp = i, dosage = dosage,
                p0 = p0, p1 = p1, p2 = p2)
  aaf_buf[i] <- mean(dosage, na.rm = TRUE) / 2
}

knitr::kable(
  data.frame(snpid = bd5$snps$snpid, aaf = round(aaf_buf, 4)),
  caption = "Alternate allele frequencies via getbd5snp_buf"
)
Alternate allele frequencies via getbd5snp_buf
snpid aaf
1:10000:C:A 0.3468
1:11000:T:C 0.0159
1:12000:T:C 0.2651
1:13000:T:C 0.2966
1:14000:G:C 0.2040
1:15000:A:C 0.5236
1:16000:G:A 0.4246
1:17000:C:A 0.4434
1:18000:C:G 0.2655
1:19000:T:G 0.2446

getbd5snp_con

getbd5snp_con is the highest-performance option. It combines the pre-allocated vector approach of getbd5snp_buf with a persistent open file connection, so the .bdose file is opened once before the loop and kept open for all reads. The C-level read uses zlib directly rather than R’s memDecompress.

Use this when reading a large number of SNPs sequentially and minimizing elapsed time matters.

Workflow

  1. Call openbd5con(bd5info) to open the file and get a "bd5con" object.
  2. Call getbd5snp_con inside the loop, passing the "bd5con" object.
  3. Call closebd5con when finished to release the file handle promptly. (The connection is also closed automatically when the "bd5con" object is garbage-collected or when R exits, so the explicit close is optional but recommended.)

openbd5con

Opens a persistent connection to the .bdose file.

Parameters

Return value

An object of class "bd5con" to be passed to getbd5snp_con and closebd5con.

closebd5con

Explicitly closes the connection.

Parameters

Return value

NULL invisibly.

getbd5snp_con

Parameters

Return value

NULL invisibly. The four output vectors are updated in place.

The same copy-on-modify constraint as getbd5snp_buf applies.

Example

dosage <- numeric(n_samp)
p0     <- numeric(n_samp)
p1     <- numeric(n_samp)
p2     <- numeric(n_samp)

con <- openbd5con(bd5)

aaf_con <- numeric(n_snps)
for (i in seq_len(n_snps)) {
  getbd5snp_con(bd5info = bd5, snp = i, dosage = dosage,
                p0 = p0, p1 = p1, p2 = p2, bd5con = con)
  aaf_con[i] <- mean(dosage, na.rm = TRUE) / 2
}

closebd5con(con)

knitr::kable(
  data.frame(snpid = bd5$snps$snpid, aaf = round(aaf_con, 4)),
  caption = "Alternate allele frequencies via getbd5snp_con"
)
Alternate allele frequencies via getbd5snp_con
snpid aaf
1:10000:C:A 0.3468
1:11000:T:C 0.0159
1:12000:T:C 0.2651
1:13000:T:C 0.2966
1:14000:G:C 0.2040
1:15000:A:C 0.5236
1:16000:G:A 0.4246
1:17000:C:A 0.4434
1:18000:C:G 0.2655
1:19000:T:G 0.2446

Choosing a reader

All three functions produce identical results, as confirmed below.

all(aaf == aaf_buf)
#> [1] TRUE
all(aaf == aaf_con)
#> [1] TRUE