Using Format 5 Binary Dosage Files

Introduction

Format 5 is a binary dosage file format designed for reading imputed genotype data from bgzipped, tabix-indexed VCF files — such as those returned by the Michigan Imputation Server using minimac.

Format 5 differs from earlier formats (1–4) in two important ways.

First, it uses per-SNP gzip compression. Each SNP’s dosage and genotype probability values are compressed independently and written as a contiguous block in the data file. This means any SNP can be decompressed in isolation without reading surrounding data.

Second, the metadata — sample IDs, the SNP table, byte offsets, and file information — is stored in a companion RDS file rather than embedded in a binary header. This makes the metadata immediately accessible from R without reading the data file.

A Format 5 dataset consists of two files.

Example files

The BinaryDosage package includes a small bgzipped VCF file, set1a.vcf.gz, which contains dosage (DS) and genotype probability (GP) data for 60 subjects and 10 SNPs on chromosome 1. This file is used in all examples below.

The output files in each example are written to a temporary directory so that nothing is left behind after the vignette runs.

vcf_file   <- system.file("extdata", "set1a.vcf.gz", package = "BinaryDosage")
bd4_file   <- system.file("extdata", "vcf1a.bdose",  package = "BinaryDosage")
bdose_file <- file.path(tempdir(), "set1a.bdose")

Converting a VCF file to Format 5

The vcftobd function converts a bgzipped VCF file to a Format 5 file pair. The VCF file must contain the FORMAT fields DS (dosage) and GP (genotype probabilities), as produced by minimac and the Michigan Imputation Server.

The function takes the following parameters.

The following code converts set1a.vcf.gz to a Format 5 file pair.

if (requireNamespace("vcfppR", quietly = TRUE)) {
  vcftobd(vcffile = vcf_file, bdose_file = bdose_file)
} else {
  updatebd(bdfiles = bd4_file, bdose_file = bdose_file)
}

To convert only a specific region of the file, supply the region parameter. The VCF file must have a tabix index (a .tbi file alongside it).

vcftobd(vcffile    = vcf_file,
         bdose_file = bdose_file,
         region     = "1:10000-15000")

Loading Format 5 file information

The getbdinfo function reads and validates a Format 5 file pair, returning a list of class "genetic-info" that is used by getbd5snp to retrieve individual SNPs. It automatically detects the Format 5 magic header and delegates to the internal Format 5 reader.

The function takes the following parameters.

bd5 <- getbdinfo(bdose_file)

The object returned has the same structure as the list returned by getbdinfo for older formats. For a full description of all fields see Genetic File Information. The key fields are described below.

File and format metadata

cat("Class:          ", class(bd5), "\n")
#> Class:           genetic-info
cat("Uses family ID: ", bd5$usesfid, "\n")
#> Uses family ID:  FALSE
cat("One chromosome: ", bd5$onechr, "\n")
#> One chromosome:  TRUE
cat("SNP ID format:  ", bd5$snpidformat, "\n")
#> SNP ID format:   2
cat("Format:         ", bd5$additionalinfo$format, "\n")
#> Format:          5
cat("Header size:    ", bd5$additionalinfo$headersize, "bytes\n")
#> Header size:     4 bytes

Sample information

The samples element is a data frame with one row per subject. The fid column holds the family ID (empty for VCF files, which carry no family information) and the sid column holds the subject ID.

cat("Number of samples:", nrow(bd5$samples), "\n")
#> Number of samples: 60
knitr::kable(head(bd5$samples, 5), caption = "First 5 samples")
First 5 samples
fid sid
I1
I2
I3
I4
I5

SNP information

The snps element is a data frame with one row per SNP.

cat("Number of SNPs:", nrow(bd5$snps), "\n")
#> Number of SNPs: 10
knitr::kable(bd5$snps, caption = "SNP table")
SNP table
chromosome location snpid reference alternate
1 10000 1:10000:C:A C A
1 11000 1:11000:T:C T C
1 12000 1:12000:T:C T C
1 13000 1:13000:T:C T C
1 14000 1:14000:G:C G C
1 15000 1:15000:A:C A C
1 16000 1:16000:G:A G A
1 17000 1:17000:C:A C A
1 18000 1:18000:C:G C G
1 19000 1:19000:T:G T G

Byte offsets

The indices element is a numeric vector of byte offsets into the .bdose file, one per SNP, giving the start of that SNP’s compressed block. These are used internally by getbd5snp.

cat("Indices (bytes):", bd5$indices, "\n")
#> Indices (bytes): 4 231 380 595 822 1025 1243 1486 1715 1966

Reading SNP data

The getbd5snp function decompresses and returns the dosage and genotype probability data for a single SNP. The general-purpose getsnp function also works with Format 5 files and can be used as an alternative.

For details on all three Format 5 SNP reader functions — including getbd5snp_buf (pre-allocated output vectors) and getbd5snp_con (persistent file connection for high-throughput loops) — see the Reading SNPs from Format 5 Files vignette.

The function takes the following parameters.

The function returns a list with four numeric vectors, each of length equal to the number of samples.

Reading a SNP by index

The following code retrieves the first SNP by its 1-based index.

snp1 <- getbd5snp(bd5info = bd5, snp = 1L)

snp1_df <- data.frame(
  SampleID = bd5$samples$sid,
  Dosage   = round(snp1$dosage, 4),
  P_00     = round(snp1$p0,     4),
  P_01     = round(snp1$p1,     4),
  P_11     = round(snp1$p2,     4)
)

knitr::kable(head(snp1_df, 10),
             caption = paste("SNP", bd5$snps$snpid[1], "— first 10 subjects"))
SNP 1:10000:C:A — first 10 subjects
SampleID Dosage P_00 P_01 P_11
I1 2.000 0.000 0.000 1.000
I2 1.002 0.000 0.998 0.002
I3 1.000 0.000 1.000 0.000
I4 0.122 0.878 0.122 0.000
I5 1.000 0.000 1.000 0.000
I6 0.000 1.000 0.000 0.000
I7 1.004 0.000 0.996 0.004
I8 2.000 0.000 0.000 1.000
I9 0.000 1.000 0.000 0.000
I10 0.918 0.082 0.918 0.000

Reading a SNP by ID

A SNP can also be retrieved by passing its SNP ID as a character string.

snp_id <- "1:12000:T:C"
snp3   <- getbd5snp(bd5info = bd5, snp = snp_id)

snp3_df <- data.frame(
  SampleID = bd5$samples$sid,
  Dosage   = round(snp3$dosage, 4),
  P_00     = round(snp3$p0,     4),
  P_01     = round(snp3$p1,     4),
  P_11     = round(snp3$p2,     4)
)

knitr::kable(head(snp3_df, 10),
             caption = paste("SNP", snp_id, "— first 10 subjects"))
SNP 1:12000:T:C — first 10 subjects
SampleID Dosage P_00 P_01 P_11
I1 1.000 0.000 1.000 0.000
I2 0.000 1.000 0.000 0.000
I3 0.000 1.000 0.000 0.000
I4 0.000 1.000 0.000 0.000
I5 0.000 1.000 0.000 0.000
I6 1.016 0.012 0.959 0.029
I7 0.000 1.000 0.000 0.000
I8 0.000 1.000 0.000 0.000
I9 0.000 1.000 0.000 0.000
I10 0.000 1.000 0.000 0.000

Applying a function to all SNPs

The simplest option is bdapply, which handles the loop internally and uses getbd5snp_con automatically for Format 5 files.

getaaf <- function(dosage, p0, p1, p2) mean(dosage, na.rm = TRUE) / 2

aaf_list  <- bdapply(bdinfo = bd5, func = getaaf)
aaf_table <- data.frame(snpid = bd5$snps$snpid,
                        aaf   = round(unlist(aaf_list), 4))

knitr::kable(aaf_table, caption = "Alternate allele frequencies via bdapply")
Alternate allele frequencies via bdapply
snpid aaf
1:10000:C:A 0.3468
1:11000:T:C 0.0159
1:12000:T:C 0.2651
1:13000:T:C 0.2966
1:14000:G:C 0.2040
1:15000:A:C 0.5236
1:16000:G:A 0.4246
1:17000:C:A 0.4434
1:18000:C:G 0.2655
1:19000:T:G 0.2446

For manual loops, getbd5snp_buf and getbd5snp_con are faster than getbd5snp because they avoid allocating a new list on every call. getbd5snp_con is the fastest option as it also keeps the file open across all iterations. See the Reading SNPs from Format 5 Files vignette for details and a performance comparison.

n_snps <- nrow(bd5$snps)
n_samp <- nrow(bd5$samples)
dosage <- numeric(n_samp)
p0     <- numeric(n_samp)
p1     <- numeric(n_samp)
p2     <- numeric(n_samp)

con <- openbd5con(bd5)
aaf <- numeric(n_snps)
for (i in seq_len(n_snps)) {
  getbd5snp_con(bd5info = bd5, snp = i,
                dosage = dosage, p0 = p0, p1 = p1, p2 = p2,
                bd5con = con)
  aaf[i] <- mean(dosage, na.rm = TRUE) / 2
}
closebd5con(con)

knitr::kable(data.frame(snpid = bd5$snps$snpid, aaf = round(aaf, 4)),
             caption = "Alternate allele frequencies via getbd5snp_con")
Alternate allele frequencies via getbd5snp_con
snpid aaf
1:10000:C:A 0.3468
1:11000:T:C 0.0159
1:12000:T:C 0.2651
1:13000:T:C 0.2966
1:14000:G:C 0.2040
1:15000:A:C 0.5236
1:16000:G:A 0.4246
1:17000:C:A 0.4434
1:18000:C:G 0.2655
1:19000:T:G 0.2446