The BinaryDosage package was originally developed to convert VCF and GEN files produced by imputation servers into a compact binary format suitable for genome-wide association and gene-environment interaction analyses. This document covers the original binary dosage formats (formats 1–4), the functions used to create and read them, and the workflow for applying analyses across all SNPs in a file.
Current recommendation: New work should use Format 5
via vcftobd. Format 5 uses per-SNP gzip compression and
stores all metadata in a companion .bdinfo file, making it
better suited for large modern datasets. See the vignette Using
Format 5 Binary Dosage Files for details.
The legacy formats remain fully supported and all files created with earlier versions of BinaryDosage are compatible with the current package.
Each binary dosage data set stores the following information:
Formats 1, 2, and 3 split data across three files:
saveRDS)saveRDS)Format 1 has an 8-byte header consisting of a magic word followed by a subformat indicator.
Format 2 uses the same 8-byte header as Format 1 but multiplies values by 20,000 (0x4e20) rather than 0x7ffe. This was adopted when it was found that imputation programs report values to only 3–4 decimal places and it was desirable to reproduce those values exactly. Missing = 0xffff.
Format 3 extends the Format 2 header to include validation information.
Format 4 embeds the family and map data directly in the binary dosage file header, eliminating the need for separate files. The header is followed by the family data, then the map data, and then the genotype data. Format 4 is the recommended legacy format and is the default for all conversion functions.
Format 4 can also store optional per-SNP statistics in the header:
Knowing that Pr(g=0) + Pr(g=1) + Pr(g=2) = 1 and d = Pr(g=1) + 2·Pr(g=2), only two values are needed to recover all four. For most subjects, either Pr(g=0) or Pr(g=2) is zero, so the dosage alone is sufficient:
\[ \Pr(g=1) = \begin{cases} d & \Pr(g=2)=0,\ d \le 1 \\ 2-d & \Pr(g=0)=0,\ d > 1 \end{cases} \]
The 16th bit of each stored dosage value is used as a flag: 0 means the above formula applies; 1 means an explicit Pr(g=1) value follows. If both Pr(g=0) and Pr(g=2) are non-zero, three values (dosage, Pr(g=0), Pr(g=2)) are written. This approach typically requires 2.2–2.4 bytes per subject per SNP.
| Function | Purpose |
|---|---|
vcftobdlegacy |
Convert a VCF file to a legacy binary dosage file |
gentobd |
Convert a GEN (Impute2) file to a legacy binary dosage file |
bdmerge |
Merge multiple legacy binary dosage files by subject |
getbdinfo |
Load metadata for a legacy binary dosage file |
getvcfinfo |
Load metadata for a VCF file |
getgeninfo |
Load metadata for a GEN file |
bdapply |
Apply a function to every SNP in a legacy binary dosage file |
vcfapply |
Apply a function to every SNP in a VCF file |
genapply |
Apply a function to every SNP in a GEN file |
getsnp |
Extract dosage and genotype probabilities for one SNP |
The vcftobdlegacy function converts VCF files to the
legacy binary dosage format. The default output format is 4 (single
file, all metadata embedded).
The package includes representative VCF files modelled on output from the Michigan Imputation Server (minimac). The following code locates them.
vcf1afile <- system.file("extdata", "set1a.vcf", package = "BinaryDosage")
vcf1ainfo <- system.file("extdata", "set1a.info", package = "BinaryDosage")
vcf1bfile <- system.file("extdata", "set1b.vcf", package = "BinaryDosage")
vcf1binfo <- system.file("extdata", "set1b.info", package = "BinaryDosage")These sets contain:
| Set | Subjects | SNPs |
|---|---|---|
| 1a | 60 | 10 |
| 1b | 40 | 10 |
Pass the VCF file name to vcftobdlegacy. An optional
imputation information file (e.g., from minimac) can be appended to the
vcffiles vector to embed per-SNP statistics.
Add gz = TRUE for gzip-compressed VCF files. The
information file, if provided, must remain uncompressed.
vcftobdlegacy reads the FORMAT field of each SNP and
looks for the DS (dosage) and GP (genotype probabilities) entries. Files
that contain only one of these are handled automatically.
The snpidformat parameter controls how SNP IDs are
stored.
| Value | Stored ID |
|---|---|
| 0 (default) | ID from the VCF file |
| 1 | chr:pos |
| 2 | chr:pos:ref:alt |
| -1 | Not stored; reconstructed as chr:pos:ref:alt on
read |
Setting snpidformat = -1 reduces file size when SNP IDs
are not needed.
vcf1brsfile <- system.file("extdata", "set1b_rssnp.vcf", package = "BinaryDosage")
bdfile_id0 <- tempfile()
bdfile_id1 <- tempfile()
bdfile_id2 <- tempfile()
bdfile_idm1 <- tempfile()
vcftobdlegacy(vcffiles = vcf1brsfile, bdfiles = bdfile_id0)
vcftobdlegacy(vcffiles = vcf1brsfile, bdfiles = bdfile_id1, snpidformat = 1L)
vcftobdlegacy(vcffiles = vcf1brsfile, bdfiles = bdfile_id2, snpidformat = 2L)
vcftobdlegacy(vcffiles = vcf1brsfile, bdfiles = bdfile_idm1, snpidformat = -1L)
snpnames <- data.frame(
format0 = getbdinfo(bdfile_id0)$snps$snpid,
format1 = getbdinfo(bdfile_id1)$snps$snpid,
format2 = getbdinfo(bdfile_id2)$snps$snpid,
formatm1 = getbdinfo(bdfile_idm1)$snps$snpid
)
knitr::kable(snpnames, caption = "SNP IDs by snpidformat")| format0 | format1 | format2 | formatm1 |
|---|---|---|---|
| 1:10000:C:A | 1:10000 | 1:10000:C:A | 1:10000:C:A |
| 1:11000:T:C | 1:11000 | 1:11000:T:C | 1:11000:T:C |
| 1:12000:T:C | 1:12000 | 1:12000:T:C | 1:12000:T:C |
| 1:13000:T:C | 1:13000 | 1:13000:T:C | 1:13000:T:C |
| 1:14000:G:C | 1:14000 | 1:14000:G:C | 1:14000:G:C |
| 1:15000:A:C | 1:15000 | 1:15000:A:C | 1:15000:A:C |
| 1:16000:G:A | 1:16000 | 1:16000:G:A | 1:16000:G:A |
| 1:17000:C:A | 1:17000 | 1:17000:C:A | 1:17000:C:A |
| 1:18000:C:G | 1:18000 | 1:18000:C:G | 1:18000:C:G |
| 1:19000:T:G | 1:19000 | 1:19000:T:G | 1:19000:T:G |
| rs12345 | 1:20000 | 1:20000:A:G | 1:20000:A:G |
When converting without an information file, bdoptions
can trigger on-the-fly calculation of alternate allele frequency
("aaf"), minor allele frequency ("maf"), and
an estimated imputation r-squared ("rsq").
bdfile1a_calc <- tempfile()
vcftobdlegacy(vcffiles = vcf1afile, bdfiles = bdfile1a_calc,
bdoptions = c("aaf", "maf", "rsq"))
#> [1] 0
bdcalcinfo <- getbdinfo(bdfile1a_calc)
knitr::kable(data.frame(bdcalcinfo$snpinfo),
caption = "Calculated per-SNP statistics", digits = 3)| aaf | maf | rsq |
|---|---|---|
| 0.347 | 0.347 | 0.956 |
| 0.016 | 0.016 | 0.434 |
| 0.265 | 0.265 | 0.933 |
| 0.297 | 0.297 | 0.952 |
| 0.204 | 0.204 | 0.970 |
| 0.524 | 0.476 | 0.955 |
| 0.425 | 0.425 | 0.951 |
| 0.443 | 0.443 | 0.948 |
| 0.265 | 0.265 | 0.927 |
| 0.245 | 0.245 | 0.964 |
GEN files are produced by the Impute2 imputation software. They are
text files with more variable formatting than VCF, so
gentobd exposes several parameters to handle different
layouts.
| Parameter | Description |
|---|---|
genfiles |
GEN file name and optional sample file name |
snpcolumns |
Column positions for chromosome, SNP ID, location, reference, alternate |
startcolumn |
Column where genotype probabilities begin (default 6) |
impformat |
Number of values per subject: 1=dosage only, 2=Pr(g=0)+Pr(g=1), 3=all three |
chromosome |
Chromosome value when not present in the file |
header |
Logical vector indicating whether GEN and sample files have headers |
gz |
TRUE if the GEN file is gzip-compressed |
sep |
Column separator |
bdfiles |
Output binary dosage file name(s) |
format |
Output binary dosage format (default 4) |
subformat |
Output subformat |
snpidformat |
How to store the SNP ID |
bdoptions |
Additional per-SNP statistics to calculate |
gen3afile <- system.file("extdata", "set3a.imp", package = "BinaryDosage")
gen3asample <- system.file("extdata", "set3a.sample", package = "BinaryDosage")
gen3bfile <- system.file("extdata", "set3b.imp", package = "BinaryDosage")
gen3bsample <- system.file("extdata", "set3b.sample", package = "BinaryDosage")The example GEN files have no chromosome column; the first column is
-- and the SNP ID encodes the chromosome as
chr:pos:ref:alt. Setting
snpcolumns = c(0L, 2L:5L) tells gentobd to
extract the chromosome from the SNP ID.
The five values in snpcolumns give the column indices
for chromosome, SNP ID, location, reference, and alternate allele
respectively.
0L for chromosome to extract it from the SNP
ID.-1L for chromosome and supply
chromosome = "1" to assign a fixed value.Use impformat when the file stores fewer than three
probability values per subject.
Add gz = TRUE for compressed GEN files. The sample file
must not be compressed.
gen4bgzfile <- system.file("extdata", "set4b.imp.gz", package = "BinaryDosage")
sample4bfile <- system.file("extdata", "set4b.sample", package = "BinaryDosage")
bdfile4b_gz <- tempfile()
gentobd(genfiles = c(gen4bgzfile, sample4bfile),
snpcolumns = c(1L, 2L, 4L, 5L, 6L),
startcolumn = 7L,
impformat = 2L,
gz = TRUE,
bdfiles = bdfile4b_gz)bdmerge combines multiple legacy binary dosage files
that share the same SNP set but have different subjects. Matching is
done by SNP ID.
| Parameter | Description |
|---|---|
mergefiles |
Output file name(s) |
bdfiles |
Character vector of input binary dosage file names |
format |
Output binary dosage format |
subformat |
Output subformat |
onegroup |
If FALSE, per-source-file SNP statistics are retained in the header |
bdoptions |
Per-SNP statistics to calculate for the merged file |
snpjoin |
"inner" or "outer" join on SNPs (default
inner) |
bd1afile <- system.file("extdata", "vcf1a.bdose", package = "BinaryDosage")
bd1bfile <- system.file("extdata", "vcf1b.bdose", package = "BinaryDosage")
bd1merged <- tempfile()
bdmerge(mergefiles = bd1merged, bdfiles = c(bd1afile, bd1bfile))
bd1ainfo <- getbdinfo(bd1afile)
bd1binfo <- getbdinfo(bd1bfile)
bd1merinfo <- getbdinfo(bd1merged)
cat("Set 1a subjects:", nrow(bd1ainfo$samples), "\n")
#> Set 1a subjects: 60
cat("Set 1b subjects:", nrow(bd1binfo$samples), "\n")
#> Set 1b subjects: 40
cat("Merged subjects:", nrow(bd1merinfo$samples), "\n")
#> Merged subjects: 100getbdinfo reads the header of a legacy binary dosage
file and returns a list of class "genetic-info" that is
required by bdapply and getsnp.
For Format 4 files, pass the single file name. For Formats 1–3, pass a character vector of length 3: binary dosage file, family file, and map file.
The returned list contains:
| Element | Description |
|---|---|
filename |
Full path to the binary dosage file |
usesfid |
Whether family IDs are present |
samples |
Data frame with columns fid and sid |
onechr |
TRUE if all SNPs are on one chromosome |
snpidformat |
Integer encoding the SNP ID format (0–3) |
snps |
Data frame: chromosome, location,
snpid, reference, alternate |
snpinfo |
Named list: aaf, maf,
avgcall, rsq (may be empty) |
datasize |
Byte sizes of per-SNP data blocks |
indices |
Byte offsets of per-SNP data blocks |
additionalinfo |
Format-specific metadata (class "bdose-info") |
The additionalinfo element contains:
| Element | Description |
|---|---|
format |
Binary dosage format (1–4) |
subformat |
Binary dosage subformat |
headersize |
Header size in bytes |
numgroups |
Number of merged source files |
groups |
Subject count per source file |
bdapply applies a user-supplied function to every SNP in
a legacy binary dosage file. It returns a list of length equal to the
number of SNPs.
The user function must accept dosage, p0,
p1, and p2 as its first four parameters.
Additional parameters can be passed through ....
calculateaaf <- function(dosage, p0, p1, p2) {
mean(dosage, na.rm = TRUE) / 2
}
bd1ainfo <- getbdinfo(bd1afile)
aaf_vals <- unlist(bdapply(bdinfo = bd1ainfo, func = calculateaaf))
aaf_table <- data.frame(snpid = bd1ainfo$snps$snpid, aaf = aaf_vals)
knitr::kable(aaf_table, caption = "Alternate allele frequencies", digits = 4)| snpid | aaf |
|---|---|
| 1:10000:C:A | 0.3468 |
| 1:11000:T:C | 0.0159 |
| 1:12000:T:C | 0.2651 |
| 1:13000:T:C | 0.2966 |
| 1:14000:G:C | 0.2040 |
| 1:15000:A:C | 0.5236 |
| 1:16000:G:A | 0.4246 |
| 1:17000:C:A | 0.4434 |
| 1:18000:C:G | 0.2655 |
| 1:19000:T:G | 0.2446 |
The built-in function getaaf provides this calculation
in the form expected by bdapply.
getsnp returns the dosage and genotype probabilities for
a single SNP, identified by 1-based index or SNP ID string.
snp_data <- data.frame(getsnp(bdinfo = bd1ainfo, snp = "1:12000:T:C", dosageonly = FALSE))
snp_table <- cbind(sid = bd1ainfo$samples$sid, snp_data)
knitr::kable(head(snp_table, 10), caption = "SNP 1:12000:T:C (first 10 subjects)", digits = 4)| sid | dosage | p0 | p1 | p2 |
|---|---|---|---|---|
| I1 | 1.000 | 0.000 | 1.000 | 0.000 |
| I2 | 0.000 | 1.000 | 0.000 | 0.000 |
| I3 | 0.000 | 1.000 | 0.000 | 0.000 |
| I4 | 0.000 | 1.000 | 0.000 | 0.000 |
| I5 | 0.000 | 1.000 | 0.000 | 0.000 |
| I6 | 1.016 | 0.012 | 0.959 | 0.029 |
| I7 | 0.000 | 1.000 | 0.000 | 0.000 |
| I8 | 0.000 | 1.000 | 0.000 | 0.000 |
| I9 | 0.000 | 1.000 | 0.000 | 0.000 |
| I10 | 0.000 | 1.000 | 0.000 | 0.000 |
Set dosageonly = TRUE (the default) to return only the
dosage vector.
For cases where conversion to binary dosage is not needed,
getvcfinfo / vcfapply and
getgeninfo / genapply allow functions to be
applied directly to source files.
getvcfinfo returns a "genetic-info" list
for a VCF file. The index parameter pre-indexes the file
for faster sequential reading (not available for compressed files).
vcfinfo <- getvcfinfo(vcffiles = vcf1afile, index = TRUE)
aaf_vcf <- unlist(vcfapply(vcfinfo = vcfinfo, func = getaaf))
aaf_compare <- data.frame(snpid = vcfinfo$snps$snpid, aaf = aaf_vcf)
knitr::kable(aaf_compare, caption = "AAF from VCF via vcfapply", digits = 4)| snpid | aaf |
|---|---|
| 1:10000:C:A | 0.3468 |
| 1:11000:T:C | 0.0159 |
| 1:12000:T:C | 0.2651 |
| 1:13000:T:C | 0.2966 |
| 1:14000:G:C | 0.2040 |
| 1:15000:A:C | 0.5236 |
| 1:16000:G:A | 0.4246 |
| 1:17000:C:A | 0.4434 |
| 1:18000:C:G | 0.2655 |
| 1:19000:T:G | 0.2446 |
The additional information in a "vcf-info" list
includes:
| Element | Description |
|---|---|
gzipped |
Whether the file is gzip-compressed |
headerlines |
Number of header lines |
headersize |
Header size in bytes |
quality, filter, info,
format |
Per-SNP VCF metadata columns |
datacolumns |
Data frame describing FORMAT field structure |
getgeninfo and genapply work analogously
for GEN files. The parameters mirror those of gentobd.
geninfo <- getgeninfo(genfiles = c(gen3bchrfile, gen3bsample), index = TRUE)
aaf_gen <- unlist(genapply(geninfo = geninfo, func = getaaf))
aaf_gen_table <- data.frame(snpid = geninfo$snps$snpid, aaf = aaf_gen)
knitr::kable(aaf_gen_table, caption = "AAF from GEN via genapply", digits = 4)| snpid | aaf |
|---|---|
| 1:10000_C_A | 0.3616 |
| 1:11000_T_C | 0.0098 |
| 1:12000_T_C | 0.2025 |
| 1:13000_T_C | 0.3988 |
| 1:14000_G_C | 0.1691 |
| 1:15000_A_C | 0.6212 |
| 1:16000_G_A | 0.5054 |
| 1:17000_C_A | 0.4793 |
| 1:18000_C_G | 0.2496 |
| 1:19000_T_G | 0.2408 |
The additional information in a "gen-info" list
includes:
| Element | Description |
|---|---|
gzipped |
Whether the file is gzip-compressed |
headersize |
Header size in bytes |
format |
Number of genotype probabilities per subject (1, 2, or 3) |
startcolumn |
Column where genotype data begins |
sep |
Column separator |
The following example reproduces the complete legacy workflow: convert two VCF files and two GEN files to binary dosage, merge each pair, apply a function to both merged files, extract one SNP, and verify that the VCF and GEN results agree.
library(BinaryDosage)
# --- Input files ---
vcf1afile <- system.file("extdata", "set1a.vcf", package = "BinaryDosage")
vcf1ainfo <- system.file("extdata", "set1a.info", package = "BinaryDosage")
vcf1bfile <- system.file("extdata", "set1b.vcf", package = "BinaryDosage")
vcf1binfo <- system.file("extdata", "set1b.info", package = "BinaryDosage")
gen3afile <- system.file("extdata", "set3a.imp", package = "BinaryDosage")
gen3asample <- system.file("extdata", "set3a.sample", package = "BinaryDosage")
gen3bfile <- system.file("extdata", "set3b.imp", package = "BinaryDosage")
gen3bsample <- system.file("extdata", "set3b.sample", package = "BinaryDosage")
# --- Convert to binary dosage ---
bdfile1a <- tempfile(); bdfile1b <- tempfile()
bdfile3a <- tempfile(); bdfile3b <- tempfile()
vcftobdlegacy(vcffiles = c(vcf1afile, vcf1ainfo), bdfiles = bdfile1a)
vcftobdlegacy(vcffiles = c(vcf1bfile, vcf1binfo), bdfiles = bdfile1b)
gentobd(genfiles = c(gen3afile, gen3asample), snpcolumns = c(0L, 2L:5L), bdfiles = bdfile3a)
gentobd(genfiles = c(gen3bfile, gen3bsample), snpcolumns = c(0L, 2L:5L), bdfiles = bdfile3b)
# --- Merge ---
mergebd1 <- tempfile(); mergebd3 <- tempfile()
bdmerge(mergefiles = mergebd1, bdfiles = c(bdfile1a, bdfile1b))
bdmerge(mergefiles = mergebd3, bdfiles = c(bdfile3a, bdfile3b))
# --- Apply function ---
mergebd1info <- getbdinfo(mergebd1)
mergebd3info <- getbdinfo(mergebd3)
aaf1 <- unlist(bdapply(mergebd1info, getaaf))
aaf3 <- unlist(bdapply(mergebd3info, getaaf))
aaf <- cbind(mergebd1info$snps[, c("chromosome", "location", "snpid")],
aaf_vcf = aaf1, aaf_gen = aaf3)
knitr::kable(aaf, caption = "AAF: VCF vs GEN sources", digits = 4, row.names = FALSE)| chromosome | location | snpid | aaf_vcf | aaf_gen |
|---|---|---|---|---|
| 1 | 10000 | 1:10000:C:A | 0.3527 | 0.3527 |
| 1 | 11000 | 1:11000:T:C | 0.0135 | 0.0135 |
| 1 | 12000 | 1:12000:T:C | 0.2400 | 0.2400 |
| 1 | 13000 | 1:13000:T:C | 0.3375 | 0.3375 |
| 1 | 14000 | 1:14000:G:C | 0.1901 | 0.1901 |
| 1 | 15000 | 1:15000:A:C | 0.5627 | 0.5627 |
| 1 | 16000 | 1:16000:G:A | 0.4569 | 0.4569 |
| 1 | 17000 | 1:17000:C:A | 0.4578 | 0.4578 |
| 1 | 18000 | 1:18000:C:G | 0.2591 | 0.2591 |
| 1 | 19000 | 1:19000:T:G | 0.2431 | 0.2431 |
# --- Extract SNP 6 ---
set1snp6 <- getsnp(mergebd1info, 6L)
set3snp6 <- getsnp(mergebd3info, 6L)
snpdf <- data.frame(
subjectid = mergebd1info$samples$sid,
dosage_vcf = unlist(set1snp6),
dosage_gen = unlist(set3snp6)
)
knitr::kable(head(snpdf, 10),
caption = "SNP 6 dosage: VCF vs GEN (first 10 subjects)",
digits = 4, row.names = FALSE)| subjectid | dosage_vcf | dosage_gen |
|---|---|---|
| I1 | 1.000 | 1.000 |
| I2 | 1.849 | 1.849 |
| I3 | 1.000 | 1.000 |
| I4 | 2.000 | 2.000 |
| I5 | 1.046 | 1.046 |
| I6 | 1.915 | 1.915 |
| I7 | 2.000 | 2.000 |
| I8 | 2.000 | 2.000 |
| I9 | 1.000 | 1.000 |
| I10 | 1.000 | 1.000 |
The alternate allele frequencies and dosage values are identical between the VCF-derived and GEN-derived files, confirming the two sources contain the same underlying data.