Type: | Package |
Title: | Bayesian Inference from the Conditional Genetic Stock Identification Model |
Version: | 0.3.4 |
Maintainer: | Eric C. Anderson <eric.anderson@noaa.gov> |
Description: | Implements Bayesian inference for the conditional genetic stock identification model. It allows inference of mixed fisheries and also simulation of mixtures to predict accuracy. A full description of the underlying methods is available in a recently published article in the Canadian Journal of Fisheries and Aquatic Sciences: <doi:10.1139/cjfas-2018-0016>. |
License: | CC0 |
LazyData: | TRUE |
Depends: | R (≥ 3.3.0) |
Imports: | dplyr, gtools, magrittr, Rcpp (≥ 0.12.5), readr, rlang, stringr, tibble, tidyr, RcppParallel |
LinkingTo: | Rcpp, RcppParallel |
SystemRequirements: | GNU make |
RoxygenNote: | 7.2.3 |
Suggests: | knitr, rmarkdown, ggplot2 |
VignetteBuilder: | knitr |
Encoding: | UTF-8 |
NeedsCompilation: | yes |
Packaged: | 2024-01-23 21:49:33 UTC; eriq |
Author: | Eric C. Anderson [aut, cre], Ben Moran [aut] |
Repository: | CRAN |
Date/Publication: | 2024-01-24 14:40:05 UTC |
Pipe operator
Description
Pipe operator
Usage
lhs %>% rhs
Generate a random population structure and mixture sample, as in Hasselman et al. 2015
Description
Creates random reporting unit (rho) and collection (omega) proportions, and a
sim_colls
vector for simulation of individual genotypes, based on the methods
used in Hasselman et al. (2015)
Usage
Hasselman_sim_colls(RU_starts, RU_vec, size = 100)
Arguments
RU_starts |
a vector delineating the reporting units in |
RU_vec |
a vector of collection indices, grouped by reporting unit;
generated by |
Details
This function is designed specifically to recreate the simulations in Hasselman et al. (2015), to check for the bias that was observed therein. Rho (reporting unit proportions) is chosen with alphas of 1.5, and omega (collection proportions) chosen with the same alpha, then scaled by the corresponding rho.
Value
Hasselman_sim_colls
returns a list with three elements.
The first two are a rho vector and an omega vector, respectively,
both with alpha parameters = 1.5. The third is a vector of origins for
simulated individuals, sampled from the collections with probabilities = omega
Convert data frame of allele frequencies to nested lists
Description
List-izes the output of reference_allele_counts
into a usable format for allelic_list
Usage
a_freq_list(D, pop_level = "collection")
Arguments
D |
the long-format dataframe of counts by collection, locus, and allele,
output by |
Value
a_freq_list
returns a list named by loci, each element of which is a matrix
containing that locus's allele count data. Rows in the matrix mark alleles, and columns collections
Examples
# Generate a list of individual genotypes by allele from
# the alewife data's reference allele count tables
example(reference_allele_counts)
ale_ac <- a_freq_list(ale_rac)
Microsat data from alewife herring reference populations
Description
Standard two-column genetic data with lots of other columns preceding it. Can be fed directly into rubias because it has at least the columns sample_type, collection, repunit and indiv.
Format
A tibble.
Source
https://datadryad.org/stash/dataset/doi:10.5061/dryad.80f4f
Create genotype lists for each locus
Description
Uses the allele counts from a_freq_list
and the cleaned short-format output of
tcf2long
to generate a nested list of individual genotypes for each locus
Usage
allelic_list(cs, ac, samp_type = "both")
Arguments
cs |
a clean short genetic data matrix; the second element of the
output from |
ac |
allele counts from a_freq_list |
samp_type |
choose which sample types of individuals to include in output: "mixture", "both", or "reference" |
Value
allelic_list
returns a two-component nested list, with data stored as character
names of alleles ($chr) or as integer indices for the alleles ($int). Both forms contain lists
representing to loci, with two component vectors corresponding to gene copies a and b.
Examples
example(a_freq_list)
ale_cs <- ale_long$clean_short
# Get the vectors of gene copies a and b for all loci in integer index form
ale_alle_list <- allelic_list(ale_cs, ale_ac)$int
Test the effects of the parametric bootstrap bias correction on a reference dataset through cross-validation
Description
This is a rewrite of bias_comparison(). Eric didn't want the plotting to be wrapped up in a function, and wanted to return a more informative data frame.
Usage
assess_pb_bias_correction(
reference,
gen_start_col,
seed = 5,
nreps = 50,
mixsize = 100,
alle_freq_prior = list(const_scaled = 1)
)
Arguments
reference |
a two-column format genetic dataset, with a "repunit" column specifying each individual's reporting unit of origin, a "collection" column specifying the collection (population or time of sampling) and "indiv" providing a unique name |
gen_start_col |
the first column containing genetic data in |
seed |
the random seed for simulations |
nreps |
The number of reps to do. |
mixsize |
The size of each simulated mixture sample. |
alle_freq_prior |
a one-element named list specifying the prior to be used when
generating Dirichlet parameters for genotype likelihood calculations. Valid methods include
|
Details
Takes a reference two-column genetic dataset, pulls a series of random "mixture" datasets with varying reporting unit proportions from this reference, and compares the results of GSI through standard MCMC vs. parametric-bootstrap MCMC bias correction
The amount of bias in reporting unit proportion calculations increases with the rate of misassignment between reporting units (decreases with genetic differentiation), and increases as the number of collections within reporting units becomes more uneven.
Output from the standard Bayesian MCMC method demonstrates the level of bias to be expected for the input data set, and parametric bootstrapping is an empirical method for the removal of any existing bias.
Value
bias_comparison
returns a list; the first element is
a list of the relevant rho values generated on each iteration of the random "mixture"
creation. This includes the true rho value, the standard result rho_mcmc
,
and the parametric bootstrapped rho_pb
.
The second element is a dataframe listing summary statistics for each
reporting unit and estimation method. mse
, the mean squared error, summarizes
the deviation of the rho estimates from their true value, including both bias and other variance.
mean_prop_bias
is the average ratio of residual to true value, which gives greater
weight to deviations at smaller values. mean_bias
is simply the average residual;
unlike mse
, this demonstrates the direction of the bias.
Examples
## Not run:
## This takes too long to run in R CMD CHECK
ale_bias <- assess_pb_bias_correction(alewife, 17)
## End(Not run)
Simulate mixtures and estimate reporting group and collection proportions.
Description
From a reference dataset, this creates a genotype-logL matrix based on simulation-by-individual with randomly drawn population proportions, then uses this in two different estimates of population mixture proportions: maximum likelihood via EM-algorithm and posterior mean from MCMC.
Usage
assess_reference_loo(
reference,
gen_start_col,
reps = 50,
mixsize = 100,
seed = 5,
alpha_repunit = 1.5,
alpha_collection = 1.5,
resampling_unit = "individual",
alle_freq_prior = list(const_scaled = 1),
printSummary = FALSE,
return_indiv_posteriors = FALSE
)
Arguments
reference |
a two-column format genetic dataset, with "repunit", "collection", and "indiv" columns, as well as a "sample_type" column that has some "reference" entries |
gen_start_col |
the first column of genetic data in |
reps |
number of reps of mixture simulation and MCMC to do |
mixsize |
the number of individuals in each simulated mixture |
seed |
a random seed for simulations |
alpha_repunit |
If a vector, this is the dirichlet parameter for simulating
the proportions of reporting units. Gets recycled to the number of reporting units. Default is 1.5.
Otherwise, this could be a two-column data frame. The first column must be named "repunit" and the
second one must be one of "dirichlet", "ppn", or "cnt", according to whether you wish to
specify dirichlet parameters, or proportions, or exact counts, respectively, for each population.
If you want to make multiple simulations, pass in a list of data frames or of individual dirichlet parameters.
For examples, see |
alpha_collection |
The dirichlet parameter for simulating proportions of collections within reporting units. Default = 1.5.
If this is a data frame then the first column must be "collection" and the second must be one of
"dirichlet", "ppn", "cnt", "sub_dirichlet", "sub_ppn". If you want to provide multiple different
scenarios. You can pass them in as a list. If alpha_repunit or alpha_collection is a list with length
greater than 1, the shorter will be recycled.
For examples, see |
resampling_unit |
what unit should be resampled. Currently the choices are "individuals" (the default) and "gene_copies". Using "individuals" preserves missing data patterns available in the reference data set. We also have "gene_copies_with_missing" capability, but it is not yet linked into this function. |
alle_freq_prior |
a one-element named list specifying the prior to be used when
generating Dirichlet parameters for genotype likelihood calculations. Valid methods include
|
printSummary |
if TRUE a summary of the reference samples will be printed to stdout. |
return_indiv_posteriors |
if TRUE, output is a list of 2. The first entry, |
Examples
# very small number of reps so it is quick enough for example
ale_dev <- assess_reference_loo(alewife, 17, reps = 5)
Partition a reference dataset and estimate reporting group and collection proportions
Description
From a reference dataset, this draws (without replacement) a simulated mixture dataset with randomly drawn population proportions, then uses this in two different estimates of population mixture proportions: maximum likelihood via EM-algorithm and posterior mean from MCMC.
Usage
assess_reference_mc(
reference,
gen_start_col,
reps = 50,
mixsize = 100,
seed = 5,
alpha_repunit = 1.5,
alpha_collection = 1.5,
min_remaining = 5,
alle_freq_prior = list(const_scaled = 1)
)
Arguments
reference |
a two-column format genetic dataset, with "repunit", "collection", and "indiv" columns, as well as a "sample_type" column that has some "reference" entries. |
gen_start_col |
the first column of genetic data in reference |
reps |
number of reps to do |
mixsize |
the number of individuals in each simulated mixture. |
seed |
a random seed for simulations |
alpha_repunit |
The dirichlet parameter for simulating the proportions of reporting units. Default = 1.5 |
alpha_collection |
The dirichlet parameter for simulating proportions of collections within reporting units. Default = 1.5 |
min_remaining |
the minimum number of individuals which should be conserved in each reference collection during sampling without replacement to form the simulated mixture |
alle_freq_prior |
a one-element named list specifying the prior to be used when
generating Dirichlet parameters for genotype likelihood calculations. Valid methods include
|
Details
This method is referred to as "Monte Carlo cross-validation".
The input parameters for assess_reference_mc
are more restrictive than those of
assess_reference_loo
. Rather than allowing a data.frame to specify Dirichlet
parameters, proportions, or counts for specific reporting units and collections,
assess_reference_mc
only allows vector input (default = 1.5) for alpha_repunit
and alpha_collection
. These inputs specify the uniform Dirichlet parameters for
all reporting units and collections, respectively.
For mixture proportion generation, the rho values are first drawn using a stick-breaking
model of the Dirichlet distribution, but with proportions capped by min_remaining
.
Stick-breaking is then used to subdivide each reporting unit into collections. In
addition to the constraint that mixture sampling without replacement cannot deplete
the number of individuals in each collection below min_remaining
, a similar
constraint is placed upon the number of individuals left in reporting units,
determined as min_remaining
* (# collections in reporting unit).
Note that this implies that the data are only truly Dirichlet distributed when no
rejections based on min_remaining
occur. This is a reasonable certainty with
sufficient reference collection sizes relative to the desired mixture size.
Examples
# only 5 reps, so it doesn't take too long. Typically you would
# do many more
ale_dev <- assess_reference_mc(alewife, 17, 5)
Get the average within-RU assignment rate for each collection
Description
This function takes a matrix of scaled genotype likelihoods for a group of individuals of known origin, and calculates the average rate at which individuals in a particular collection are assigned to the correct reporting unit.
Usage
avg_coll2correctRU(SL, coll, RU_starts, RU_vec)
Arguments
SL |
a scaled likelihood matrix; each column should sum to one, and represent the probability of assignments to each collection (row) for a particular individual |
coll |
a vector of the collection of origin indices of the individuals (length = |
RU_starts |
a vector delineating starting indices of different reporting units in RU_vec |
RU_vec |
a vector of collection indices, organized by reporting unit |
Details
The average rate of correct within-reporting unit assignment is proportional to
reporting-unit-level bias in the posterior probability for this collection;
if the correct assignment rate is high relative to other collections,
it will be upwardly biased, and vice versa. The inverse of this vector is used
to scale Dirichlet draws of omega
during misassignment-scaled MCMC.
Value
avg_coll2correctRU
returns a vector of length nrow(SL)
, where
each element represents the average proportion of fish from the corresponding collection
which are correctly assigned to the proper collection, or misassigned to another
collection within the same reporting unit. This is distinct from the rate of
correct assignment at the collection level, which is too low and variable to serve
as a stable metric for omega
scaling.
Examples
locnames <- names(alewife)[-(1:16)][c(TRUE, FALSE)]
ploidies <- rep(2, length(locnames))
names(ploidies) <- locnames
params <- tcf2param_list(alewife, 17, ploidies = ploidies)
SL <- geno_logL(params) %>% exp() %>% apply(2, function(x) x/sum(x))
correct <- avg_coll2correctRU(SL, params$coll, params$RU_starts, params$RU_vec)
Microsat data from blueback herring reference populations
Description
Standard two-column genetic data with lots of other columns preceding it. Can be fed directly into rubias because it has at least the columns sample_type, collection, repunit and indiv.
Format
A tibble.
Source
https://datadryad.org/stash/dataset/doi:10.5061/dryad.80f4f
Perform a parametric bootstrapping correction on an estimated rho vector
Description
Takes an estimate of rho, and a two-column format genetic data frame containing both reference and mixture data. Returns a new rho corrected by parametric bootstrapping
Usage
bootstrap_rho(
rho_est,
pi_est,
D,
gen_start_col,
niter = 100,
reps = 2000,
burn_in = 100,
pi_prior = NA,
pi_prior_sum = 1
)
Arguments
rho_est |
the rho value previously estimated from MCMC GSI from the provided reference and mixture data |
pi_est |
the pi value previously estimated from MCMC GSI from the provided reference and mixture data |
D |
a two-column genetic dataframe containing the reference and mixture
data from which |
gen_start_col |
the first column of genetic data in D. All columns after
|
pi_prior |
The prior to be added to the collection allocations, in order
to generate pseudo-count Dirichlet parameters for the simulation of a new pi vector.
Non-default values should be a vector of length equal to the number of populations
in the reference dataset. Default value of NA leads to the
calculation of a symmetrical prior based on |
pi_prior_sum |
total weight on default symmetrical prior for pi. In parametric bootstrapping, |
Value
bootstrap_rho
returns a new rho value, corrected by parametric
bootstrapping.
check a baseline and mixture file together to ensure the known_collections are valid if they exist
Description
Simple function that checks for known_collections columns in a reference and mixture and makes sure that they are compliant. If there is a non-NA entry in the Mixture frame's known_collection column this function returns TRUE. Otherwise it returns FALSE.
Usage
check_known_collections(R, M)
Arguments
R |
reference data frame |
M |
mixture data frame |
A helper function to check that the input data frame is OK
Description
Just checks to make sure that column types are correct.
Usage
check_refmix(D, gen_start_col, type = "reference")
Arguments
D |
the data frame |
gen_start_col |
the column in which the genetic data starts |
type |
For writing errors, supply "mixture" or "reference" as appropriate. |
Details
It also checks the patterns of missing data, and from that infers whether markers are haploid or diploid.
SNP data from chinook reference populations
Description
Chinook salmon baseline data similar to that which can be downloaded from https://datadryad.org/stash/dataset/doi:10.5061/dryad.574sv. This data set includes 91 SNPs and 7301 fish and is what the Dryad data became after we converted from TaqMan to SNPtype assays (being forced to toss some loci) and tossed out a bunch of lousy historical samples from Trinity River.
Format
A tbl_df-ed (from dplyr) data frame with 7,301 rows and 185 variables. The first three columns are
- repunit (chr)
the reporting unit that the individual is in
- pop (chr)
the population from which the individual was sampled
- ID (chr)
Unique identifier of the individual fish
The remaining columns are two columns for each locus. These columns are named like, "Locus.1" and "Locus.2" for the first and second gene copies at that locus. For example, "Ots_104569-86.1" and "Ots_104569-86.2". The locus columns are ints and missing data is denoted by NA.
Source
https://datadryad.org/stash/dataset/doi:10.5061/dryad.574sv
a vector that gives a desired sort order of the chinook collections
Description
This is just an example of what one would use as levels in order to
get the chinook
collections in a desired sort order after
analysis. The issue here is collection in the input data frame to most
functions must be a character vector, not a factor. But, after analysis
you can always make them a factor again and use a vector like this
one to specify the levels.
Source
Made it up!
SNP data from Chinook salmon taken in May/August 2015 from California fisheries
Description
This has data from 91 SNP markers (a subset of the 95 markers in the chinook
baseline
data set).
Format
A tbl_df-ed (from dplyr) data frame with 2256 rows and 193 variables. The first four columns are meta data. The remaining columns are two columns for each locus. These columns are named like, "Locus.1" and "Locus.2" for the first and second gene copies at that locus. For example, "Ots_104569-86.1" and "Ots_104569-86.2". The locus columns are ints and missing data is denoted by NA.
Source
Southwest Fisheries Science Center, Santa Cruz, CA
a vector that gives a desired sort order of the chinook repunits
Description
This is just an example of what one would use as levels in order to
get the chinook
repunits in a desired sort order after
analysis. The issue here is that repunit in the input data frame to most
functions must be a character vector, not a factor. But, after analysis
you can always make them a factor again and use a vector like this
one to specify the levels.
Source
Made it up!
check for matching (or close to matching) genotypes in a data frame
Description
Super simple function that looks at all pairs of fish from the data frame and returns a tibble that includes those which shared a fraction >= than min_frac_non_miss of the genotypes not missing in either fish, and which were matching at a fraction >= min_frac_matching of those non-missing pairs of genotypes.
Usage
close_matching_samples(
D,
gen_start_col,
min_frac_non_miss = 0.7,
min_frac_matching = 0.9
)
Arguments
D |
a two-column format genetic dataset, with "repunit", "collection", and "indiv" columns, as well as a "sample_type" column that has entried either of "reference" or "mixture" or both. |
gen_start_col |
the first column of genetic data in |
min_frac_non_miss |
the fraction of loci that the pair must share non missing in order to be reported |
min_frac_matching |
the fraction of shared non-missing loci that must be shared between the indivdiuals to be reported as a matching pair. |
Value
a tibble ...
Examples
# one pair found in the interal alewife data set:
close_matching_samples(alewife, 17)
for each individual count the number or loci missing and non_missing
Description
Takes a rubias genetic data frame that must have column "indiv". Haploids have the second column at each locus totally missing. Diploids with missing data will have both gene copies missing.
Usage
count_missing_data(D, gen_start_col)
Arguments
D |
the data frame |
gen_start_col |
the column in which the genetic data starts |
Value
returns a data frame with indiv (as characters), n_non_miss_loci, n_miss_loci (as numeric) and missing_loci (as a list-column of named integer vectors)
Create a vector of pi Dirichlet priors with specified values for one or more collections
Description
This handles a case in which the user provides a data frame for pi_prior
. The
data frame lists desired Dirichlet parameter priors for at least one reference collection,
and/or a default value for all unspecified collections.
Usage
custom_pi_prior(P, C)
Arguments
P |
A data frame of one or more desired pi prior parameters. One column, "collection", is a character vector, with valid values including the names of any reference collections, or the special value "DEFAULT_PI". The second column, "pi_param" is the prior value to be used for each collection. |
C |
a tibble with a column "collection" collection names |
Details
Input checking is currently done in the early stages of infer_mixture
in order to
throw errors before long processing times, and avoid re-checking during bootstrap_rho
.
Given a vector of different categories in 1...n and a prior, simulate a Dirichlet random vector
Description
Takes a vector of collection indices to which individuals (vector elements) were assigned, and returns a Dirichlet random variable generated by adding the prior to the sum of each collection's occurrences, and simulating an alpha from a gamma distribution with this shape parameter.
Usage
dirch_from_allocations(C, lambda)
Arguments
C |
a vector giving different categories of individual (not counts of categories - untabulated) |
lambda |
priors for the categories |
Details
The categories are labeled in C from 1 up to n. n is the length of lambda
,
which is a vector of priors. Note that all elements of lambda
must be strictly greater than 0.
Given a vector of counts for different categories in 1...n and a prior, simulate a Dirichlet random vector
Description
Takes a vector of counts for 1:n collections, and returns a Dirichlet random variable generated by adding the prior to each collection value, and simulating an alpha from a gamma distribution with this shape parameter.
Usage
dirch_from_counts(C, lambda)
Arguments
C |
a vector giving counts of categories |
lambda |
priors for the categories |
Details
The categories are labeled in C from 1 up to n. n is the length of lambda
,
which is a vector of priors. Note that all elements of lambda
must be strictly greater than 0.
Calculate a matrix of genotype log-likelihoods for a genetic dataset
Description
Takes a list of key parameters from a genetic dataset, and calculates the log-likelihood of each individual's genotype, given the allele counts in each collection
Usage
geno_logL(par_list)
Arguments
par_list |
genetic data converted to the param_list format by |
Details
Leave-One-Out cross-validation is used to avoid bias in log-likelihood for an individual's known collection of origin
Value
geno_logL
returns a matrix with C rows and I columns. Each column
represents an individual, and each row the log-likelihood of that individual's
genotype, given the allele counts in that collection
Examples
example(tcf2param_list)
ale_glL <- geno_logL(ale_par_list)
Calculate a matrix of sum-of-squares of genotype log-likelihoods for a genetic dataset
Description
Takes a list of key parameters from a genetic dataset, and calculates the sum of squared log-likelihood of each individual's genotype, given the allele counts in each collection. This is used for the quick-and-dirty Z-score calculations.
Usage
geno_logL_ssq(par_list)
Arguments
par_list |
genetic data converted to the param_list format by |
Details
Leave-One-Out cross-validation is used to avoid bias in log-likelihood for an individual's known collection of origin
Value
geno_logL
returns a matrix with C rows and I columns. Each column
represents an individual, and each row the log-likelihood of that individual's
genotype, given the allele counts in that collection
Examples
example(tcf2param_list)
ale_glL <- geno_logL(ale_par_list)
infer the ploidy from the pattern of NAs in the columns of data
Description
This is strictly internal
Usage
get_ploidy_from_frame(tmp, type)
Arguments
tmp |
a data frame with 2 * L columns (two for each locus) |
Simulate genotype log-likelihoods from a population by gene copy
Description
Takes a list of parameters from a genetic dataset, and returns a genotype log-likelihood matrix for individuals simulated by gene copy from the specified collections
Usage
gprob_sim_gc(par_list, sim_colls)
Arguments
par_list |
genetic data converted to the param_list format by |
sim_colls |
a vector of indices for the collections desired for simulation; each element of the list corresponds to an individual |
Details
In simulation by gene copy, the genotype at a locus for any individual is the result of two random draws from the allele count matrix of that locus. Draws within an individual are performed without replacement, but allele counts are replaced between individuals.
Value
gprob_sim
returns a matrix of the summed log-likelihoods
for all loci of a simulated population mixture; columns represent individuals,
with each row containing their log-likelihood of belonging to the collection
of the same index, given the selection of two independent gene copies from the
desired collection of origin's reference allele frequencies
Examples
example(tcf2param_list)
sim_colls <- sample(ale_par_list$C, 1070, replace = TRUE)
ale_sim_gprobs_gc <- gprob_sim_gc(ale_par_list, sim_colls)
Simulate genotypes by gene copy, with missing data from chosen individuals
Description
Takes a list of parameters from a genetic dataset, and returns a genotype log-likelihood matrix for individuals simulated by gene copy from the specified collections, with genotypes masked by missing data patterns from reference individuals
Usage
gprob_sim_gc_missing(par_list, sim_colls, sim_missing)
Arguments
par_list |
genetic data converted to the param_list format by |
sim_colls |
a vector; element i specifies the collection from which to sample the genotypes for individual i |
sim_missing |
a vector; element i specifies the index for the individual in params$I whose missing data should be copied for individual i |
Details
In simulation by gene copy, the genotype at a locus for any individual is the result
of two random draws from the allele count matrix of that locus. Draws within an individual
are performed without replacement, but allele counts are replaced between individuals.
If the data at a particular locus is missing for individual i in sim_missing
,
this data will also be missing in simulated individual i for the
log-likelihood calculation.
Examples
# If one wanted to simulate the missing data patterns
# of a troublesome mixture dataset, one would run tcf2param_list,
# selecting samp_type = "mixture", then draw sim_miss from
# the mixture individual genotype list
# make a fake mixture data set to demonstrate...
drawn <- mixture_draw(alewife, rhos = c(1/3, 1/3, 1/3),N = 100)
ref <- drawn$reference
mix <- drawn$mix
# then run it...
# we have to get the ploidies to pass to tcf2param_list
locnames <- names(alewife)[-(1:16)][c(TRUE, FALSE)]
ploidies <- rep(2, length(locnames))
names(ploidies) <- locnames
params <- tcf2param_list(rbind(ref,mix), 17, samp_type = "mixture", ploidies = ploidies)
sim_colls <- sample(params$C, 1070, replace = TRUE)
sim_miss <- sample(length(params$coll), 1070, replace = TRUE)
ale_sim_gprobs_miss <- gprob_sim_gc_missing(params, sim_colls, sim_miss)
Simulate genotype log-likelihoods from a population by individual
Description
Takes a list of parameters from a genetic dataset, and returns a genotype log-likelihood matrix for individuals simulated by individual from the specified collections
Usage
gprob_sim_ind(par_list, sim_colls)
Arguments
par_list |
genetic data converted to the param_list format by |
sim_colls |
a vector of indices for the collections desired for simulation; each element of the list corresponds to an individual |
Details
In simulation by individual, the genotype for any simulated individual is the result of a single random draw from the genotypes of all individuals in the collection. Gene copies and loci are therefore not independent.
Value
gprob_sim
returns a matrix of the summed log-likelihoods
for all loci of a simulated population mixture; columns represent individuals,
with each row containing their log-likelihood of belonging to the collection
of the same index, given the selection of an individual's genotype from the
reference collection of interest. Selection at the locus and gene copy level
are not independent, and missing data is included in selection.
Examples
example(tcf2param_list)
sim_colls <- sample(ale_par_list$C, 1070, replace = TRUE)
ale_sim_gprobs_ind <- gprob_sim_ind(ale_par_list, sim_colls)
EM algorithm from the simplest GSI model for pi and the individual posterior probabilities
Description
Using a matrix of scaled likelihoods, this function does an EM algorithm to climb the likelihood surface for pi, and computes the plug-in estimate of the posteriors for all the individuals. It returns the output in a list.
Usage
gsi_em_1(SL, Pi_init, max_iterations, tolerance, return_progression)
Arguments
SL |
a matrix of the scaled likelihoods. This is should have values for each individual in a column (going down in the rows are values for different collections). |
Pi_init |
Starting value for the pi (collection mixture proportion) vector. |
max_iterations |
the maximum total number of reps iterations to do. |
tolerance |
the EM-algorithm will be considered converged when the sum over the elements of pi of the absolute value of the difference between the previous and the current estimate is less than tolerance. |
return_progression |
If true, then the pi_trace component of the output shows the value of pi visited en route to the end. |
Value
gsi_em_1
returns a final Maximum-Likelihood estimate for pi and PofZ,
as well as the number of iterations needed to reach convergence ("iterations_performed"),
and traces of the pi values and change in pi in each iteration
Examples
# this is shown with a scaled likelihood matrix from self-assignment
# of the reference individuals
# we have to get the ploidies to pass to tcf2param_list
locnames <- names(alewife)[-(1:16)][c(TRUE, FALSE)]
ploidies <- rep(2, length(locnames))
names(ploidies) <- locnames
params <- tcf2param_list(alewife, 17, ploidies = ploidies)
logl <- geno_logL(params)
SL <- apply(exp(logl), 2, function(x) x/sum(x))
test_em <- gsi_em_1(SL,
rep(1/params$C, params$C),
max_iterations = 10^6,
tolerance = 10^-7,
return_progression = TRUE)
MCMC from the simplest GSI model for pi and the individual posterior probabilities
Description
Using a matrix of scaled likelihoods, this function samples values of pi and the posteriors for all the individuals. It returns the output in a list.
Usage
gsi_mcmc_1(SL, Pi_init, lambda, reps, burn_in, sample_int_Pi, sample_int_PofZ)
Arguments
SL |
matrix of the scaled likelihoods. This is should have values for each individual in a column (going down in the rows are values for different populations). |
Pi_init |
Starting value for the pi (collection mixture proportion) vector. |
lambda |
the prior to be added to the collection allocations, in order to generate pseudo-count Dirichlet parameters for the simulation of a new pi vector |
reps |
total number of reps (sweeps) to do. |
burn_in |
how many reps to discard in the beginning when doing the mean calculation. They will still be returned in the traces if desired |
sample_int_Pi |
the number of reps between samples being taken for Pi traces. If 0 no trace samples are taken |
sample_int_PofZ |
the number of reps between samples being taken for the traces of posterior of each individual's origin. If 0 no trace samples are taken. |
Value
gsi_mcmc_1
returns a list of three. $mean
lists the posterior
means for collection proportions pi
and for the individual posterior
probabilities of assignment PofZ
. $sd
returns the posterior standard
deviations for the same values.
If the corresponding sample_int
variables are not 0, $trace
contains
samples taken from the Markov chain at intervals of sample_int_
(variable) steps.
Examples
# this demonstrates it with scaled likelihoods computed from
# assignment of the reference samples
# we have to get the ploidies to pass to tcf2param_list
locnames <- names(alewife)[-(1:16)][c(TRUE, FALSE)]
ploidies <- rep(2, length(locnames))
names(ploidies) <- locnames
params <- tcf2param_list(alewife, 17, ploidies = ploidies)
logl <- geno_logL(params)
SL <- apply(exp(logl), 2, function(x) x/sum(x))
lambda <- rep(1/params$C, params$C)
mcmc <- gsi_mcmc_1(SL, lambda, lambda, 200, 50, 5, 5)
MCMC from the fully Bayesian GSI model for pi and the individual posterior probabilities
Description
Given a list of key parameters from a genetic dataset, this function samples values of pi and the posteriors for all the individuals. Each MCMC iteration includes a recalculation of the scaled genotype likelihood matrix, with baseline allele frequencies updated based on the previous iteration's allocations. It returns the output in a list.
Usage
gsi_mcmc_fb(
par_list,
Pi_init,
lambda,
reps,
burn_in,
sample_int_Pi,
sample_int_PofZ
)
Arguments
par_list |
genetic data converted to the param_list format by |
Pi_init |
Starting value for the pi (collection mixture proportion) vector. |
lambda |
the prior to be added to the collection allocations, in order to generate pseudo-count Dirichlet parameters for the simulation of a new pi vector |
reps |
total number of reps (sweeps) to do. |
burn_in |
how many reps to discard in the beginning when doing the mean calculation. They will still be returned in the traces if desired |
sample_int_Pi |
the number of reps between samples being taken for Pi traces. If 0 no trace samples are taken |
sample_int_PofZ |
the number of reps between samples being taken for the traces of posterior of each individual's origin. If 0 no trace samples are taken. |
Value
gsi_mcmc_fb
returns a list of three. $mean
lists the posterior
means for collection proportions pi
, for the individual posterior
probabilities of assignment PofZ
, and for the allele frequencies theta
.
$sd
returns the posterior standard deviations for the same values.
If the corresponding sample_int
variables are not 0, $trace
contains
samples taken from the Markov chain at intervals of sample_int_
(variable) steps.
Examples
# this demonstrates it with scaled likelihoods computed from
# assignment of the reference samples
# we have to get the ploidies to pass to tcf2param_list
locnames <- names(alewife)[-(1:16)][c(TRUE, FALSE)]
ploidies <- rep(2, length(locnames))
names(ploidies) <- locnames
params <- tcf2param_list(alewife, 17, ploidies = ploidies)
lambda <- rep(1/params$C, params$C)
# use very short run and burn in so it doesn't take too long
# when checking on CRAN
mcmc <- gsi_mcmc_fb(params, lambda, lambda, 20, 5, 4, 4)
Estimate mixing proportions and origin probabilities from one or several mixtures
Description
Takes a mixture and reference dataframe of two-column genetic data, and a desired method of estimation for the population mixture proportions (MCMC, PB, BR). Returns the output of the chosen estimation method
Usage
infer_mixture(
reference,
mixture,
gen_start_col,
method = "MCMC",
alle_freq_prior = list(const_scaled = 1),
pi_prior = NA,
pi_init = NULL,
reps = 2000,
burn_in = 100,
pb_iter = 100,
prelim_reps = NULL,
prelim_burn_in = NULL,
sample_int_Pi = 1,
sample_theta = TRUE,
pi_prior_sum = 1
)
Arguments
reference |
a dataframe of two-column genetic format data, proceeded by "repunit", "collection", and "indiv" columns. Does not need "sample_type" column, and will be overwritten if provided |
mixture |
a dataframe of two-column genetic format data. Must have the same structure as
|
gen_start_col |
the first column of genetic data in both data frames |
method |
a choice between "MCMC", "PB", "BR" methods for estimating mixture proportions |
alle_freq_prior |
a one-element named list specifying the prior to be used when
generating Dirichlet parameters for genotype likelihood calculations. Valid methods include
|
pi_prior |
The prior to be added to the collection allocations, in order to generate pseudo-count
Dirichlet parameters for the simulation of new pi vectors in MCMC. Default value of NA leads to the
calculation of a symmetrical prior based on |
pi_init |
The initial value to use for the mixing proportion of collections. This lets the user start the chain from a specific value of the mixing proportion vector. If pi_init is NULL (the default) then the mixing proportions are all initialized to be equal. Otherwise, you pass in a data frame with one column named "collection" and the other named "pi_init". Every value in the pi_init column must be strictly positive (> 0), and a value must be given for every collection. If they sum to more than one the values will be normalized to sum to one. |
reps |
the number of iterations to be performed in MCMC |
burn_in |
how many reps to discard in the beginning of MCMC when doing the mean calculation. They will still be returned in the traces if desired. |
pb_iter |
how many bootstrapped data sets to do for bootstrap correction using method PB. Default is 100. |
prelim_reps |
for method "BR", the number of reps of conditional MCMC (as in method "MCMC")
to perform prior to MCMC with baseline resampling. The posterior mean of mixing proportions
from this conditional MCMC is then used as |
prelim_burn_in |
for method "BR", this sets the number of sweeps out of |
sample_int_Pi |
how many iterations between storing the mixing proportions trace. Default is 1. Can't be 0. Can't be so large that fewer than 10 samples are taken from the burn in and the sweeps. |
sample_theta |
for method "BR", whether or not the function should store the posterior mean of the updated allele frequences. Default is TRUE |
pi_prior_sum |
For |
Details
"MCMC" estimates mixing proportions and individual posterior probabilities of assignment through Markov-chain Monte Carlo conditional on the reference allele frequencies, while "PB" does the same with a parametric bootstrapping correction, and "BR" runs MCMC sweeps while simulating reference allele frequencies using the genotypes of mixture individuals and allocations from the previous sweep. All methods default to a uniform 1/(# collections or RUs) prior for the mixing proportions.
Value
Tidy data frames in a list with the following components: mixing_proportions: the estimated mixing proportions of the different collections. indiv_posteriors: the posterior probs of fish being from each of the collections. mix_prop_traces: the traces of the mixing proportions. Useful for computing credible intervals. bootstrapped_proportions: If using method "PB" this returns the bootstrap corrected reporting unit proportions.
Examples
mcmc <- infer_mixture(reference = small_chinook_ref,
mixture = small_chinook_mix,
gen_start_col = 5,
method = "MCMC",
reps = 200)
Collect essential data values before mixture proportion estimation
Description
Takes all relevant information created in previous steps of data conversion pipeline, and combines into a single list which serves as input for further calculations
Usage
list_diploid_params(
AC_list,
I_list,
PO,
coll_N,
RU_vec,
RU_starts,
alle_freq_prior = list(const_scaled = 1)
)
Arguments
AC_list |
a list of allele count matrices; output from |
I_list |
a list of genotype vectors; output from |
PO |
a vector of collection (population of origin) indices
for every individual in the sample, in order identical to |
coll_N |
a vector of the total number of individuals in each collection, in order of appearance in the dataset |
RU_vec |
a vector of collection indices, sorted by reporting unit |
RU_starts |
a vector of indices, designating the first collection for each reporting unit in RU_vec |
alle_freq_prior |
a one-element named list specifying the prior to be used when
generating Dirichlet parameters for genotype likelihood calculations. The name of the
list item determines the type of prior used, with options |
Details
Genotypes represented in I_list
are converted into a single long vector,
ordered by locus, individual, and gene copy, with NA
values represented as 0s.
Similarly, AC_list
is unlisted to AC
, ordered by locus, collection,
and allele. DP
is a list of Dirichlet priors for likelihood calculations, created
by adding the values calculated from alle_freq_prior
to each allele
sum_AC
and sum_DP
are the summed allele values for each locus
of their parent vectors, ordered by locus and collection.
Value
list_diploid_params
returns a list of the information necessary
for the calculation of genotype likelihoods in MCMC:
L
, N
, and C
represent the number of loci, individual genotypes,
and collections, respectively. A
is a vector of the number of alleles at each
locus, and CA
is the cumulative sum of A
. coll
, coll_N
,
RU_vec
, and RU_starts
are copied directly from input.
I
, AC
, sum_AC
, DP
, and sum_DP
are vectorized
versions of data previously represented as lists and matrices; indexing macros
use L
, N
, C
, A
, and CA
to access these vectors
in later Rcpp-based calculations.
Examples
example(allelic_list)
PO <- as.integer(factor(ale_long$clean_short$collection))
coll_N <- as.vector(table(PO))
Colls_by_RU <- dplyr::count(ale_long$clean_short, repunit, collection) %>%
dplyr::filter(n > 0) %>%
dplyr::select(-n)
PC <- rep(0, length(unique((Colls_by_RU$repunit))))
for(i in 1:nrow(Colls_by_RU)) {
PC[Colls_by_RU$repunit[i]] <- PC[Colls_by_RU$repunit[i]] + 1
}
RU_starts <- c(0, cumsum(PC))
RU_vec <- as.integer(Colls_by_RU$collection)
param_list <- list_diploid_params(ale_ac, ale_alle_list, PO, coll_N, RU_vec, RU_starts)
Separate a chosen proportion of a reference dataset into a mixture with known population proportions
Description
Takes a reference dataset and a set of population proportions, either at the collection or reporting unit level. Randomly samples individuals to satisfy these desired proportions, and splits them into a new "mixture" dataframe.
Usage
mixture_draw(D, rhos = NULL, omegas = NULL, N, min_remaining = 0)
Arguments
D |
a two-column genetic dataframe with "indiv", "repunit", and "collection" columns |
rhos |
a vector of the desired reporting unit proportions in the mixture set; if not named, will be assumed to be ordered by order of appearance in the dataset |
omegas |
the desired collection proportions in the mixture set |
N |
the total size of the mixture set |
min_remaining |
the fraction of any collection in the reference dataset which must remain at the end of the draw |
Value
mixture_draw
returns a list of two data frames,
"mixture" being the random sample taken, and "reference" being the remaining samples
Examples
rhos <- as.vector(gtools::rdirichlet(1, table(alewife$repunit)))
cross_val <- mixture_draw(D = alewife, rhos = rhos, N = 100, min_remaining = .005)
for individuals of known origin in the mixture, put all their weight on their known collection
Description
This is used internally.
Usage
modify_scaled_likelihoods_for_known_mixture_fish(SL, KC, CFL)
Arguments
SL |
the matrix of scaled likelihoods. |
KC |
a character vector of collections that the individuals belong to (or NA if you don't know where they come from). If this is NULL, then SL just gets returned untouched. |
CFL |
the levels of the collections factor (which is within clean$short) |
Compute the mean and variance of the single-locus genotype likelihoods for each collection
Description
This assumes that you have compiled params for a reference data set and then it just calls rcpp_per_locus and then summarizes the results.
Usage
per_locus_means_and_vars(par_list)
Arguments
par_list |
genetic data converted to the param_list format by |
Value
Returns a list with two components, mean and var, each one a matrix that has C (number of collections) rows and L (number of loci) columns, giving the mean (or variance) of the genotype likelihoods in the individuals in that collection at that locus.
perfect-assignment genetic data for chinook.
Description
This is just like the chinook
data, but only has 7 loci and all loci are
fixed in fortuitous patterns so that every single collection is easily resolved. This is
primarily useful for testing purposes.
Source
Made it up!
perfect-assignment mixture genetic data for chinook.
Description
This is similar to the chinook_mix
data, but only has 7 loci and all loci are
fixed in fortuitous patterns so that every single collection is easily resolved. This is
primarily useful for testing purposes. The name of the individual has its collection
inside the colons.
Source
Made it up!
Find all pairs that have close matches
Description
Find all pairs that have close matches
Usage
rcpp_close_matchers(par_list, non_miss_fract, match_fract)
Arguments
par_list |
genetic data converted to the param_list format by |
Value
Gotta say more
Examples
# gotta do something here too
From the pattern of missing data at each individual, compute the expected mean and variance of the logl
Description
This takes a param_list so that it has access to individual genotypes (and hence can cycle through them
and know which are missing and which are not.) It also takes a matrix of per-locus logl means
and variances like what is computed by per_locus_means_and_vars
.
Usage
rcpp_indiv_specific_logl_means_and_vars(par_list, MV)
Arguments
par_list |
genetic data converted to the param_list format by |
MV |
a list of two matrices, one of means and the other of variances, which are C x L
matrices. This is basically the list that is returned by |
Details
This function doesn't do any checking to assure that the par_list and the per-locus logl means matrix are made for one another. (i.e. use the same collections in the same order.)
Value
a matrix with C rows and I columns. Each row represents a collection, and each column an individual.
Return a matrix of locus-specific self-assignment logls
Description
Takes a list of key parameters from a genetic dataset, and calculates the log-likelihood of each individual's single-locus genotype, given the allele counts in the individual's collection.
Usage
rcpp_per_locus_logls(par_list)
Arguments
par_list |
genetic data converted to the param_list format by |
Details
This uses Leave-One-Out cross-validation is used to avoid bias in log-likelihood for an individual's known collection of origin
Value
rcpp_per_locus_logls
returns a matrix with I rows and L columns. Each row
represents an individual, and each column a locus. Note that missing data at a locus
returns a zero. That should be changed to NA later.
read a gsi_sim formatted input file into a tibble that rubias can use
Description
Note that this relies on a system call to awk. It probably won't work on Windows.
Usage
read_gsi_sim(path, sample_type, repunits = NULL)
Arguments
path |
path to the gsi_sim file |
sample_type |
should be "reference" or "mixture" depending on what kind of file it is |
repunits |
the gsi_sim reporting units file. Has no effect if sample_type is "mixture". If sample_type is "reference" and this is left as NULL, then all collections will be put in the the "default_repu". |
Estimate mixing proportions from reference and mixture datasets
Description
Takes a mixture and reference dataframe of two-column genetic data, and a desired method of estimation for the population mixture proportions (MCMC, PB, or BH MCMC) Returns the output of the chosen estimation method
Usage
ref_and_mix_pipeline(
reference,
mixture,
gen_start_col,
method = "MCMC",
reps = 2000,
burn_in = 100,
sample_int_Pi = 0,
sample_int_PofZ = 0,
sample_int_omega = 0,
sample_int_rho = 0,
sample_int_PofR = 0
)
Arguments
reference |
a dataframe of two-column genetic format data, proceeded by "repunit", "collection", and "indiv" columns. Does not need "sample_type" column, and will be overwritten if provided |
mixture |
a dataframe of two-column genetic format data. Must have the same structure as
|
gen_start_col |
the first column of genetic data in both data frames |
method |
this must be "MCMC". "PB" and "BH" are no longer supported in this function. |
reps |
the number of iterations to be performed in MCMC |
burn_in |
how many reps to discard in the beginning of MCMC when doing the mean calculation. They will still be returned in the traces if desired. |
sample_int_Pi |
the number of reps between samples being taken for pi traces. If 0 no traces are taken. Only used in methods "MCMC" and "PB". |
sample_int_PofZ |
the number of reps between samples being taken for the posterior traces of each individual's collection of origin. If 0 no trace samples are taken. Used in all methods |
sample_int_omega |
the number of reps between samples being taken for collection proportion traces. If 0 no traces are taken. Only used in method "BH" |
sample_int_rho |
the number of reps between samples being taken for reporting unit proportion traces. If 0 no traces are taken. Only used in method "BH" |
sample_int_PofR |
the number of reps between samples being taken for the posterior traces of each individual's reporting unit of origin. If 0 no trace samples are taken. Only used in method "BH". |
Details
"MCMC" estimates mixing proportions and individual posterior probabilities of assignment through Markov-chain Monte Carlo, while "PB" does the same with a parametric bootstrapping correction, and "BH" uses the misassignment-scaled, hierarchical MCMC. All methods use a uniform 1/(# collections or RUs) prior for pi/omega and rho.
Value
mix_proportion_pipeline
returns the standard output of the chosen
mixing proportion estimation method (always a list). For method "PB",
returns the standard MCMC results, as well as the bootstrap-corrected
collection proportions under $mean$bootstrap
Examples
reference <- small_chinook_ref
mixture <- small_chinook_mix
gen_start_col <- 5
# this function expects things as factors. This function is old and needs
# to be replaced and deprecated.
reference$repunit <- factor(reference$repunit, levels = unique(reference$repunit))
reference$collection <- factor(reference$collection, levels = unique(reference$collection))
mixture$repunit <- factor(mixture$repunit, levels = unique(mixture$repunit))
mixture$collection <- factor(mixture$collection, levels = unique(mixture$collection))
mcmc <- ref_and_mix_pipeline(reference, mixture, gen_start_col, method = "MCMC")
Tabulate occurrences of all observed alleles in reference genetic data
Description
Takes the first output of tcf2long
, along with two columns named "collection" and "sample_type",
and returns a data frame of allele counts for each locus within each reference population.
Alleles to be counted are identified from both reference and mixture populations.
Usage
reference_allele_counts(D, pop_level = "collection")
Arguments
D |
A data frame containing, at minimum, a column of sample group identifiers named "collection", a column designating each row as "reference" or "mixture", named "sample_type", and (from tcf2long output) locus, gene copy, and observed alleles. If higher-level reporting unit counts are desired, must have a column of reporting unit identifiers named "repunit" |
pop_level |
a character vector expressing the population level for which allele counts should be tabulated. Set to "collection" for collection/underlying sample group (default), or "repunit" for reporting unit/overlying sample groups |
Details
The "collection" column should be a key assigning samples to the desired groups, e.g. collection site, run time, year. The "sample_type" column must contain either "reference" or "mixture" for each sample.
Value
reference_allele_counts
returns a long-format dataframe, with count data for
each collection, locus, and allele. Counts are only drawn from "reference" samples; alleles
unique to the "mixture" samples will still appear in the list, but will have 0s for all groups.
Examples
## count alleles in alewife reference populations
example(tcf2long) # gets variable ale_long
ale_rac <- reference_allele_counts(ale_long$long)
Round a given number, with 5 always rounded up
Description
Given a number and a digit to round to, returns the rounded number, with 5 always rounded upwards.
Usage
round2(x, n)
Arguments
x |
the data to be rounded |
n |
the number of digits to round to |
Details
This function differs from round
, which rounds 5 "towards the
even number". Rounding 5s up leads to bias when positive and negative numbers
are expected, but can be desired in some cases.
rubias: Bayesian inference from the conditional genetic stock identification model
Description
Read the "rubias-overview" vignette for information on data input formats and how to use the package
the rubias
main high-level functions
The following functions are wrappers, designed for user-friendly input and useful output:
infer_mixture
is used to perform genetic stock identification.
Options include standard MCMC and the parametric bootstrap bias correction.
self_assign
does simple self-assignment of individuals in a reference data set
to the various collections in the reference data set.
assess_reference_loo
does leave-one-out based simulations to predict how
accurately GSI can be done.
assess_reference_mc
uses Monte-Carlo cross-validation based simulations
to predict how accurately GSI can be done.
assess_pb_bias_correction
attempts to demonstrate how much (or little)
improvement can be expected from the parametric bootstrap correction given a particular
reference data set.
genetic data format
See the vignette.
example data
alewife
, blueback
, and chinook
are
genetic data sets that are useful for playing around with rubias and testing it
out.
Sample 1 observation from cell probabilities that are columns of a matrix
Description
Takes a matrix in which columns sum to one. For each column, performs a single multinomial draw from the rows, weighted by their values in that column
Usage
samp_from_mat(M)
Arguments
M |
a matrix whose columns are reals summing to one |
Value
a vector length = ncol(M)
of indices, with each element being
the row that was chosen in that column's sampling
Do leave-one-out self-assignment of individuals in a reference baseline
Description
Returns a tidy data frame
Usage
self_assign(
reference,
gen_start_col,
preCompiledParams = NULL,
alle_freq_prior = list(const_scaled = 1)
)
Arguments
reference |
a two-column format genetic dataset, with "repunit", "collection", and "indiv" columns, as well as a "sample_type" column that has some "reference" entries |
gen_start_col |
the first column of genetic data in |
preCompiledParams |
Users should never use this option. It is here only so that this function can be called on a precompiled set of parameters with infer_mixture. Don't use this, unless you are one of the package developers... |
alle_freq_prior |
a one-element named list specifying the prior to be used when
generating Dirichlet parameters for genotype likelihood calculations. Valid methods include
|
Value
a tibble ...
Examples
ale_sa <- self_assign(alewife, 17)
List of example ways of specifying repunit and collection quantities in simulations
Description
This is just a list of tibbles that can be passed to the alpha_repunit or the
alpha_collection parameters in, for example, assess_reference_loo
.
Source
Made it up!
Generate a samples for a mixture.
Description
Creates random reporting unit (rho) and collection (omega) proportions, and a
sim_colls
vector for simulation of individual genotypes, based on the methods
used in Hasselman et al. (2015)
Usage
simulate_random_samples(
RU_starts,
RU_vec,
size = 100,
alpha_repunit = 1.5,
alpha_collection = 1.5,
coll_sub_dirichlet_default = 1.5
)
Arguments
RU_starts |
a vector delineating the reporting units in |
RU_vec |
a vector of collection indices, grouped by reporting unit;
generated by |
size |
The number of individuals desired in the mixture sample. Default = 100. This is ignored without a warning if alpha_repunit is specified with counts (a cnt column) or if alpha_collection is specified with counts (a cnt column). |
alpha_repunit |
If a vector, this is the dirichlet parameter for simulating the proportions of reporting units. Gets recycled to the number of reporting units. Default is 1.5. Otherwise, this could be a two-column data frame. The first column must be named "repunit" and the second one must be one of "dirichlet", "ppn", or "cnt", according to whether you wish to specify dirichlet parameters, or proportions, or exact counts, respectively, for each population. |
alpha_collection |
The dirichlet parameter for simulating proportions of collections within reporting units. Default = 1.5 |
coll_sub_dirichlet_default |
If you are providing a data frame with requested sub_dirichlet pars for collections, and you don't specifically list one, this is the value it gets. |
Value
a list with three elements. The first two are a rho vector and an omega vector, respectively. The third is a vector of origins for simulated individuals, sampled from the collections with probabilities = omega
Small sample of SNP data from Chinook salmon taken in May/August 2015 from California fisheries
Description
This is simply a sample of 100 fish from chinook
.
Source
Southwest Fisheries Science Center, Santa Cruz, CA
SNP data from selected chinook reference populations
Description
A small number of poulations from the Chinook salmon baseline data similar to that which can be downloaded from https://datadryad.org/stash/dataset/doi:10.5061/dryad.574sv. This data set includes 91 SNPs and 909 fish.
Format
A tbl_df-ed (from dplyr) data frame with 909 rows and 185 variables. The first three columns are
- repunit (chr)
the reporting unit that the individual is in
- pop (chr)
the population from which the individual was sampled
- ID (chr)
Unique identifier of the individual fish
The remaining columns are two columns for each locus. These columns are named like, "Locus.1" and "Locus.2" for the first and second gene copies at that locus. For example, "Ots_104569-86.1" and "Ots_104569-86.2". The locus columns are ints and missing data is denoted by NA.
Source
https://datadryad.org/stash/dataset/doi:10.5061/dryad.574sv
Convert Two-Column Genetic Data to Long Format
Description
Takes a data frame consisting of metadata followed by paired columns of genetic data, with each column in the pair representing a gene copy at a locus. Returns a list of two data frames: one with genetic data condensed into one column, and the other with two-column structure intact, but with cleaned allele names.
Usage
tcf2long(D, gen_start_col)
Arguments
D |
A data frame containing two-column genetic data, optionally preceded by metadata. The header of the first genetic data column in each pair lists the locus name, the second is ignored. Locus names must not have spaces in them! |
gen_start_col |
The index (number) of the column in which genetic data starts. Columns must be only genetic data after genetic data starts. |
Value
tcf2long
returns a list of two data frames: in the first, "long", the rightmost
column is the genetic data. Two new columns, "locus" and "gene copy", duplicate the original
column name provided in the first of each pair, and designate copies "a" and "b", respectively.
Metadata is duplicated as necessary for each locus. The second, "clean_short", replicates the
input dataset, but with column names replaced by "(locus name) a" and "(locus name) b" in each pair.
In other words the locus name has an "a" or a "b" added to it after a space.
Examples
## Convert the alewife dataset for further processing
# the data frame passed into this function must have had
# character collections and repunits converted to factors
reference <- alewife
reference$repunit <- factor(reference$repunit, levels = unique(reference$repunit))
reference$collection <- factor(reference$collection, levels = unique(reference$collection))
ale_long <- tcf2long(reference, 17)
Generate MCMC parameter list from two-column genetic data & print summary
Description
This function is a wrapper for all steps to create the parameter list necessary for genotype log-likelihood calculation from the starting two-column genetic data
Usage
tcf2param_list(
D,
gen_start_col,
samp_type = "both",
alle_freq_prior = list(const_scaled = 1),
summ = T,
ploidies
)
Arguments
D |
A data frame containing two-column genetic data, preceded by metadata. The header of the first genetic data column in each pair lists the locus name, the second is ignored. Locus names must not have spaces in them! Required metadata includes a column of unique individual identifiers named "indiv", a column named "collection" designating the sample groups, a column "repunit" designating the reporting unit of origin of each fish, and a "sample_type" column denoting each individual as a "reference" or "mixture" sample. No NAs should be present in metadata |
gen_start_col |
The index (number) of the column in which genetic data starts. Columns must be only genetic data after genetic data starts. |
samp_type |
the sample groups to be include in the individual genotype list, whose likelihoods will be used in MCMC. Options "reference", "mixture", and "both" |
alle_freq_prior |
a one-element named list specifying the prior to be used when
generating Dirichlet parameters for genotype likelihood calculations. Valid methods
include |
summ |
logical indicating whether summary descriptions of the formatted data be provided |
ploidies |
a named vector of ploidies (1 or 2) for each locus. The names must the the locus names. |
Details
In order for all steps in conversion to be carried out successfully, the dataset
must have "repunit", "collection", "indiv", and "sample_type" columns preceding
two-column genetic data. If summ == TRUE
, the function prints summary statistics
describing the structure of the dataset, as well as the presence of missing data,
enabling verification of proper data conversion.
Value
tcf2param_list
returns the output of list_diploid_params
,
after the original dataset is converted to a usable format and all relevant values
are extracted. See ?list_diploid_params
for details
Examples
# after adding support for haploid markers we need to pass
# in the ploidies vector. These markers are all diploid...
locnames <- names(alewife)[-(1:16)][c(TRUE, FALSE)]
ploidies <- rep(2, length(locnames))
names(ploidies) <- locnames
ale_par_list <- tcf2param_list(alewife, 17, ploidies = ploidies)
A helper function to tidy up the output from the gsi_mcmc functions
Description
This makes a tidy data frame of stuff, and also changes things back to factors, if the levels are provided.
Usage
tidy_mcmc_coll_rep_stuff(field, p, pname, car_tib)
Arguments
field |
The output to tidy (i.e.. out$mean) |
p |
the name of the parameter whose values you want to extract (like "pi") |
pname |
the name that you want the parameter to be called in the output |
car_tib |
a tibble with repunit and collection in the order they appear in the output |
A helper function to tidy up the PofZ-like output from the gsi_mcmc functions
Description
This makes a tidy data frame of stuff, and also changes things back to factors, if the levels are provided.
Usage
tidy_mcmc_pofz(input, pname, car_tib, mix_indiv_tib)
Arguments
input |
The output to tidy (i.e.. out$mean$PofZ) |
pname |
the name that you want the parameter to be called in the output |
car_tib |
a tibble with repunit and collection in the order they appear in the output |
mix_indiv_tib |
a tibble with the individuals in the order they appear in the output |
a helper function to tidy up the pi-traces that come out of the mcmc functions
Description
This makes a tidy data frame of stuff, and also changes things back to factors, if the levels are provided.
Usage
tidy_pi_traces(input, pname, car_tib, interval)
Arguments
input |
The output to tidy (i.e.. out$trace$pi) |
pname |
the name that you want the parameter to be called in the output |
car_tib |
a tibble with repunit and collection in the order they appear in the output |
interval |
the thinning interval that was used |
Write a mixture data frame to gsi_sim format baseline and repunits file
Description
Note, this is only intended to work with integer-valued alleles, at the moment. It was just written for testing and verifying that things are working correctly.
Usage
write_gsi_sim_mixture(mix, gen_start_col, mixprefix)
Arguments
mix |
mixture data frame |
gen_start_col |
column in which the genetic data start |
mixprefix |
path to write the mixture file to. The mixture collection name + .txt will be appended to this. This path can include directories if they exist. An example would be "./my_gsi_data/mixture". This is a required argument. |
Examples
# this writes to file prefix "mixfile" in a temporary directory
dd <- tempdir()
prefix <- file.path(dd, "mixfile")
# print that
prefix
# note that in practice you will probably want to specify
# your own directory...
# run the function
write_gsi_sim_mixture(chinook_mix, 5, prefix)
# see where those files live:
dir(dd, pattern = "mixfile*", full.names = TRUE)
Write a reference data frame to gsi_sim format baseline and repunits file
Description
Note, this is only intended to work with integer-valued alleles, at the moment. It was just written for testing and verifying that things are working correctly.
Usage
write_gsi_sim_reference(
ref,
gen_start_col,
baseout = "baseline.txt",
repout = "repunits.txt"
)
Arguments
ref |
reference data frame |
gen_start_col |
column in which the genetic data start |
baseout |
path to write the baseline file to. Required. |
repout |
path to write the repunits file to. Required. |
Examples
# create a temp directory to put example outputs
dd <- tempdir()
basefile <- file.path(dd, "baseline.txt")
repunitsfile <- file.path(dd, "repunits.txt")
# print those
basefile
repunitsfile
# note that in practice you will probably want to specify
# your own filepaths...
# run the function
write_gsi_sim_reference(alewife, 17, basefile, repunitsfile)