---
title: "AdmixPoly-vignette"
author: "Simon Rio"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{AdmixPoly-vignette}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

**Table of Contents**

  - [Simulate admixed population](#simulate-admixed-population)
  - [Unsupervised global admixture inference](#unsupervised-global-admixture-inference)
  - [Semi-supervised global admixture inference](#semi-supervised-global-admixture-inference)
  - [Local admixture inference](#local-admixture-inference)
  - [Import genotyping data](#import-genotyping-data)

## Simulate admixed population 

First, an admixed dataset is simulated using the `SimulatePop` function, 
considering `N=50` individuals of ploidy `P=8`, `M=200` markers with `L=10` 
alleles distributed on `C=5` chromosomes, and `K=4` ancestral groups.

```{r}
DataSim <- AdmixPoly::SimulatePop(K=3L, N=50L, P=8L, M=200L, C=2L, L=10L, NbThreads = 1)
```

The resulting `DataSim` object includes:

(i) `Geno`: list of genotyping matrices over markers

```{r}
head(DataSim$Geno[[1]])
```

(ii) `Prop`: matrix of admixture proportions 

```{r}
head(DataSim$Prop)
```

(iii) `Freq`: list of matrices of allele frequencies in ancestral groups 

```{r}
DataSim$Freq[[1]]
```

(iv) `GeneticMap`: genetic map dataframe

```{r}
head(DataSim$GeneticMap)
```


## Unsupervised global admixture inference

From simulated data, admixture proportions can be estimated using the 
`AdmixGlobal` function:

```{r}
ResGlobalAdmix <- AdmixPoly::AdmixGlobal(Geno=DataSim$Geno, K=3L, MaxIter=100, Verbose=F, NbThreads = 1)
```

The estimated admixture proportions are available from:

```{r}
head(ResGlobalAdmix$Prop)
```

They can be represented using the [ggplot2](https://github.com/tidyverse/ggplot2)-based 
`GlobalPlot` function:

```{r,fig.width=7, fig.height=3}
AdmixPoly::GlobalPlot(ResGlobalAdmix$Prop)
```

Estimated proportions are very close to simulated ones (up to an arbitrary labeling of groups), 
as indicated by the root mean square error (RMSE):

```{r}
group_corresp <- apply(cor(DataSim$Prop, ResGlobalAdmix$Prop), 2, which.max)
sqrt(mean((DataSim$Prop-ResGlobalAdmix$Prop[,group_corresp])^2))
```

Estimated allele frequencies are available from:

```{r}
ResGlobalAdmix$Freq[[1]]
```

The convergence can be checked from the log-likelihood:

```{r,fig.width=7, fig.height=4}
plot(ResGlobalAdmix$LogLik, xlab="Iterations", ylab="LogLik", type="o")
```

## Semi-supervised global admixture inference

Let us consider a second admixed population from the same ancestral groups with 
a higher admixture level by specifying a higher 'AlphaProp' value.

```{r}
DataSim2 <- AdmixPoly::SimulatePop(K=3L, N=50L, P=8L, M=200L, C=2L, L=10L,
                                   AlphaProp=10, Freq=DataSim$Freq, NbThreads = 1)
```

The higher admixture level can be illustrated using a barplot of simulated amdixture proportions.

```{r,fig.width=7, fig.height=3}
AdmixPoly::GlobalPlot(DataSim2$Prop)
```


In this case, it can be beneficial to initialize the algorithm with ancestral allele 
frequencies estimated using the first population 'FreqInit=ResGlobalAdmix$Freq' 
and only update admixture proportions 'ParamToUpdate="Prop"'.

```{r}
ResGlobalAdmix2 <- AdmixPoly::AdmixGlobal(Geno=DataSim2$Geno, K=3L, MaxIter=100, Verbose=F, 
                                          FreqInit=ResGlobalAdmix$Freq, ParamToUpdate="Prop",
                                          NbThreads=1)
```

Again, estimated proportions are very close to simulated ones, 
as indicated by the RMSE:

```{r}
sqrt(mean((DataSim2$Prop-ResGlobalAdmix2$Prop[,group_corresp])^2))
```

## Local admixture inference
Based on global admixture results, local admixture can be estimated for a given 
individual using the `AdmixLocal` function:

```{r}
Ind <- "Ind1"
ResLocalAdmix <- AdmixPoly::AdmixLocal(Geno=DataSim$Geno, ResAdmixGlobal=ResGlobalAdmix,
                                       GeneticMap=DataSim$GeneticMap, Ind=Ind, P = 8L,
                                       SmoothParam=1, Verbose=F, NbThreads = 1)
```

Local ancestry dosages based on posterior probabilities are available from:

```{r}
head(ResLocalAdmix$Posterior$Chr1)
```
Alternatively, local ancestry dosages based on the Viterbi algorithm are available from:

```{r}
head(ResLocalAdmix$Viterbi$Chr1)
```

The RMSE of estimated local ancestry proportions (ancestry dosages divided by the ploidy) is:

```{r}
sqrt(mean((DataSim$Ancestry[[Ind]]$Chr1/8-ResLocalAdmix$Posterior$Chr1[,group_corresp]/8)^2))
sqrt(mean((DataSim$Ancestry[[Ind]]$Chr1/8-ResLocalAdmix$Viterbi$Chr1[,group_corresp]/8)^2))
```

Local ancestry dosages can be represented using the [ggplot2](https://github.com/tidyverse/ggplot2)-based 
`LocalPlot` function:

```{r,fig.width=7, fig.height=4}
AdmixPoly::LocalPlot(Ancestry = ResLocalAdmix$Posterior, GeneticMap=DataSim$GeneticMap)
```

Local admixture inference over the five first individuals can be run using:

```{r}
ResLocalAdmix_list <- lapply(paste0("Ind",1:5),function(i){
  res_i <- AdmixPoly::AdmixLocal(Geno=DataSim$Geno, ResAdmixGlobal=ResGlobalAdmix,
                               GeneticMap=DataSim$GeneticMap, Ind=i, P=8L,
                               SmoothParam=1, Verbose=F, NbThreads=1)
  return(res_i)
})
```

When computing time is large, each individual can be run in parallel on a 
high-performance computing cluster.

## Import genotyping data

Genotyping data formatted as a variant call format (VCF) can be imported as:

```{r, results=FALSE}
vcf_path <- system.file("extdata", "Test.vcf", package="AdmixPoly")
DataVCF <- AdmixPoly::ReadVCF(File=vcf_path, NbThreads=1)
```

The resulting `DataVCF` object includes:

(i) `MarkerInfo`: first eight columns of the VCF

```{r}
head(DataVCF$MarkerInfo)
```

(ii) `Geno`: list of genotyping matrices

```{r}
head(DataVCF$Geno[[1]])
```

(iii) `GeneticMap`: genetic map dataframe in which physical between first and last marker of each chromosome are scaled to 100 cM

```{r}
head(DataVCF$GeneticMap)
```

Instead of estimated dosages, read depth ratios can be imported by specifying the 
allele read depths field of the FORMAT column:

```{r,results=FALSE}
DataVCF2 <- AdmixPoly::ReadVCF(File = vcf_path,AlleleDepthField = "AD",NbThreads = 1)
```
```{r}
head(DataVCF2$Geno[[1]])
```
Please note that read depth ratios are not scaled to the ploidy level, 
which must then be informed in the inference functions.

Another function can be used to read the haplotype presence-absence (HPA) format 
obtained from  [HaploCharmer](https://gitlab.cirad.fr/agap/seg/haplocharmer):

```{r,results=FALSE}
hpa_path <- system.file("extdata", "Test.hpa", package="AdmixPoly")
DataHPA <- AdmixPoly::ReadHPA(File=hpa_path, NbThreads=1)
```
```{r}
head(DataHPA$Geno[[1]])
```