Type: | Package |
Title: | Visualization of Viral Protein Sequence Diversity Dynamics |
Version: | 2.0.1 |
Description: | To ease the visualization of outputs from Diversity Motif Analyser ('DiMA'; https://github.com/BVU-BILSAB/DiMA). 'vDiveR' allows visualization of the diversity motifs (index and its variants – major, minor and unique) for elucidation of the underlying inherent dynamics. Please refer https://vdiver-manual.readthedocs.io/en/latest/ for more information. |
License: | MIT + file LICENSE |
Encoding: | UTF-8 |
LazyData: | true |
Imports: | dplyr, gghalves, ggplot2, ggpubr, grid, ggtext, magrittr, tidyr, stringr, rlang, rentrez, scales, utils, maps, cowplot, stringdist |
RoxygenNote: | 7.3.2 |
Depends: | R (≥ 2.10) |
Suggests: | testthat (≥ 3.0.0) |
Config/testthat/edition: | 3 |
NeedsCompilation: | no |
Packaged: | 2024-11-22 08:01:25 UTC; Kelvin |
Author: | Pendy Tok [aut, cre], Li Chuin Chong [aut], Evgenia Chikina [aut], Yin Cheng Chen [aut], Mohammad Asif Khan [aut] |
Maintainer: | Pendy Tok <pendytok0518@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2024-11-22 08:20:02 UTC |
vDiveR: Visualization of Viral Protein Sequence Diversity Dynamics
Description
To ease the visualization of outputs from Diversity Motif Analyser ('DiMA'; https://github.com/BVU-BILSAB/DiMA). 'vDiveR' allows visualization of the diversity motifs (index and its variants – major, minor and unique) for elucidation of the underlying inherent dynamics. Please refer https://vdiver-manual.readthedocs.io/en/latest/ for more information.
Author(s)
Maintainer: Pendy Tok pendytok0518@gmail.com
Authors:
Li Chuin Chong
Evgenia Chikina
Yin Cheng Chen
Mohammad Asif Khan
DiMA (v5.0.9) JSON Output File
Description
A sample DiMA JSON Output File which acts as the input for JSON2CSV()
Usage
JSON_sample
Format
A Diversity Motif Analyzer (DiMA) tool JSON file
k-mer sequences concatenation
Description
This function concatenates completely (index incidence = 100 index incidence < 100 k-mer position or are adjacent to each other and generate the CCS/HCS sequence in either CSv or FASTA format
Usage
concat_conserved_kmer(
data,
conservation_level = "HCS",
kmer = 9,
threshold_pct = NULL
)
Arguments
data |
DiMA JSON converted csv file data |
conservation_level |
CCS (completely conserved) / HCS (highly conserved) |
kmer |
size of the k-mer window |
threshold_pct |
manually set threshold of index.incidence for HCS |
Value
A list wit csv and fasta dataframes
Examples
csv<-concat_conserved_kmer(proteins_1host)$csv
csv_2hosts<-concat_conserved_kmer(protein_2hosts, conservation_level = "CCS")$csv
fasta <- concat_conserved_kmer(protein_2hosts, conservation_level = "HCS")$fasta
Extract metadata via fasta file from GISAID
Description
This function get the metadata from each header of GISAID fasta file
Usage
extract_from_GISAID(file_path)
Arguments
file_path |
path of fasta file |
Extract metadata via fasta file from ncbi
Description
This function get the metadata from each head of fasta file
Usage
extract_from_NCBI(file_path)
Arguments
file_path |
path of fasta file |
JSON2CSV
Description
This function converts DiMA (v5.0.9) JSON output file to a dataframe with 17 predefined columns which further acts as the input for other functions provided in this vDiveR package.
Usage
json2csv(
json_data,
host_name = "unknown host",
protein_name = "unknown protein"
)
Arguments
json_data |
DiMA JSON output dataframe |
host_name |
name of the host species |
protein_name |
name of the protein |
Value
A dataframe which acts as input for the other functions in vDiveR package
Examples
inputdf<-json2csv(JSON_sample)
Metadata Input Sample
Description
A dummy dataset that acts as an input for plot_world_map() and plot_time()
Usage
metadata
Format
A data frame with 1000 rows and 3 variables:
- ID
unique identifier of the sequence
- region
geographical region of the sequence collection
- date
collection date of the sequence
Metadata Extraction from NCBI/GISAID (EpiFlu/EpiCoV/EpiPox/EpiArbo) FASTA file
Description
This function retrieves metadata (ID, region, date) from the input FASTA file, with the source of, either NCBI (with default FASTA header) or GISAID (with default FASTA header). The function will return a dataframe that has three columns consisting ID, collected region and collected date. Records that do not have region or date information will be excluded from the output dataframe.
Usage
metadata_extraction(file_path, source)
Arguments
file_path |
path of fasta file |
source |
the source of fasta file, either "NCBI" or "GISAID" |
Value
A dataframe that has three columns consisting ID, collected region and collected date
Examples
filepath <- system.file('extdata','GISAID_EpiCoV.faa', package = 'vDiveR')
meta_gisaid <- metadata_extraction(filepath, 'GISAID')
Conservation Levels Distribution Plot
Description
This function plots conservation levels distribution of k-mer positions, which consists of completely conserved (black) (index incidence = 100%), highly conserved (blue) (90% <= index incidence < 100%), mixed variable (green) (20% < index incidence <= 90%), highly diverse (purple) (10% < index incidence <= 20%) and extremely diverse (pink) (index incidence <= 10%).
Usage
plot_conservation_level(
df,
protein_order = NULL,
conservation_label = 1,
host = 1,
base_size = 11,
line_dot_size = 2,
label_size = 2.6,
alpha = 0.6
)
Arguments
df |
DiMA JSON converted csv file data |
protein_order |
order of proteins displayed in plot |
conservation_label |
0 (partial; show present conservation labels only) or 1 (full; show ALL conservation labels) in plot |
host |
number of host (1/2) |
base_size |
base font size in plot |
line_dot_size |
lines and dots size |
label_size |
conservation labels font size |
alpha |
any number from 0 (transparent) to 1 (opaque) |
Value
A plot
Examples
plot_conservation_level(proteins_1host, conservation_label = 1,alpha=0.8, base_size = 15)
plot_conservation_level(protein_2hosts, conservation_label = 0, host=2)
Entropy and total variant incidence correlation plot
Description
This function plots the correlation between entropy and total variant incidence of all the provided protein(s).
Usage
plot_correlation(
df,
host = 1,
alpha = 1/3,
line_dot_size = 3,
base_size = 11,
ylabel = "k-mer entropy (bits)\n",
xlabel = "\nTotal variants (%)",
ymax = ceiling(max(df$entropy)),
ybreak = 0.5
)
Arguments
df |
DiMA JSON converted csv file data |
host |
number of host (1/2) |
alpha |
any number from 0 (transparent) to 1 (opaque) |
line_dot_size |
dot size in scatter plot |
base_size |
base font size in plot |
ylabel |
y-axis label |
xlabel |
x-axis label |
ymax |
maximum y-axis |
ybreak |
y-axis breaks |
Value
A scatter plot
Examples
plot_correlation(proteins_1host)
plot_correlation(protein_2hosts, base_size = 2, ybreak=1, ymax=10, host = 2)
Dynamics of Diversity Motifs (Protein) Plot
Description
This function compactly display the dynamics of diversity motifs (index and its variants: major, minor and unique) in the form of dot plot(s) as well as violin plots for all the provided individual protein(s).
Usage
plot_dynamics_protein(
df,
host = 1,
protein_order = NULL,
base_size = 8,
alpha = 1/3,
line_dot_size = 3,
bw = "nrd0",
adjust = 1
)
Arguments
df |
DiMA JSON converted csv file data |
host |
number of host (1/2) |
protein_order |
order of proteins displayed in plot |
base_size |
base font size in plot |
alpha |
any number from 0 (transparent) to 1 (opaque) |
line_dot_size |
dot size in scatter plot |
bw |
smoothing bandwidth of violin plot (default: nrd0) |
adjust |
adjust the width of violin plot (default: 1) |
Value
A plot
Examples
plot_dynamics_protein(proteins_1host)
Dynamics of Diversity Motifs (Proteome) Plot
Description
This function compactly display the dynamics of diversity motifs (index and its variants: major, minor and unique) in the form of dot plot as well as violin plot for all the provided proteins at proteome level.
Usage
plot_dynamics_proteome(
df,
host = 1,
line_dot_size = 2,
base_size = 10,
alpha = 1/3,
bw = "nrd0",
adjust = 1
)
Arguments
df |
DiMA JSON converted csv file data |
host |
number of host (1/2) |
line_dot_size |
size of dot in plot |
base_size |
word size in plot |
alpha |
any number from 0 (transparent) to 1 (opaque) |
bw |
smoothing bandwidth of violin plot (default: nrd0) |
adjust |
adjust the width of violin plot (default: 1) |
Value
A plot
Examples
plot_dynamics_proteome(proteins_1host)
Entropy plot
Description
This function plot entropy (black) and total variant (red) incidence of each k-mer position across the studied proteins and highlight region(s) with zero entropy in yellow. k-mer position with low support is marked with a red triangle underneath the x-axis line.
Usage
plot_entropy(
df,
host = 1,
protein_order = "",
kmer_size = 9,
ymax = 10,
line_size = 2,
base_size = 8,
all = TRUE,
highlight_zero_entropy = TRUE
)
Arguments
df |
DiMA JSON converted csv file data |
host |
number of host (1/2) |
protein_order |
order of proteins displayed in plot |
kmer_size |
size of the k-mer window |
ymax |
maximum y-axis |
line_size |
size of the horizontal (reference) line in plot |
base_size |
word size in plot |
all |
plot both the entropy and total variants (pass FALSE in to plot only the entropy) |
highlight_zero_entropy |
highlight region with zero entropy (default: TRUE) |
Value
A plot
Examples
plot_entropy(proteins_1host)
plot_entropy(protein_2hosts, host = 2)
Time Distribution of Sequences Plot
Description
This function plots the time distribution of provided sequences in the form of bar plot with 'Month' as x-axis and 'Number of Sequences' as y-axis. Aside from the plot, this function also returns a dataframe with 2 columns: 'Date' and 'Number of sequences'. The input dataframe of this function is obtainable from metadata_extraction(), with NCBI Protein / GISAID (EpiFlu/EpiCoV/EpiPox/EpiArbo) FASTA file as input.
Usage
plot_time(
metadata,
date_format = "%Y-%m-%d",
base_size = 8,
date_break = "2 month",
scale = "count"
)
Arguments
metadata |
a dataframe with 3 columns, 'ID', 'region', and 'date' |
date_format |
date format of the input dataframe |
base_size |
word size in plot |
date_break |
date break for the scale_x_date |
scale |
plot counts or log scale the data |
Value
A single plot or a list with 2 elements (a plot followed by a dataframe, default)
Examples
time_plot <- plot_time(metadata, date_format="%d/%m/%Y")$plot
time_df <- plot_time(metadata, date_format="%d/%m/%Y")$df
Geographical Distribution of Sequences Plot
Description
This function plots a world map and color the affected geographical region(s) from light (lower) to dark (higher), depends on the cumulative number of sequences. Aside from the plot, this function also returns a dataframe with 2 columns: 'Region' and 'Number of Sequences'. The input dataframe of this function is obtainable from metadata_extraction(), with NCBI Protein / GISAID (EpiFlu/EpiCoV/EpiPox/EpiArbo) FASTA file as input.
Usage
plot_world_map(metadata, base_size = 8)
Arguments
metadata |
a dataframe with 3 columns, 'ID', 'region', and 'date' |
base_size |
word size in plot |
Value
A list with 2 elements (a plot followed by a dataframe)
Examples
geographical_plot <- plot_world_map(metadata)$plot
geographical_df <- plot_world_map(metadata)$df
DiMA (v5.0.9) JSON converted-CSV Output Sample 2
Description
A dummy dataset with 1 protein (Core) from two hosts, human and bat
Usage
protein_2hosts
Format
A data frame with 200 rows and 17 variables:
- proteinName
name of the protein
- position
starting position of the aligned, overlapping k-mer window
- count
number of k-mer sequences at the given position
- lowSupport
k-mer position with sequences lesser than the minimum support threshold (TRUE) are considered of low support, in terms of sample size
- entropy
level of variability at the k-mer position, with zero representing completely conserved
- indexSequence
the predominant sequence (index motif) at the given k-mer position
- index.incidence
the fraction (in percentage) of the index sequences at the k-mer position
- major.incidence
the fraction (in percentage) of the major sequence (the predominant variant to the index) at the k-mer position
- minor.incidence
the fraction (in percentage) of minor sequences (of frequency lesser than the major variant, but not singletons) at the k-mer position
- unique.incidence
the fraction (in percentage) of unique sequences (singletons, observed only once) at the k-mer position
- totalVariants.incidence
the fraction (in percentage) of sequences at the k-mer position that are variants to the index (includes: major, minor and unique variants)
- distinctVariant.incidence
incidence of the distinct k-mer peptides at the k-mer position
- multiIndex
presence of more than one index sequence of equal incidence
- host
species name of the organism host to the virus
- highestEntropy.position
k-mer position that has the highest entropy value
- highestEntropy
highest entropy values observed in the studied protein
- averageEntropy
average entropy values across all the k-mer positions
DiMA (v5.0.9) JSON converted-CSV Output Sample 1
Description
A dummy dataset with two proteins (A and B) from one host, human
Usage
proteins_1host
Format
A data frame with 806 rows and 17 variables:
- proteinName
name of the protein
- position
starting position of the aligned, overlapping k-mer window
- count
number of k-mer sequences at the given position
- lowSupport
k-mer position with sequences lesser than the minimum support threshold (TRUE) are considered of low support, in terms of sample size
- entropy
level of variability at the k-mer position, with zero representing completely conserved
- indexSequence
the predominant sequence (index motif) at the given k-mer position
- index.incidence
the fraction (in percentage) of the index sequences at the k-mer position
- major.incidence
the fraction (in percentage) of the major sequence (the predominant variant to the index) at the k-mer position
- minor.incidence
the fraction (in percentage) of minor sequences (of frequency lesser than the major variant, but not singletons) at the k-mer position
- unique.incidence
the fraction (in percentage) of unique sequences (singletons, observed only once) at the k-mer position
- totalVariants.incidence
the fraction (in percentage) of sequences at the k-mer position that are variants to the index (includes: major, minor and unique variants)
- distinctVariant.incidence
incidence of the distinct k-mer peptides at the k-mer position
- multiIndex
presence of more than one index sequence of equal incidence
- host
species name of the organism host to the virus
- highestEntropy.position
k-mer position that has the highest entropy value
- highestEntropy
highest entropy values observed in the studied protein
- averageEntropy
average entropy values across all the k-mer positions