% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/Main.codes.R
\name{Data.rf.classifier}
\alias{Data.rf.classifier}
\title{Random Forest classification for OTU/ASV Data}
\usage{
Data.rf.classifier(
  raw_data,
  metadata,
  train_p,
  Group,
  OTU_counts_filter_value = NA,
  reps = 5,
  cv_fold = 10,
  title_size = 10,
  axis_title_size = 8,
  legend_title_size = 8,
  legend_text_size = 6,
  seed = 123
)
}
\arguments{
\item{raw_data}{A numeric matrix or data frame of counts data with OTUs/ASVs as rows and samples as columns.}

\item{metadata}{A data frame. Containing information about all samples, including at least the grouping of all samples as well as
individual information (\code{Group} and \code{ID}), the sampling \code{Time} point for each sample, and other relevant information.}

\item{train_p}{A positive decimal. Indicating the percentage of data that goes to training. For example, when
\code{train_p = 0.7}, 70\% samples were randomly selected as training dataset. More information see
\code{\link[randomForest]{rfcv}}.}

\item{Group}{A string that specifies the columns in the \code{metadata} for grouping the temporal series samples.}

\item{OTU_counts_filter_value}{An integer, indicating the sum of the minimum abundances of OTUs/ASVs in all samples. If the sum
of the abundances that OTU/ASV is below the given positive integer threshold, the OTU/ASV is excluded, and vice versa, it
is retained. The default is \code{NA}.}

\item{reps}{An integer. The number of replications for cross-validation. By default, \code{reps = 5}. More details see
\code{\link[randomForest]{rfcv}}.}

\item{cv_fold}{An integer. Number of folds in the cross-validation. By default, \code{cv_fold = 10}. see \code{\link[randomForest]{rfcv}}}

\item{title_size}{Numeric value for the font size of plot titles. Defaults to 10.}

\item{axis_title_size}{Numeric value for the font size of axis titles. Defaults to 8.}

\item{legend_title_size}{Numeric value for the font size of legend titles. Defaults to 8.}

\item{legend_text_size}{Numeric value for the font size of legend text. Defaults to 6.}

\item{seed}{Random seed.}
}
\value{
An object of class \code{DataRFClassifier} with the following elements:
\describe{
\item{Input_data}{The transposed and (optionally) filtered OTU table.}
\item{Predicted_results_on_train_set}{A vector of predicted group labels for the training set.}
\item{Predicted_results_on_test_set}{A vector of predicted group labels for the test set.}
\item{Traindata_confusion_matrix}{A confusion matrix comparing actual vs. predicted group labels for the training set.}
\item{Testdata_confusion_matrix}{A confusion matrix comparing actual vs. predicted group labels for the test set.}
\item{Margin_scores_train}{A ggplot object displaying the margin scores of the training set samples.}
\item{OTU_importance}{A data frame of OTU importance metrics, sorted by Mean Decrease Accuracy.}
\item{Classifier}{A random forest classifier object trained on the training set.}
\item{cross_validation}{A ggplot object showing the cross-validation error curve as a function of the number of features.}
}
}
\description{
This function implements a random forest classification model tailored for OTU/ASV datasets.
It performs data filtering, model training, performance evaluation, cross-validation, and
biomarker (important microbial features) selection based on Mean Decrease Accuracy.
#' @details
The function processes the input OTU count data and corresponding metadata in several steps:
\enumerate{
\item \strong{Data Filtering and Preparation:} If a minimum count threshold (\code{OTU_counts_filter_value}) is provided,
OTUs with total counts below this value are removed. The OTU table is then transposed and merged with the metadata,
where a specific column (specified by \code{Group}) indicates the group labels.
\item \strong{Data Partitioning:} The combined dataset is split into training and testing subsets based on the proportion
specified by \code{train_p}.
\item \strong{Model Training:} A random forest classifier is trained on the training data. The function computes the
margin scores for the training samples, which are plotted to visualize the model’s confidence.
\item \strong{Performance Evaluation:} Predictions are made on both training and testing datasets. Confusion matrices
are generated to compare the actual versus predicted classes.
\item \strong{Feature Importance and Cross-Validation:} OTU importance is assessed using Mean Decrease Accuracy.
Repeated k-fold cross-validation (default 10-fold repeated \code{reps} times) is performed to determine the optimal number
of OTUs (biomarkers). A cross-validation error curve is plotted, and the user is prompted to input the best number
of OTUs based on the plot.
}
}
\examples{
\donttest{
# Example OTU count data (20 OTUs x 10 samples)
set.seed(123)
otu_data <- matrix(sample(0:100, 200, replace = TRUE), nrow = 20)
colnames(otu_data) <- paste0("Sample", 1:10)
rownames(otu_data) <- paste0("OTU", 1:20)

# Example metadata with group labels
metadata <- data.frame(Group = rep(c("Control", "Treatment"), each = 5))

# Run the classifier
result <- Data.rf.classifier(raw_data = otu_data,
                             metadata = metadata,
                             train_p = 0.7,
                             Group = "Group",
                             OTU_counts_filter_value = 50)
}
}
\author{
Shijia Li
}
