% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/tax_check.R
\name{tax_check}
\alias{tax_check}
\title{Taxonomic spell check}
\usage{
tax_check(
  taxdf,
  name = "genus",
  group = NULL,
  dis = 0.05,
  start = 1,
  verbose = TRUE
)
}
\arguments{
\item{taxdf}{\code{dataframe}. A dataframe with named columns containing
taxon names (e.g. "species", "genus"). An optional column
containing the groups (e.g. "family", "order") which taxon names
belong to may also be provided (see \code{group} for details).
NA values or empty strings in the name and group columns (i.e. "" and " ")
are ignored.}

\item{name}{\code{character}. The column name of the taxon names you wish
to check (e.g. "genus").}

\item{group}{\code{character}. The column name of the higher taxonomic
assignments in \code{taxdf} you wish to group by. If \code{NULL} (default), name
comparison will be conducted within alphabetical groups.}

\item{dis}{\code{numeric}. The dissimilarity threshold: a value greater than
0 (completely dissimilar), and less than 1 (completely similar).
Potential synonyms above this threshold are not returned.
This value is set to 0.05 by default, but the user might wish to experiment
with this value for their specific data.}

\item{start}{\code{numeric}. The number of shared characters at the
beginning of potential synonyms that should match. Potential synonyms below
this value will not be returned. By default this value is set to 1 (i.e.
the first letter of synonyms must match).}

\item{verbose}{\code{logical}. Should the results of the non-letter
character check be reported to the user? If \code{TRUE}, the result will only be
reported if such characters are detected in the taxon names.}
}
\value{
If verbose = \code{TRUE} (default), a \code{list} with three elements. The
first element in the list (synonyms) is a \code{data.frame} with each row
reporting a pair of potential synonyms. The first column "group" contains the
higher group in which they occur (alphabetical groupings if \code{group} is
not provided). The second column "greater" contains the most common synonym
in each pair. The third column "lesser" contains the least common synonym in
each pair. The third and fourth column (\code{count_greater}, \code{count_lesser})
contain the respective counts of each synonym in a pair. If no matches were
found for the filtering arguments, this element is \code{NULL} instead. The second
element (\code{non_letter_name}) is a vector of taxon names which contain
non-letter characters, or \code{NULL} if none were detected. The third element
(non_letter_group) is a vector of taxon groups which contain non-letter
characters, or \code{NULL} if none were detected. If verbose = \code{FALSE}, a
\code{data.frame} as described above is returned, or \code{NULL} if no matches
were found.
}
\description{
A function to check for and count potential spelling variations of the same
taxon. Spelling variations are checked within alphabetical groups (default),
or within higher taxonomic groups if provided.
}
\details{
When higher taxonomy is provided, but some entries are missing,
comparisons will still be made within alphabetical groups of taxa which lack
higher taxonomic affiliations. The function also performs a check for
non-letter characters which are not expected to be present in
correctly-formatted taxon names. This detection may be made available to the
user via the \code{verbose} argument. Comparisons are performed using the
Jaro dissimilarity metric via
\code{\link[stringdist:stringdistmatrix]{stringdist::stringdistmatrix()}}.

As all string distance metrics rely on approximate string matching,
different metrics can produce different results. This function uses Jaro
distance as it was designed with short, typed strings in mind, but good
practice should include comparisons using multiple metrics, and ultimately
specific taxonomic vetting where possible. A more complete implementation
and workflow for cleaning taxonomic occurrence data is available in the
\code{fossilbrush} R package on CRAN.
}
\section{Reference}{

van der Loo, M. P. J. (2014). The stringdist package for approximate string
matching. The R Journal 6, 111-122.
}

\section{Developer(s)}{

Joseph T. Flannery-Sutherland & Lewis A. Jones
}

\section{Reviewer(s)}{

Lewis A. Jones, Kilian Eichenseer & Christopher D. Dean
}

\examples{
# load occurrence data
data("tetrapods")
# Check taxon names alphabetically
ex1 <- tax_check(taxdf = tetrapods, name = "genus", dis = 0.1)
# Check taxon names by group
ex2 <- tax_check(taxdf = tetrapods, name = "genus",
                 group = "family", dis = 0.1)

}
