ontologySimilarity is part of the ‘ontologyX’ family of
packages (see the ‘Introduction to ontologyX’ vignette supplied with the
ontologyIndex package). It contains various functions for
calculating semantic similarity between ontological objects. The
functions operate on various kinds of object. It’s useful to look out
for particular parameter names, as each kind of object tends to be
called the same thing by the functions. To make full use of the features
in ontologySimilarity, the user is encouraged to gain
familiarity of the functions in ontologyIndex.
ontology - Objects of class ontologyIndex,
described in the package ontologyIndex.terms - A character vector of term IDs -
either representing terms individually, or terms which together annotate
a particular thing, e.g. term IDs from the Gene Ontology (GO)
representing the functional annotations of a gene.term_sets - A list of
character vectors of term IDs.information_content - A numeric vector of
information content values for individual terms, named by term IDs.
Typically this would be used in an evaluation of either Resnik or Lin’s
between-term similarity expression.pop_sim - An object which stores information about
similarites of a population of (ontological) objects, either to one
another or to some foreign object. Used to increase performance when
many look-ups of similarity are required.Various kinds of similarity can be calculated, including:
Some key functions are:
get_term_sim_mat for pairwise term similarities which
returns a matrix,get_sim_grid for pairwise similarities between sets of
terms which returns a matrix,get_sim for group similarity,get_sim_p for computing a p-value for group
similarity.To use the package, first load ontologyIndex and an
ontology_index object. Here we demonstrate using the Human
Phenotype Ontology, hpo.
Next, we’ll set the information content for the terms. This is typically based on some kind of ‘population frequency’, for example: the frequency with which the term is used, explicitly or implicity, to annotate objects in a database. Such frequency information is not always available, but it could still be useful to define the information content with respect to the frequency with which the term is an ancestor of other terms in the ontology (as this still captures the structure of the ontology).
Now we’ll generate some random sets of terms. We’ll sample 5 random
term sets (which could for example represent the phenotypes of patients)
of 8 terms. Note that here, we call the minimal_set
function from the ontologyIndex package on each sample set
to remove redundant terms. Typically, ontological annotations would be
stored as such minimal sets, however if you are unsure, it is best to
call minimal_set on each term set to guarantee the
similarity expressions are faithfully evaluated (the package chooses not
to map to minimal sets by default for speed).
term_sets <- replicate(simplify=FALSE, n=5, expr=minimal_set(hpo, sample(hpo$id, size=8)))
term_sets## [[1]]
## [1] "HP:0430093" "HP:0007165" "HP:0032845" "HP:0030058" "HP:0005876"
## [6] "HP:0033126" "HP:0031139"
##
## [[2]]
## [1] "HP:0031850" "HP:3000041" "HP:0033889" "HP:5200422" "HP:0033530"
## [6] "HP:0025158" "HP:0011571" "HP:0500133"
##
## [[3]]
## [1] "HP:0001100" "HP:0009792" "HP:0031795" "HP:0030578" "HP:0010377"
## [6] "HP:0034289" "HP:0020133" "HP:0004921"
##
## [[4]]
## [1] "HP:0002622" "HP:0002034" "HP:0034854" "HP:0032247" "HP:0100078"
## [6] "HP:0006887" "HP:0009785" "HP:0001635"
##
## [[5]]
## [1] "HP:0030138" "HP:0025103" "HP:0200035" "HP:0500049" "HP:0008390"
## [6] "HP:5200413" "HP:0033555" "HP:0100927"
Then one can calculate a similarity matrix, containing pairwise term-set similarities:
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.00000000 0.1242382 0.07890359 0.1361949 0.2450250
## [2,] 0.12423823 1.0000000 0.22358934 0.1615010 0.1728086
## [3,] 0.07890359 0.2235893 1.00000000 0.1095122 0.1405417
## [4,] 0.13619488 0.1615010 0.10951216 1.0000000 0.1488531
## [5,] 0.24502498 0.1728086 0.14054170 0.1488531 1.0000000
Group similarity of phenotypes 1-3, based on
sim_mat:
## [1] 0.1422437
p-value for significance of similarity of phenotypes 1-3:
## [1] 0.7132867