Type: | Package |
Title: | Directed Dependence Coefficient |
Version: | 0.1.0 |
Maintainer: | Yuping Wang <yuping.wang@plus.ac.at> |
Description: | Directed Dependence Coefficient (didec) is a measure of directed dependence. Multivariate Feature Ordering by Conditional Independence (MFOCI) is a variable selection algorithm based on didec. Hierarchical Variable Clustering (VarClustPartition) is a variable clustering method based on didec. For more information, see the paper by Ansari and Fuchs (2024, <doi:10.48550/arXiv.2212.01621>), and the paper by Fuchs and Wang (2024, <doi:10.1016/j.ijar.2024.109185>). |
License: | MIT + file LICENSE |
Encoding: | UTF-8 |
LazyData: | true |
RoxygenNote: | 7.3.0 |
Imports: | copBasic (≥ 2.2.3), cowplot (≥ 1.1.2), dendextend (≥ 1.17.1), factoextra (≥ 1.0.7), FOCI (≥ 0.1.3), ggplot2 (≥ 3.4.4), graphics (≥ 4.3.0), grDevices (≥ 0.5-1), gtools (≥ 3.9.5), phylogram (≥ 2.1.0), rlang (≥ 1.1.4), stats (≥ 4.3.0) |
Depends: | R (≥ 2.10) |
NeedsCompilation: | no |
Packaged: | 2024-08-20 17:40:01 UTC; yupin |
Author: | Yuping Wang [aut, cre], Sebastian Fuchs [aut], Jonathan Ansari [aut] |
Repository: | CRAN |
Date/Publication: | 2024-08-26 15:40:05 UTC |
Average diameter & Maximum split of every partition of a given dendrogram
Description
Average diameter & Maximum split of every partition of a given dendrogram
Usage
Adiam.Msplit(X, dend = dend, dist.func = "PD")
Arguments
X |
a data frame for a set of variables X |
dend |
a dendrogramm |
dist.func |
PD / MPD / kendall / footrule |
Value
a data frame
Estimate for T^q(Y|X) based on function Codec
Description
Estimate for T^q(Y|X) based on function Codec
Usage
Codec.Tq(X, Y)
Arguments
X |
a data frame for input vector X |
Y |
a data frame for output vector Y |
Value
the value of T^q(Y|X)
Estimate for T^q_bar(Y|X) based on function Codec & a sample of all / all increasing / all decreasing permutations
Description
Estimate for T^q_bar(Y|X) based on function Codec & a sample of all / all increasing / all decreasing permutations
Usage
Codec.Tq.Perm(X, Y, method = c("sample"))
Arguments
X |
a data frame for input vector X |
Y |
a data frame for output vector Y |
method |
permuatation methods: sample / increasing / decreasing / full |
Value
the value of T^q_bar(Y|X)
Silhouette value for the i. variable given variable partition
Description
Silhouette value for the i. variable given variable partition
Usage
Silhouette(i, df, partition, dist.func = "PD")
Arguments
i |
the index of the variable |
df |
a data frame for all variables |
partition |
a partition |
dist.func |
PD / MPD / kendall / footrule |
Value
a value for Silhouette
Silhouette coefficients given a dendrogram
Description
Silhouette coefficients given a dendrogram
Usage
Silhouette.coefficient(X, dend, dist.func = "PD")
Arguments
X |
a data frame for a set of variables X |
dend |
a dendrogramm |
dist.func |
PD / MPD / kendall / footrule |
Value
a data frame
Hierarchical variable clustering.
Description
VarClustPartition
is a hierarchical variable clustering algorithm based on the directed dependence coefficient (didec
) or a concordance measure (Kendall tau \tau
or Spearman's footrule) according to a pre-selected number of clusters or an optimality criterion (Adiam&Msplit or Silhouette coefficient).
Usage
VarClustPartition(
X,
dist.method = c("PD"),
linkage = FALSE,
link.method = c("complete"),
part.method = c("optimal"),
criterion = c("Adiam&Msplit"),
num.cluster = NULL,
plot = FALSE
)
Arguments
X |
A numeric matrix or data.frame/data.table. Contains the variables to be clustered. |
dist.method |
An optional character string computing a distance function for clustering. This must be one of the strings |
linkage |
A logical. If |
link.method |
An optional character string selecting a linkage method. This must be one of the strings |
part.method |
An optional character string selecting a partitioning method. This must be one of the strings |
criterion |
An optional character string selecting a criterion for the optimal partition, if |
num.cluster |
An integer value for the selected number of clusters, if |
plot |
A logical. If |
Details
VarClustPartition
performs a hierarchical variable clustering based on the directed dependence coefficient (didec
) and provides a partition of the set of variables.
If dist.method =="PD"
or dist.method =="MPD"
, the clustering is performed using didec
either as a directed ("PD") or as a symmetric ("MPD") dependence coefficient.
If dist.method =="kendall"
or dist.method =="footrule"
, clustering is performed using either multivariate Kendall's tau ("kendall") or multivariate Spearman's footrule ("footrule").
Instead of using one of the above-mentioned four multivariate measures for the clustering, the option linkage == TRUE
enables the use of bivariate linkage methods,
including complete linkage (link.method == "complete"
), average linkage (link.method == "average"
) and single linkage (link.method == "single"
).
Note that the multivariate distance methods are computationally demanding because higher-dimensional dependencies are included in the calculation, in contrast to linkage methods which only incorporate pairwise dependencies.
A pre-selected number of clusters num.cluster
can be realized with the option part.method == "selected"
.
Otherwise (part.method == "optimal"
), the number of clusters is determined by maximizing the intra-cluster similarity (similarity within the same cluster) and minimizing the inter-cluster similarity (similarity among the clusters). Two optimality criteria are available:
"Adiam&Msplit"
: Adiam measures the intra-cluster similarity and Msplit measures the inter-cluster similarity.
"Silhouette"
: A mixed coefficient incorporating the intra-cluster similarity and the inter-cluster similarity. The optimal number of clusters corresponds to the maximum Silhouette coefficient.
Value
A list containing a dendrogram without colored branches (dendrogram), an integer value determining the number of clusters after partitioning (num.cluster), and a list containing the clusters after partitioning (clusters).
Author(s)
Yuping Wang, Sebastian Fuchs
References
S. Fuchs, Y. Wang, Hierarchical variable clustering based on the predictive strength between random vectors, Int. J. Approx. Reason. 170, Article ID 109185, 2024.
P. Hansen, B. Jaumard, Cluster analysis and mathematical programming, Math. Program. 79 (1) 191–215, 1997.
L. Kaufman, Finding Groups in Data, John Wiley & Sons, 1990.
Examples
library(didec)
n <- 50
X1 <- rnorm(n,0,1)
X2 <- X1
X3 <- rnorm(n,0,1)
X4 <- X3 + X2
X <- data.frame(X1=X1,X2=X2,X3=X3,X4=X4)
vcp <- VarClustPartition(X,
dist.method = c("PD"),
part.method = c("optimal"),
criterion = c("Silhouette"),
plot = TRUE)
vcp$clusters
data("bioclimatic")
X <- bioclimatic[c(2:4,9)]
vcp1 <- VarClustPartition(X,
linkage = TRUE,
link.method = c("complete"),
dist.method = "PD",
part.method = "optimal",
criterion = "Silhouette",
plot = TRUE)
vcp1$clusters
vcp2 <- VarClustPartition(X,
linkage = TRUE,
link.method = c("complete"),
dist.method = "footrule",
part.method = "optimal",
criterion = "Adiam&Msplit",
plot = TRUE)
vcp2$clusters
Bioclimatic variables
Description
A data set of bioclimatic variables for n=1,862
locations homogeneously distributed over the global landmass from CHELSA("Climatologies at high resolution for the earth’s land surface areas").
Usage
bioclimatic
Format
An object of class data.frame
with 1862 rows and 14 columns.
References
D.N. Karger, O. Conrad, J. Böhner, T. Kawohl, H. Kreft, R.W. Soria-Auza, N.E. Zimmermann, H.P. Linder, M. Kessler, Climatologies at high resolution for the Earth's land surface areas, Sci. Data 4(1), 2017.
Examples
head(bioclimatic)
Cluster a set of variables using distance function based on predictive measure
Description
Cluster a set of variables using distance function based on predictive measure
Usage
clust.Tq(X, perm = TRUE, perm.method = c("decreasing"), mutual = FALSE)
Arguments
X |
a data frame for a set of variables X |
perm |
T^q or T^q_bar |
perm.method |
permutation methods: sample / increasing / decreasing |
mutual |
type B function or not |
Value
a list for hierarchical clustering result
Clustering a set of variables using distance function based on multivariate concordance measures
Description
Clustering a set of variables using distance function based on multivariate concordance measures
Usage
clust.concor.M(X, method = c("footrule"))
Arguments
X |
a data frame for vector X |
method |
kendall / footrule |
Value
a list for hierarchical clustering result
Estimation for multivariate concordance measures
Description
Estimation for multivariate concordance measures
Usage
concor.M(X, method = c("footrule"))
Arguments
X |
a data frame for vector X |
method |
kendall / footrule |
Value
a value of the estimator for the multivariate concordance measures
Read a dendrogram from a list for hierarchical clustering result
Description
Read a dendrogram from a list for hierarchical clustering result
Usage
dendrogram(clust, step = TRUE)
Arguments
clust |
a list for hierarchical clustering result |
step |
whether using clustering step as y axis or not |
Value
an object of class "dendrogram"
Diameter of a class of variables based on different distance function
Description
Diameter of a class of variables based on different distance function
Usage
diam(X, dist.func = "PD")
Arguments
X |
a data frame for a set of variables X |
dist.func |
PD / MPD / kendall / footrule |
Value
a value
Computes the directed dependence coefficient.
Description
The directed dependence coefficient (didec
) estimates the degree of directed dependence of a random vector Y on a random vector X, based on an i.i.d. sample of (X,Y).
Usage
didec(X, Y, perm = FALSE, perm.method = c("decreasing"))
Arguments
X |
A numeric matrix or data.frame/data.table. Contains the predictor vector X. |
Y |
A numeric matrix or data.frame/data.table. Contains the response vector Y. |
perm |
A logical. If |
perm.method |
An optional character string specifying a method for permuting the response variables. This must be one of the strings |
Details
The directed dependence coefficient (didec) is an extension of Azadkia & Chatterjee's measure of directed dependence (Azadkia & Chatterjee, 2021) to a vector of response variables introduced in (Ansari & Fuchs, 2023).
Its calculation is based on the function codec
which estimates Azadkia & Chatterjee’s measure of directed dependence and is provided in the R package FOCI
.
By definition, didec
is invariant with respect to permutations of the variables within the predictor vector X. Invariance with respect to permutations within the response vector Y is achieved by computing the arithmetic mean over all possible (or chosen) permutations.
In addition to the option "full"
of running all q!
permutations of (1, ..., q)
, less computationally intensive options are also available (here, q
denotes the number of response variables): a random selection of q
permutations "sample"
, cyclic permutations such as (1,2,...,q)
, (2,...,q,1)
either "increasing"
or "decreasing"
.
Note that when the number of variables q
is large, choosing "full"
may result in long computation times.
Value
The degree of directed dependence of the random vector Y on the random vector X.
Author(s)
Yuping Wang, Sebastian Fuchs, Jonathan Ansari
References
M. Azadkia, S. Chatterjee, A simple measure of conditional dependence, Ann. Stat. 49 (6), 2021.
J. Ansari, S. Fuchs, A simple extension of Azadkia & Chatterjee's rank correlation to multi-response vectors, Available at https://arxiv.org/abs/2212.01621, 2024.
distance function based on T^q
Description
distance function based on T^q
Usage
dist.Tq(X, Y, perm = TRUE, perm.method = c("decreasing"), mutual = FALSE)
Arguments
X |
a data frame for vector X |
Y |
a data frame for vector Y |
perm |
permuted version or not |
perm.method |
permutation methods: sample / increasing / decreasing / full |
mutual |
use mutual perfect dependence or not |
Value
a value for distance between two vectors
distance function based on multivariate concordance measures
Description
distance function based on multivariate concordance measures
Usage
dist.concor.M(X, Y, method = c("footrule"))
Arguments
X |
a data frame for vector X |
Y |
a data frame for vector Y |
method |
kendall / footrule |
Value
a value for distance between two vectors
Distance Matrix Computation using distance function based on T^q
Description
Distance Matrix Computation using distance function based on T^q
Usage
dist.mat.T(X, mutual = FALSE)
Arguments
X |
a data frame for vector X |
mutual |
use type B function (mutual perfect dependence) or not |
Value
an object of class "dist"
Distance Matrix Computation using distance function based on multivariate concordance measures
Description
Distance Matrix Computation using distance function based on multivariate concordance measures
Usage
dist.mat.concor(X, method = c("footrule"))
Arguments
X |
a data frame for vector X |
method |
kendall / footrule |
Value
an object of class "dist"
Multivariate feature ordering by conditional independence.
Description
A variable selection algorithm based on the directed dependence coefficient (didec
).
Usage
mfoci(
X,
Y,
pre.selected = NULL,
perm = FALSE,
perm.method = c("decreasing"),
autostop = TRUE
)
Arguments
X |
A numeric matrix or data.frame/data.table. Contains the predictor vector X. |
Y |
A numeric matrix or data.frame/data.table. Contains the response vector Y. |
pre.selected |
An integer vector for indexing pre-selected predictor variables from X. |
perm |
A logical. If |
perm.method |
An optional character string specifying a method in |
autostop |
A logical. If |
Details
mfoci
is a forward feature selection algorithm for multiple-outcome data that employs the directed dependence coefficient (didec
) at each step.
mfoci
is proved to be consistent in the sense that the subset of predictor variables selected via mfoci
is sufficient with high probability.
If autostop == TRUE
the algorithm stops at the first non-increasing value of didec
, thereby selecting a subset of variables.
Otherwise, all predictor variables are ordered according to their predictive strength measured by didec
.
Value
A data.frame listing the selected variables.
Author(s)
Sebastian Fuchs, Jonathan Ansari, Yuping Wang
References
J. Ansari, S. Fuchs, A simple extension of Azadkia & Chatterjee's rank correlation to multi-response vectors, Available at https://arxiv.org/abs/2212.01621, 2024.
Examples
library(didec)
data("bioclimatic")
X <- bioclimatic[, c(9:12)]
Y <- bioclimatic[, c(1,8)]
mfoci(X, Y, pre.selected = c(1, 3))
Plot the trade-off of Adiam and Msplit
Description
Plot the trade-off of Adiam and Msplit
Usage
plot_Adiam.Msplit(tradeoff, main = NULL, sub = NULL)
Arguments
tradeoff |
a data frame |
main |
main title |
sub |
sub title |
Value
ggplot
Plot the Silhouette coefficient
Description
Plot the Silhouette coefficient
Usage
plot_Silhouette.coefficient(Silhouette_Index, main = NULL, sub = NULL)
Arguments
Silhouette_Index |
a data frame of Silhouette coefficient |
main |
main title |
sub |
sub title |
Value
ggplot
plotting a dendrogram with colored branches
Description
plotting a dendrogram with colored branches
Usage
plot_dendrogram(
dend,
num.cluster = num.cluster,
linkage = FALSE,
ylab = ylab,
cex.lab = 0.6,
cex.axis = 0.6
)
Arguments
dend |
a dendrogram |
num.cluster |
the number of colored branches |
linkage |
logical; if 'True', the linkage method is used |
ylab |
a string |
cex.lab |
a value |
cex.axis |
a value |
Value
plot
Split of two classes of variables based on different distance function
Description
Split of two classes of variables based on different distance function
Usage
split(X, Y, dist.func = "PD")
Arguments
X |
a data frame for a set of variables X |
Y |
a data frame for a set of variables Y |
dist.func |
PD / MPD / kendall / footrule |
Value
a value