We want to design a structure that incorporates all these features.
compare_methods() function, then unpacks what the results
mean.
We use the bundled steel_industry dataset: one full year
of 15-minute energy measurements from a Korean steel plant, including
reactive power, power factor, CO2 emissions, and time-of-day
indicators.
Three preprocessing steps:
steel_industry doesn’t ship with explicit class labels,
but Usage_kWh gives us natural ones: low, medium, and high
consumption regimes, defined by tertile cutoffs. We use these as a
yardstick for evaluating how meaningful each clustering result is.
true_labels <- label_by_quantile(data_thin$Usage_kWh,
probs = c(1/3, 2/3))
table(true_labels)
#> true_labels
#> 1 2 3
#> 1184 1152 1168Each class has roughly N/3 observations.
compare_methods() takes a named list of feature
extractors. Each is just a function that takes the raw data and returns
a numeric feature matrix.
feature_methods <- list(
pca_only = function(d) {
pca <- prcomp(d, center = FALSE, scale. = FALSE)
pca$x[, 1:3]
},
pca_circular = function(d) {
pca <- prcomp(d, center = FALSE, scale. = FALSE)
phase <- compute_phase(d, axis = "feature")
circ <- extract_circular_features(phase)
cbind(pca$x[, 1:3], circ)
}
)We try DBSCAN with two different parameter settings: one with a larger neighborhood radius (loose) and one with a smaller one (tight). This is a parameter sweep disguised as a method comparison.
compare_methods() runs every combination, evaluates each
with the requested metrics, and returns a single comparison table.
comparison <- compare_methods(
data = data_scaled,
feature_methods = feature_methods,
cluster_methods = cluster_methods,
metrics = c("dbi", "accuracy", "n_clusters", "n_noise"),
true_labels = true_labels,
normalize = NULL,
verbose = FALSE
)
print(comparison)
#> feature_method cluster_method dbi accuracy n_clusters n_noise
#> 1 pca_only dbscan_loose 0.9056772 0.5907534 4 15
#> 2 pca_only dbscan_tight 0.6312147 0.7285959 13 56
#> 3 pca_circular dbscan_loose 0.5957794 0.7619863 14 126
#> 4 pca_circular dbscan_tight 0.8596812 0.7796804 38 254Four rows, one per combination, four metrics each. Now to read it.