Comparing Feature Engineering Approaches

All combinations

We want to design a structure that incorporates all these features. compare_methods() function, then unpacks what the results mean.

We use the bundled steel_industry dataset: one full year of 15-minute energy measurements from a Korean steel plant, including reactive power, power factor, CO2 emissions, and time-of-day indicators.

library(cyclicwave)
data(steel_industry)

Preparing the data

Three preprocessing steps:

Thinning: keep every 10th row to reduce samples. The clustering analysis becomes both faster and more meaningful.
Select numeric columns: discard date and categorical columns.
Z-score normalization: required for any distance-based method.

data_thin    <- thin_data(steel_industry, step = 10)
numeric_data <- select_numeric_columns(data_thin)
data_scaled  <- normalize_features(numeric_data, method = "zscore")

dim(data_scaled)
#> [1] 3504    7

Ground-truth labels

steel_industry doesn’t ship with explicit class labels, but Usage_kWh gives us natural ones: low, medium, and high consumption regimes, defined by tertile cutoffs. We use these as a yardstick for evaluating how meaningful each clustering result is.

true_labels <- label_by_quantile(data_thin$Usage_kWh,
                                 probs = c(1/3, 2/3))
table(true_labels)
#> true_labels
#>    1    2    3 
#> 1184 1152 1168

Each class has roughly N/3 observations.

Defining the feature methods

compare_methods() takes a named list of feature extractors. Each is just a function that takes the raw data and returns a numeric feature matrix.

feature_methods <- list(
  pca_only = function(d) {
    pca <- prcomp(d, center = FALSE, scale. = FALSE)
    pca$x[, 1:3]
  },
  pca_circular = function(d) {
    pca <- prcomp(d, center = FALSE, scale. = FALSE)
    phase <- compute_phase(d, axis = "feature")
    circ <- extract_circular_features(phase)
    cbind(pca$x[, 1:3], circ)
  }
)

Defining the clustering methods

We try DBSCAN with two different parameter settings: one with a larger neighborhood radius (loose) and one with a smaller one (tight). This is a parameter sweep disguised as a method comparison.

cluster_methods <- list(
  dbscan_loose = list(fn = run_dbscan, params = list(eps = 0.5, min_pts = 8)),
  dbscan_tight = list(fn = run_dbscan, params = list(eps = 0.3, min_pts = 5))
)

One call to rule them all

compare_methods() runs every combination, evaluates each with the requested metrics, and returns a single comparison table.

comparison <- compare_methods(
  data            = data_scaled,
  feature_methods = feature_methods,
  cluster_methods = cluster_methods,
  metrics         = c("dbi", "accuracy", "n_clusters", "n_noise"),
  true_labels     = true_labels,
  normalize       = NULL,
  verbose         = FALSE
)

print(comparison)
#>   feature_method cluster_method       dbi  accuracy n_clusters n_noise
#> 1       pca_only   dbscan_loose 0.9056772 0.5907534          4      15
#> 2       pca_only   dbscan_tight 0.6312147 0.7285959         13      56
#> 3   pca_circular   dbscan_loose 0.5957794 0.7619863         14     126
#> 4   pca_circular   dbscan_tight 0.8596812 0.7796804         38     254

Four rows, one per combination, four metrics each. Now to read it.