This vignette collects the advanced cat2cat workflows in
one place: ML weights, multi-period chaining, panel data with
identifiers, aggregated data, and regression on replicated data. It
assumes you’ve read Get Started.
Use this vignette by module:
library(cat2cat)
library(dplyr)
library(tidyr)
library(fixest)
data(occup, package = "cat2cat")
data(occup_panel, package = "cat2cat")
data(trans, package = "cat2cat")
data(verticals, package = "cat2cat")
data(verticals2, package = "cat2cat")
occup_2006 <- occup[occup$year == 2006, ]
occup_2008 <- occup[occup$year == 2008, ]
occup_2010 <- occup[occup$year == 2010, ]
occup_2012 <- occup[occup$year == 2012, ]Machine-learning weights are useful when the mapping table alone is too coarse and observed features help distinguish which target category is most plausible for a replicated observation.
This section shows:
ml argument in
cat2cat(),In practice, ML is most helpful when replication is substantial and
the ambiguous categories differ systematically by observed
characteristics such as age, education, experience, or salary. If ML
does not improve on the frequency baseline in
cat2cat_ml_run(), the simpler wei_freq_c2c
weights are usually the better choice.
ML features must be numeric, logical, or factor columns. Factor
columns are one-hot encoded automatically; character columns are not, so
convert character categories to factors before listing them in
features.
ml_setup <- list(
data = bind_rows(occup_2010, occup_2012),
cat_var = "code",
method = c("knn", "rf", "lda"),
features = c("age", "sex", "edu", "exp", "parttime", "salary"),
args = list(k = 10, ntree = 50),
on_fail = "freq",
fail_warn = TRUE
)
result_ml <- cat2cat(
data = list(old = occup_2008, new = occup_2010,
cat_var = "code", time_var = "year"),
mappings = list(trans = trans, direction = "backward"),
ml = ml_setup
)Validate whether ML adds value before using it in production:
cv <- cat2cat_ml_run(
mappings = list(trans = trans, direction = "backward"),
ml = ml_setup
)
print(cv)Baseline-only diagnostics are also available:
ml_baseline <- list(
data = bind_rows(occup_2010, occup_2012),
cat_var = "code",
method = character(0),
features = character(0)
)
cv_baseline <- cat2cat_ml_run(
mappings = list(trans = trans, direction = "backward"),
ml = ml_baseline
)
print(cv_baseline)Use the same direction in diagnostics as in the mapping
workflow you want to evaluate, because the mapping groups and
base-period frequencies differ by direction.
If ML probabilities cannot be produced for some replicated rows, use
on_fail and fail_warn:
on_fail = "freq" (default): replace failed ML rows with
wei_freq_c2con_fail = "naive": replace failed ML rows with
wei_naive_c2con_fail = "na": keep failed ML rows as
NAon_fail = "error": stop immediatelyfail_warn = TRUE (default): warn with affected
rows/observations per methodfail_warn = FALSE: silence the warningRepeated cross-sections or longitudinal datasets often contain more
than two waves. To bring all waves onto the same encoding, apply
cat2cat() iteratively, feeding the output of one step into
the next.
Use this module when you need one harmonised categorical variable across 3 or more periods. The main design choice is whether to chain backward or forward, and whether truncating hierarchical codes can reduce replication before chaining.
In the examples below, occup is a repeated cross-section
dataset, not a panel. The chained outputs therefore combine harmonised
cross-sections from multiple years.
mappings$freqs_dfIn most cases you do not need to pass
mappings$freqs_df. If it is omitted, cat2cat()
computes base-period frequencies internally from the provided data.
Pass mappings$freqs_df only when you need explicit
control over base frequencies.
The object should be a two-column data frame: category name in the first column and counts in the second.
Before chaining, inspect how disruptive the transition table is:
max_digits <- max(nchar(as.character(trans[[1]])), nchar(as.character(trans[[2]])))
stability <- sapply(1:max_digits, function(d) {
old_trunc <- substr(as.character(trans[[1]]), 1, d)
new_trunc <- substr(as.character(trans[[2]]), 1, d)
mean(old_trunc != new_trunc) * 100
})
data.frame(
digits = 1:max_digits,
pct_changed = round(stability, 1)
)
#> digits pct_changed
#> 1 1 6.5
#> 2 2 40.4
#> 3 3 77.5
#> 4 4 85.7
#> 5 5 100.0
#> 6 6 100.0Truncating codes to fewer digits can reduce replication:
For backward mapping this is often true, but for forward mapping truncation can move in the opposite direction. Collapsing detailed codes into broader prefixes can create additional many-to-one and one-to-many ties on the old-code side, which may increase replication in the mapped period.
occup_2008_trunc <- occup_2008
occup_2008_trunc$code <- substr(occup_2008_trunc$code, 1, 4)
occup_2010_trunc <- occup_2010
occup_2010_trunc$code <- substr(occup_2010_trunc$code, 1, 4)
trans_trunc <- unique(data.frame(
old = substr(trans$old, 1, 4),
new = substr(trans$new, 1, 4)
))
back_full <- cat2cat(
data = list(old = occup_2008, new = occup_2010,
cat_var = "code", time_var = "year"),
mappings = list(trans = trans, direction = "backward")
)
back_trunc <- cat2cat(
data = list(old = occup_2008_trunc, new = occup_2010_trunc,
cat_var = "code", time_var = "year"),
mappings = list(trans = trans_trunc, direction = "backward")
)
fwd_full <- cat2cat(
data = list(old = occup_2008, new = occup_2010,
cat_var = "code", time_var = "year"),
mappings = list(trans = trans, direction = "forward")
)
fwd_trunc <- cat2cat(
data = list(old = occup_2008_trunc, new = occup_2010_trunc,
cat_var = "code", time_var = "year"),
mappings = list(trans = trans_trunc, direction = "forward")
)
data.frame(
mapping = c("backward full (4->6)", "backward trunc (4->4)",
"forward full (6->4)", "forward trunc (4->4)"),
mean_rep = c(mean(back_full$old$rep_c2c), mean(back_trunc$old$rep_c2c),
mean(fwd_full$new$rep_c2c), mean(fwd_trunc$new$rep_c2c))
)
#> mapping mean_rep
#> 1 backward full (4->6) 23.479114
#> 2 backward trunc (4->4) 4.697948
#> 3 forward full (6->4) 1.363676
#> 4 forward trunc (4->4) 3.268608step1 <- cat2cat(
data = list(old = occup_2008, new = occup_2010,
cat_var = "code", time_var = "year"),
mappings = list(trans = trans, direction = "backward")
)
step2 <- cat2cat(
data = list(old = occup_2006, new = step1$old,
cat_var_old = "code", cat_var_new = "g_new_c2c",
time_var = "year"),
mappings = list(trans = trans, direction = "backward")
)
harmonised_back <- bind_rows(
step2$old,
step1$old,
step1$new,
dummy_c2c(occup_2012, "code")
)Validation: weighted counts should match the original counts within each year.
harmonised_back %>%
group_by(year) %>%
summarise(weighted_n = round(sum(wei_freq_c2c)), .groups = "drop") %>%
left_join(count(occup, year), by = "year")
#> # A tibble: 4 × 3
#> year weighted_n n
#> <int> <dbl> <int>
#> 1 2006 16540 16540
#> 2 2008 17223 17223
#> 3 2010 17323 17323
#> 4 2012 18040 18040trans_fwd <- rbind(
trans,
data.frame(old = "no_cat",
new = setdiff(c(occup_2010$code, occup_2012$code), trans$new))
)
fwd1 <- cat2cat(
data = list(old = occup_2008, new = occup_2010,
cat_var = "code", time_var = "year"),
mappings = list(trans = trans_fwd, direction = "forward")
)
fwd2 <- cat2cat(
data = list(old = fwd1$new, new = occup_2012,
cat_var_old = "g_new_c2c", cat_var_new = "code",
time_var = "year"),
mappings = list(trans = trans_fwd, direction = "forward")
)
harmonised_fwd <- bind_rows(
dummy_c2c(occup_2006, "code"),
fwd1$old,
fwd1$new,
fwd2$new
)step1_ml <- cat2cat(
data = list(old = occup_2008, new = occup_2010,
cat_var = "code", time_var = "year"),
mappings = list(trans = trans, direction = "backward"),
ml = ml_setup
)
step2_ml <- cat2cat(
data = list(old = occup_2006, new = step1_ml$old,
cat_var_old = "code", cat_var_new = "g_new_c2c",
time_var = "year"),
mappings = list(trans = trans, direction = "backward"),
ml = ml_setup
)If subjects have stable identifiers across waves, id_var
can reduce unnecessary replication by directly matching returning
subjects.
Use this module only when the identifier truly tracks the same subject across adjacent waves and short-run category changes are unlikely to represent genuine transitions rather than coding changes.
If you have a complete panel with every subject observed in both
periods and no missing category values, you may not need probabilistic
harmonisation at all. In that case, the target-period category can often
be joined back to the earlier record by id_var, and the
task is mostly a deterministic join. cat2cat() is more
useful when the panel is incomplete, rotational, or mixed with new
entrants and leavers, so some observations still need the mapping-table
replication path.
panel_old <- occup_panel[occup_panel$quarter == "2009Q4", ]
panel_new <- occup_panel[occup_panel$quarter == "2010Q1", ]
shared_ids <- intersect(panel_old$panel_id, panel_new$panel_id)
length(shared_ids)
#> [1] 450result_id <- cat2cat(
data = list(
old = panel_old,
new = panel_new,
id_var = "panel_id",
cat_var = "code",
time_var = "quarter"
),
mappings = list(trans = trans, direction = "backward")
)How id_var works:
rep_c2c = 1 and weight 1,table(result_id$old$rep_c2c)
#>
#> 1 2 3 4 5 6 7 8 9 10 11 13 16 18 19 21 22 23 24 25
#> 457 6 12 56 75 54 84 88 81 30 165 26 64 180 190 42 44 46 48 25
#> 33 34 46 70
#> 33 170 276 70
sum(result_id$old$wei_freq_c2c)
#> [1] 600
nrow(panel_old)
#> [1] 600Compare with and without identifiers:
result_no_id <- cat2cat(
data = list(
old = panel_old,
new = panel_new,
cat_var = "code",
time_var = "quarter"
),
mappings = list(trans = trans, direction = "backward")
)
cat("WITH id_var average replication:", round(mean(result_id$old$rep_c2c), 2), "\n")
#> WITH id_var average replication: 18.49
cat("WITHOUT id_var average replication:", round(mean(result_no_id$old$rep_c2c), 2), "\n")
#> WITHOUT id_var average replication: 23.49Use id_var when:
When only aggregate counts are available, use
cat2cat_agg() with mapping equations rather than micro-data
replication.
Use this module when you do not have person-level data, or when the classification itself has a hierarchical code structure that can be exploited to build coarser mappings.
agg_old <- verticals[verticals$v_date == "2020-04-01", ]
agg_new <- verticals[verticals$v_date == "2020-05-01", ]agg <- cat2cat_agg(
data = list(
old = agg_old,
new = agg_new,
cat_var = "vertical",
time_var = "v_date",
freq_var = "counts"
),
Automotive %<% c(Automotive1, Automotive2),
c(Kids1, Kids2) %>% c(Kids),
Home %>% c(Home, Supermarket)
)Inspect how categories were proportionally redistributed:
agg$old[agg$old$vertical %in% c("Automotive1", "Automotive2"), ]
#> vertical sales counts v_date prop_c2c
#> 4 Automotive1 76.54302 135 2020-04-01 0.6452772
#> 4.1 Automotive2 76.54302 135 2020-04-01 0.3547228
agg$new[agg$new$vertical %in% c("Kids1", "Kids2"), ]
#> vertical sales counts v_date prop_c2c
#> 13 Kids1 105.4317 874 2020-05-01 0.3534726
#> 13.1 Kids2 105.4317 874 2020-05-01 0.6465274Hierarchical codes can also be used to build coarser mapping tables when an official transition table is unavailable:
trans_2digit <- data.frame(
old = substr(trans$old, 1, 2),
new = substr(trans$new, 1, 2)
)
trans_2digit <- unique(trans_2digit)
cat("2-digit mapping rows:", nrow(trans_2digit),
"vs full mapping rows:", nrow(trans))
#> 2-digit mapping rows: 122 vs full mapping rows: 2666This works for classifications with stable prefix hierarchies such as ISCO, ICD, NACE, CPC, or HS codes.
The replication is neutral for regressions on non-mapped covariates because per-subject weights sum to one. Standard errors, however, must be corrected because replication inflates the row count.
Use this module when your end goal is estimation rather than descriptive harmonisation. The key issue is not coefficient bias for non-mapped regressors, but valid inference after replication.
lms_orig <- lm(
I(log(salary)) ~ age + sex + factor(edu) + parttime + exp,
data = occup,
weights = multiplier
)
lms_harmonised <- lm(
I(log(salary)) ~ age + sex + factor(edu) + parttime + exp,
data = cat2cat_data_back,
weights = multiplier * wei_freq_c2c
)
summary_c2c(lms_harmonised, df_old = nrow(occup))
#> Estimate Std. Error t value Pr(>|t|) correct
#> (Intercept) 8.567134022 0.0055759904 1536.43270 0.000000e+00 2.246708
#> age -0.001669601 0.0001429794 -11.67722 1.689342e-31 2.246708
#> sexTRUE 0.254854849 0.0015855706 160.73384 0.000000e+00 2.246708
#> factor(edu)2 -0.123208217 0.0031187751 -39.50532 0.000000e+00 2.246708
#> factor(edu)3 -0.390643025 0.0037961258 -102.90571 0.000000e+00 2.246708
#> factor(edu)4 -0.465471793 0.0022196689 -209.70325 0.000000e+00 2.246708
#> factor(edu)5 -0.443598202 0.0031191398 -142.21812 0.000000e+00 2.246708
#> factor(edu)6 -0.678797186 0.0022404213 -302.97747 0.000000e+00 2.246708
#> factor(edu)7 -0.617843013 0.0192393288 -32.11354 6.112599e-226 2.246708
#> factor(edu)8 -0.717563371 0.0035222572 -203.72259 0.000000e+00 2.246708
#> parttime 1.999007607 0.0037223872 537.02301 0.000000e+00 2.246708
#> exp 0.011337142 0.0001368157 82.86435 0.000000e+00 2.246708
#> std.error_c statistic_c p.value_c reference_dist
#> (Intercept) 0.0125276207 683.859629 0.000000e+00 t
#> age 0.0003212329 -5.197479 2.025825e-07 t
#> sexTRUE 0.0035623137 71.541945 0.000000e+00 t
#> factor(edu)2 0.0070069760 -17.583650 4.650384e-69 t
#> factor(edu)3 0.0085287850 -45.802893 0.000000e+00 t
#> factor(edu)4 0.0049869472 -93.338022 0.000000e+00 t
#> factor(edu)5 0.0070077954 -63.300678 0.000000e+00 t
#> factor(edu)6 0.0050335718 -134.853979 0.000000e+00 t
#> factor(edu)7 0.0432251482 -14.293601 2.792952e-46 t
#> factor(edu)8 0.0079134824 -90.676056 0.000000e+00 t
#> parttime 0.0083631160 239.026649 0.000000e+00 t
#> exp 0.0003073848 36.882568 6.567139e-295 tsummary_c2c() scales naive standard errors by the
replication factor:
\[\text{SE}_{\text{corrected}} = \text{SE}_{\text{naive}} \times \sqrt{\frac{n_{\text{rep}}}{n_{\text{orig}}}}\]
Report coefficient estimates with corrected standard errors and
p-values from summary_c2c(). Ordinary \(R^2\) is preserved in this neutral setup
because the response and covariates do not vary across replicated copies
and the weights for each source observation sum to the original weight.
Adjusted \(R^2\), AIC, and BIC depend
on sample-size and degrees-of-freedom conventions, so do not report
their replicated-model values unless you recompute them on the intended
original-observation scale. If the harmonised category itself enters the
model, fit statistics are conditional on the chosen harmonisation
weights.
harmonised_fe <- cat2cat_data_back %>%
prune_c2c(method = "nonzero") %>%
mutate(orig_obs_id = interaction(year, index_c2c, drop = TRUE, lex.order = TRUE)) %>%
filter(!is.na(g_new_c2c), !is.na(salary), salary > 0)
fe_model_cluster <- feols(
log(salary) ~ age + sex + factor(edu) + parttime + exp | g_new_c2c + year,
data = harmonised_fe,
weights = ~multiplier * wei_freq_c2c,
cluster = ~orig_obs_id
)
summary(fe_model_cluster)
#> OLS estimation, Dep. Var.: log(salary)
#> Observations: 348,744
#> Weights: multiplier * wei_freq_c2c
#> Fixed-effects: g_new_c2c: 1,561, year: 4
#> Standard-errors: Clustered (orig_obs_id)
#> Estimate Std. Error t value Pr(>|t|)
#> age -0.000589 0.000353 -1.67055 0.094814 .
#> sexTRUE 0.127013 0.005018 25.30940 < 2.2e-16 ***
#> factor(edu)2 -0.122144 0.009111 -13.40615 < 2.2e-16 ***
#> factor(edu)3 -0.220057 0.009472 -23.23191 < 2.2e-16 ***
#> factor(edu)4 -0.254008 0.007823 -32.46831 < 2.2e-16 ***
#> factor(edu)5 -0.236453 0.009999 -23.64660 < 2.2e-16 ***
#> factor(edu)6 -0.329728 0.008764 -37.62222 < 2.2e-16 ***
#> factor(edu)7 -0.299024 0.027005 -11.07289 < 2.2e-16 ***
#> factor(edu)8 -0.335418 0.010119 -33.14595 < 2.2e-16 ***
#> parttime 1.871975 0.011668 160.44112 < 2.2e-16 ***
#> exp 0.008152 0.000332 24.51817 < 2.2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> RMSE: 0.367611 Adj. R2: 0.713868
#> Within R2: 0.555095Do not cluster directly on index_c2c after binding
multiple waves, because index_c2c is created separately
inside each cat2cat() call and can repeat across years.
Instead, build a wave-specific original-observation identifier such as
interaction(year, index_c2c) and cluster on that. This
treats all replications of the same source row as one cluster without
incorrectly merging different people from different waves.
| Problem | Recommended tool |
|---|---|
| Need feature-informed weights | cat2cat(..., ml = ...) +
cat2cat_ml_run() |
| Need 3+ wave harmonisation | iterative cat2cat() chaining |
| Stable subject identifiers across waves | id_var |
| Only aggregated counts available | cat2cat_agg() |
| Regression on replicated data | summary_c2c() or clustered inference |