Handling Semantic Ambiguity with prelabelled Vectors

Introduction

You are developing a family of R packages that extend tidy data workflows with richer semantic and provenance-aware capabilities. The work began from practical experience building tidyverse-based data pipelines and repeatedly encountering the same limitation: while tidy datasets are highly efficient and semantically clear within a given workflow, much of their meaning remains implicit and dependent on the contextual knowledge of their creator. Once exported, serialized, or transferred across environments, this contextual information is often lost. :contentReferenceoaicite:0

library(dataset)

The dataset package introduces semantically enriched vectors and data frames that preserve explicit metadata throughout the workflow lifecycle. However, fully formal semantic annotation is verbose and cognitively demanding. Constructing semantically complete RDF-compatible objects is appropriate only for mature stages of a workflow.

In practice, semantic stabilization is usually incremental. Observational data often arrive with partially inconsistent, incomplete, or ambiguous labels. Before a variable can mature into a formally defined vector created with labelled::labelled() or dataset::defined(), analysts typically perform several rounds of semantic harmonisation.

The prelabelled class supports this intermediate stage.

Unlike formally defined semantic vectors, prelabelled vectors tolerate:

This vignette demonstrates how provisional semantic assertions can be incrementally stabilised while preserving the original observational evidence.

A small ambiguous dataset

We begin with a small dataset containing country observations. The dataset is intentionally inconsistent: some observations use full country names, while others already use ISO 3166 alpha-2 country codes.

Such ambiguity is extremely common in operational analytical workflows, particularly when datasets are merged from multiple sources or manually curated over time.

country_data_1 <- data.frame(
  country = c("Andorra", "LI", "San Marino", "AD", "Liechtenstein"),
  time = c(2020, 2020, 2020, 2021, 2021),
  value = c(1.2, 2.4, 3.1, 1.3, 2.5)
  )

Creating provisional semantic assertions

We now create a lightweight semantic mapping.

The goal is not yet to create a formally closed semantic vocabulary. Instead, we begin stabilising the semantics incrementally by mapping some observational values to candidate semantic assertions.

Values that are not explicitly mapped remain self-describing.

country_map <- c(
  "Andorra" = "AD",
  "Liechtenstein" = "LI",
  "San Marino" = "SM"
)

country_data_1$country <-
  prelabel(
    country_data_1$country,
    labels = country_map
  )

Inspecting the prelabelled vector

The resulting vector preserves the original observational values while attaching a provisional semantic vocabulary in the "prelabel" attribute.

print(country_data_1$country)
#> [1] "Andorra"       "LI"            "San Marino"    "AD"           
#> [5] "Liechtenstein"
#> attr(,"prelabel")
#>       Andorra Liechtenstein    San Marino            LI            AD 
#>          "AD"          "LI"          "SM"          "LI"          "AD" 
#> attr(,"class")
#> [1] "prelabelled" "character"

This separation between:

is a central design principle of the prelabelled class.

The observational values remain unchanged, while semantic operationalisation may evolve iteratively over time.

Semantic operationalisation

Using as.character() operationalises the semantic assertions into a semantically stabilised character vector.

country_data_2 <- data.frame(
  country = as.character(country_data_1$country),
  time = country_data_1$time,
  value = country_data_1$value
)

country_data_2
#>   country time value
#> 1      AD 2020   1.2
#> 2      LI 2020   2.4
#> 3      SM 2020   3.1
#> 4      AD 2021   1.3
#> 5      LI 2021   2.5

Mapped observations are converted into their candidate semantic assertions, while unmatched values remain self-describing.

This allows analysts to gradually reduce semantic ambiguity without destroying the original observational evidence.

A more ambiguous dataset

The next dataset contains a more difficult form of semantic ambiguity.

Some observations use ISO 3166 alpha-2 country codes, while others use ISO 3166 alpha-3 codes or full country names. Although the observations are semantically related, they do not yet form a stable closed vocabulary.

country_data_3 <- data.frame(
  country = c(
    "AD", "AND", "LI", "LIE", "SMR", "San Marino"
  ),
  time = c(2020, 2020, 2020, 2021, 2021, 2021),
  value = c(1, 2, 3, 4, 5, 6)
)

Incremental semantic stabilization

The prelabelled workflow does not require complete semantic resolution from the outset.

Instead, semantic stabilization can proceed incrementally:

country_map_3 <- c(
  "Andorra" = "AD",
  "Andorra" = "AND",
  "Liechtenstein" = "LI",
  "San Marino" = "SM",
  "San Marino" = "SMR"
)

prelabelled_country <- prelabel(
  country_data_3$country,
  labels = country_map_3
)

This approach is particularly useful in exploratory analytical workflows, archival reconstruction, metadata harmonisation, and cross-dataset integration tasks.

prelabelled_country
#> [1] "AD"         "AND"        "LI"         "LIE"        "SMR"       
#> [6] "San Marino"
#> attr(,"prelabel")
#>       Andorra       Andorra Liechtenstein    San Marino    San Marino 
#>          "AD"         "AND"          "LI"          "SM"         "SMR" 
#>            AD           AND            LI           LIE           SMR 
#>          "AD"         "AND"          "LI"         "LIE"         "SMR" 
#> attr(,"class")
#> [1] "prelabelled" "character"

Semantic workspaces

While as.character() provides lightweight semantic coercion, which may be more useful after semantic stabilisation.

as.character(prelabelled_country)
#> [1] "AD"  "AND" "LI"  "LIE" "SMR" "SM"

The as_character() method creates a provenance-preserving semantic workspace.

as_character(prelabelled_country)
#> [1] "AD"  "AND" "LI"  "LIE" "SMR" "SM" 
#> attr(,"prelabel")
#>       Andorra       Andorra Liechtenstein    San Marino    San Marino 
#>          "AD"         "AND"          "LI"          "SM"         "SMR" 
#>            AD           AND            LI           LIE           SMR 
#>          "AD"         "AND"          "LI"         "LIE"         "SMR" 
#> attr(,"original_values")
#> [1] "AD"         "AND"        "LI"         "LIE"        "SMR"       
#> [6] "San Marino"
#> attr(,"original_values")attr(,"prelabel")
#>       Andorra       Andorra Liechtenstein    San Marino    San Marino 
#>          "AD"         "AND"          "LI"          "SM"         "SMR" 
#>            AD           AND            LI           LIE           SMR 
#>          "AD"         "AND"          "LI"         "LIE"         "SMR"

The resulting vector retains:

This allows analysts to continue semantic refinement workflows while preserving reversibility and provenance awareness.

From provisional semantics to formally defined semantics

The goal of prelabelled vectors is not to replace formally defined semantic vectors.

Instead, they provide a lightweight preparatory stage for incremental semantic stabilization.

Once semantic ambiguity has been sufficiently reduced, prelabelled vectors may mature into formally defined semantic vectors created with labelled::labelled() or dataset::defined(). For further information, see vignette("defined", package = "dataset")- Working with semantic vectors: Semantic vectors with defined().

In this sense, semantic enrichment becomes an iterative analytical workflow rather than a single terminal annotation step.