When AI systems score essays, short-answer responses, or structured tasks, a critical fairness question arises: does the AI scoring engine shift item difficulties differently for different demographic groups?
Classical DIF methods test whether an item performs differently
across groups within a single scoring condition.
aiDIF extends this to a paired design:
make_aidif_eg() returns a built-in example with item
parameter MLEs for 6 items in two groups under both scoring conditions.
The planted structure is:
fit_aidif() runs the robust IRLS engine under each
scoring condition and performs the DASB test.
mod <- fit_aidif(
human_mle = eg$human,
ai_mle = eg$ai,
alpha = 0.05
)
print(mod)
#> AI-DIF Analysis
#> ----------------------------------------
#> Human scoring — robust scale est: -0.5776 (SE: 0.0747)
#> — DIF items flagged: 3 / 6
#> AI scoring — robust scale est: -0.5921 (SE: 0.0748)
#> — DIF items flagged: 3 / 6
#> DASB test — items with differential AI bias: 1 / 6summary(mod)
#> =============================================================
#> AI Differential Item Functioning Analysis (aiDIF)
#> =============================================================
#>
#> --- Human Scoring DIF ----------------------------------------
#> Robust scale estimate: -0.5776 (SE: 0.0747)
#> Wald DIF tests:
#> delta se z p_val
#> item1_d1 0.5693 0.0759 7.4995 0.0000
#> item2_d1 0.0366 0.1060 0.3448 0.7303
#> item3_d1 0.2302 0.0623 3.6953 0.0002
#> item4_d1 0.0163 0.0931 0.1756 0.8606
#> item5_d1 0.2700 0.0693 3.8947 0.0001
#> item6_d1 -0.1181 0.1232 -0.9584 0.3379
#>
#> --- AI Scoring DIF -------------------------------------------
#> Robust scale estimate: -0.5921 (SE: 0.0748)
#> Wald DIF tests:
#> delta se z p_val
#> item1_d1 0.5756 0.0761 7.5596 0.0000
#> item2_d1 0.0466 0.1046 0.4458 0.6557
#> item3_d1 0.5499 0.0619 8.8820 0.0000
#> item4_d1 0.0046 0.0926 0.0495 0.9605
#> item5_d1 0.3308 0.0695 4.7559 0.0000
#> item6_d1 -0.1455 0.1240 -1.1737 0.2405
#>
#> --- Differential AI Scoring Bias (DASB) ---------------------
#> H0: AI scoring shift does not differ across groups
#> (Positive DASB => AI scoring disadvantages focal group)
#>
#> shift_g1 shift_g2 DASB se z p_val
#> item1 0.13 0.12 -0.01 0.14 -0.071 0.9431
#> item2 0.08 0.07 -0.01 0.14 -0.071 0.9431
#> item3 0.11 0.54 0.43 0.14 3.071 0.0021
#> item4 0.12 0.09 -0.03 0.14 -0.214 0.8303
#> item5 0.07 0.13 0.06 0.14 0.429 0.6682
#> item6 0.11 0.08 -0.03 0.14 -0.214 0.8303
#>
#> --- AI-Effect Classification ---------------------------------
#> stable_clean : not flagged in either condition
#> stable_dif : flagged in both (same direction)
#> introduced : flagged only under AI scoring
#> masked : flagged only under human scoring
#> new_direction : flagged in both, opposite direction
#>
#> human_delta ai_delta human_flag ai_flag status
#> item1_d1 0.5693 0.5756 TRUE TRUE stable_dif
#> item2_d1 0.0366 0.0466 FALSE FALSE stable_clean
#> item3_d1 0.2302 0.5499 TRUE TRUE stable_dif
#> item4_d1 0.0163 0.0046 FALSE FALSE stable_clean
#> item5_d1 0.2700 0.3308 TRUE TRUE stable_dif
#> item6_d1 -0.1181 -0.1455 FALSE FALSE stable_clean
#>
#> Status counts:
#>
#> stable_clean stable_dif
#> 3 3scoring_bias_test() can also be called directly.
sb <- scoring_bias_test(eg$human, eg$ai)
print(sb)
#> shift_g1 shift_g2 DASB se z p_val
#> item1 0.13 0.12 -0.01 0.14 -0.071 0.9431
#> item2 0.08 0.07 -0.01 0.14 -0.071 0.9431
#> item3 0.11 0.54 0.43 0.14 3.071 0.0021
#> item4 0.12 0.09 -0.03 0.14 -0.214 0.8303
#> item5 0.07 0.13 0.06 0.14 0.429 0.6682
#> item6 0.11 0.08 -0.03 0.14 -0.214 0.8303Item 3 should be significant, reflecting the planted group-dependent AI scoring bias.
eff <- ai_effect_summary(mod$dif_human, mod$dif_ai)
print(eff)
#> human_delta ai_delta human_flag ai_flag status
#> item1_d1 0.5693 0.5756 TRUE TRUE stable_dif
#> item2_d1 0.0366 0.0466 FALSE FALSE stable_clean
#> item3_d1 0.2302 0.5499 TRUE TRUE stable_dif
#> item4_d1 0.0163 0.0046 FALSE FALSE stable_clean
#> item5_d1 0.2700 0.3308 TRUE TRUE stable_dif
#> item6_d1 -0.1181 -0.1455 FALSE FALSE stable_clean| Status | Meaning |
|---|---|
introduced |
AI scoring creates DIF not present under human scoring |
masked |
AI scoring hides DIF that existed under human scoring |
stable_dif |
DIF detected in both conditions |
stable_clean |
No DIF in either condition |
dat <- simulate_aidif_data(
n_items = 8,
n_obs = 600,
dif_items = c(1, 2),
dif_mag = 0.5,
dasb_items = 5,
dasb_mag = 0.4,
seed = 123
)
sim_mod <- fit_aidif(dat$human, dat$ai)
print(sim_mod)
#> AI-DIF Analysis
#> ----------------------------------------
#> Human scoring — robust scale est: -0.2670 (SE: 0.0322)
#> — DIF items flagged: 4 / 8
#> AI scoring — robust scale est: 0.0536 (SE: 0.0363)
#> — DIF items flagged: 5 / 8
#> DASB test — items with differential AI bias: 1 / 8