| Type: | Package |
| Title: | Differential Item Functioning for AI-Scored Assessments |
| Version: | 0.1.0 |
| Description: | Detects and quantifies differential item functioning (DIF) in AI-scored educational and psychological assessments. Provides a fully self-contained robust DIF engine (M-estimation via iteratively re-weighted least squares with the bi-square loss) alongside the novel Differential AI Scoring Bias (DASB) test, which detects item-level scoring shifts that differ across subgroups when comparing human and AI scoring conditions. Includes simulation utilities, anchor weight diagnostics, and an AI-effect classification framework. |
| License: | GPL (≥ 3) |
| Encoding: | UTF-8 |
| Depends: | R (≥ 3.5.0) |
| Imports: | Matrix, stats, graphics |
| Suggests: | mirt, testthat (≥ 3.0.0), knitr, rmarkdown |
| Config/testthat/edition: | 3 |
| RoxygenNote: | 7.3.3 |
| VignetteBuilder: | knitr |
| URL: | https://github.com/causalfragility-lab/aiDIF |
| BugReports: | https://github.com/causalfragility-lab/aiDIF/issues |
| NeedsCompilation: | no |
| Packaged: | 2026-04-20 20:43:11 UTC; Subir |
| Author: | Subir Hait |
| Maintainer: | Subir Hait <haitsubi@msu.edu> |
| Repository: | CRAN |
| Date/Publication: | 2026-04-21 20:52:36 UTC |
Summarise the effect of AI scoring on DIF flagging.
Description
Compares the DIF flagging patterns from human and AI scoring conditions
and classifies each item as: "stable_clean" (not flagged in
either), "stable_dif" (flagged in both), "introduced"
(flagged only under AI), "masked" (flagged only under human), or
"new_direction" (flagged in both but bias reverses sign).
Usage
ai_effect_summary(dif_human, dif_ai, alpha = 0.05)
Arguments
dif_human |
A |
dif_ai |
A |
alpha |
Significance threshold for flagging. Default: |
Value
A data.frame with one row per item/threshold and columns:
human_deltaEstimated DIF effect under human scoring.
ai_deltaEstimated DIF effect under AI scoring.
human_flagLogical: flagged under human scoring?
ai_flagLogical: flagged under AI scoring?
statusClassification (see Description).
See Also
Examples
eg <- make_aidif_eg()
mod <- fit_aidif(eg$human, eg$ai)
ai_effect_summary(mod$dif_human, mod$dif_ai)
Anchor item weights from the robust AI-DIF procedure.
Description
Returns the bi-square weights assigned to each item under each scoring condition. Items with weight near zero are effectively excluded from the robust scaling estimate, indicating likely DIF contamination.
Usage
anchor_weights(object)
Arguments
object |
An |
Value
A data.frame with columns human_weight and
(if AI data were provided) ai_weight. Higher weight means
the item is contributing more to the robust scale estimate.
Examples
eg <- make_aidif_eg()
mod <- fit_aidif(eg$human, eg$ai)
anchor_weights(mod)
Bi-square psi (influence) function
Description
Bi-square psi (influence) function
Usage
bisq_psi(u, k = 1.96)
Derivative of bi-square psi
Description
Derivative of bi-square psi
Usage
bisq_psi_prime(u, k = 1.96)
Bi-square rho (objective) function
Description
Bi-square rho (objective) function
Usage
bisq_rho(u, k = 1.96)
Bi-square weight function
Description
Bi-square weight function
Usage
bisq_weight(u, k = 1.96)
Arguments
u |
Numeric vector of standardised residuals. |
k |
Tuning parameter (default 1.96). |
Value
Numeric vector of weights in [0, 1].
Block-diagonal joint covariance matrix for both groups
Description
Block-diagonal joint covariance matrix for both groups
Usage
build_joint_vcov(mle)
Arguments
mle |
A validated mle list. |
Value
A Matrix::bdiag sparse block-diagonal matrix.
Compute item-level IRT scaling functions
Description
For each item, computes a standardised difference between
group parameter estimates. The result is a vector y
whose robust location is estimated by
estimate_robust_scale.
Usage
compute_scaling_fn(mle, type = "intercept", scale_by = "pooled")
Arguments
mle |
A validated |
type |
One of |
scale_by |
One of |
Value
A named numeric vector of scaling-function values, one entry per item threshold (or per item for slopes).
Robust DIF scale estimation via IRLS
Description
Estimates a robust location parameter for the vector of IRT scaling functions using iteratively re-weighted least squares (IRLS) with the bi-square loss. This is the core estimation engine of aiDIF.
Usage
estimate_robust_scale(
mle,
alpha = 0.05,
scale_by = "pooled",
tol = 1e-07,
maxit = 100L
)
Arguments
mle |
A validated mle list. |
alpha |
Significance level controlling the bi-square
tuning parameter |
scale_by |
Scaling denominator; passed to
|
tol |
Convergence tolerance. Default |
maxit |
Maximum IRLS iterations. Default |
Value
A list of class rdif_fit with elements:
estEstimated robust scale parameter.
weightsBi-square item weights.
rho_valueValue of objective at solution.
n_iterNumber of iterations used.
kTuning parameter used.
yRaw scaling function values.
vcov_estCovariance matrix of
yat solution.dif_testWald item-level DIF test (data.frame).
dtf_testWald test of differential test functioning.
Examples
dat <- simulate_aidif_data(n_items = 5, seed = 1)
fit <- estimate_robust_scale(dat$human)
print(fit$est)
Fit the AI-DIF model
Description
The primary estimation function of aiDIF. Runs the robust DIF
procedure under both human and AI scoring using the built-in IRLS engine
(estimate_robust_scale), then tests for Differential AI
Scoring Bias (DASB).
Usage
fit_aidif(
human_mle,
ai_mle = NULL,
alpha = 0.05,
scale_by = "pooled",
tol = 1e-07,
maxit = 100L
)
Arguments
human_mle |
A validated mle list for human-scored data. |
ai_mle |
A validated mle list for AI-scored data, or |
alpha |
Significance level. Default |
scale_by |
Denominator for standardising intercept differences:
|
tol |
IRLS convergence tolerance. Default |
maxit |
Maximum IRLS iterations. Default |
Value
An object of class "aidif".
See Also
estimate_robust_scale,
scoring_bias_test, simulate_aidif_data
Examples
dat <- simulate_aidif_data(n_items = 6, seed = 1)
mod <- fit_aidif(dat$human, dat$ai)
print(mod)
summary(mod)
Gradient of the intercept scaling function
Description
Computes the Jacobian of compute_scaling_fn
with respect to all item parameters in both groups,
organised to be conformable with the block-diagonal
covariance matrix built by build_joint_vcov.
Usage
grad_intercept_fn(mle, theta = NULL, scale_by = "pooled")
Arguments
mle |
A validated mle list. |
theta |
Optional scalar: if supplied, item-specific
scaling-function values are replaced by |
scale_by |
Passed from |
Value
A matrix with n_items * n_thresholds columns,
each being the gradient vector of one scaling-function
entry with respect to the full parameter vector.
Grid search for bi-square objective minimum (starting value)
Description
Grid search for bi-square objective minimum (starting value)
Usage
grid_rho_search(y, var_fn, k, width = 0.01)
Least trimmed squares estimate of location (starting value)
Description
Least trimmed squares estimate of location (starting value)
Usage
lts_location(y, trim = 0.5)
Arguments
y |
Numeric vector. |
trim |
Proportion to trim (default 0.5). |
Built-in example dataset for aiDIF
Description
Constructs and returns the built-in example dataset: paired human and AI item parameter estimates for 6 items in two groups, with known DIF and DASB planted at specific items.
Usage
make_aidif_eg()
Details
The data-generating model includes:
-
Item 1: DIF under human scoring (intercept +0.5 in focal group).
-
Item 3: Differential AI Scoring Bias (DASB) — AI scoring adds +0.4 to the focal-group intercept only.
-
Impact: 0.5 SD (focal group higher on latent trait).
-
AI drift: uniform +0.1 calibration offset on all items in both groups.
Value
A list with elements human and ai, each a
validated mle list (see simulate_aidif_data for
format details).
See Also
simulate_aidif_data, fit_aidif
Examples
eg <- make_aidif_eg()
mod <- fit_aidif(eg$human, eg$ai)
summary(mod)
S3 plot method for class "aidif".
Description
Produces one of several diagnostic plots depending on type.
Usage
## S3 method for class 'aidif'
plot(x, type = "dif_forest", ...)
Arguments
x |
An object of class |
type |
Character. One of:
|
... |
Additional graphical parameters passed to low-level plot functions. |
Value
x, invisibly.
S3 print method for class "aidif".
Description
Prints a compact summary of the estimated robust scaling parameters and, when available, the number of items flagged for DIF and DASB.
Usage
## S3 method for class 'aidif'
print(x, ...)
Arguments
x |
An object of class |
... |
Further arguments (currently ignored). |
Value
x, invisibly.
Validate and bundle paired human/AI parameter estimates
Description
Takes two mle lists (one per scoring condition) and returns a validated
aidif_data object for use in fit_aidif.
Usage
read_ai_scored(human_mle, ai_mle)
Arguments
human_mle |
An mle list for human-scored data. Must contain
|
ai_mle |
An mle list for AI-scored data in the same format. |
Value
A list of class "aidif_data" with elements
human and ai.
See Also
fit_aidif, make_aidif_eg,
simulate_aidif_data
Differential AI Scoring Bias (DASB) test.
Description
For each item, computes the change in item intercept from human to AI scoring within each group, then tests whether this scoring shift differs significantly across groups. A significant result indicates the AI scoring engine introduces a group-dependent parameter distortion — i.e., the AI does not merely re-scale all items uniformly but disfavours (or favours) one group at specific items.
Usage
scoring_bias_test(human_mle, ai_mle, fun = "d_fun3")
Arguments
human_mle |
Output of |
ai_mle |
Output of |
fun |
Scaling function (passed to
the internal scaling function) to use when normalising shifts.
Default: |
Details
Estimand. Define the scoring shift in group g for item
i threshold j as:
\delta_{igj} = d_{igj}^{\text{AI}} - d_{igj}^{\text{Human}}
The DASB is \delta_{i2j} - \delta_{i1j}. Under
H_0: \text{DASB}_{ij} = 0 and independence across scoring
conditions and groups,
\widehat{\mathrm{Var}}(\text{DASB}_{ij}) =
(\sigma_{i1j}^{H})^2 + (\sigma_{i2j}^{H})^2 +
(\sigma_{i1j}^{AI})^2 + (\sigma_{i2j}^{AI})^2
where each \sigma^2 is the diagonal element of the corresponding
group-specific covariance matrix.
Value
A data.frame with one row per item (per threshold for
polytomous items) and columns:
shift_g1Scoring shift
\delta_{i1} = d_{i1}^{AI} - d_{i1}^{H}.shift_g2Scoring shift
\delta_{i2} = d_{i2}^{AI} - d_{i2}^{H}.DASBDifferential AI Scoring Bias:
\delta_{i2} - \delta_{i1}.seStandard error of DASB under the delta method.
zWald z-statistic.
p_valTwo-tailed p-value.
See Also
Examples
eg <- make_aidif_eg()
scoring_bias_test(eg$human, eg$ai)
Simulate item parameter estimates for the AI-DIF model.
Description
Generates a synthetic aidif_data-compatible list suitable for
benchmarking and method evaluation. The data-generating model contains:
classical DIF in the human scoring condition (controlled via
dif_items and dif_mag), differential AI scoring bias
(controlled via dasb_items and dasb_mag), and a latent
group mean difference (impact).
Usage
simulate_aidif_data(
n_items = 10L,
n_obs = 500L,
impact = 0.5,
dif_items = 1L,
dif_mag = 0.5,
dasb_items = 3L,
dasb_mag = 0.4,
ai_drift = 0.1,
seed = 42L
)
Arguments
n_items |
Integer. Number of items. Default: |
n_obs |
Integer. Approximate number of observations per group,
used to scale the covariance matrices. Default: |
impact |
Numeric. Latent mean difference (group 2 minus group 1)
in SD units. Default: |
dif_items |
Integer vector. Indices of items with DIF in the human
scoring condition (intercept shift added to group 2). Default:
|
dif_mag |
Numeric. Magnitude of the intercept DIF effect (in
IRT metric). Default: |
dasb_items |
Integer vector. Indices of items where AI scoring
introduces differential bias (intercept shift added to group 2 in the
AI condition only). Default: |
dasb_mag |
Numeric. Magnitude of the DASB effect. Default:
|
ai_drift |
Numeric. Uniform intercept shift applied to ALL items in
BOTH groups under AI scoring (simulates AI calibration offset).
Default: |
seed |
Integer seed for reproducibility, or |
Details
Rather than simulating item responses and refitting IRT models (which
requires additional dependencies), this function directly simulates
maximum-likelihood estimates and their asymptotic covariance matrices,
consistent with a 2PL model fitted to n_obs observations per
group.
Value
A list with elements human and ai, each formatted
identically to the output of
read_ai_scored. Can be passed directly to
fit_aidif.
Examples
dat <- simulate_aidif_data(
n_items = 8,
n_obs = 600,
dif_items = c(1, 2),
dasb_items = 5
)
mod <- fit_aidif(dat$human, dat$ai)
summary(mod)
S3 summary method for class "aidif".
Description
Prints a detailed report including DIF test tables for each scoring condition, the DASB table, and the AI-effect classification.
Usage
## S3 method for class 'aidif'
summary(object, ...)
Arguments
object |
An object of class |
... |
Further arguments (currently ignored). |
Value
NULL, invisibly.
Delta-method covariance matrix of the scaling functions
Description
Delta-method covariance matrix of the scaling functions
Usage
vcov_scaling_fn(mle, theta = NULL, scale_by = "pooled")
Arguments
mle |
A validated mle list. |
theta |
Optional scalar evaluated under H0. |
scale_by |
Passed to gradient function. |
Value
A square matrix of size
n_items * n_thresholds.
Wald item-level DIF test
Description
Tests H0: y_i = theta for each item, using the projected variance that accounts for estimation of theta itself.
Usage
wald_dif_test(y, theta, Vcov)
Arguments
y |
Scaling function values. |
theta |
Estimated robust scale parameter. |
Vcov |
Covariance matrix of y (at theta under H0). |
Value
A data.frame with columns delta, se,
z, p_val.
Wald test of differential test functioning (DTF)
Description
Tests H0: mean(y) - theta = 0, i.e. whether the robust scale estimate differs significantly from the naive mean. A significant result indicates meaningful DTF.
Usage
wald_dtf_test(y, theta, weights, Vcov_H0, Vcov_raw, k)
Arguments
y |
Scaling function values. |
theta |
Robust scale estimate. |
weights |
Bi-square weights from IRLS. |
Vcov_H0 |
Covariance of y under H0 (theta plugged in). |
Vcov_raw |
Covariance of y (no theta substitution). |
k |
Bi-square tuning parameter. |
Value
A one-row data.frame.