Help for package aiDIF

Type:

Package

Title:

Differential Item Functioning for AI-Scored Assessments

Version:

0.1.0

Description:

Detects and quantifies differential item functioning (DIF) in AI-scored educational and psychological assessments. Provides a fully self-contained robust DIF engine (M-estimation via iteratively re-weighted least squares with the bi-square loss) alongside the novel Differential AI Scoring Bias (DASB) test, which detects item-level scoring shifts that differ across subgroups when comparing human and AI scoring conditions. Includes simulation utilities, anchor weight diagnostics, and an AI-effect classification framework.

License:

GPL (≥ 3)

Encoding:

UTF-8

Depends:

R (≥ 3.5.0)

Imports:

Matrix, stats, graphics

Suggests:

mirt, testthat (≥ 3.0.0), knitr, rmarkdown

Config/testthat/edition:

RoxygenNote:

7.3.3

VignetteBuilder:

knitr

URL:

https://github.com/causalfragility-lab/aiDIF

BugReports:

https://github.com/causalfragility-lab/aiDIF/issues

NeedsCompilation:

Packaged:

2026-04-20 20:43:11 UTC; Subir

Author:

Subir Hait

[aut, cre]

Maintainer:

Subir Hait <haitsubi@msu.edu>

Repository:

CRAN

Date/Publication:

2026-04-21 20:52:36 UTC

Summarise the effect of AI scoring on DIF flagging.

Description

Compares the DIF flagging patterns from human and AI scoring conditions and classifies each item as: "stable_clean" (not flagged in either), "stable_dif" (flagged in both), "introduced" (flagged only under AI), "masked" (flagged only under human), or "new_direction" (flagged in both but bias reverses sign).

Usage

ai_effect_summary(dif_human, dif_ai, alpha = 0.05)

Arguments

dif_human

A data.frame returned by fit_aidif for the human scoring condition.

dif_ai

A data.frame returned by fit_aidif for the AI scoring condition.

alpha

Significance threshold for flagging. Default: 0.05.

Value

A data.frame with one row per item/threshold and columns:

human_delta: Estimated DIF effect under human scoring.
ai_delta: Estimated DIF effect under AI scoring.
human_flag: Logical: flagged under human scoring?
ai_flag: Logical: flagged under AI scoring?
status: Classification (see Description).

Examples

eg <- make_aidif_eg()
mod <- fit_aidif(eg$human, eg$ai)
ai_effect_summary(mod$dif_human, mod$dif_ai)

Anchor item weights from the robust AI-DIF procedure.

Description

Returns the bi-square weights assigned to each item under each scoring condition. Items with weight near zero are effectively excluded from the robust scaling estimate, indicating likely DIF contamination.

Usage

anchor_weights(object)

Arguments

object

An aidif object from fit_aidif.

Value

A data.frame with columns human_weight and (if AI data were provided) ai_weight. Higher weight means the item is contributing more to the robust scale estimate.

Examples

eg <- make_aidif_eg()
mod <- fit_aidif(eg$human, eg$ai)
anchor_weights(mod)

Bi-square psi (influence) function

Description

Bi-square psi (influence) function

Usage

bisq_psi(u, k = 1.96)

Derivative of bi-square psi

Description

Derivative of bi-square psi

Usage

bisq_psi_prime(u, k = 1.96)

Bi-square rho (objective) function

Description

Bi-square rho (objective) function

Usage

bisq_rho(u, k = 1.96)

Bi-square weight function

Description

Bi-square weight function

Usage

bisq_weight(u, k = 1.96)

Arguments

u

Numeric vector of standardised residuals.

k

Tuning parameter (default 1.96).

Value

Numeric vector of weights in [0, 1].

Block-diagonal joint covariance matrix for both groups

Description

Block-diagonal joint covariance matrix for both groups

Usage

build_joint_vcov(mle)

Arguments

mle

A validated mle list.

Value

A Matrix::bdiag sparse block-diagonal matrix.

Compute item-level IRT scaling functions

Description

For each item, computes a standardised difference between group parameter estimates. The result is a vector y whose robust location is estimated by estimate_robust_scale.

Usage

compute_scaling_fn(mle, type = "intercept", scale_by = "pooled")

Arguments

mle

A validated mle list (output of read_ai_scored or constructed manually). Must contain est$group.1, est$group.2 and matching var.cov matrices.

type

One of "intercept" (default) or "slope". Determines which parameters are compared.

scale_by

One of "pooled" (default), "ref", or "focal". Controls the denominator used to standardise intercept differences: "pooled" uses \sqrt{(a_1^2+a_2^2)/2}; "ref" uses a_1; "focal" uses a_2. Ignored when type = "slope".

Value

A named numeric vector of scaling-function values, one entry per item threshold (or per item for slopes).

Robust DIF scale estimation via IRLS

Description

Estimates a robust location parameter for the vector of IRT scaling functions using iteratively re-weighted least squares (IRLS) with the bi-square loss. This is the core estimation engine of aiDIF.

Usage

estimate_robust_scale(
  mle,
  alpha = 0.05,
  scale_by = "pooled",
  tol = 1e-07,
  maxit = 100L
)

Arguments

mle

A validated mle list.

alpha

Significance level controlling the bi-square tuning parameter k = z_{1-\alpha/2}. Default 0.05.

scale_by

Scaling denominator; passed to compute_scaling_fn. Default "pooled".

tol

Convergence tolerance. Default 1e-7.

maxit

Maximum IRLS iterations. Default 100.

Value

A list of class rdif_fit with elements:

est: Estimated robust scale parameter.
weights: Bi-square item weights.
rho_value: Value of objective at solution.
n_iter: Number of iterations used.
k: Tuning parameter used.
y: Raw scaling function values.
vcov_est: Covariance matrix of y at solution.
dif_test: Wald item-level DIF test (data.frame).
dtf_test: Wald test of differential test functioning.

Examples

dat <- simulate_aidif_data(n_items = 5, seed = 1)
fit <- estimate_robust_scale(dat$human)
print(fit$est)

Fit the AI-DIF model

Description

The primary estimation function of aiDIF. Runs the robust DIF procedure under both human and AI scoring using the built-in IRLS engine (estimate_robust_scale), then tests for Differential AI Scoring Bias (DASB).

Usage

fit_aidif(
  human_mle,
  ai_mle = NULL,
  alpha = 0.05,
  scale_by = "pooled",
  tol = 1e-07,
  maxit = 100L
)

Arguments

human_mle

A validated mle list for human-scored data.

ai_mle

A validated mle list for AI-scored data, or NULL.

alpha

Significance level. Default 0.05.

scale_by

Denominator for standardising intercept differences: "pooled" (default), "ref", or "focal".

tol

IRLS convergence tolerance. Default 1e-7.

maxit

Maximum IRLS iterations. Default 100.

Value

An object of class "aidif".

Examples

dat <- simulate_aidif_data(n_items = 6, seed = 1)
mod <- fit_aidif(dat$human, dat$ai)
print(mod)
summary(mod)

Gradient of the intercept scaling function

Description

Computes the Jacobian of compute_scaling_fn with respect to all item parameters in both groups, organised to be conformable with the block-diagonal covariance matrix built by build_joint_vcov.

Usage

grad_intercept_fn(mle, theta = NULL, scale_by = "pooled")

Arguments

mle

A validated mle list.

theta

Optional scalar: if supplied, item-specific scaling-function values are replaced by theta in the gradient (used when evaluating under H0).

scale_by

Passed from compute_scaling_fn.

Value

A matrix with n_items * n_thresholds columns, each being the gradient vector of one scaling-function entry with respect to the full parameter vector.

Grid search for bi-square objective minimum (starting value)

Description

Grid search for bi-square objective minimum (starting value)

Usage

grid_rho_search(y, var_fn, k, width = 0.01)

Least trimmed squares estimate of location (starting value)

Description

Least trimmed squares estimate of location (starting value)

Usage

lts_location(y, trim = 0.5)

Arguments

y

Numeric vector.

trim

Proportion to trim (default 0.5).

Built-in example dataset for aiDIF

Description

Constructs and returns the built-in example dataset: paired human and AI item parameter estimates for 6 items in two groups, with known DIF and DASB planted at specific items.

Usage

make_aidif_eg()

Details

The data-generating model includes:

Item 1: DIF under human scoring (intercept +0.5 in focal group).
Item 3: Differential AI Scoring Bias (DASB) — AI scoring adds +0.4 to the focal-group intercept only.
Impact: 0.5 SD (focal group higher on latent trait).
AI drift: uniform +0.1 calibration offset on all items in both groups.

Value

A list with elements human and ai, each a validated mle list (see simulate_aidif_data for format details).

Examples

eg  <- make_aidif_eg()
mod <- fit_aidif(eg$human, eg$ai)
summary(mod)

S3 plot method for class `"aidif"`.

Description

Produces one of several diagnostic plots depending on type.

Usage

## S3 method for class 'aidif'
plot(x, type = "dif_forest", ...)

Arguments

x

An object of class "aidif".

type

Character. One of:

"dif_forest": Forest plot of DIF estimates with 95% confidence intervals for both scoring conditions (default).
"dasb": Bar chart of DASB estimates with error bars.
"weights": Dot plot of bi-square anchor weights.
"rho": Bi-square objective function for human scoring.

...

Additional graphical parameters passed to low-level plot functions.

Value

x, invisibly.

S3 print method for class `"aidif"`.

Description

Prints a compact summary of the estimated robust scaling parameters and, when available, the number of items flagged for DIF and DASB.

Usage

## S3 method for class 'aidif'
print(x, ...)

Arguments

x

An object of class "aidif".

...

Further arguments (currently ignored).

Value

x, invisibly.

Validate and bundle paired human/AI parameter estimates

Description

Takes two mle lists (one per scoring condition) and returns a validated aidif_data object for use in fit_aidif.

Usage

read_ai_scored(human_mle, ai_mle)

Arguments

human_mle

An mle list for human-scored data. Must contain est (a named list group.1, group.2 of data.frames with columns a1, d1) and var.cov (matching list of covariance matrices).

ai_mle

An mle list for AI-scored data in the same format.

Value

A list of class "aidif_data" with elements human and ai.

Differential AI Scoring Bias (DASB) test.

Description

For each item, computes the change in item intercept from human to AI scoring within each group, then tests whether this scoring shift differs significantly across groups. A significant result indicates the AI scoring engine introduces a group-dependent parameter distortion — i.e., the AI does not merely re-scale all items uniformly but disfavours (or favours) one group at specific items.

Usage

scoring_bias_test(human_mle, ai_mle, fun = "d_fun3")

Arguments

human_mle

Output of simulate_aidif_data for human-scored data.

ai_mle

Output of simulate_aidif_data for AI-scored data. Must have the same item/group structure.

fun

Scaling function (passed to the internal scaling function) to use when normalising shifts. Default: "d_fun3".

Details

Estimand. Define the scoring shift in group g for item i threshold j as:

\delta_{igj} = d_{igj}^{\text{AI}} - d_{igj}^{\text{Human}}

The DASB is \delta_{i2j} - \delta_{i1j}. Under H_0: \text{DASB}_{ij} = 0 and independence across scoring conditions and groups,

\widehat{\mathrm{Var}}(\text{DASB}_{ij}) = (\sigma_{i1j}^{H})^2 + (\sigma_{i2j}^{H})^2 + (\sigma_{i1j}^{AI})^2 + (\sigma_{i2j}^{AI})^2

where each \sigma^2 is the diagonal element of the corresponding group-specific covariance matrix.

Value

A data.frame with one row per item (per threshold for polytomous items) and columns:

shift_g1: Scoring shift \delta_{i1} = d_{i1}^{AI} - d_{i1}^{H}.
shift_g2: Scoring shift \delta_{i2} = d_{i2}^{AI} - d_{i2}^{H}.
DASB: Differential AI Scoring Bias: \delta_{i2} - \delta_{i1}.
se: Standard error of DASB under the delta method.
z: Wald z-statistic.
p_val: Two-tailed p-value.

Examples

eg <- make_aidif_eg()
scoring_bias_test(eg$human, eg$ai)

Simulate item parameter estimates for the AI-DIF model.

Description

Generates a synthetic aidif_data-compatible list suitable for benchmarking and method evaluation. The data-generating model contains: classical DIF in the human scoring condition (controlled via dif_items and dif_mag), differential AI scoring bias (controlled via dasb_items and dasb_mag), and a latent group mean difference (impact).

Usage

simulate_aidif_data(
  n_items = 10L,
  n_obs = 500L,
  impact = 0.5,
  dif_items = 1L,
  dif_mag = 0.5,
  dasb_items = 3L,
  dasb_mag = 0.4,
  ai_drift = 0.1,
  seed = 42L
)

Arguments

n_items

Integer. Number of items. Default: 10.

n_obs

Integer. Approximate number of observations per group, used to scale the covariance matrices. Default: 500.

impact

Numeric. Latent mean difference (group 2 minus group 1) in SD units. Default: 0.5.

dif_items

Integer vector. Indices of items with DIF in the human scoring condition (intercept shift added to group 2). Default: 1.

dif_mag

Numeric. Magnitude of the intercept DIF effect (in IRT metric). Default: 0.5.

dasb_items

Integer vector. Indices of items where AI scoring introduces differential bias (intercept shift added to group 2 in the AI condition only). Default: 3.

dasb_mag

Numeric. Magnitude of the DASB effect. Default: 0.4.

ai_drift

Numeric. Uniform intercept shift applied to ALL items in BOTH groups under AI scoring (simulates AI calibration offset). Default: 0.1.

seed

Integer seed for reproducibility, or NULL. Default: 42.

Details

Rather than simulating item responses and refitting IRT models (which requires additional dependencies), this function directly simulates maximum-likelihood estimates and their asymptotic covariance matrices, consistent with a 2PL model fitted to n_obs observations per group.

Value

A list with elements human and ai, each formatted identically to the output of read_ai_scored. Can be passed directly to fit_aidif.

Examples

dat <- simulate_aidif_data(
  n_items   = 8,
  n_obs     = 600,
  dif_items = c(1, 2),
  dasb_items = 5
)
mod <- fit_aidif(dat$human, dat$ai)
summary(mod)

S3 summary method for class `"aidif"`.

Description

Prints a detailed report including DIF test tables for each scoring condition, the DASB table, and the AI-effect classification.

Usage

## S3 method for class 'aidif'
summary(object, ...)

Arguments

object

An object of class "aidif".

...

Further arguments (currently ignored).

Value

NULL, invisibly.

Delta-method covariance matrix of the scaling functions

Description

Delta-method covariance matrix of the scaling functions

Usage

vcov_scaling_fn(mle, theta = NULL, scale_by = "pooled")

Arguments

mle

A validated mle list.

theta

Optional scalar evaluated under H0.

scale_by

Passed to gradient function.

Value

A square matrix of size n_items * n_thresholds.

Wald item-level DIF test

Description

Tests H0: y_i = theta for each item, using the projected variance that accounts for estimation of theta itself.

Usage

wald_dif_test(y, theta, Vcov)

Arguments

y

Scaling function values.

theta

Estimated robust scale parameter.

Vcov

Covariance matrix of y (at theta under H0).

Value

A data.frame with columns delta, se, z, p_val.

Wald test of differential test functioning (DTF)

Description

Tests H0: mean(y) - theta = 0, i.e. whether the robust scale estimate differs significantly from the naive mean. A significant result indicates meaningful DTF.

Usage

wald_dtf_test(y, theta, weights, Vcov_H0, Vcov_raw, k)

Arguments

y

Scaling function values.

theta

Robust scale estimate.

weights

Bi-square weights from IRLS.

Vcov_H0

Covariance of y under H0 (theta plugged in).

Vcov_raw

Covariance of y (no theta substitution).

k

Bi-square tuning parameter.

Value

A one-row data.frame.

Package {aiDIF}

Summarise the effect of AI scoring on DIF flagging.

Description

Usage

Arguments

Value

See Also

Examples

Anchor item weights from the robust AI-DIF procedure.

Description

Usage

Arguments

Value

Examples

Bi-square psi (influence) function

Description

Usage

Derivative of bi-square psi

Description

Usage

Bi-square rho (objective) function

Description

Usage

Bi-square weight function

Description

Usage

Arguments

Value

Block-diagonal joint covariance matrix for both groups

Description

Usage

Arguments

Value

Compute item-level IRT scaling functions

Description

Usage

Arguments

Value

Robust DIF scale estimation via IRLS

Description

Usage

Arguments

Value

Examples

Fit the AI-DIF model

Description

Usage

Arguments

Value

See Also

Examples

Gradient of the intercept scaling function

Description

Usage

Arguments

Value

Grid search for bi-square objective minimum (starting value)

Description

Usage

Least trimmed squares estimate of location (starting value)

Description

Usage

Arguments

Built-in example dataset for aiDIF

Description

Usage

Details

Value

See Also

Examples

S3 plot method for class "aidif".

Description

Usage

Arguments

Value

S3 print method for class "aidif".

Description

Usage

Arguments

Value

S3 plot method for class `"aidif"`.

S3 print method for class `"aidif"`.

S3 summary method for class `"aidif"`.