Type: Package
Title: Differential Item Functioning for AI-Scored Assessments
Version: 0.1.0
Description: Detects and quantifies differential item functioning (DIF) in AI-scored educational and psychological assessments. Provides a fully self-contained robust DIF engine (M-estimation via iteratively re-weighted least squares with the bi-square loss) alongside the novel Differential AI Scoring Bias (DASB) test, which detects item-level scoring shifts that differ across subgroups when comparing human and AI scoring conditions. Includes simulation utilities, anchor weight diagnostics, and an AI-effect classification framework.
License: GPL (≥ 3)
Encoding: UTF-8
Depends: R (≥ 3.5.0)
Imports: Matrix, stats, graphics
Suggests: mirt, testthat (≥ 3.0.0), knitr, rmarkdown
Config/testthat/edition: 3
RoxygenNote: 7.3.3
VignetteBuilder: knitr
URL: https://github.com/causalfragility-lab/aiDIF
BugReports: https://github.com/causalfragility-lab/aiDIF/issues
NeedsCompilation: no
Packaged: 2026-04-20 20:43:11 UTC; Subir
Author: Subir Hait ORCID iD [aut, cre]
Maintainer: Subir Hait <haitsubi@msu.edu>
Repository: CRAN
Date/Publication: 2026-04-21 20:52:36 UTC

Summarise the effect of AI scoring on DIF flagging.

Description

Compares the DIF flagging patterns from human and AI scoring conditions and classifies each item as: "stable_clean" (not flagged in either), "stable_dif" (flagged in both), "introduced" (flagged only under AI), "masked" (flagged only under human), or "new_direction" (flagged in both but bias reverses sign).

Usage

ai_effect_summary(dif_human, dif_ai, alpha = 0.05)

Arguments

dif_human

A data.frame returned by fit_aidif for the human scoring condition.

dif_ai

A data.frame returned by fit_aidif for the AI scoring condition.

alpha

Significance threshold for flagging. Default: 0.05.

Value

A data.frame with one row per item/threshold and columns:

human_delta

Estimated DIF effect under human scoring.

ai_delta

Estimated DIF effect under AI scoring.

human_flag

Logical: flagged under human scoring?

ai_flag

Logical: flagged under AI scoring?

status

Classification (see Description).

See Also

scoring_bias_test, fit_aidif

Examples

eg <- make_aidif_eg()
mod <- fit_aidif(eg$human, eg$ai)
ai_effect_summary(mod$dif_human, mod$dif_ai)


Anchor item weights from the robust AI-DIF procedure.

Description

Returns the bi-square weights assigned to each item under each scoring condition. Items with weight near zero are effectively excluded from the robust scaling estimate, indicating likely DIF contamination.

Usage

anchor_weights(object)

Arguments

object

An aidif object from fit_aidif.

Value

A data.frame with columns human_weight and (if AI data were provided) ai_weight. Higher weight means the item is contributing more to the robust scale estimate.

Examples

eg <- make_aidif_eg()
mod <- fit_aidif(eg$human, eg$ai)
anchor_weights(mod)


Bi-square psi (influence) function

Description

Bi-square psi (influence) function

Usage

bisq_psi(u, k = 1.96)

Derivative of bi-square psi

Description

Derivative of bi-square psi

Usage

bisq_psi_prime(u, k = 1.96)

Bi-square rho (objective) function

Description

Bi-square rho (objective) function

Usage

bisq_rho(u, k = 1.96)

Bi-square weight function

Description

Bi-square weight function

Usage

bisq_weight(u, k = 1.96)

Arguments

u

Numeric vector of standardised residuals.

k

Tuning parameter (default 1.96).

Value

Numeric vector of weights in [0, 1].


Block-diagonal joint covariance matrix for both groups

Description

Block-diagonal joint covariance matrix for both groups

Usage

build_joint_vcov(mle)

Arguments

mle

A validated mle list.

Value

A Matrix::bdiag sparse block-diagonal matrix.


Compute item-level IRT scaling functions

Description

For each item, computes a standardised difference between group parameter estimates. The result is a vector y whose robust location is estimated by estimate_robust_scale.

Usage

compute_scaling_fn(mle, type = "intercept", scale_by = "pooled")

Arguments

mle

A validated mle list (output of read_ai_scored or constructed manually). Must contain est$group.1, est$group.2 and matching var.cov matrices.

type

One of "intercept" (default) or "slope". Determines which parameters are compared.

scale_by

One of "pooled" (default), "ref", or "focal". Controls the denominator used to standardise intercept differences: "pooled" uses \sqrt{(a_1^2+a_2^2)/2}; "ref" uses a_1; "focal" uses a_2. Ignored when type = "slope".

Value

A named numeric vector of scaling-function values, one entry per item threshold (or per item for slopes).


Robust DIF scale estimation via IRLS

Description

Estimates a robust location parameter for the vector of IRT scaling functions using iteratively re-weighted least squares (IRLS) with the bi-square loss. This is the core estimation engine of aiDIF.

Usage

estimate_robust_scale(
  mle,
  alpha = 0.05,
  scale_by = "pooled",
  tol = 1e-07,
  maxit = 100L
)

Arguments

mle

A validated mle list.

alpha

Significance level controlling the bi-square tuning parameter k = z_{1-\alpha/2}. Default 0.05.

scale_by

Scaling denominator; passed to compute_scaling_fn. Default "pooled".

tol

Convergence tolerance. Default 1e-7.

maxit

Maximum IRLS iterations. Default 100.

Value

A list of class rdif_fit with elements:

est

Estimated robust scale parameter.

weights

Bi-square item weights.

rho_value

Value of objective at solution.

n_iter

Number of iterations used.

k

Tuning parameter used.

y

Raw scaling function values.

vcov_est

Covariance matrix of y at solution.

dif_test

Wald item-level DIF test (data.frame).

dtf_test

Wald test of differential test functioning.

Examples

dat <- simulate_aidif_data(n_items = 5, seed = 1)
fit <- estimate_robust_scale(dat$human)
print(fit$est)


Fit the AI-DIF model

Description

The primary estimation function of aiDIF. Runs the robust DIF procedure under both human and AI scoring using the built-in IRLS engine (estimate_robust_scale), then tests for Differential AI Scoring Bias (DASB).

Usage

fit_aidif(
  human_mle,
  ai_mle = NULL,
  alpha = 0.05,
  scale_by = "pooled",
  tol = 1e-07,
  maxit = 100L
)

Arguments

human_mle

A validated mle list for human-scored data.

ai_mle

A validated mle list for AI-scored data, or NULL.

alpha

Significance level. Default 0.05.

scale_by

Denominator for standardising intercept differences: "pooled" (default), "ref", or "focal".

tol

IRLS convergence tolerance. Default 1e-7.

maxit

Maximum IRLS iterations. Default 100.

Value

An object of class "aidif".

See Also

estimate_robust_scale, scoring_bias_test, simulate_aidif_data

Examples

dat <- simulate_aidif_data(n_items = 6, seed = 1)
mod <- fit_aidif(dat$human, dat$ai)
print(mod)
summary(mod)


Gradient of the intercept scaling function

Description

Computes the Jacobian of compute_scaling_fn with respect to all item parameters in both groups, organised to be conformable with the block-diagonal covariance matrix built by build_joint_vcov.

Usage

grad_intercept_fn(mle, theta = NULL, scale_by = "pooled")

Arguments

mle

A validated mle list.

theta

Optional scalar: if supplied, item-specific scaling-function values are replaced by theta in the gradient (used when evaluating under H0).

scale_by

Passed from compute_scaling_fn.

Value

A matrix with n_items * n_thresholds columns, each being the gradient vector of one scaling-function entry with respect to the full parameter vector.


Description

Grid search for bi-square objective minimum (starting value)

Usage

grid_rho_search(y, var_fn, k, width = 0.01)

Least trimmed squares estimate of location (starting value)

Description

Least trimmed squares estimate of location (starting value)

Usage

lts_location(y, trim = 0.5)

Arguments

y

Numeric vector.

trim

Proportion to trim (default 0.5).


Built-in example dataset for aiDIF

Description

Constructs and returns the built-in example dataset: paired human and AI item parameter estimates for 6 items in two groups, with known DIF and DASB planted at specific items.

Usage

make_aidif_eg()

Details

The data-generating model includes:

Value

A list with elements human and ai, each a validated mle list (see simulate_aidif_data for format details).

See Also

simulate_aidif_data, fit_aidif

Examples

eg  <- make_aidif_eg()
mod <- fit_aidif(eg$human, eg$ai)
summary(mod)


S3 plot method for class "aidif".

Description

Produces one of several diagnostic plots depending on type.

Usage

## S3 method for class 'aidif'
plot(x, type = "dif_forest", ...)

Arguments

x

An object of class "aidif".

type

Character. One of:

"dif_forest"

Forest plot of DIF estimates with 95% confidence intervals for both scoring conditions (default).

"dasb"

Bar chart of DASB estimates with error bars.

"weights"

Dot plot of bi-square anchor weights.

"rho"

Bi-square objective function for human scoring.

...

Additional graphical parameters passed to low-level plot functions.

Value

x, invisibly.


S3 print method for class "aidif".

Description

Prints a compact summary of the estimated robust scaling parameters and, when available, the number of items flagged for DIF and DASB.

Usage

## S3 method for class 'aidif'
print(x, ...)

Arguments

x

An object of class "aidif".

...

Further arguments (currently ignored).

Value

x, invisibly.


Validate and bundle paired human/AI parameter estimates

Description

Takes two mle lists (one per scoring condition) and returns a validated aidif_data object for use in fit_aidif.

Usage

read_ai_scored(human_mle, ai_mle)

Arguments

human_mle

An mle list for human-scored data. Must contain est (a named list group.1, group.2 of data.frames with columns a1, d1) and var.cov (matching list of covariance matrices).

ai_mle

An mle list for AI-scored data in the same format.

Value

A list of class "aidif_data" with elements human and ai.

See Also

fit_aidif, make_aidif_eg, simulate_aidif_data


Differential AI Scoring Bias (DASB) test.

Description

For each item, computes the change in item intercept from human to AI scoring within each group, then tests whether this scoring shift differs significantly across groups. A significant result indicates the AI scoring engine introduces a group-dependent parameter distortion — i.e., the AI does not merely re-scale all items uniformly but disfavours (or favours) one group at specific items.

Usage

scoring_bias_test(human_mle, ai_mle, fun = "d_fun3")

Arguments

human_mle

Output of simulate_aidif_data for human-scored data.

ai_mle

Output of simulate_aidif_data for AI-scored data. Must have the same item/group structure.

fun

Scaling function (passed to the internal scaling function) to use when normalising shifts. Default: "d_fun3".

Details

Estimand. Define the scoring shift in group g for item i threshold j as:

\delta_{igj} = d_{igj}^{\text{AI}} - d_{igj}^{\text{Human}}

The DASB is \delta_{i2j} - \delta_{i1j}. Under H_0: \text{DASB}_{ij} = 0 and independence across scoring conditions and groups,

\widehat{\mathrm{Var}}(\text{DASB}_{ij}) = (\sigma_{i1j}^{H})^2 + (\sigma_{i2j}^{H})^2 + (\sigma_{i1j}^{AI})^2 + (\sigma_{i2j}^{AI})^2

where each \sigma^2 is the diagonal element of the corresponding group-specific covariance matrix.

Value

A data.frame with one row per item (per threshold for polytomous items) and columns:

shift_g1

Scoring shift \delta_{i1} = d_{i1}^{AI} - d_{i1}^{H}.

shift_g2

Scoring shift \delta_{i2} = d_{i2}^{AI} - d_{i2}^{H}.

DASB

Differential AI Scoring Bias: \delta_{i2} - \delta_{i1}.

se

Standard error of DASB under the delta method.

z

Wald z-statistic.

p_val

Two-tailed p-value.

See Also

fit_aidif, ai_effect_summary

Examples

eg <- make_aidif_eg()
scoring_bias_test(eg$human, eg$ai)


Simulate item parameter estimates for the AI-DIF model.

Description

Generates a synthetic aidif_data-compatible list suitable for benchmarking and method evaluation. The data-generating model contains: classical DIF in the human scoring condition (controlled via dif_items and dif_mag), differential AI scoring bias (controlled via dasb_items and dasb_mag), and a latent group mean difference (impact).

Usage

simulate_aidif_data(
  n_items = 10L,
  n_obs = 500L,
  impact = 0.5,
  dif_items = 1L,
  dif_mag = 0.5,
  dasb_items = 3L,
  dasb_mag = 0.4,
  ai_drift = 0.1,
  seed = 42L
)

Arguments

n_items

Integer. Number of items. Default: 10.

n_obs

Integer. Approximate number of observations per group, used to scale the covariance matrices. Default: 500.

impact

Numeric. Latent mean difference (group 2 minus group 1) in SD units. Default: 0.5.

dif_items

Integer vector. Indices of items with DIF in the human scoring condition (intercept shift added to group 2). Default: 1.

dif_mag

Numeric. Magnitude of the intercept DIF effect (in IRT metric). Default: 0.5.

dasb_items

Integer vector. Indices of items where AI scoring introduces differential bias (intercept shift added to group 2 in the AI condition only). Default: 3.

dasb_mag

Numeric. Magnitude of the DASB effect. Default: 0.4.

ai_drift

Numeric. Uniform intercept shift applied to ALL items in BOTH groups under AI scoring (simulates AI calibration offset). Default: 0.1.

seed

Integer seed for reproducibility, or NULL. Default: 42.

Details

Rather than simulating item responses and refitting IRT models (which requires additional dependencies), this function directly simulates maximum-likelihood estimates and their asymptotic covariance matrices, consistent with a 2PL model fitted to n_obs observations per group.

Value

A list with elements human and ai, each formatted identically to the output of read_ai_scored. Can be passed directly to fit_aidif.

Examples

dat <- simulate_aidif_data(
  n_items   = 8,
  n_obs     = 600,
  dif_items = c(1, 2),
  dasb_items = 5
)
mod <- fit_aidif(dat$human, dat$ai)
summary(mod)


S3 summary method for class "aidif".

Description

Prints a detailed report including DIF test tables for each scoring condition, the DASB table, and the AI-effect classification.

Usage

## S3 method for class 'aidif'
summary(object, ...)

Arguments

object

An object of class "aidif".

...

Further arguments (currently ignored).

Value

NULL, invisibly.


Delta-method covariance matrix of the scaling functions

Description

Delta-method covariance matrix of the scaling functions

Usage

vcov_scaling_fn(mle, theta = NULL, scale_by = "pooled")

Arguments

mle

A validated mle list.

theta

Optional scalar evaluated under H0.

scale_by

Passed to gradient function.

Value

A square matrix of size n_items * n_thresholds.


Wald item-level DIF test

Description

Tests H0: y_i = theta for each item, using the projected variance that accounts for estimation of theta itself.

Usage

wald_dif_test(y, theta, Vcov)

Arguments

y

Scaling function values.

theta

Estimated robust scale parameter.

Vcov

Covariance matrix of y (at theta under H0).

Value

A data.frame with columns delta, se, z, p_val.


Wald test of differential test functioning (DTF)

Description

Tests H0: mean(y) - theta = 0, i.e. whether the robust scale estimate differs significantly from the naive mean. A significant result indicates meaningful DTF.

Usage

wald_dtf_test(y, theta, weights, Vcov_H0, Vcov_raw, k)

Arguments

y

Scaling function values.

theta

Robust scale estimate.

weights

Bi-square weights from IRLS.

Vcov_H0

Covariance of y under H0 (theta plugged in).

Vcov_raw

Covariance of y (no theta substitution).

k

Bi-square tuning parameter.

Value

A one-row data.frame.