Skip to content

API reference

Auto-generated from the package docstrings.

Top level

fairscope.FairnessAudit(model, domain, **kwargs)

Route to a domain-specific fairness audit.

Parameters:

Name Type Description Default
model the fitted classifier (or ``None`` with precomputed scores via the domain API).
required
domain str

The audit domain. Implemented: "healthcare" (uses model) and "nlp" (the CPFE protocol; operates on precomputed platform_data, model ignored).

required
**kwargs passed through to the domain audit class.
{}

Examples:

>>> import fairscope
>>> callable(fairscope.FairnessAudit)
True

fairscope.core

fairscope.core

Core statistical primitives for fairscope.

Public API for subgroup-stratified, calibration-aware fairness auditing: DeLong AUC confidence intervals and tests, a stratified bootstrap AUC test, calibration error and recalibration, multiple-comparison corrections, and subgroup fairness metrics.

bootstrap_auc_test(y_true, score_a, score_b, n_boot=2000, random_state=None)

Stratified bootstrap test of (AUC_a - AUC_b) on the same samples.

Resamples positives and negatives separately (preserving class balance). Returns a dict: auc_a, auc_b, delta, se, z, p_value, ci_lower, ci_upper, n_boot.

Examples:

>>> import numpy as np
>>> rng = np.random.default_rng(0)
>>> y = rng.integers(0, 2, 400)
>>> a = rng.random(400) + 0.8 * y
>>> b = rng.random(400) + 0.05 * y
>>> bootstrap_auc_test(y, a, b, n_boot=200, random_state=0)["delta"] > 0
True

ece_by_group(y_true, y_prob, groups, n_bins=10, strategy='uniform')

Per-subgroup Expected Calibration Error. Returns {group: ece}.

expected_calibration_error(y_true, y_prob, n_bins=10, strategy='uniform')

Weighted mean |observed frequency - mean predicted probability| across bins.

Returns 0.0 if there is no data. strategy is 'uniform' (equal-width) or 'quantile' (equal-frequency) bins.

Examples:

>>> import numpy as np
>>> y = np.array([0, 0, 1, 1])
>>> p = np.array([0.0, 0.0, 1.0, 1.0])
>>> expected_calibration_error(y, p)
0.0

isotonic_recalibrate(probs, y_true)

Fit isotonic regression mapping predicted prob -> calibrated prob (Zadrozny & Elkan, 2002). Returns (fitted_model, calibrated_probabilities).

maximum_calibration_error(y_true, y_prob, n_bins=10, strategy='uniform')

Maximum |observed frequency - mean predicted probability| over bins.

reliability_diagram(y_true, y_prob, groups=None, n_bins=10)

Reliability diagram (mean predicted probability vs observed frequency).

Returns a matplotlib Figure. If groups is given, draws one curve per subgroup.

temperature_scale(logits, y_true, max_iter=200)

Fit a single temperature T>0 by minimizing NLL (Guo et al., 2017).

logits may be 1-D (binary positive-class logit) or 2-D (n, n_classes). Returns (T, calibrated_probabilities); calibrated probabilities are 1-D for 1-D input.

benjamini_hochberg(p_values, alpha=0.05)

Benjamini-Hochberg FDR correction (step-up). Returns adjusted, reject.

bonferroni(p_values, alpha=0.05)

Bonferroni correction. Returns adjusted, reject, threshold.

Examples:

>>> bonferroni([0.01, 0.5], alpha=0.05)["adjusted"].tolist()
[0.02, 1.0]

delong_auc_ci(y_true, y_score, alpha=0.05)

AUC with a DeLong (1 - alpha) normal-approximation confidence interval.

Returns a dict: auc, ci_lower, ci_upper, se, n_pos, n_neg.

Examples:

>>> import numpy as np
>>> y = np.array([1, 1, 1, 1, 0, 0, 0, 0])
>>> s = np.array([0.9, 0.8, 0.7, 0.4, 0.6, 0.5, 0.3, 0.2])
>>> round(delong_auc_ci(y, s)["auc"], 3)
0.875

delong_by_group(y_true, y_score, groups, alpha=0.05)

Per-subgroup DeLong AUC CIs. Returns {group: ci_dict}.

delong_paired_test(y_true, score_a, score_b)

Covariance-aware paired DeLong test for two scores on the SAME samples.

Returns: auc_a, auc_b, delta, z, p_value.

delong_unpaired_test(y_true_a, score_a, y_true_b, score_b)

Unpaired DeLong test for two INDEPENDENT samples (the cross-platform case).

disparate_impact(y_pred, groups, group_a, group_b)

Symmetric disparate impact between two groups: min(rate_a/rate_b, rate_b/rate_a).

rate = P(y_pred == 1 | group) over hard 0/1 predictions. Result is in (0, 1]; < 0.80 violates the four-fifths rule and < 0.50 is a severe disparity (interpretation per the CPFE paper). If both rates are 0 the groups are treated as equal (returns 1.0); if exactly one rate is 0 the disparity is maximal (returns 0.0).

Examples:

>>> import numpy as np
>>> y_pred = np.array([1, 1, 0, 0, 1, 0, 0, 0])
>>> g = np.array(["A", "A", "A", "A", "B", "B", "B", "B"])
>>> round(disparate_impact(y_pred, g, "A", "B"), 3)
0.5

equalized_odds_difference(y_true, y_pred, groups, group_a, group_b)

|TPR_a - TPR_b|, the equalized odds difference as defined in the CPFE paper.

TPR = P(y_pred == 1 | y_true == 1, group). Raises if either group has no positive labels (TPR undefined).

subgroup_metrics(y_true, y_score, groups, threshold=0.5)

Per-subgroup discrimination metrics: {group: {auc, brier, f1, n}}.

y_score are predicted probabilities in [0, 1] (Brier assumes a probability); threshold binarizes them for F1. Raises if any subgroup is single-class (AUC undefined) or if y_score contains NaN.

fairscope.healthcare

fairscope.healthcare.audit

One-call clinical fairness audit. Composes fairscope.core; invents no statistics.

Pipeline per protected attribute: per-subgroup DeLong AUC CIs -> per-subgroup ECE -> per-subgroup Brier/F1 -> Bonferroni-corrected pairwise (unpaired) DeLong tests across the attribute's subgroups. Mirrors the analysis in the diabetes paper (IEEE CIPHER 2026).

HealthcareFairnessAudit

Audit a fitted classifier (or precomputed scores) for subgroup fairness.

Parameters:

Name Type Description Default
model object with ``predict_proba`` (positive-class probability in column 1), or

None when using :meth:from_scores.

required
X_test test features and binary labels.
required
y_test test features and binary labels.
required
protected_attr dict ``{attribute_name: 1-D array of subgroup labels per sample}``.
required

from_scores(y_true, y_score, protected_attr, *, n_bins=10, alpha=0.05) classmethod

Build an audit from precomputed positive-class probabilities (no model).

run()

Run the audit and return a :class:HealthcareReport.

Raises a clear, attribute-named ValueError if a subgroup is single-class (AUC undefined) rather than letting a low-level error surface.

shap_summary(max_samples=200)

Mean absolute SHAP value per feature (optional). Requires pip install fairscope[shap] and a model (not from_scores). Returns a dict {feature_index: mean_abs_shap}.

HealthcareReport

Holds audit results and the raw (y, score, groups) needed to render reliability curves; provides tables (here), and plots/PDF (see plotting methods).

summary()

Return a human-readable summary string (no print side effect).

plot_auc_forest(attribute=None)

Forest plot of per-subgroup AUC with DeLong 95% CIs. Returns a Figure.

plot_calibration(attribute=None)

Reliability curves per subgroup for one attribute, drawn with core.reliability_diagram. Returns a Figure.

to_pdf(path)

Write a multi-page PDF: summary, AUC forest, and one calibration page per attribute. Uses matplotlib only (no extra dependency).

fairscope.nlp

fairscope.nlp

Cross-Platform Fairness Evaluation (CPFE) for NLP, built on fairscope.core.

Public API: the five-axis CPFEProtocol/CPFEReport, the multiclass metric and significance primitives (axes 1-4), and the attribution-stability functions (axis 5).

CPFEProtocol

Run the CPFE five-axis evaluation over precomputed per-platform outputs.

Parameters:

Name Type Description Default
platform_data dict ``{name: {"y_true": array, "probs": (n, n_classes) array}}``.
required
reference the within-platform name (e.g. the training platform).
required
n_classes number of classes.
required
delta_auc_pct_max ILLUSTRATIVE macro-AUC-drop screening limit (percent) used by

CPFEReport.deployment_readiness. NOT a published cutoff: P4 Section 6.6 declines to set one (observed drops were 28.6-39.5%); the default echoes that ">30%" magnitude and is labelled illustrative everywhere it surfaces.

30.0

CPFEReport

Holds the five-axis results and renders tables and a deployment-readiness diagnostic.

deployment_readiness()

Structured per-axis, per-platform screening DIAGNOSTIC -- NOT a deployment decision. Following the CPFE paper (Sections 6.5-6.6), cross-platform degradation is an informative diagnostic, not definitive evidence of bias.

Thresholds: calibration uses P4's stated ECE bands (Suppl. Fig. S2); equity uses P4's four-fifths rule; discrimination uses an ILLUSTRATIVE delta_auc_pct_max (P4 Section 6.6 declines to set a published cutoff). Returns {platform: {"ready": bool, "axes": {axis: {pass, value, threshold, source, reason}}}}.

jaccard_topk(saliency_a, saliency_b, k)

Jaccard overlap of the top-k tokens by saliency: |topK(A) ∩ topK(B)| / |union|.

saliency_* map token -> saliency score. Returns 0.0 if both are empty.

Examples:

>>> jaccard_topk({"a": 0.9, "b": 0.8}, {"a": 0.7, "c": 0.6}, k=2)
0.3333333333333333

token_saliency(model, tokenizer, text, target=None)

Per-token gradient saliency s_i = ‖∂P(y|x)/∂E_i‖₂ via Captum (optional). Requires pip install fairscope[nlp]. Returns {token: saliency}.

macro_auc(y_true, probs)

Macro one-vs-rest AUC. Requires every class present in y_true.

macro_f1(y_true, probs)

Macro F1 of the argmax predictions.

multiclass_ece(y_true, probs, n_bins=10)

Confidence-accuracy Expected Calibration Error (Guo et al. 2017; CPFE paper): ECE = sum_m (|B_m|/n) * |acc(B_m) - conf(B_m)| with conf = max prob and acc = top-1 correct.

per_class_disparate_impact(probs_a, probs_b, n_classes)

Symmetric DI per class between two platforms, reusing core.disparate_impact with the class binarized (pred == c) and platform as the two-group label.

per_class_equalized_odds(y_a, probs_a, y_b, probs_b, n_classes)

EOD per class between two platforms (|TPR_c(A) - TPR_c(B)|), reusing core.equalized_odds_difference. A class with no positive labels in a platform is returned as None (TPR undefined).

bootstrap_macro_auc_test(y_a, probs_a, y_b, probs_b, n_boot=2000, random_state=None)

Compare macro AUC across two platforms (independent test sets).

Each platform's macro-AUC standard error is estimated by a class-stratified bootstrap; the errors are combined for an unpaired z-test. Returns a dict: auc_a, auc_b, delta, se, z, p_value, n_boot.