API reference¶

Auto-generated from the package docstrings.

Top level¶

`fairscope.FairnessAudit(model, domain, **kwargs)` ¶

Route to a domain-specific fairness audit.

Parameters:

Name	Type	Description	Default
`model`	the fitted classifier (or ``None`` with precomputed scores via the domain API).		required
`domain`	`str`	The audit domain. Implemented: `"healthcare"` (uses `model`) and `"nlp"` (the CPFE protocol; operates on precomputed `platform_data`, `model` ignored).	required
`**kwargs`	`passed through to the domain audit class.`		`{}`

Examples:

>>> import fairscope
>>> callable(fairscope.FairnessAudit)
True

`fairscope.core`¶

`fairscope.core` ¶

Core statistical primitives for fairscope.

Public API for subgroup-stratified, calibration-aware fairness auditing: DeLong AUC confidence intervals and tests, a stratified bootstrap AUC test, calibration error and recalibration, multiple-comparison corrections, and subgroup fairness metrics.

`bootstrap_auc_test(y_true, score_a, score_b, n_boot=2000, random_state=None)` ¶

Stratified bootstrap test of (AUC_a - AUC_b) on the same samples.

Resamples positives and negatives separately (preserving class balance). Returns a dict: auc_a, auc_b, delta, se, z, p_value, ci_lower, ci_upper, n_boot.

Examples:

>>> import numpy as np
>>> rng = np.random.default_rng(0)
>>> y = rng.integers(0, 2, 400)
>>> a = rng.random(400) + 0.8 * y
>>> b = rng.random(400) + 0.05 * y
>>> bootstrap_auc_test(y, a, b, n_boot=200, random_state=0)["delta"] > 0
True

`ece_by_group(y_true, y_prob, groups, n_bins=10, strategy='uniform')` ¶

Per-subgroup Expected Calibration Error. Returns {group: ece}.

`expected_calibration_error(y_true, y_prob, n_bins=10, strategy='uniform')` ¶

Weighted mean |observed frequency - mean predicted probability| across bins.

Returns 0.0 if there is no data. strategy is 'uniform' (equal-width) or 'quantile' (equal-frequency) bins.

Examples:

>>> import numpy as np
>>> y = np.array([0, 0, 1, 1])
>>> p = np.array([0.0, 0.0, 1.0, 1.0])
>>> expected_calibration_error(y, p)
0.0

`isotonic_recalibrate(probs, y_true)` ¶

Fit isotonic regression mapping predicted prob -> calibrated prob (Zadrozny & Elkan, 2002). Returns (fitted_model, calibrated_probabilities).

`maximum_calibration_error(y_true, y_prob, n_bins=10, strategy='uniform')` ¶

Maximum |observed frequency - mean predicted probability| over bins.

`reliability_diagram(y_true, y_prob, groups=None, n_bins=10)` ¶

Reliability diagram (mean predicted probability vs observed frequency).

Returns a matplotlib Figure. If groups is given, draws one curve per subgroup.

`temperature_scale(logits, y_true, max_iter=200)` ¶

Fit a single temperature T>0 by minimizing NLL (Guo et al., 2017).

logits may be 1-D (binary positive-class logit) or 2-D (n, n_classes). Returns (T, calibrated_probabilities); calibrated probabilities are 1-D for 1-D input.

`benjamini_hochberg(p_values, alpha=0.05)` ¶

Benjamini-Hochberg FDR correction (step-up). Returns adjusted, reject.

`bonferroni(p_values, alpha=0.05)` ¶

Bonferroni correction. Returns adjusted, reject, threshold.

Examples:

>>> bonferroni([0.01, 0.5], alpha=0.05)["adjusted"].tolist()
[0.02, 1.0]

`delong_auc_ci(y_true, y_score, alpha=0.05)` ¶

AUC with a DeLong (1 - alpha) normal-approximation confidence interval.

Returns a dict: auc, ci_lower, ci_upper, se, n_pos, n_neg.

Examples:

>>> import numpy as np
>>> y = np.array([1, 1, 1, 1, 0, 0, 0, 0])
>>> s = np.array([0.9, 0.8, 0.7, 0.4, 0.6, 0.5, 0.3, 0.2])
>>> round(delong_auc_ci(y, s)["auc"], 3)
0.875

`delong_by_group(y_true, y_score, groups, alpha=0.05)` ¶

Per-subgroup DeLong AUC CIs. Returns {group: ci_dict}.

`delong_paired_test(y_true, score_a, score_b)` ¶

Covariance-aware paired DeLong test for two scores on the SAME samples.

Returns: auc_a, auc_b, delta, z, p_value.

`delong_unpaired_test(y_true_a, score_a, y_true_b, score_b)` ¶

Unpaired DeLong test for two INDEPENDENT samples (the cross-platform case).

`disparate_impact(y_pred, groups, group_a, group_b)` ¶

Symmetric disparate impact between two groups: min(rate_a/rate_b, rate_b/rate_a).

rate = P(y_pred == 1 | group) over hard 0/1 predictions. Result is in (0, 1]; < 0.80 violates the four-fifths rule and < 0.50 is a severe disparity (interpretation per the CPFE paper). If both rates are 0 the groups are treated as equal (returns 1.0); if exactly one rate is 0 the disparity is maximal (returns 0.0).

Examples:

>>> import numpy as np
>>> y_pred = np.array([1, 1, 0, 0, 1, 0, 0, 0])
>>> g = np.array(["A", "A", "A", "A", "B", "B", "B", "B"])
>>> round(disparate_impact(y_pred, g, "A", "B"), 3)
0.5

`equalized_odds_difference(y_true, y_pred, groups, group_a, group_b)` ¶

|TPR_a - TPR_b|, the equalized odds difference as defined in the CPFE paper.

TPR = P(y_pred == 1 | y_true == 1, group). Raises if either group has no positive labels (TPR undefined).

`subgroup_metrics(y_true, y_score, groups, threshold=0.5)` ¶

Per-subgroup discrimination metrics: {group: {auc, brier, f1, n}}.

y_score are predicted probabilities in [0, 1] (Brier assumes a probability); threshold binarizes them for F1. Raises if any subgroup is single-class (AUC undefined) or if y_score contains NaN.

`fairscope.healthcare`¶

`fairscope.healthcare.audit` ¶

One-call clinical fairness audit. Composes fairscope.core; invents no statistics.

Pipeline per protected attribute: per-subgroup DeLong AUC CIs -> per-subgroup ECE -> per-subgroup Brier/F1 -> Bonferroni-corrected pairwise (unpaired) DeLong tests across the attribute's subgroups. Mirrors the analysis in the diabetes paper (IEEE CIPHER 2026).

`HealthcareFairnessAudit` ¶

Audit a fitted classifier (or precomputed scores) for subgroup fairness.

Parameters:

Name	Type	Description	Default
`model`	object with ``predict_proba`` (positive-class probability in column 1), or	`None` when using :meth:`from_scores`.	required
`X_test`	`test features and binary labels.`		required
`y_test`	`test features and binary labels.`		required
`protected_attr`	dict ``{attribute_name: 1-D array of subgroup labels per sample}``.		required

`from_scores(y_true, y_score, protected_attr, *, n_bins=10, alpha=0.05)` `classmethod` ¶

Build an audit from precomputed positive-class probabilities (no model).

`run()` ¶

Run the audit and return a :class:HealthcareReport.

Raises a clear, attribute-named ValueError if a subgroup is single-class (AUC undefined) rather than letting a low-level error surface.

`shap_summary(max_samples=200)` ¶

Mean absolute SHAP value per feature (optional). Requires pip install fairscope[shap] and a model (not from_scores). Returns a dict {feature_index: mean_abs_shap}.

`HealthcareReport` ¶

Holds audit results and the raw (y, score, groups) needed to render reliability curves; provides tables (here), and plots/PDF (see plotting methods).

`summary()` ¶

Return a human-readable summary string (no print side effect).

`plot_auc_forest(attribute=None)` ¶

Forest plot of per-subgroup AUC with DeLong 95% CIs. Returns a Figure.

`plot_calibration(attribute=None)` ¶

Reliability curves per subgroup for one attribute, drawn with core.reliability_diagram. Returns a Figure.

`to_pdf(path)` ¶

Write a multi-page PDF: summary, AUC forest, and one calibration page per attribute. Uses matplotlib only (no extra dependency).

`fairscope.nlp`¶

`fairscope.nlp` ¶

Cross-Platform Fairness Evaluation (CPFE) for NLP, built on fairscope.core.

Public API: the five-axis CPFEProtocol/CPFEReport, the multiclass metric and significance primitives (axes 1-4), and the attribution-stability functions (axis 5).

`CPFEProtocol` ¶

Run the CPFE five-axis evaluation over precomputed per-platform outputs.

Parameters:

Name	Type	Description	Default
`platform_data`	dict ``{name: {"y_true": array, "probs": (n, n_classes) array}}``.		required
`reference`	`the within-platform name (e.g. the training platform).`		required
`n_classes`	`number of classes.`		required
`delta_auc_pct_max`	`ILLUSTRATIVE macro-AUC-drop screening limit (percent) used by`	`CPFEReport.deployment_readiness`. NOT a published cutoff: P4 Section 6.6 declines to set one (observed drops were 28.6-39.5%); the default echoes that ">30%" magnitude and is labelled illustrative everywhere it surfaces.	`30.0`

`CPFEReport` ¶

Holds the five-axis results and renders tables and a deployment-readiness diagnostic.

`deployment_readiness()` ¶

Structured per-axis, per-platform screening DIAGNOSTIC -- NOT a deployment decision. Following the CPFE paper (Sections 6.5-6.6), cross-platform degradation is an informative diagnostic, not definitive evidence of bias.

Thresholds: calibration uses P4's stated ECE bands (Suppl. Fig. S2); equity uses P4's four-fifths rule; discrimination uses an ILLUSTRATIVE delta_auc_pct_max (P4 Section 6.6 declines to set a published cutoff). Returns {platform: {"ready": bool, "axes": {axis: {pass, value, threshold, source, reason}}}}.

`jaccard_topk(saliency_a, saliency_b, k)` ¶

Jaccard overlap of the top-k tokens by saliency: |topK(A) ∩ topK(B)| / |union|.

saliency_* map token -> saliency score. Returns 0.0 if both are empty.

Examples:

>>> jaccard_topk({"a": 0.9, "b": 0.8}, {"a": 0.7, "c": 0.6}, k=2)
0.3333333333333333

`token_saliency(model, tokenizer, text, target=None)` ¶

Per-token gradient saliency s_i = ‖∂P(y|x)/∂E_i‖₂ via Captum (optional). Requires pip install fairscope[nlp]. Returns {token: saliency}.

`macro_auc(y_true, probs)` ¶

Macro one-vs-rest AUC. Requires every class present in y_true.

`macro_f1(y_true, probs)` ¶

Macro F1 of the argmax predictions.

`multiclass_ece(y_true, probs, n_bins=10)` ¶

Confidence-accuracy Expected Calibration Error (Guo et al. 2017; CPFE paper): ECE = sum_m (|B_m|/n) * |acc(B_m) - conf(B_m)| with conf = max prob and acc = top-1 correct.

`per_class_disparate_impact(probs_a, probs_b, n_classes)` ¶

Symmetric DI per class between two platforms, reusing core.disparate_impact with the class binarized (pred == c) and platform as the two-group label.

`per_class_equalized_odds(y_a, probs_a, y_b, probs_b, n_classes)` ¶

EOD per class between two platforms (|TPR_c(A) - TPR_c(B)|), reusing core.equalized_odds_difference. A class with no positive labels in a platform is returned as None (TPR undefined).

`bootstrap_macro_auc_test(y_a, probs_a, y_b, probs_b, n_boot=2000, random_state=None)` ¶

Compare macro AUC across two platforms (independent test sets).

Each platform's macro-AUC standard error is estimated by a class-stratified bootstrap; the errors are combined for an unpaired z-test. Returns a dict: auc_a, auc_b, delta, se, z, p_value, n_boot.

API reference¶

Top level¶

fairscope.FairnessAudit(model, domain, **kwargs) ¶

fairscope.core¶

fairscope.core ¶

bootstrap_auc_test(y_true, score_a, score_b, n_boot=2000, random_state=None) ¶

ece_by_group(y_true, y_prob, groups, n_bins=10, strategy='uniform') ¶

expected_calibration_error(y_true, y_prob, n_bins=10, strategy='uniform') ¶

isotonic_recalibrate(probs, y_true) ¶

maximum_calibration_error(y_true, y_prob, n_bins=10, strategy='uniform') ¶

reliability_diagram(y_true, y_prob, groups=None, n_bins=10) ¶

temperature_scale(logits, y_true, max_iter=200) ¶

benjamini_hochberg(p_values, alpha=0.05) ¶

bonferroni(p_values, alpha=0.05) ¶

delong_auc_ci(y_true, y_score, alpha=0.05) ¶

delong_by_group(y_true, y_score, groups, alpha=0.05) ¶

delong_paired_test(y_true, score_a, score_b) ¶

delong_unpaired_test(y_true_a, score_a, y_true_b, score_b) ¶

disparate_impact(y_pred, groups, group_a, group_b) ¶

equalized_odds_difference(y_true, y_pred, groups, group_a, group_b) ¶

subgroup_metrics(y_true, y_score, groups, threshold=0.5) ¶

fairscope.healthcare¶

fairscope.healthcare.audit ¶

HealthcareFairnessAudit ¶

from_scores(y_true, y_score, protected_attr, *, n_bins=10, alpha=0.05) classmethod ¶

run() ¶

shap_summary(max_samples=200) ¶

HealthcareReport ¶

summary() ¶

plot_auc_forest(attribute=None) ¶

plot_calibration(attribute=None) ¶

to_pdf(path) ¶

fairscope.nlp¶

fairscope.nlp ¶

CPFEProtocol ¶

CPFEReport ¶

deployment_readiness() ¶

jaccard_topk(saliency_a, saliency_b, k) ¶

token_saliency(model, tokenizer, text, target=None) ¶

macro_auc(y_true, probs) ¶

macro_f1(y_true, probs) ¶

multiclass_ece(y_true, probs, n_bins=10) ¶

per_class_disparate_impact(probs_a, probs_b, n_classes) ¶

per_class_equalized_odds(y_a, probs_a, y_b, probs_b, n_classes) ¶

bootstrap_macro_auc_test(y_a, probs_a, y_b, probs_b, n_boot=2000, random_state=None) ¶

`fairscope.FairnessAudit(model, domain, **kwargs)` ¶

`fairscope.core`¶

`fairscope.core` ¶

`bootstrap_auc_test(y_true, score_a, score_b, n_boot=2000, random_state=None)` ¶

`ece_by_group(y_true, y_prob, groups, n_bins=10, strategy='uniform')` ¶

`expected_calibration_error(y_true, y_prob, n_bins=10, strategy='uniform')` ¶

`isotonic_recalibrate(probs, y_true)` ¶

`maximum_calibration_error(y_true, y_prob, n_bins=10, strategy='uniform')` ¶

`reliability_diagram(y_true, y_prob, groups=None, n_bins=10)` ¶

`temperature_scale(logits, y_true, max_iter=200)` ¶

`benjamini_hochberg(p_values, alpha=0.05)` ¶

`bonferroni(p_values, alpha=0.05)` ¶

`delong_auc_ci(y_true, y_score, alpha=0.05)` ¶

`delong_by_group(y_true, y_score, groups, alpha=0.05)` ¶

`delong_paired_test(y_true, score_a, score_b)` ¶

`delong_unpaired_test(y_true_a, score_a, y_true_b, score_b)` ¶

`disparate_impact(y_pred, groups, group_a, group_b)` ¶

`equalized_odds_difference(y_true, y_pred, groups, group_a, group_b)` ¶

`subgroup_metrics(y_true, y_score, groups, threshold=0.5)` ¶

`fairscope.healthcare`¶

`fairscope.healthcare.audit` ¶

`HealthcareFairnessAudit` ¶

`from_scores(y_true, y_score, protected_attr, *, n_bins=10, alpha=0.05)` `classmethod` ¶

`run()` ¶

`shap_summary(max_samples=200)` ¶

`HealthcareReport` ¶

`summary()` ¶

`plot_auc_forest(attribute=None)` ¶

`plot_calibration(attribute=None)` ¶

`to_pdf(path)` ¶

`fairscope.nlp`¶

`fairscope.nlp` ¶

`CPFEProtocol` ¶

`CPFEReport` ¶

`deployment_readiness()` ¶

`jaccard_topk(saliency_a, saliency_b, k)` ¶

`token_saliency(model, tokenizer, text, target=None)` ¶

`macro_auc(y_true, probs)` ¶

`macro_f1(y_true, probs)` ¶

`multiclass_ece(y_true, probs, n_bins=10)` ¶

`per_class_disparate_impact(probs_a, probs_b, n_classes)` ¶

`per_class_equalized_odds(y_a, probs_a, y_b, probs_b, n_classes)` ¶

`bootstrap_macro_auc_test(y_a, probs_a, y_b, probs_b, n_boot=2000, random_state=None)` ¶