API reference¶
Auto-generated from the package docstrings.
Top level¶
fairscope.FairnessAudit(model, domain, **kwargs)
¶
Route to a domain-specific fairness audit.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
the fitted classifier (or ``None`` with precomputed scores via the domain API).
|
|
required |
domain
|
str
|
The audit domain. Implemented: |
required |
**kwargs
|
passed through to the domain audit class.
|
|
{}
|
Examples:
>>> import fairscope
>>> callable(fairscope.FairnessAudit)
True
fairscope.core¶
fairscope.core
¶
Core statistical primitives for fairscope.
Public API for subgroup-stratified, calibration-aware fairness auditing: DeLong AUC confidence intervals and tests, a stratified bootstrap AUC test, calibration error and recalibration, multiple-comparison corrections, and subgroup fairness metrics.
bootstrap_auc_test(y_true, score_a, score_b, n_boot=2000, random_state=None)
¶
Stratified bootstrap test of (AUC_a - AUC_b) on the same samples.
Resamples positives and negatives separately (preserving class balance). Returns a dict: auc_a, auc_b, delta, se, z, p_value, ci_lower, ci_upper, n_boot.
Examples:
>>> import numpy as np
>>> rng = np.random.default_rng(0)
>>> y = rng.integers(0, 2, 400)
>>> a = rng.random(400) + 0.8 * y
>>> b = rng.random(400) + 0.05 * y
>>> bootstrap_auc_test(y, a, b, n_boot=200, random_state=0)["delta"] > 0
True
ece_by_group(y_true, y_prob, groups, n_bins=10, strategy='uniform')
¶
Per-subgroup Expected Calibration Error. Returns {group: ece}.
expected_calibration_error(y_true, y_prob, n_bins=10, strategy='uniform')
¶
Weighted mean |observed frequency - mean predicted probability| across bins.
Returns 0.0 if there is no data. strategy is 'uniform' (equal-width) or 'quantile'
(equal-frequency) bins.
Examples:
>>> import numpy as np
>>> y = np.array([0, 0, 1, 1])
>>> p = np.array([0.0, 0.0, 1.0, 1.0])
>>> expected_calibration_error(y, p)
0.0
isotonic_recalibrate(probs, y_true)
¶
Fit isotonic regression mapping predicted prob -> calibrated prob
(Zadrozny & Elkan, 2002). Returns (fitted_model, calibrated_probabilities).
maximum_calibration_error(y_true, y_prob, n_bins=10, strategy='uniform')
¶
Maximum |observed frequency - mean predicted probability| over bins.
reliability_diagram(y_true, y_prob, groups=None, n_bins=10)
¶
Reliability diagram (mean predicted probability vs observed frequency).
Returns a matplotlib Figure. If groups is given, draws one curve per subgroup.
temperature_scale(logits, y_true, max_iter=200)
¶
Fit a single temperature T>0 by minimizing NLL (Guo et al., 2017).
logits may be 1-D (binary positive-class logit) or 2-D (n, n_classes). Returns
(T, calibrated_probabilities); calibrated probabilities are 1-D for 1-D input.
benjamini_hochberg(p_values, alpha=0.05)
¶
Benjamini-Hochberg FDR correction (step-up). Returns adjusted, reject.
bonferroni(p_values, alpha=0.05)
¶
Bonferroni correction. Returns adjusted, reject, threshold.
Examples:
>>> bonferroni([0.01, 0.5], alpha=0.05)["adjusted"].tolist()
[0.02, 1.0]
delong_auc_ci(y_true, y_score, alpha=0.05)
¶
AUC with a DeLong (1 - alpha) normal-approximation confidence interval.
Returns a dict: auc, ci_lower, ci_upper, se, n_pos, n_neg.
Examples:
>>> import numpy as np
>>> y = np.array([1, 1, 1, 1, 0, 0, 0, 0])
>>> s = np.array([0.9, 0.8, 0.7, 0.4, 0.6, 0.5, 0.3, 0.2])
>>> round(delong_auc_ci(y, s)["auc"], 3)
0.875
delong_by_group(y_true, y_score, groups, alpha=0.05)
¶
Per-subgroup DeLong AUC CIs. Returns {group: ci_dict}.
delong_paired_test(y_true, score_a, score_b)
¶
Covariance-aware paired DeLong test for two scores on the SAME samples.
Returns: auc_a, auc_b, delta, z, p_value.
delong_unpaired_test(y_true_a, score_a, y_true_b, score_b)
¶
Unpaired DeLong test for two INDEPENDENT samples (the cross-platform case).
disparate_impact(y_pred, groups, group_a, group_b)
¶
Symmetric disparate impact between two groups: min(rate_a/rate_b, rate_b/rate_a).
rate = P(y_pred == 1 | group) over hard 0/1 predictions. Result is in (0, 1];
< 0.80 violates the four-fifths rule and < 0.50 is a severe disparity (interpretation
per the CPFE paper). If both rates are 0 the groups are treated as equal (returns 1.0);
if exactly one rate is 0 the disparity is maximal (returns 0.0).
Examples:
>>> import numpy as np
>>> y_pred = np.array([1, 1, 0, 0, 1, 0, 0, 0])
>>> g = np.array(["A", "A", "A", "A", "B", "B", "B", "B"])
>>> round(disparate_impact(y_pred, g, "A", "B"), 3)
0.5
equalized_odds_difference(y_true, y_pred, groups, group_a, group_b)
¶
|TPR_a - TPR_b|, the equalized odds difference as defined in the CPFE paper.
TPR = P(y_pred == 1 | y_true == 1, group). Raises if either group has no
positive labels (TPR undefined).
subgroup_metrics(y_true, y_score, groups, threshold=0.5)
¶
Per-subgroup discrimination metrics: {group: {auc, brier, f1, n}}.
y_score are predicted probabilities in [0, 1] (Brier assumes a probability);
threshold binarizes them for F1. Raises if any subgroup is single-class (AUC
undefined) or if y_score contains NaN.
fairscope.healthcare¶
fairscope.healthcare.audit
¶
One-call clinical fairness audit. Composes fairscope.core; invents no statistics.
Pipeline per protected attribute: per-subgroup DeLong AUC CIs -> per-subgroup ECE -> per-subgroup Brier/F1 -> Bonferroni-corrected pairwise (unpaired) DeLong tests across the attribute's subgroups. Mirrors the analysis in the diabetes paper (IEEE CIPHER 2026).
HealthcareFairnessAudit
¶
Audit a fitted classifier (or precomputed scores) for subgroup fairness.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
object with ``predict_proba`` (positive-class probability in column 1), or
|
|
required |
X_test
|
test features and binary labels.
|
|
required |
y_test
|
test features and binary labels.
|
|
required |
protected_attr
|
dict ``{attribute_name: 1-D array of subgroup labels per sample}``.
|
|
required |
from_scores(y_true, y_score, protected_attr, *, n_bins=10, alpha=0.05)
classmethod
¶
Build an audit from precomputed positive-class probabilities (no model).
run()
¶
Run the audit and return a :class:HealthcareReport.
Raises a clear, attribute-named ValueError if a subgroup is single-class
(AUC undefined) rather than letting a low-level error surface.
shap_summary(max_samples=200)
¶
Mean absolute SHAP value per feature (optional). Requires
pip install fairscope[shap] and a model (not from_scores). Returns a dict
{feature_index: mean_abs_shap}.
HealthcareReport
¶
Holds audit results and the raw (y, score, groups) needed to render reliability curves; provides tables (here), and plots/PDF (see plotting methods).
summary()
¶
Return a human-readable summary string (no print side effect).
plot_auc_forest(attribute=None)
¶
Forest plot of per-subgroup AUC with DeLong 95% CIs. Returns a Figure.
plot_calibration(attribute=None)
¶
Reliability curves per subgroup for one attribute, drawn with
core.reliability_diagram. Returns a Figure.
to_pdf(path)
¶
Write a multi-page PDF: summary, AUC forest, and one calibration page per attribute. Uses matplotlib only (no extra dependency).
fairscope.nlp¶
fairscope.nlp
¶
Cross-Platform Fairness Evaluation (CPFE) for NLP, built on fairscope.core.
Public API: the five-axis CPFEProtocol/CPFEReport, the multiclass metric and
significance primitives (axes 1-4), and the attribution-stability functions (axis 5).
CPFEProtocol
¶
Run the CPFE five-axis evaluation over precomputed per-platform outputs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
platform_data
|
dict ``{name: {"y_true": array, "probs": (n, n_classes) array}}``.
|
|
required |
reference
|
the within-platform name (e.g. the training platform).
|
|
required |
n_classes
|
number of classes.
|
|
required |
delta_auc_pct_max
|
ILLUSTRATIVE macro-AUC-drop screening limit (percent) used by
|
|
30.0
|
CPFEReport
¶
Holds the five-axis results and renders tables and a deployment-readiness diagnostic.
deployment_readiness()
¶
Structured per-axis, per-platform screening DIAGNOSTIC -- NOT a deployment decision. Following the CPFE paper (Sections 6.5-6.6), cross-platform degradation is an informative diagnostic, not definitive evidence of bias.
Thresholds: calibration uses P4's stated ECE bands (Suppl. Fig. S2); equity uses
P4's four-fifths rule; discrimination uses an ILLUSTRATIVE delta_auc_pct_max
(P4 Section 6.6 declines to set a published cutoff). Returns
{platform: {"ready": bool, "axes": {axis: {pass, value, threshold, source, reason}}}}.
jaccard_topk(saliency_a, saliency_b, k)
¶
Jaccard overlap of the top-k tokens by saliency: |topK(A) ∩ topK(B)| / |union|.
saliency_* map token -> saliency score. Returns 0.0 if both are empty.
Examples:
>>> jaccard_topk({"a": 0.9, "b": 0.8}, {"a": 0.7, "c": 0.6}, k=2)
0.3333333333333333
token_saliency(model, tokenizer, text, target=None)
¶
Per-token gradient saliency s_i = ‖∂P(y|x)/∂E_i‖₂ via Captum (optional).
Requires pip install fairscope[nlp]. Returns {token: saliency}.
macro_auc(y_true, probs)
¶
Macro one-vs-rest AUC. Requires every class present in y_true.
macro_f1(y_true, probs)
¶
Macro F1 of the argmax predictions.
multiclass_ece(y_true, probs, n_bins=10)
¶
Confidence-accuracy Expected Calibration Error (Guo et al. 2017; CPFE paper):
ECE = sum_m (|B_m|/n) * |acc(B_m) - conf(B_m)| with conf = max prob and
acc = top-1 correct.
per_class_disparate_impact(probs_a, probs_b, n_classes)
¶
Symmetric DI per class between two platforms, reusing core.disparate_impact with
the class binarized (pred == c) and platform as the two-group label.
per_class_equalized_odds(y_a, probs_a, y_b, probs_b, n_classes)
¶
EOD per class between two platforms (|TPR_c(A) - TPR_c(B)|), reusing
core.equalized_odds_difference. A class with no positive labels in a platform is
returned as None (TPR undefined).
bootstrap_macro_auc_test(y_a, probs_a, y_b, probs_b, n_boot=2000, random_state=None)
¶
Compare macro AUC across two platforms (independent test sets).
Each platform's macro-AUC standard error is estimated by a class-stratified bootstrap; the errors are combined for an unpaired z-test. Returns a dict: auc_a, auc_b, delta, se, z, p_value, n_boot.