Cross-Platform Fairness Evaluation (CPFE)¶
CPFE is fairscope's novel contribution: a five-axis protocol for evaluating what happens
when a model trained on one data source is deployed on another. A single AUC number cannot
distinguish the ways a model degrades across platforms; CPFE evaluates five orthogonal axes,
comparing each non-training platform against a reference.
The five axes¶
- Discriminative performance — macro one-vs-rest AUC and macro F1, with the relative change ΔAUC%.
- Calibration — confidence–accuracy Expected Calibration Error per platform.
- Statistical significance — a stratified bootstrap standard error on the macro-AUC difference between platforms (independent test sets), with Bonferroni correction.
- Prediction equity — symmetric disparate impact and equalized-odds difference per class, treating platform membership as the group.
- Attribution stability — Jaccard overlap of the top-K gradient-saliency token sets
across platforms (requires
pip install fairscope[nlp]).
Usage¶
The protocol runs on precomputed per-platform outputs (axes 1–4 need no deep-learning dependencies):
from fairscope.nlp import CPFEProtocol
platform_data = {
"reference": {"y_true": y_ref, "probs": probs_ref}, # (n, n_classes)
"deployment": {"y_true": y_dep, "probs": probs_dep},
}
report = CPFEProtocol(platform_data, reference="reference", n_classes=4).run()
report.to_dataframe() # macro AUC/F1, ECE, ΔAUC% per platform
report.significance # bootstrap macro-AUC test + Bonferroni-adjusted p
report.equity # per-class disparate impact and equalized odds
report.deployment_readiness() # structured per-axis diagnostic
deployment_readiness() — a diagnostic, not a decision¶
deployment_readiness() returns a structured per-axis verdict
({platform: {ready, axes: {axis: {pass, value, threshold, source, reason}}}}). It is
explicitly a screening diagnostic, not a deployment decision or compliance verdict.
Each axis reports the threshold it used and that threshold's source:
- Equity uses the four-fifths rule (disparate impact < 0.80 is a violation, < 0.50 severe) — a stated reference value.
- Calibration uses stated reference bands (ECE < 0.10 well-calibrated, > 0.20 moderate miscalibration).
- Discrimination uses an illustrative, user-configurable ΔAUC limit
(
delta_auc_pct_max) — not a published decision threshold; the underlying study explicitly declines to set one.
Cross-platform degradation is an informative signal that warrants platform-specific validation; it is not, by itself, evidence of algorithmic bias.