A Consequentialist Critique of Binary Classification Evaluation: Theory, Practice, and Tools

arXiv:2504.04528v3 Announce Type: replace-cross
Abstract: Machine learning-supported decisions, such as ordering diagnostic tests or determining preventive custody, often require converting probabilistic forecasts into binary classifications. We adopt a consequentialist perspective from decision theory to argue that evaluation methods should prioritize forecast quality across thresholds and base rates. This motivates the use of proper scoring rules such as the Brier score and log loss. However, our empirical review of practices at major ML venues (ICML, FAccT, CHIL) reveals a dominant reliance on top-K metrics or fixed-threshold evaluations. To bridge this disconnect, we introduce a decision-theoretic framework that maps evaluation metrics to their appropriate use cases, accompanied by a practical Python package, \texttt{briertools}, which lowers the barrier to applying proper scoring rules in practice. Methodologically, we derive and implement a clipped Brier score variant that avoids full integration and better reflects bounded, interpretable threshold ranges. Theoretically, we reconcile the Brier score with decision curve analysis, directly addressing the critique of (Assel, et al. 2017) regarding the clinical utility of proper scoring rules.

Source link

A Consequentialist Critique of Binary Classification Evaluation: Theory, Practice, and Tools

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Leave a Reply Cancel reply

Recent Posts

Recent Comments

You Might Also Like

Five Years of Building the Foundation of Open Machine Learning

Optimizing Token Generation in PyTorch Decoder Models

A Guardrail for Safety and Adversarial Robustness in Modern LLM Systems

Fish Audio Releases Fish Audio S2: A New Generation of Expressive Text-to-Speech (TTS) with Absurdly Controllable Emotion