Urner, RuthTorabian, Alireza2023-12-082023-12-082023-12-08https://hdl.handle.net/10315/41698Calibration is a frequently invoked concept when useful label probability estimates are required on top of classification accuracy. A calibrated model is a scoring function whose scores correctly reflect underlying label probabilities. Calibration in itself however does not imply classification accuracy, nor human interpretable estimates, nor is it straightforward to verify calibration from finite data. There is a plethora of evaluation metrics (and loss functions) that each assesses a specific aspect of a calibration model. In this work, we initiate an axiomatic study of the notion of calibration and evaluation measures for calibration. We catalogue desirable properties of calibration models as well as evaluation metrics and analyze their feasibility and correspondences. We complement this analysis with an empirical evaluation, comparing two metrics and comparing common calibration methods to employing a simple, interpretable decision tree.Author owns copyright, except where explicitly noted. Please contact the author directly with licensing requests.Computer scienceInvestigating Calibrated Classification Scores through the Lens of InterpretabilityElectronic Thesis or Dissertation2023-12-08Machine learningCalibrationClassificationReliabilityInterpretabilityTheoretical machine learning