Publication: Optimizing the Infidelity, Sensitivity, and Complexity of Feature Importance Explanations for Machine Learning Models
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
Interpretable machine learning aims to bridge the gap between complex model predictions and human understanding. Given the open-ended nature of the field, there is an abundance of different methods for achieving interpretability and metrics for evaluating the quality of explanations. In this thesis, we survey the existing work in the field and focus on three main types of metrics, which we refer to as \textit{infidelity}, \textit{sensitivity}, and \textit{complexity}. We explore a novel framework for interpretability that balances these three objectives using a metric of explanation quality that incorporates all three objectives, ensuring that explanations accurately capture the model's behavior, remain stable across similar inputs, and avoid being needlessly hard to interpret. Specifically, we consider explanations that attempt to generate an \textit{importance score} for each feature of the input. We calculate the optimal explanation by minimizing the metric according to an algorithm we introduce based on coordinate descent and the Adam optimizer. We implement this algorithm and evaluate our approach on a neural network trained on the MNIST dataset.