Explaining Explanations and Perturbing Perturbations
CitationJia, Emily. 2020. Explaining Explanations and Perturbing Perturbations. Bachelor's thesis, Harvard College.
AbstractThe impressive performance of machine learning models on prediction and classification tasks has sparked interest in deploying such models to make high-stakes decisions in domains such as health care and criminal justice. Due to the complexity and opacity of the models, as well as the sensitive nature of the tasks, additional explanation methods are needed to help identify bias and undesired behavior in models. LIME (Local Interpretable Model-agnostic Explanations) and SHAP (Shapley Additive exPlanations) are two widely-used explanation methods that can explain the local behavior of any black box model on a single data instance.
The first part of the thesis establishes the theoretical relationship between LIME and Kernel SHAP. We prove that Kernel SHAP is an instance of LIME that uniquely satisfies three desirable properties for feature attributions: local accuracy, missingness, and consistency. As suggested by the name of Kernel SHAP, the proof relies on a solution concept in game theory called the ``Shapley Value.” The second part of the thesis illustrates significant vulnerabilities in LIME and Kernel SHAP that make them unreliable for bias detection. We present a scaffolded model that can adversarially attack LIME or Kernel SHAP to produce any explanation. The consistency of this attack across three datasets (COMPAS, Communities and Crime, German Credit) demonstrates that highly biased classifiers can fool perturbation-based explanations such as LIME and Kernel SHAP into producing innocuous explanations.
Citable link to this pagehttps://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37364690
- FAS Theses and Dissertations