Publication: TRIM: Text Replacement for Interpreting Models, A Novel Approach for Interpreting Text Classifiers
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
Current local model-agnostic interpretation algorithms perturb text data by deleting random words and measuring how much the deletion changes the output of classifiers. We show that this random deletion breaks the grammar and structure of the text and results in data that is out of distribution for classifiers. Instead, we propose TRIM: Text Replacement for Interpreting Models. Instead of deleting words, we replace words with "neutral" words that fit into the same text and use that to measure the contribution of the original words to the output.
We train a classifier on two different classification tasks —Sentiment Analysis and Question-Answering Classification— and interpret the classifier using current algorithms and TRIM. We show that TRIM is better at estimating word contribution in complex contexts and in the existence of multiple important words. In addition, we use QUACKIE to evaluate the two interpretation algorithms, finding that TRIM outperforms the baseline in most settings and improves accuracy by up to 4.5%.