Publication: Two-Stream Transformer Architecture With Discrete Attention for Better Interpretrability and Separation of Model Concerns
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
The transformer has become a central model for natural language processing tasks ranging from translation to classification to representation learning. Its success demonstrates the effectiveness of stacked attention as a replacement for recurrence for many tasks. Attention is broadly interpreted as selectively attending to different parts on an input. So, in theory, attention offers more insights into the model’s internal decisions; however, in practice, when stacked, it quickly becomes nearly as fully-connected, making it hard to disentangle final decision dependencies. In this work, we propose an alternative transformer architecture, discrete transformer, with the goal of improving model interpretability. We use discrete latent variable attention to ensure that decision steps only depend on a limited context. We separate out attention decisions from representation modeling by using a separate stream for each. Empirically, on both classification and translation tasks, this approach maintains similar levels of performance on several datasets as the standard transformer, while obtaining quantitatively better attention interpretability and separating out syntactic features in the learned representations. Finally, our two-stream formulation can be used to transfer knowledge in a multiview arithmetic evaluation task.