Publication:

Multidimensional Attention Masks: Towards Performant and Provable Order Independence in LLMs

Loading...
Thumbnail Image

Date

2025-05-22

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Brown, Katrina. 2025. Multidimensional Attention Masks: Towards Performant and Provable Order Independence in LLMs. Bachelors Thesis, Harvard University Engineering and Applied Sciences.

Research Data

Abstract

The development of generative language models that can create long and coherent textual outputs via autoregression has led to a proliferation of uses and a corresponding sweep of analyses as researchers work to determine the limitations of this new paradigm. Unlike humans, these ‘Large Language Models’ (LLMs) are highly sensitive to small changes in their input, leading to unwanted inconsistency in their behavior. One problematic inconsistency when LLMs are used to answer multiple-choice questions or analyze multiple inputs is order de- pendency: the output of an LLM can (and often does) change significantly when sub-sequences are swapped, despite both orderings being semantically identical. In this paper we present Multidimensional Attention Mask, a technique that guarantees the output of an LLM will not have order dependence on a specified set of sub-sequences. We show that this method provably mitigates order dependency, and that it can be applied to any transformer-based LLM to enable text generation that is unaffected by re-orderings. Delving into the implications of the method, we show that, despite our inputs being out of distribution, the impact on expected accuracy is small, where the expectation is over the order of the candidate responses. Thus, Multidimensional Attention Mask can be used as a ‘dropped-in’ method on fully trained models. Finally, we introduce a fine-tuning strategy that integrates SBP into the training process, “pulling” these set-formatted prompts closer to the model’s training manifold in order to address the performance degradation induced by the drop-in Multidimensional Attention Mask approach. Experiments demonstrate that SBP fine-tuning significantly improves accuracy and robustness to answer-order permutations, all while preserving broader language modeling capabilities. We discuss the broader implications of order-invariant modeling and outline future directions for building fairer, more consistent LLMs.

Description

Other Available Sources

Keywords

Computer science

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Referenced By

Related Stories