Publication: Mechanistic Control of Language Models
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
This dissertation talks about mechanistic control of large language models (LLM). The intuition is that, in order to rigorously control large language models, it is beneficial to first understand their inner working mechanisms. Even though a fully mechanistic understanding of language models is far from complete at the time this dissertation is written, I will demonstrate several cases where such understanding provides a useful surface for control. I argue that connecting interpretability to real-life applications--e.g., alignment and safety--offers advantages to the development of both fields.
I first position this methodological approach within a broader scientific context, discussing and systematizing the role of interpretability in artificial intelligence research. I then present four intensive case studies, each leveraging certain interpretability insights to control the behavior of language models.
The interpretability insights I explored span emergent world representations, representation of internal knowledge, phenomenon of attention decay, and representation of users. The demonstrated effects of control include targeted editing of the world model, reduced hallucination, improved stability of system prompts in long conversations, and better planning in multi-turn dialogue.