Publication:

Mechanistic Control of Language Models

Loading...
Thumbnail Image

Open/View Files

Date

2025-05-12

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Li, Kenneth. 2025. Mechanistic Control of Language Models. Doctoral Dissertation, Harvard University Graduate School of Arts and Sciences.

Abstract

This dissertation talks about mechanistic control of large language models (LLM). The intuition is that, in order to rigorously control large language models, it is beneficial to first understand their inner working mechanisms. Even though a fully mechanistic understanding of language models is far from complete at the time this dissertation is written, I will demonstrate several cases where such understanding provides a useful surface for control. I argue that connecting interpretability to real-life applications--e.g., alignment and safety--offers advantages to the development of both fields.

I first position this methodological approach within a broader scientific context, discussing and systematizing the role of interpretability in artificial intelligence research. I then present four intensive case studies, each leveraging certain interpretability insights to control the behavior of language models.

The interpretability insights I explored span emergent world representations, representation of internal knowledge, phenomenon of attention decay, and representation of users. The demonstrated effects of control include targeted editing of the world model, reduced hallucination, improved stability of system prompts in long conversations, and better planning in multi-turn dialogue.

Description

Other Available Sources

Research Data

Keywords

Computer science

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Related Stories