Publication:

Towards Principled AI Alignment: An Evaluation and Augmentation of Inverse Constitutional AI

Loading...
Thumbnail Image

Date

2025-05-22

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

An, Esther. 2025. Towards Principled AI Alignment: An Evaluation and Augmentation of Inverse Constitutional AI. Bachelors Thesis, Harvard University Engineering and Applied Sciences.

Research Data

Abstract

The accelerated pace of development for advanced AI systems motivates an examination of whether such systems are actually aligned to human designer and user intent. The notion of human intent, however, remains highly ambiguous, with ongoing debate regarding whether large language models (LLMs) should be aligned to demonstrated behavior communicated through the expression of human preferences or abstract normative principles defined by collective deliberation. Methods such as reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) demonstrate approaches to AI alignment that depend on the communication of human preferences. The formalization of Constitutional AI (CAI), which leverages reinforcement learning from AI feedback (RLAIF) and involves the communication of a set of normative principles for directing the behavior of an LLM, motivates the development of scalable approaches to bridge the gap between standard alignment methods that employ human preference data and interpretable, principled AI alignment. Inverse Constitutional AI (ICAI) is a new framework that aims to learn a constitution from human preference data that may serve as both an interpretable compression of human preference and an instructive set of principles for aligning LLM behavior.

In this thesis, we present an expanded implementation of the ICAI framework that addresses the desideratum of representation for varied human preferences by electing a principle committee using algorithms drawn from the theory of approval voting in social choice. We describe relevant metrics for evaluating the efficacy of a constitution constructed through the implementation of the framework and provide a method of systematic evaluation for all components of the pipeline. We hope that the improved formalization of this approach to principled model alignment will contribute to the development and deployment of more interpretable and better aligned AI systems.

Description

Other Available Sources

Keywords

Computer science

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Referenced By

Related Stories