Publication: Towards Principled AI Alignment: An Evaluation and Augmentation of Inverse Constitutional AI
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Research Data
Abstract
The accelerated pace of development for advanced AI systems motivates an examination of whether such systems are actually aligned to human designer and user intent. The notion of human intent, however, remains highly ambiguous, with ongoing debate regarding whether large language models (LLMs) should be aligned to demonstrated behavior communicated through the expression of human preferences or abstract normative principles defined by collective deliberation. Methods such as reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) demonstrate approaches to AI alignment that depend on the communication of human preferences. The formalization of Constitutional AI (CAI), which leverages reinforcement learning from AI feedback (RLAIF) and involves the communication of a set of normative principles for directing the behavior of an LLM, motivates the development of scalable approaches to bridge the gap between standard alignment methods that employ human preference data and interpretable, principled AI alignment. Inverse Constitutional AI (ICAI) is a new framework that aims to learn a constitution from human preference data that may serve as both an interpretable compression of human preference and an instructive set of principles for aligning LLM behavior.
In this thesis, we present an expanded implementation of the ICAI framework that addresses the desideratum of representation for varied human preferences by electing a principle committee using algorithms drawn from the theory of approval voting in social choice. We describe relevant metrics for evaluating the efficacy of a constitution constructed through the implementation of the framework and provide a method of systematic evaluation for all components of the pipeline. We hope that the improved formalization of this approach to principled model alignment will contribute to the development and deployment of more interpretable and better aligned AI systems.