Publication:

Deduction-Projection Estimators for Understanding Neural Networks

Loading...
Thumbnail Image

Date

2025-05-16

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Wu, Gabriel. 2025. Deduction-Projection Estimators for Understanding Neural Networks. Bachelors Thesis, Harvard University Engineering and Applied Sciences.

Abstract

We introduce Deduction-Projection Estimators: a family of methods for measuring properties of neural networks inspired by the notion of a "deductive heuristic estimator" introduced in Christiano et al. (2022). Unlike traditional techniques used in machine learning, a DPE produces its estimate by mechanistically tracking how activations are processed throughout a neural network. This allows us to understand how a model behaves over an entire input distribution without having to generalize from observed behavior on a finite number of sampled inputs.

The first half of this thesis deals with the philosophy and theory of deductive estimation: how it differs from inductive reasoning, examples of deductive estimators in mathematics and machine learning, and why we might expect concise deductive explanations to exist and be tractable to find in the first place. We also discuss how efficient deductive estimators might be used to scalably control the worst-case behaviors of AI systems.

In the second half, we empirically evaluate DPEs against traditional machine learning methods. First, we introduce the problem of low probability estimation: given a transformer language model and a formally specified input distribution, can we estimate the probability that it generates a particular output, even when this probability is too small to detect with naive sampling? We develop activation extrapolation methods based on simplified DPEs, which empirically outperform naive sampling. However, the highest-performing estimators leverage importance sampling, which can be thought of as a generalization of adversarial training.

Finally, we explore how our techniques can be used to optimize small neural network classifiers to achieve maximal accuracy on an algorithmic task. Our experimental results show that DPEs often outperform cross-entropy loss as an optimization target by providing a stronger gradient signal towards maximizing accuracy. We discuss the limitations of our empirical settings and list future lines of work.

Description

Other Available Sources

Research Data

Keywords

AI alignment, deductive estimation, low probability estimation, machine learning, mechanistic interpretability, Computer science, Mathematics

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Related Stories