Publication:

Problems in High-Dimensional Estimation and Large Language Models

Loading...
Thumbnail Image

Date

2026-01-05

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Li, Xiaomin. 2026. Problems in High-Dimensional Estimation and Large Language Models. Doctoral Dissertation, Harvard University Graduate School of Arts and Sciences.

Abstract

This dissertation investigates critical problems at the intersection of high-dimensional statistics and the rapidly advancing field of large language models (LLMs), forging a narrative that bridges foundational theory with state-of-the-art applications. The work is presented in two interconnected parts, unified by the theme that principles of high-dimensional estimation provide a powerful framework for addressing key challenges in modern artificial intelligence.

The first part establishes a rigorous theoretical foundation for high-dimensional estimation. We present a sharp asymptotic analysis of a spectral method, inspired by Principal Hessian Directions, for learning multi-index models from nonlinear measurements. In a high-dimensional regime where data and signal dimensions grow proportionally, our analysis reveals a distinct phase transition phenomenon. We derive a set of deterministic fixed-point equations that precisely characterize the method's performance, offering an exact quantification of the alignment between the estimated and true subspaces. This theoretical contribution extends prior work from single-signal to multi-signal recovery, deepening our understanding of learning and signal processing in high-dimensional spaces.

The second part of this dissertation transitions from theory to practice, demonstrating how the mathematical rigor developed in the first part can be leveraged to solve pressing challenges in the development and deployment of LLMs. We introduce three novel frameworks. First, we propose a principled method for the Selection of LLM Fine-Tuning Data based on Orthogonal Rules, which uses the Determinantal Point Process (DPP) to select a diverse and non-redundant set of data quality metrics. This approach, grounded in the concept of orthogonality, significantly improves the efficiency and performance of model fine-tuning across multiple domains. Second, we introduce RuleAdapter, a dynamic framework for training multi-attribute reward models in Reinforcement Learning from Human Feedback (RLHF). Motivated by information theory, RuleAdapter adaptively selects the most critical safety rules for each context, leading to state-of-the-art safety performance and demonstrably more trustworthy LLMs. Third, we propose Semantic Volume, a novel, unsupervised geometric measure for quantifying and detecting both internal (model-based) and external (query-based) uncertainty in LLMs. By linking this measure to differential entropy, we provide a robust and interpretable method to enhance model reliability and mitigate hallucinations.

Collectively, this dissertation demonstrates that a deep understanding of high-dimensional systems is not merely a theoretical pursuit but an essential tool for building more robust, trustworthy, and efficient large language models. The presented research offers new theoretical insights into high-dimensional learning and delivers practical, mathematically-grounded methodologies that advance the state-of-the-art in the responsible development of artificial intelligence.

Description

Other Available Sources

Research Data

Keywords

Data Selection, High-Dimensional Estimation, Large Language Models, Multi-Index Models, Spectral Methods, Uncertainty Quantification, Applied mathematics

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Related Stories