Publication: Large Scale Inference and Combinatorial Variable Selection for Complex Dataset
No Thumbnail Available
Open/View Files
Date
2024-04-30
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
The Harvard community has made this article openly available. Please share how this access benefits you.
Citation
Liu, Yue. 2024. Large Scale Inference and Combinatorial Variable Selection for Complex Dataset. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.
Research Data
Abstract
This dissertation advances the field of modern statistical theory and methodology by focusing on two primary areas: first, the quantification of uncertainty beyond mere estimation in combinatorial inference theory; and second, addressing the complexities and challenges inherent in electronic health records (EHR).
Chapter 1 introduces a novel combinatorial inference framework to conduct general uncertainty quantification in ranking problems. By considering the Bradley-Terry-Luce model, we aim to infer both local and global ranking properties, and generalize the method to multi-tesing problem with false discovery rate (FDR) control.
Chapter 2 focuses on the development of a semi-supervised approach that efficiently leverages sizable unlabeled samples with error-prone EHR surrogate outcomes from multiple local sites, to improve the learning accuracy of the small gold-labeled data. we apply our method to develop a high dimensional genetic risk model for type II diabetes using large-scale data sets from UK and Mass General Brigham biobanks, where only a small fraction of subjects in one site has been labeled via chart reviewing.
Chapter 3 presents
a novel inferential framework for general graphical models to select graph features with false discovery rate controlled.
The proposed method is based on the maximum of $p$-values from single edges that comprise the topological feature of interest, thus is able to detect weak signals.
Moreover, we introduce the $K$-dimensional persistent Homology Adaptive selectioN (KHAN) algorithm to select all the homological features within $K$ dimensions with the uniform control of the false discovery rate over continuous filtration levels. The KHAN method applies a novel discrete Gram-Schmidt algorithm to select statistically significant generators from the homology group.
We apply the structural screening method to identify the important residues of the SARS-CoV-2 spike protein during the binding process to the ACE2 receptors. We score the residues for all domains in the spike protein by the $p$-value weighted filtration level in the network persistent homology for the closed, partially open, and open states and identify the residues crucial for protein conformational changes and thus being potential targets for inhibition.
Description
Other Available Sources
Keywords
Statistics, Biostatistics
Terms of Use
This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service