Publication: Large Scale Inference with Theoretical Guarantees under Various Distributed Settings
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
Large-scale inference under distributed settings is an essential problem in machine learning and statistical inference. In many applications, the amount of data is too large to be processed by a single machine, requiring distributed computing to achieve scalable and efficient computation. Distributed inference involves splitting the data and computations across multiple machines, which brings challenges in terms of communication, computation, and synchronization.
One important field that the distributed data settings might pose challenges to is principal component analysis (PCA), which is a widely used technique for dimensionality reduction and feature extraction. In view of the growing dimensionality and increasing sample sizes of modern big data, the traditional PCA algorithm is computationally expensive and does not scale well to large datasets. Besides, in federated ecosystems, the traditional PCA method is often not applicable due to privacy protection considerations and large communicational burden. Algorithms were proposed to lower the computational cost, but few can handle both high dimensionality and massive sample size under the distributed setting. In Chapter 1, we propose the FAst DIstributed (FADI) PCA method for federated data when both the dimension d and the sample size n are ultra-large, by simultaneously performing parallel computing along d and dis- tributed computing along n. Specifically, we utilize L parallel copies of p-dimensional fast sketches to divide the computing burden along d and aggregate the results distributetively along the split samples. We present FADI under a general framework applicable to multiple statistical problems, and establish comprehensive theoretical results under the general framework. We show that FADI enjoys the same non-asymptotic error rate as the traditional PCA when Lp ≥ d. We also derive inferential results that characterize the asymptotic distribution of FADI, and show a phase-transition phenomenon as Lp increases. We perform extensive simulations to show that FADI substantially out- performs the existing methods in computational efficiency while preserving accuracy, and validate the distributional phase-transition phenomenon through numerical experiments. We apply FADI to the 1000 Genomes data to study the population structure.
Another important problem in large-scale inference is federated variable selection, where the goal is to select important features from distributed datasets without sharing the raw data. Federated variable selection has become increasingly popular in applications such as healthcare, where data privacy and security are of utmost importance. Recent advances in federated learning have led to the development of new distributed algorithms, such as federated Lasso and federated elastic net, which allow for efficient variable selection in a distributed setting. In Chapter 2, we propose a novel l0l1-regularized variable selection method via contrastive learning for high-dimensional federated data. Specifically, we consider the regime where the data dimension d is much larger than the sample size n, and assume the data to be vertically distributed along the dimension across multiple machines. By considering the combination of a supervised regression loss and an unsupervised contrastive loss, we are able to simultaneously perform the representation learning of a low-dimensional mapping matrix to embed the data and the supervised learning of the regression coefficients from the embedded data. We present an alternating algorithm to solve for the optimizers distributively with high computational efficiency, and provide theoretical guarantees on both the accuracy of representation and the statistical rates of the regression estimator. Our results are supported by numerical simulations, where our method is shown to have similar performance on prediction and variable selection to the traditional l0l1- regularized regression while having much lower computational costs.
Finally, doubly robust inference is another important topic in large-scale inference, as the regularized learning in high-dimensional data usually induces non-negligible bias. In Chapter 3, we propose a doubly robust method that modifies the sequence kernel association test (SKAT) under the high-dimensional setting. In specific, we propose a sequence kernel log-likelihood ratio test (SKiRT) statistic by constructing a Neyman orthogonal score for the sequence kernel log-likelihood. Our method is rate doubly robust in the sense that it only requires the product of the convergence rates for two nuisance estimators to be of order o(n^1/2). We provide theoretical guarantees that under the high-dimensional regime, our proposed SKiRT statistic still converges in distribution to a mixture of Chi-square distributions. Our proof techniques differ from that of the previous doubly robust literature in the sense that our statistic is not a summation of i.i.d. data due to the inserted kernel, and we do not need sample splitting of the data. We perform simulation studies to validate our theoretical results and evaluate the performance of our method under different settings.