Publication: On Statistical Learning for Structural Data: Data Fusion and Semi-supervised Learning
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
Nowadays, with the rise of the large data era, data tends to become more complex and structural. The data may incorporate various different sources, with distinct data quality, sample size, or set of covariates. For instance, in ecological inference and validation studies of epidemiology, it is common that the predictor and response of interest are gathered in different datasets. However, many statistical estimands of interest (e.g., in regression or causality) are functions of the joint distribution of multiple random variables. In this scenario, the only possible approach is one of data fusion, where multiple independent data sets, each measuring a subset of the random variables of interest, are combined for inference. In general, since all random variables are never observed jointly, their joint distribution, and hence also the estimand which is a function of it, is only partially identifiable. Unfortunately, the endpoints of the partially identifiable region depend in general on entire conditional distributions, rendering them hard both operationally and statistically to estimate. Inspired by this challenge, in the first chapter, we present a novel outer-bound on the region of partial identifiability (and establish conditions under which it is tight) that depends only on certain conditional first and second moments. This allows us to derive semiparametrically efficient estimators of our endpoint outer-bounds that only require the standard machine learning toolbox which learns conditional means. We prove asymptotic normality and semiparametric efficiency of our estimators and provide consistent estimators of their variances, enabling asymptotically valid confidence interval construction for our original partially identifiable estimand. We demonstrate the utility of our method in simulations and a data fusion problem from economics.
Beyond multi-source datasets, specialized formats-—such as network data-—are also a crucial component of structured data. With the large amount of networks and graphs in the modern data age, such as social networks (Facebook, LinkedIn, etc.), citation networks, and medical networks (e.g., propagation networks of treatment, infection networks of viruses), a proper theory for unveiling the underlying network sub-structures like communities becomes increasingly essential. Motivated by social network analysis and network-based recommendation systems, in the second chapter, we study a semi-supervised community detection problem in which the objective is to estimate the community label of a new node using the network topology and partially observed community labels of existing nodes. The network is modeled using a degree-corrected stochastic block model, which allows for severe degree heterogeneity and potentially non-assortative communities. We propose an algorithm that computes a `structural similarity metric' between the new node and each of the K communities by aggregating labeled and unlabeled data. The estimated label of the new node corresponds to the value of k that maximizes this similarity metric. Our method is fast and numerically outperforms existing semi-supervised algorithms. Theoretically, we derive explicit bounds for the misclassification error and show the efficiency of our method by comparing it with an ideal classifier. Our findings highlight, to the best of our knowledge, the first semi-supervised community detection algorithm that offers theoretical guarantees.
An extension of community detection is mixed membership estimation (MME), which is a classical problem in network data analysis. It extends community detection by allowing a node to have fractional memberships in multiple communities. One major challenge in mixed membership estimation is to discover the underlying simplex structure of the high-dimensional data, which corresponds to the membership of each individual. Previous literature mainly focuses on leveraging the data itself to identify the simplex in an unsupervised learning fashion. However, in many scenarios, prior information may be available. For instance, the past user history of certain individuals in a social network may provide a clue to their membership. To incorporate this class of knowledge, in the third chapter, we consider a semi-supervised setting where the membership vectors of a subset of nodes are given. This problem is significantly more challenging than semi-supervised community detection, and to the best of our knowledge, it has not been studied in most previous literature. We discover an insightful structural equation for utilizing the known labels, which inspires a delicate way of extending unsupervised mixed-membership estimation algorithms to the semi-supervised setting. Compared to unsupervised algorithms for identifying the vertex structure, our method does not require the existence of pure nodes and needs fewer regularity conditions on the vertices. Additionally, assuming a degree-corrected mixed membership model, we provide theoretical guarantees of our algorithm, and show that it is efficient in the sense that its error rate is the same as a least squares problem with more prior information on the vertices. To the best of our knowledge, this is the first semi-supervised vertex hunting algorithm with a theoretical guarantee. We also demonstrate the excellent performance of our algorithm in several empirical studies, illustrating that with only a tiny fraction of the label information, our method can dramatically outperform unsupervised algorithms.