Statistical Methods for Network Data
CitationLarson, Jonathan. 2021. Statistical Methods for Network Data. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.
AbstractThis dissertation consists of three chapters, each of which proposes and evaluates a new statistical method for network data.
Chapter 1: Black men who have sex with men (MSM) in the U.S. are more likely to be HIV-positive than White MSM. Intentional and unintentional segregation of Black from non-Black MSM in sex partner meeting places may perpetuate this disparity, a fact that is ignored by current HIV risk indices, which mainly focus on individual behaviors and not systemic factors. This paper capitalizes on recent studies in which the venues where MSM meet their sex partners are known. Connecting individuals and venues leads to so-called affiliation networks; we propose a model for how HIV might spread along these networks, and we formulate a new risk index based on this model. We test this new risk index on an affiliation network of 466 African-American MSM in Chicago, and in simulation. The new risk index works well when there are two groups of people, one with higher HIV prevalence than the other, with limited overlap in where they meet their sex partners.
Chapter 2: Minimum spanning trees (MSTs) are used in a variety of fields, from computer science to geography. Infectious disease researchers have used them to infer the transmission pathway of certain pathogens. However, these are often the MSTs of sample networks, not population networks, and surprisingly little is known about what can be inferred about a population MST from a sample MST. We prove that if n nodes (the sample) are selected uniformly at random from a complete graph with N nodes and unique edge weights (the population), the probability that an edge is in the population graph's MST given that it is in the sample graph's MST is n/N. We use simulation to investigate this conditional probability for G(N,p) graphs, Barabasi–Albert (BA) graphs, graphs whose nodes are distributed in R^2 according to a bivariate standard normal distribution, and an empirical HIV genetic distance network. Broadly, results for the complete, G(N,p), and normal graphs are similar, and results for the BA and empirical HIV graphs are similar. We recommend that researchers use an edge-weighted random walk to sample nodes from the population so that they maximize the probability that an edge is in the population MST given that it is in the sample MST.
Chapter 3: Mechanistic network models specify the mechanisms by which networks grow and change, allowing researchers to investigate complex systems using both simulation and analytical techniques. Unfortunately, it is difficult to write likelihoods for instances of graphs generated with mechanistic models because of a combinatorial explosion in outcomes of repeated applications of the mechanism. Thus it is near impossible to estimate the parameters using maximum likelihood estimation. In this paper, we propose treating node sequence in a growing network model as an additional parameter, or as a missing random variable, and maximizing over the resulting likelihood. We develop this framework in the context of a simple mechanistic network model, used to study gene duplication and divergence, testing a variety of algorithms for maximizing the likelihood in simulated graphs. We also run the best-performing algorithm on a human protein-protein interaction network, finding a high rate of mutation. Although we focus on a specific mechanistic network model here, the proposed framework is more generally applicable to reversible models.
Citable link to this pagehttps://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37368256
- FAS Theses and Dissertations