Hypothesis Testing and Model Selection for Complex Data
AbstractIn this dissertation, we propose methodology for hypothesis testing in statistical genetics and model selection in networks. In chapters 1 and 2, we introduce new methods to tackle difficulties in hypothesis testing for sequencing association studies brought on by advancement of sequencing technology. In chapter 3, we introduce a flexible framework for mechanistic network model selection, which is an area of the networks literature with a dearth of work.
In chapter 1, we aim to test for association in a case-control sequencing studies, where the case-control status is completely confounded by the quality of the sequencing data. Such a situation can arise when one combines next generation sequencing data from cases with publicly available sequencing data (using an older platform) from controls. We propose a regression calibration-based method and consider maximum-likelihood for conducting an association study with the aligned reads from cases and controls. The methods allow for adjusting for non-confounding covariates as well as confounders in some situations. Both methods control type I error and have comparable power to analysis conducted using the true genotype with sufficiently high but different sequencing depths. The regression calibration method allows for analysis with the naive variance estimate and standard software under certain circumstances.
In chapter 2, we present a method for sparse signal detection for association between a set of SNPs that contain rare variants and a binary phenotype. Such settings are common in the increasingly abundant whole genome sequencing analyses. Traditional single SNP tests with rare variants are subject to poor power. Thus, methods that test for association by aggregating the test statistics of multiple rare variants together in a genetic region are popular. These existing methods for rare variant analysis, such as SKAT, have good power when the signals are dense in the set of SNPs tested, but can have poor power when the signals are sparse. In contrast, thresholding methods for signal detection, such as higher criticism and Berk-Jones methods, have good power in the presence of sparse signals. However, they rely on the single SNP test statistics to behave well as normally distributed asymptotically. The normality assumption of the individual test statistics does not hold in the presence of rare variants for binary phenotypes and yields incorrect type I error rates. Our proposed rare variant higher criticism approach for sparse signal detection has higher power than the existing aggregating methods and allows weighting of the SNPs, with the correct size.
In chapter 3, we propose a procedure for mechanistic network model selection. Our proposal aims to address the dearth of work on model selection for mechanistic network models. Such models describe network growth and evolution over time starting from simple microscopic mechanisms. Along with statistical models, which are probabilistic models for the final observed network, they are two prominent paradigms for modeling network structure. In comparison to statistical models, mechanistic models are easier to incorporate domain knowledge with, to study effects of interventions and to sample from, but typically have intractable likelihoods. To handle this intractability, our procedure makes use of the flexible Super Learner framework and borrows aspects from Approximate Bayesian Computation.
Citable link to this pagehttp://nrs.harvard.edu/urn-3:HUL.InstRepos:40046537
- FAS Theses and Dissertations