Detecting Meaningful Relationships in Large Data Sets
MetadataShow full item record
AbstractAs data sets grow and algorithms scale, two questions have become central to data-rich science. The first is the exploration question: how can we avoid only testing hypotheses consistent with current models and instead find new, unanticipated types of relationships that will extend our understanding? The second is the interpretation question: given a robust relationship that has been identified, how can we know whether it proves our hypothesis or whether there are other confounders that are responsible for what we see? In this thesis, we develop a set of tools and theory centered around these two questions.
We begin with the exploration question, considering a common scenario in which researchers compute some statistic on every pair of variables in a high-dimensional data set, rank the variable pairs by their scores, and then examine the top of the resulting list. We formulate a theoretical framework for codifying which properties the statistic in question should have in order for this approach to successfully identify new, interesting relationships. We then introduce a suite of tools aimed at achieving these properties, show through theoretical analysis and simulations that they do so, and demonstrate their practical utility by using them to discover robust, novel relationships in a data set of social, political, and economic indicators collected by the World Health Organization about every country in the world.
We then turn to the interpretation question, specifically in the context of genome-wide association study (GWAS) data. Interpretation of GWAS data is notoriously difficult because tight correlations between nearby genetic variants, along with the multiple biological functions of each individual variant, mean that identified associations are consistent with many different hypotheses about disease mechanism. We posit a new type of genome-wide pattern that, when present, points to a relatively specific set of biological explanations and is therefore highly scientifically informative. We develop a statistic for confidently identifying this type of pattern, show in simulations that it indeed does so, and apply it to GWAS data spanning tens of diseases and complex traits, identifying both known and novel disease genes across a range of human diseases and traits.
Citable link to this pagehttp://nrs.harvard.edu/urn-3:HUL.InstRepos:40049997
- FAS Theses and Dissertations