Publication:

Statistical Methods for Analysis of Genetic and Genomic Data in Population Science

Loading...
Thumbnail Image

Date

2017-05-02

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Barfield, Richard Thomas. 2017. Statistical Methods for Analysis of Genetic and Genomic Data in Population Science. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences.

Abstract

In chapter 1, we develop a missing mediator analysis using the EM algorithm for studies where the mediator is a genomic marker. Typically measures such as DNA methylation or gene expression are collected on a subset of participants from a larger study. Under standard assumptions for mediation analysis and an additional assumption that the missing data mechanism is ignorable, we can estimate the causal direct and indirect effects using all individuals with exposure and outcome. We applied our method to Project Viva to assess whether cord blood DNA methylation mediates the effect of maternal pre-pregnancy BMI on childhood BMI. In chapter 2, we develop a statistical method to estimate cell specific associations in whole blood DNA methylation data which is a mixture of several cell types using observed cell composition when cell-specific methylations are not observed. We use Generalized Estimating Equations to estimate cell specific exposure effects using observed whole blood methylation and cell type count data. We evaluated the performance of the proposed methods through simulation studies and analyzing data from the Normative Aging Study to assess for cell specific smoking associations on 49 probes established to be associated with smoking on the aggregate csale. In chapter 3, we introduce a novel approach to help differentiate when multiple eQTL genes co-localize at disease loci (due to linkage disequilibrium, LD), to help in identifying the true susceptible gene. We developed LD aware MR-Egger regression, an extension of MR-Egger regression to when multiple SNPs in LD are associated with gene expression. This approach requires only summary GWAS and eQTL effects, along with LD from reference panels. Through simulations we show that when SNPs have direct (pleiotropic) effects, our approach provides adequate control of type I error, high power, and less bias than previously proposed methods under certain conditions. We analyzed summary data from a GWAS on the risk of Breast Cancer with eQTL data from breast tissue from GTEx to demonstrate the usefulness of this method.

Description

Other Available Sources

Research Data

Keywords

Biostatistics, Statistical Genetics

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Related Stories