Publication:

Statistical Methods for Mobile Health and Genomics Data

Loading...
Thumbnail Image

Date

2022-05-12

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Quinn, Matthew. 2022. Statistical Methods for Mobile Health and Genomics Data. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.

Abstract

A common goal in statistical analyses is to differentiate signal from noise. This problem is ubiquitous to many fields, including mobile health (mHealth) and genomics, both of which have garnered tremendous interest in recent years as advancements in technology continue to make them even more prominent for studying human health. While this challenge of detecting signal is universal, the solutions to it are not. Different research applications introduce their own idiosyncrasies that can make existing approaches for signal detection insufficient for that specific context. In this dissertation, we present approaches for signal detection for three different problems in mHealth and genomics.

In Chapter 1, we study mHealth data, which are often collected through wearable devices, such as watches and other fitness trackers. The devices record and process data using algorithms that are subject to updates and glitches, which device manufacturers often do not publicize. As a result, devices can suddenly change how data are collected and reported over time. A researcher using mHealth data needs to be able to detect these changes in order to adjust for them. We propose Automated Selection of Changepoints using Empirical P-values and Trimming (ASCEPT) as an approach for objectively identifying where these changes occur. ASCEPT relies upon Monte Carlo simulations and regression models to accurately identify these algorithmic changes. We compare ASCEPT to an existing method on both simulated and real mHealth data.

In Chapter 2, we look at chromatin immunoprecipitation sequencing (ChIP-seq) data, which reflect where proteins bind to a genome. Researchers often compare individuals from different experimental groups or biological conditions to detect regions of the genome in which there is differential binding (DB). DB in particular regions may then be associated with different health outcomes between the two groups, in turn helping the researcher understand risk factors or mechanisms contributing to a particular disease. However, popular methods for detecting DB often do not fully account for autocorrelation within samples, biological variability across samples, or selection procedures used to find regions of interest. As a result, they often report inappropriate inference regarding the significance of DB regions. We present a permutation test pipeline for finding DB sites on a genome while accounting for autocorrelation, biological variability, and the selection procedure in order to provide accurate inference. We compare this pipeline to two popular methods on both real and simulated data.

In Chapter 3, we continue studying genomics data, but this time focus on ribonucleic acid sequencing (RNA-seq) data, which reflect gene expression. Researchers commonly use RNA-seq data to study gene co-expression, or how the expression of different genes are correlated with one another. One can use the co-expression between genes to construct networks to better understand gene regulation or biological mechanisms, often with the hope of learning more about the drivers of certain health outcomes. However, not all co-expressions in RNA-seq are genuine. Technical issues with sequencing and normalization procedures that researchers perform may introduce spurious signals. We present evidence that this problem arises for different genes in real RNA-seq data and that the characteristics of these false signals can vary depending both on the normalization procedure used and the tissue in which the expression occurs. We present different metrics for characterizing the presence of these spurious correlations and permutation tests for assessing their statistical significance.

While we present three different research problems, they are all manifestations of the same core challenge. Whether we detect algorithmic changes in mHealth data over time, regions on the genome that contain DB, or spurious correlations among genes, the same underlying challenge of differentiating true signal from noise comes up. Additionally, while the solution to each instance is unique, we find that computational techniques, like Monte Carlo simulations and permutation tests, are particularly helpful tools in each scenario. Thus, while both the specific type of signal detection and its solution will depend on the underlying research context, there are commonalities among signal detection problems that can be helpful for understanding and addressing them.

Description

Other Available Sources

Research Data

Keywords

Biostatistics

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Related Stories