Publication:
Big Data in Biology and Medicine: Methodology and Computation

No Thumbnail Available

Date

2019-05-14

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Yang, Shihao. 2019. Big Data in Biology and Medicine: Methodology and Computation. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences.

Research Data

Abstract

Statistics is entering into an exciting era. Huge volumes of electronic data are accumulated every day as the activities of millions of individuals are collected in nearly every aspect of life. With these big data also raises unique challenges. This thesis attempts to address the big data challenges in the context of real-world research projects, and to harness its power for solving real-life problems. The biggest challenge is that the size of the dataset doesn’t guarantee the validity of the results; without rigorous methods, quick-and-dirty approaches typically give biased conclusions. This thesis thus attempts to develop novel and rigorous methodology for big-data analysis, focusing on two distinct big datasets. The first is to propose method that optimally extracts information from online search data such as Google for accurate infectious disease prediction, such as flu in United States or dengue fever in tropical countries. The second is to do causal inference on the electronic health data, studying the causal relationship between treatment and side-effect. In particular, I used a tailor-made matching method on a nation-wide electronic health data to study the causal relationship between cancer immunotherapy treatment and side-effects. Another challenge in big-data study is that many traditional inference methods are not computationally feasible in the big data setting. Efficient computation and approximation tools must be developed. In this thesis, I tackled the computation issue from two perspectives: a general computation tool and a problem-specific approximated inference. For general purpose computation, I developed a new parallelizable Markov chain Monte Carlo method for Bayesian posterior inference. For problem specific computation, I introduced a Gaussian process approximation method for inference in dynamic systems of ordinary differential equations.

Description

Other Available Sources

Keywords

big data, statistical applications,

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Referenced By

Related Stories