Publication: Big Data in Biology and Medicine: Methodology and Computation
No Thumbnail Available
Date
2019-05-14
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
The Harvard community has made this article openly available. Please share how this access benefits you.
Citation
Yang, Shihao. 2019. Big Data in Biology and Medicine: Methodology and Computation. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences.
Research Data
Abstract
Statistics is entering into an exciting era. Huge volumes of electronic data are accumulated every day as the activities of millions of individuals are collected in nearly every aspect of life. With these big data also raises unique challenges. This thesis attempts to address the big data challenges in the context of real-world research projects, and to harness its power for solving real-life problems.
The biggest challenge is that the size of the dataset doesn’t guarantee the validity of the results; without rigorous methods, quick-and-dirty approaches typically give biased conclusions. This thesis thus attempts to develop novel and rigorous methodology for big-data analysis, focusing on two distinct big datasets. The first is to propose method that optimally extracts information from online search data such as Google for accurate infectious disease prediction, such as flu in United States or dengue fever in tropical countries. The second is to do causal inference on the electronic health data, studying the causal relationship between treatment and side-effect. In particular, I used a tailor-made matching method on a nation-wide electronic health data to study the causal relationship between cancer immunotherapy treatment and side-effects.
Another challenge in big-data study is that many traditional inference methods are not computationally feasible in the big data setting. Efficient computation and approximation tools must be developed.
In this thesis, I tackled the computation issue from two perspectives: a general computation tool and a problem-specific approximated inference. For general purpose computation, I developed a new parallelizable Markov chain Monte Carlo method for Bayesian posterior inference. For problem specific computation, I introduced a Gaussian process approximation method for inference in dynamic systems of ordinary differential equations.
Description
Other Available Sources
Keywords
big data, statistical applications,
Terms of Use
This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service