Big Data in Biology and Medicine: Methodology and Computation

Yang, Shihao

Publication:
Big Data in Biology and Medicine: Methodology and Computation

Open/View Files

Primary YANG-DISSERTATION-2019.pdf (18.28 MB)

source.zip (21.43 MB)

Date

2019-05-14

Authors

Yang, Shihao

The Harvard community has made this article openly available. Please share how this access benefits you.

Citation

Yang, Shihao. 2019. Big Data in Biology and Medicine: Methodology and Computation. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences.

Abstract

Statistics is entering into an exciting era. Huge volumes of electronic data are accumulated every day as the activities of millions of individuals are collected in nearly every aspect of life. With these big data also raises unique challenges. This thesis attempts to address the big data challenges in the context of real-world research projects, and to harness its power for solving real-life problems. The biggest challenge is that the size of the dataset doesn’t guarantee the validity of the results; without rigorous methods, quick-and-dirty approaches typically give biased conclusions. This thesis thus attempts to develop novel and rigorous methodology for big-data analysis, focusing on two distinct big datasets. The first is to propose method that optimally extracts information from online search data such as Google for accurate infectious disease prediction, such as flu in United States or dengue fever in tropical countries. The second is to do causal inference on the electronic health data, studying the causal relationship between treatment and side-effect. In particular, I used a tailor-made matching method on a nation-wide electronic health data to study the causal relationship between cancer immunotherapy treatment and side-effects. Another challenge in big-data study is that many traditional inference methods are not computationally feasible in the big data setting. Efficient computation and approximation tools must be developed. In this thesis, I tackled the computation issue from two perspectives: a general computation tool and a problem-specific approximated inference. For general purpose computation, I developed a new parallelizable Markov chain Monte Carlo method for Bayesian posterior inference. For problem specific computation, I introduced a Gaussian process approximation method for inference in dynamic systems of ordinary differential equations.

Keywords

big data, statistical applications,

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

URI

http://nrs.harvard.edu/urn-3:HUL.InstRepos:42029490

Collections

FAS Theses and Dissertations

Full item page

Publication:
Big Data in Biology and Medicine: Methodology and Computation

Open/View Files

Date

Authors

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

Research Projects

Organizational Units

Journal Issue

Citation

Research Data

Abstract

Description

Other Available Sources

Keywords

Terms of Use

URI

Collections

Endorsement

Review

Supplemented By

Referenced By

Related Stories

Publication: Big Data in Biology and Medicine: Methodology and Computation

Open/View Files

Date

Authors

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

Research Projects

Organizational Units

Journal Issue

Citation

Research Data

Abstract

Description

Other Available Sources

Keywords

Terms of Use

URI

Collections

Endorsement

Review

Supplemented By

Referenced By

Related Stories

Publication:
Big Data in Biology and Medicine: Methodology and Computation