DNA Coverage Prediction Using Aggregated Poisson Approximation
CitationShakir, Khalid. 2018. DNA Coverage Prediction Using Aggregated Poisson Approximation. Master's thesis, Harvard Extension School.
AbstractDNA whole genome sequence analysis is an important process, but is timely and expensive. In past studies, experts have often theorized that once aligned to a reference genome, short reads of DNA sequence would cover a genome reference in a Poisson based distribution. Under this theory increasing the DNA sequencing depth across the genome should cover all bases with a minimum number of reads once the mean depth reaches a certain coverage threshold. However, since the completion of the human genome reference, there are clearly coverage disparities when examining the distributions of short reads. When a sample is mapped to the reference, the coverage distribution does not fit a Poisson, where among other things the variance in coverage should equal the mean.
Here a new tool is described that can better predict from a fraction of the eventual reads the coverage of a full sample. The degree of coverage for a full sample may be predicted based on the coverage analysis of a sequencing run with just a subset of the sample. The subset of read coverage is divided by metadata, for example read covariates, reference covariates, etc. Each division of read coverage is put into a bin, where each bin has an average coverage. The amount of desired coverage in the full sample is used to scale the average coverages using Poisson approximations of coverage for the desired full sample. Compared to a single poisson, when aggregated these approximations of stratified coverage are better able to predict the distribution of coverage for regions of the genome with high or low coverage.
The tool uses a number of stratification techniques. Disparate methods of stratification were evaluated during the development of the tool, some of which included: separating reads based on their characteristics, separating regions of the genome based on metrics generated from the reference used for aligning the short reads, and the use of various metadata files containing summary information about the reference and prior human genomic analyses. The use of this tool in a processing pipeline, will enable data analysts and investigators to better estimate how additional sequencing would affect the coverage distributions in sequencing samples prepared with and without PCR amplification. This enhancement in workflow automation is expected to save both time and money for whole genome projects looking to reach a certain threshold of coverage for samples.
Citable link to this pagehttp://nrs.harvard.edu/urn-3:HUL.InstRepos:42004048