Assessing Population Based Differences of Average Total Depth of Coverage in Next Generation Sequencing
MetadataShow full item record
CitationLandry, Latrice. 2015. Assessing Population Based Differences of Average Total Depth of Coverage in Next Generation Sequencing. Master's thesis, Harvard Medical School.
Purpose. Next generation sequencing (NGS) is increasingly important to the development and advancement of Precision Medicine. However, there is limited data to support the establishment of a technical validation process sensitive to the complexities of genomes across populations. This validation is hinged upon key metrics used in assessing the quality control (QC) of NGS based tests. Total depth of coverage, a key QC parameter in several NGS analysis steps, is an example of one such metric that has not been assessed for systematic bias across populations.
Objective. To assess differences in average total depth of coverage between an African population and a European population from the 1000 Genomes Project (1KGP) exon capture dataset and the low-coverage whole genome sequencing (WGS) dataset.
Methods. Using previously called variant call format files (VCFs) from the 1KGP, we compared average total depth of sequencing coverage using exon capture, and WGS data in a Yoruban, Nigerian African population (YRI, N=119) and a Central European population (CEU, N=91). Additionally, we compared mean total depth of sequence coverage from a low-coverage WGS dataset (target depth of 2-4x) in the same populations. Comparisons were made using T-tests, and confirmed using Kolmogorov-Smirnov where normal distributions were questioned.
Results. We found a higher average total depth of coverage in the exon capture dataset for CEU when compared to YRI. These data suggest on average there are eighteen more reads capturing a variant in the CEU exomes compared to YRI exomes. The low-coverage data showed no meaningful difference in total depth of coverage between the two groups.
Conclusion and Significance. Given the prominence of NGS technologies in the development of precision medicine, it is imperative to understand key population differences that may affect the ability to detect genomic variation precisely and accurately. The data used in this investigation were taken from publicly available repositories and represent a consensus of different approaches to sequencing and variant calling. Thus it is not clear if these findings represent real differences or an artifact of the different approaches. Artifacts are a potential concern as ‘batch effects’ are a well known issue for NGS analysis. Additionally, artifacts are of concern as the 1KGP study design includes many different approaches to sequencing and calling variants with a subsequent application of post-hoc filters, which are not consistent between the exon-capture and low-coverage whole genome sequencing projects. It is important to follow-up with additional analyses, where variants are called through a single pipeline with all parameters known and controlled for. Additionally, this is a preliminary step toward the much needed robust testing of NGS in preparation for technical validation and wide-spread clinical use.
Citable link to this pagehttps://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37366134