Publication:
A Review of VGP’s Current Techniques and Best Practices for the Generation of Vertebrate Chromosome-level Reference Genomes Using Multiple Sequencing Technologies.

No Thumbnail Available

Date

2023-05-01

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Hachey, Julie Christine. 2023. A Review of VGP’s Current Techniques and Best Practices for the Generation of Vertebrate Chromosome-level Reference Genomes Using Multiple Sequencing Technologies.. Master's thesis, Harvard University Division of Continuing Education.

Research Data

Abstract

Abstract A reference genome is important for two main reasons: First, it provides a universal blueprint for the scientific community to report and communicate their findings; Second, it reduces the computation cost of genomic data processing (Mardis et al., 2002). The decade-long process of generating the human reference genome opened many doors, ranging from developing molecular medicine to uncovering the mystery of human evolution (Genome 10k Community of Scientists, 2009; Koepfli et al., 2015; Rhie et al., 2021). To increase our library of reference genomes, the international Genome 10K (G10K) consortium was created in 2009 to benchmark methodologies, technologies, and algorithms required to generate reference genomes for vertebrates. In 2014 G10K reported successes, lessons learned, and the limitations of the current technology (Koepfli et al., 2015). In 2017 G10K formed the Vertebrate Genomes Project (VGP) to leverage new technologies, specifically long-read sequencing (LRS) and long-range chromosome scaffolding technologies, with the ambitious goal to generate a reference genome for every organism with a backbone. The VGP published their flagship paper, Rhie et al. 2021, generating 16 new references with a strong focus on genomic quality and standardization of tools required to produce the highest quality vertebrate assembly. The VGP’s quality standard defines the minimum threshold needed to generate a vertebrate genome. The requirements to produce a close to error-free and gapless, chromosome-level, haplotype-phased, and annotated reference genome assembly are as follow: (1) 1 Mb contig NG50; 10 Mb scaffold NG50; assigning 90% of the sequence to chromosomes, structurally validated by at least two independent lines of evidence; (2) Q40 average base quality; (3) haplotypes assembled as completely and correctly as possible (Rhie et al., 2021). Although, Rhie et al. 2021 reported on what steps are needed to produce the highest quality dataset, work has not been done to my knowledge to explore the updated VGP pipeline 2.0 tutorial found on Galaxy Bioinformatics(Lariviere et al., 2023). The main objective of this proposal is to review the current techniques and best practices used by the VGP for the generation of vertebrate chromosome-level reference genomes using multiple sequencing technologies. Specifically, this project aims to: 1. Review the use of Bionano, PacBio, and Hi-C sequencing technologies in the VGP pipeline 2.0 and develop in-house bash script to run the pipeline. 2. Examine the advantages of performing a genome profile analysis to help improve the accuracy and efficiency of de novo assembly and alignment algorithms. 3. Identify the difference in outcome between the use of Bionano, PacBio, and Hi-C sequencing technologies in the VGP pipeline 2.0 by measuring the completeness of a genome. First, run PacBio HiFi only; then, run PacBio HiFi and Hi-C technologies together; and lastly, use all three Bionano, PacBio, and Hi-C. This study aims to validate the Rhie et al. 2021 Vertebrate Genome Project (VGP) analysis toolset version 2.0 using a publicly available tutorial on Galaxy Bioinformatics, and to create a modified pipeline that allows for easy data input and processing (Lariviere et al., 2023). Specifically, the study will review the use of Bionano, PacBio, and Hi-C sequencing technologies in the VGP pipeline 2.0 by measuring the completeness of a genome. The pipeline will be reconstructed using custom shell scripts documented on GitHub (Rhie et al., 2021), and the dependencies will be implemented in the workflow as docker containers. The pipeline will be executed on a supercomputer meeting the CPU and RAM requirements outlined in the package documentation. Computational needs will be met by utilizing XSEDE (Extreme Science and Engineering Discovery Environment), a virtual organization providing researchers with access to advanced computing resources such as supercomputers, data storage, and visualization tools, funded by the National Science Foundation (NSF) and open to researchers from all fields of science and engineering (Towns et al., 2014). The main goal of this study is to provide researchers with a comprehensive understanding of emerging sequencing technologies and their associated cost and labor requirements for future genome assembly projects. By reconstructing the VGP pipeline v2.0 following the Galaxy VPG tutorial and analyzing different combinations of sequencing data types, the study will provide insights into the minimum number of datasets needed to meet the requirements for a complete reference vertebrate genome. Furthermore, by investigating whether different data types are needed across diverse vertebrate species, specifically birds, the study will contribute to a better understanding of the generalizability of genome assembly approaches across multiple sequencing technologies. Ultimately, this study aims to help researchers make more informed decisions about the sequencing technologies and assembly methods to employ in future genome assembly projects, thereby advancing our understanding of the genome biology of diverse species.

Description

Other Available Sources

Keywords

Bioinformatics

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Referenced By

Related Stories