Publication: Genome-Wide Detection of Structural Variants and Signatures of Their Selection in Cancer
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
Structural variants (SV) are a heterogeneous class of genomic variation that can have profound effects on the structure and function of the cancer genome. SVs are challenging to detect in short-read sequencing data through standard alignment methods. Sequence assembly offers a powerful detection approach, but is difficult to apply genome-wide due to its computational complexity and the difficulty of extracting SVs from assemblies. I describe SvABA, an efficient and accurate method for detecting SVs using genome-wide local assembly. Evaluated on the NA12878 human genome and in simulated and real cancer genomes, SvABA demonstrates superior sensitivity and specificity across a large spectrum of SVs, and substantially improves detection performance over existing methods for variants in the 20-300 bp range. SvABA also identifies complex somatic rearrangements with chains of short (< 1,000 bp) templated-sequence insertions copied from distant genomic regions. I further describe the application of SvABA to several cancer sequencing projects to reveal both indels and rearrangements that drive cancer. I next analyze rearrangements in 2,693 cancer whole-genomes from the International Cancer Genome Consortium (ICGC). To understand the mechanistic and selective pressures shaping these variants, I describe a two-part paradigm for analyzing rearrangement breakpoints and the fusions connecting disparate loci. I find that breakpoint rates exhibit substantial heterogeneity across the genome and among tumor types, and are enriched in open-chromatin and sites with high densities of repetitive elements. After accounting for these mechanistic factors, I discovered enrichment of breakpoints within 0.3% of the genome, including novel focal microdeletions at BRD4 in breast and ovarian cancers. For fusions, the major determinant of whether two loci will be fused is the genomic distance between them. Accounting for this distribution, I identify significantly recurrent fusion events, including a novel recurrent t(2;7) translocation between THADA and IGF2BP3 in thyroid cancer. I further find that chromatin structure and the relative homology between breakpoints in the context of repetitive elements significantly influence the distribution of somatic fusions. Finally, I describe a suite of open-access C++ tools, including VariantBam for extracting variant-containing sequencing reads from sequencing files, and the SeqLib sequence alignment and sequence assembly toolkit.