Mapping copy number variation by population scale genome sequencing Ryan  E.  Mills1,*,  Klaudia  Walter2,*,  Chip  Stewart3,*,  Robert  E.  Handsaker4,*,  Ken  Chen5,*,  Can  Alkan6,7,*,  Alexej  Abyzov8,*,  Seungtai  Chris  Yoon9,*,  Kai  Ye10,*,  R.  Keira  Cheetham11, Asif Chinwalla5, Donald F. Conrad2, Yutao Fu12, Fabian Grubert13., Iman  Hajirasouliha14,  Fereydoun  Hormozdiari14,  Lilia  M.  Iakoucheva15,  Zamin  Iqbal16,  Shuli  Kang15,  Jeffrey  M.  Kidd6,  Miriam  K.  Konkel17,  Joshua  Korn4,  Ekta  Khurana8,18,  Deniz  Kural3,  Hugo  Y.  K.  Lam13,  Jing  Leng8,  Ruiqiang  Li19,  Yingrui  Li19,  Chang‐Yun  Lin20, Ruibang Luo19, Xinmeng Jasmine Mu8, James Nemesh4, Heather E. Peckham12,  Tobias  Rausch21,  Aylwyn  Scally2,  Xinghua  Shi1,  Michael  P.  Stromberg3,  Adrian  M.  Stütz21, Alexander Eckehart Urban13, Jerilyn A. Walker17, Jiantao Wu3, Yujun Zhang2,  Zhengdong  D.  Zhang8,  Mark  A.  Batzer17,  Li  Ding5,22,  Gabor  T.  Marth3,  Gil  McVean23,  Jonathan  Sebat15,  Michael  Snyder13,  Jun  Wang19,24,  Kenny  Ye20,  Evan  E.  Eichler6,7,*,  Mark B. Gerstein8,18,25,*, Matthew E. Hurles2,*, Charles Lee1,*, Steven A. McCarroll4,26,*,  and Jan O. Korbel21,*,@  for the 1000 Genomes Project#   1. Department of Pathology, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA  2. The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA UK.  3. Department of Biology, Boston College, Boston, MA  4. Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA  5. The Genome Center at Washington University, St. Louis, MO  6. Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA   7. Howard Hughes Medical Institute, University of Washington, Seattle, Washington, USA.  8. Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT  9. Seaver Autism Center and Department of Psychiatry, Mount Sinai School of Medicine, New York, NY  10. Departments of Molecular Epidemiology, Medical Statistics and Bioinformatics, Leiden University Medical Center, Leiden,  the Netherlands   11. Illumina Cambridge Ltd, Chesterford Research Park, Little Chesterford, Essex CB10 1XL, UK   12. Life Technologies, Beverly, MA  13. Department of Genetics, Stanford University, Stanford, CA  14. School of Computing Science, Simon Fraser University, Burnaby, British Columbia,  Canada.  15. Department of Psychiatry, Department of Cellular and Molecular Medicine, Institute for Genomic Medicine, University of  California, San Diego, La Jolla, CA  16. Wellcome Trust Centre for Human Genetics, University of Oxford, OX3 7BN, UK  17. Department of Biological Sciences, Louisiana State University, Baton Rouge, Louisiana  18. Molecular Biophysics and Biochemistry Department, Yale University, New Haven, CT  19. BGI‐Shenzhen, Shenzhen 518083, China  20. Albert Einstein College of Medicine, Bronx, NY  21. Genome Biology Research Unit, European Molecular Biology Laboratory, Heidelberg, Germany  22. Department of Genetics, Washington University, St. Louis, MO  23. Department of Statistics, University of Oxford, OX3 7BN, UK  24. Department of Biology, University of Copenhagen, Copenhagen, Denmark  25. Department of Computer Science, Yale University, New Haven, CT  26. Department of Genetics, Harvard Medical School, Boston, MA    *These authors contributed equally to this work.  @ Correspondence should be addressed to J.O.K. (jan.korbel@embl.de).  #Lists of participants and affiliations appear in Supplementary Information.    1  Summary Genomic structural variants (SVs) are abundant in humans, differing from other variation classes in extent, origin, and functional impact. Despite progress in SV characterization, the nucleotide resolution architecture of most SVs remains unknown. We constructed a map of unbalanced SVs (i.e., copy number variants) based on whole genome DNA sequencing data from 185 human genomes, integrating evidence from complementary SV discovery approaches with extensive experimental validations. Our map encompassed 22,025 deletions and 6,000 additional SVs, including insertions and tandem duplications. Most SVs (53%) were mapped to nucleotide resolution, which facilitated analyzing their origin and functional impact. We examined numerous whole and partial gene deletions with a genotyping approach and observed a depletion of gene disruptions amongst high frequency deletions. Furthermore, we observed differences in the size spectra of SVs originating from distinct formation mechanisms, and constructed a map of SV hotspots formed by common mechanisms. Our analytical framework and SV map serves as a resource for sequencing-based association studies.   2    Introduction  Unbalanced  structural  variants  (SVs),  or  copy  number  variants  (CNVs),  involving  large‐scale deletions, duplications, and insertions form one of the least well studied  classes  of  genetic  variation.  The  fraction  of  the  genome  affected  by  SVs  is  comparatively larger than that accounted for by single nucleotide polymorphisms1  (SNPs), implying significant consequences of SVs on phenotypic variation. SVs have  already been associated with diverse diseases, including autism2,3, schizophrenia4,5  and  Crohn’s  disease6,7.  Furthermore,  locus‐specific  studies  suggest  that  diverse  mechanisms  may  form  SVs  de  novo,  with  some  mechanisms  involving  complex  rearrangements resulting in multiple chromosomal breakpoints8,9.   Initial  microarray‐based  SV  surveys  focused  on  large  gains  and  losses10,11,12,  with  recent advances in array technology widening the accessible size spectrum towards  smaller  SVs1,13.  Microarray‐based  surveys  commonly  mapped  SVs  to  approximate  genomic locations. However, a detailed SV characterization, including analyses of SV  origin  and  impact,  requires  knowledge  of  precise  SV  sequences.  Advances  in  sequencing  technology  have  enabled  applying  sequence‐based  approaches  for  mapping SVs at fine‐scale14,15,16,17,18,19,20,21. These approaches include: (i) paired‐end  mapping  (or  read  pair  ‘RP’  analysis)  based  on  sequencing  and  analysis  of  abnormally  mapping  pairs  of  clone  ends14,22,23,24  or  high‐throughput  sequencing  fragments15,17,18; (ii) read‐depth (‘RD’) analysis, which detects SVs by analyzing the  read  depth‐of‐coverage16,21,25,26,27;  (iii) split‐read  (‘SR’)  analysis,  which  evaluates  gapped sequence alignments for SV detection28,29; and (iv) sequence assembly (‘AS’),  which  enables  the  fine‐scale  discovery  of  SVs,  including  novel  (non‐reference)  sequence  insertions30,31,32.  Sequence‐based  SV  discovery  approaches  have  thus  far  been  applied  to  a  limited  (<20)  number  of  genomes,  leaving  the  fine‐scale  architecture of most common SVs unknown.  Sequence  data  generated  by  the  1000  Genomes  Project  (1000GP)  provide  an  unprecedented  opportunity  to  generate  a  comprehensive  SV  map.  The  1000GP  recently generated 4.1 Terabases of raw sequence in pilot projects targeting whole  human  genomes33  (Supplementary  Table  1).  These  studies  comprise  a  population‐ scale  project,  termed  ‘low‐coverage  project’,  in  which  179  unrelated  individuals  were sequenced with an average coverage of 3.6X – including 59 Yoruba individuals  from Nigeria (YRI), 60 individuals of European ancestry from Utah (CEU), 30 of Han  ancestry  from  Beijing  (CHB),  and  30  of  Japanese  ancestry  from  Tokyo  (JPT;  the  latter two were jointly analyzed as JPT+CHB). In addition, a high‐coverage project,  termed  the  ‘trio  project’,  was  carried  out,  with  individuals  of  a  CEU  and  a  YRI  parent‐offspring trio sequenced to 42X coverage on average.    We  report  here  the  results  of  analyses  undertaken  by  the  Structural  Variation  Analysis  Group  of  the  1000GP.  The  group’s  objectives  were  to  discover,  assemble,  genotype,  and  validate  SVs  of  50  bp  and  larger  in  size,  and  to  assess  and  compare  different  sequence‐based  SV  detection  approaches.  The  focus  of  the  group  was  initially  on  deletions,  a  variant  class  often  associated  with  disease9,  for  which  rich    3  control datasets and diverse ascertainment approaches exist1,13,22,28. Less focus was  placed  on  insertions  and  duplications34  and  none  on  balanced  SV  forms  (such  as  inversions). Specifically, we applied nineteen methods to generate an SV discovery  set. We further generated reference genotypes for most deletions, assessed the SVs’  functional  impact,  and  stratified  SV  formation  mechanism  with  respect  to  variant  size and genomic context.     Prediction of SV candidate loci and assessment of discovery methods  We incorporated the SV discovery methods into a pipeline (Fig. 1AB), with the goal  of  ascertaining  different  SV  types  and  assessing  each  method  for  its  ability  to  discover SVs. The methods detected SVs by analyzing RD, RP, SR, and AS features, or  by  combining  RP  and  RD  features  (abbreviated  as  ‘PD’).  Altogether  we  generated  thirty‐six SV callsets by applying the methods on trio and low‐coverage data, and by  identifying SVs as genomic differences relative to a human reference, corresponding  to  the  reference  genome,  or  to  a  set  of  individuals  (i.e.  population  reference;  Supplementary  Table  2).  We  initially  identified  SVs  as  deletions,  tandem  duplications,  novel  sequence  insertions,  and  mobile  element  insertions  (MEIs)  relative  to  the  human  reference.  Subsequent  comparative  analyses  involving  primate genomes enabled us to classify SVs as deletions, duplications, or insertions  relative  to  inferred  ancestral  genomic  loci,  reflecting  mechanisms  of  SV  formation  (see below). DNA reads analyzed by SV discovery methods were initially mapped to  the human reference genome using a variety of alignment algorithms. Most of these  algorithms mapped each read to a single genomic position, although one algorithm  (mrFAST16)  also  considered  alternative  mapping  positions  for  reads  aligning  onto  repetitive  regions  (see  Supplementary  Tables  2‐4  for  method‐specific  parameters  and  full  SV  callsets).  We  filtered  each  callset  by  excluding  SVs  <50bp,  which  are  reported  elsewhere33.  Many  SVs  exhibited  support  from  distinct  SV  discovery  methods,  as  exemplified  by  a  common  deletion,  previously  associated  with  body‐ mass  index35  (BMI),  that  we  identified  with  RP,  RD,  and  SR  methods  (Fig.  1C).  Nonetheless,  we  observed  notable  differences  between  methods  (Fig.  2ABC)  in  terms  of  genomic  regions  ascertained  (Supplementary  Fig.  1),  accessible  SV  size‐ range (Fig. 2A), and breakpoint precision (Fig. 2C, Supplementary Fig. 2).   To  estimate  callset  specificity,  we  carried  out  extensive  validations  (Methods),  including  PCRs  for  over  3,000  candidate  loci,  and  microarray  data  analyses  for  50,000  candidate  loci  (Supplementary  Tables  3,  4;  Supplementary  Fig.  3).  We  combined  PCR  and  array‐based  analysis  results  to  estimate  false  discovery  rates  (FDRs), and found that eight callsets (three deletion, four insertion, and one tandem  duplication  callset)  met  the  pre‐specified  specificity  threshold33  (FDR≤10%),  whereas the other callsets yielded lower specificity (FDRs of 13%‐89%).   We further assessed the sensitivity of deletion discovery methods by collating data  from  four  earlier  surveys1,13,22,28  into  a  gold  standard  (Methods,  Supplementary  Tables  5,  6,  and  Supplementary  Fig.  4A),  and  specifically  assessing  the  detection  sensitivity for an individual sequenced at high‐coverage (NA12878) as well as for an    4  individual sequenced at low‐coverage (NA12156). Unsurprisingly, given the typical  trade‐off  between  sensitivity  and  specificity,  in  the  trios  the  highest  sensitivities  were achieved by RD and RP methods with FDR>10% (Fig. 2B). By comparison, in  the  low‐coverage  data,  the  individual  method  with  the  greatest  accuracy  (FDR=3.7%)  was  the  second  most  sensitive  based  on  our  gold  standard  (Fig.  2B),  and  the  most  sensitive  when  expanding  the  gold  standard  to  a  larger  set  of  individuals (Supplementary Fig. 4B). This method, Genome STRiP (to be described  elsewhere36),  integrated  both  RP  and  RD  features  (PD),  implying  that  considering  different evidence types can improve SV discovery.    Construction of a highconfidence SV discovery set  To  construct  our  SV  discovery  set  (“release  set”),  we  joined  calls  from  different  discovery methods corresponding to the same SV with a merging approach that was  aware  of  each  callset’s  precision  in  SV  breakpoint  detection (Supplementary  Fig.  5  and  Methods).  Most  SVs  in  the  release  set  (61%)  were  contributed  by  individual  methods  meeting  the  pre‐defined  specificity  threshold (FDR≤10%).  The  remaining  39% of calls were contributed by lower specificity methods following experimental  validation.  Altogether,  the  release  set  comprised  22,025  deletions,  501  tandem  duplications, 5,371 MEIs, and 128 non‐reference insertions (Table 1, Supplementary  Table  7).  With  our  gold  standard  we  estimated  an  overall  sensitivity  of  deletion  discovery of 82% in the trios, and 69% in low‐coverage sequence (Fig. 2B) using a 1  bp  overlap  criterion.  When  instead  applying  a  stringent  50%  reciprocal  overlap  criterion  for  sensitivity  assessment  (which  required  SV  sizes  inferred  on  different  experimental  platforms  to  be  in  close  agreement)  our  sensitivity  estimates  decreased  by  12%  and  18%,  respectively,  in  trio  and  low‐coverage  sequence  (Supplementary  Table  8).  We  further  examined  an  alternative  approach  that  involved  the  pairwise  integration  of  deletion  discovery  methods,  and  tested  its  ability  to  discover  SVs  without  relying  on  the  inclusion  of  lower  specificity  calls  following  experimental  validation  (“algorithm‐centric  set”;  Fig.  1B).  While  this  alternative approach resulted in an increased number (by ~13%) of high‐specificity  (FDR<10%)  calls  compared  to  the  release  set  (Supplementary  Text),  it  overall  resulted in fewer SV calls owing to its decreased sensitivity at the lower (<200bp)  SV size range. In the following analyses we thus focused on the release set.     Extent and impact of our SV discovery set  We  next  assessed  the  extent  and  impact  of  our  SV  discovery  (release)  set.  The  median SV size was 729 bp (mean=8 kb), approximately four times smaller than in a  recent tiling CGH based study1, reflecting the high resolution of DNA sequence based  SV  discovery.  We  also  compared  our  set  to  a  recent  survey  of  SVs  in  an  individual  genome37 based on capillary sequencing and array‐based analyses24, and observed a  similar  size  distribution  for  deletions,  but  differences  in  the  size  distributions  of  other  SV  classes,  reflecting  underlying  differences  in  SV  ascertainment  (Supplementary  Fig.  6).  By  comparing  our  SVs  to  databases  of  structural  variation  and to additional personal genome datasets, we classified 15,556 SVs in our set as    5  novel,  with  an  enrichment  of  low  frequency  SVs  and  small  SVs  amongst  the  novel  variants (Methods and Supplementary Text).   A  major  advantage  of  sequence‐based  SV  discovery  is  the  nucleotide  resolution  mapping of SVs. We initially mapped the breakpoints of 7,066 deletions and 3,299  MEIs using SR and AS features. Using the TIGRA‐targeted assembly approach38 we  further identified the breakpoints of an additional 4,188 deletions and 160 tandem  duplications,  initially  discovered  by  RD,  RP,  and  PD  methods  (Methods,  Supplementary  Table  2).  Altogether,  we  mapped  ~15,000  SVs  at  nucleotide  resolution,  48%  of  which  were  novel.  Few  deletion  loci  (4.4%)  displayed  different  SV  breakpoints  in  different  samples,  which  is  explainable  by  rare  TIGRA  mis‐ assemblies,  or  alternatively,  by  recurrently  formed,  multi‐allelic  SVs  (Supplementary Text). TIGRA further enabled us to validate an additional 7,359 SVs  discovered  with  RP  or  RD  features  by  identifying  the  SVs’  breakpoints  (Methods),  and  to  evaluate  the  mapping  precision  of  SV  discovery  methods  (Fig.  2C,  Supplementary Figure 2).   We  further  assessed  the  putative  functional  impact  of  SVs  in  our  set  by  relating  them  to  genomic  annotation.  Seventeen  hundred  SVs  affected  coding  sequences,  resulting in full gene overlaps or exon disruptions (Table 2), many of which led to  out‐of‐frame  exons  (Supplementary  Table  9).  We  related  gene  disruptions  to  gene  functions,  and  observed  significant  enrichments  for  several  functional  categories  including  cell  defense  and  sensory  perception  (Supplementary  Table 10).  High  levels  of  structural  variation,  including  copy‐number  variation,  were  previously  described  for  both  processes15,22,39.  These  SVs  might  be  maintained  in  the  population  by  selection  for  the  purpose  of  functional  redundancy.  While  most  SVs  intersecting  with  genes  were  deletions,  several  validated  tandem  duplications  and  MEIs also intersected with coding sequences (Table 2).     Population genetic properties of deletions   We next sought to generate genotypes for deletions discovered in the 1000GP data,  both to facilitate population genetics analyses and to make our SV set amenable to  association  studies  in  the  form  of  a  reference  genotype  set.  In  this  regard,  the  Genome  STRiP36  genotyping  method  was  developed,  a  method  combining  information  from  RD,  RP,  SR  and  haplotype  features  of  population‐scale  sequence  data  for  genotyping  (Methods,  Supplementary  Text).  Using  this  approach  we  generated  genotypes  for  13,826  autosomal  deletions  in  156  individuals.  The  genotypes  displayed  99.1%  concordance  with  CGH  array1  based  genotypes  (available for 1,970 of the deletions), suggesting high genotyping accuracy.   Fig.  3  presents  allele  frequency  analyses  based  on  these  genotypes.  As  expected,  common  polymorphisms  (minor  allele  frequency  (MAF)  >5%)  were  generally  shared across populations, while rare alleles were frequently observed in only one  population  (Figs.  3ABC).  We  observed  several  candidates  for  monomorphic  deletions  (i.e.,  genomic  segments  putatively  deleted  in  all  individuals),  explainable    6  by  rare  insertions  present  in  the  reference  genome  or  by  remaining  genotyping  inaccuracies (Supplementary Text).  We  next  assessed  the  allele  frequencies  of  gene  deletions  (Fig.  3D).  Similar  to  a  recent array‐based study1, we observed a depletion of high frequency alleles among  deletions  intersecting  with  protein‐coding  sequence  compared  to  other  deletions  (P=1.1x10‐11;  KS  test),  consistent  with  purifying  selection  keeping  most  gene  deletions  at  low  frequency.  Nonetheless,  several  coding  sequence  deletions  were  observed  with  high  allele  frequency  (>80%).  Most  of  these  occurred  in  regions  annotated  as  segmental  duplications,  consistent  with  lessened  evolutionary  constraint in functionally redundant gene categories22.  Intriguingly, common gene  deletions  also  affected  many  unique  genes  with  no  obvious  paralogs.  We  further  analyzed  the  abundance  of  gene  deletions  in  different  populations  and  observed  highly differentiated loci, albeit with no statistically significant relationship between  differentiation  and  particular  categories  of  gene  overlap,  i.e.,  intronic  vs.  exonic  (Supplementary Text).  By  comparing  deletion  genotypes  with  genotypes  of  nearby  SNPs,  we  found,  consistent with earlier studies1,13,40, that deletions in genomic regions accessible to  short  read  sequencing  display  extensive  linkage  disequilibrium  (LD)  with  SNPs.  81%  of  common  deletions  had  one  or  more  SNPs  with  which  they  are  strongly  correlated  (r2>0.8;  Supplementary  Fig.  7).  This  suggests  that  many  deletions  mapped  in  our  study  will  be  identifiable  through  tagging  SNPs  in  future  studies  (Supplementary  Text).  On  the  other  hand,  a  fifth  of  the  genotyped  deletions  were  not  tagged  by  HapMap  SNPs  (a  figure  similar  to  the  fraction  of  SNPs  that  are  not  tagged by HapMap SNPs41), implying that these SVs should be genotyped directly in  association studies. Furthermore, the LD properties of complex SVs (e.g., multiallelic  SV)  have  not  yet  been  fully  ascertained  as  methods  for  genotyping  such  SVs  with  similar accuracy are still being developed.    SV formation mechanism analysis  Nucleotide  resolution  breakpoint  information  enables  inference  of  SV  formation  mechanisms15,22.  Recent  studies  broadly  distinguished  between  several  germline  rearrangement  classes,  some  of  which  may  comprise  more  than  one  SV  formation  mechanism15,22,42,43: non‐allelic homologous recombination (NAHR), associated with  long  sequence  similarity  stretches  around  the  breakpoints;  rearrangements  in  the  absence of extended sequence similarity (abbreviated as “non‐homologous” or NH),  associated  with  DNA  repair  by  non‐homologous  end‐joining  (NHEJ)  or  with  microhomology‐mediated  break‐induced  replication  (MMBIR);  the  shrinking  or  expansion  of  variable  number  of  tandem  repeats  (VNTRs),  frequently  involving  simple sequences, by slippage; and MEIs. We distinguished among the classes NAHR,  NH, VNTR, and MEI by examining the breakpoint junction sequence of SVs initially  discovered as deletions or tandem duplications relative to a human reference.     7  We  first  compared  the  SVs  to  orthologous  primate  genomic  regions  to  distinguish  deletions from insertions/duplications with respect to reconstructed ancestral loci  using  the  BreakSeq  classification  approach43.  This  analysis  showed  that  of  the  11,254  nucleotide‐resolution  SVs  discovered  as  deletions  relative  to  a  human  reference,  21%  actually  represented  insertions  and  2%  represented  tandem  duplications  relative  to  the  putative  ancestral  genome.  Of  the  remaining  SVs,  60%  were  classified  as  deletions  relative  to  ancestral  sequence,  whereas  the  ancestral  state  of  17%  was  undetermined.  By  comparison,  out  of  160  nucleotide‐resolution  SVs identified as tandem duplications relative to the reference genome, 91.6% were  classified  as  duplications  relative  to  the  ancestral  genome,  whereas  the  ancestral  state  of  8.4%  remained  undetermined  (Supplementary  Text).  Our  breakpoint  analysis revealed that 70.8% of the deletions and 89.6% of the insertions exhibited  breakpoint  microhomology/homology  ranging  from  2‐376  bp  in  size,  with  distribution  modes  of  2  bp  (attributable  to  NH)  and  15  bp  (attributable  to  MEI),  respectively  (Fig.  4A,  Supplementary  Text).  As  expected42,  a  small  portion  of  the  deletions  (16.1%)  displayed  non‐template  inserted  sequences  at  their  breakpoint  junctions.  By  comparison,  the  tandem  duplications  showed  extensive  stretches  displaying ≥95% sequence identity at the breakpoints linearly correlating in length  with  SV  size  (Fig. 4A).  In  addition,  most  tandem  duplications  displayed  2‐17  bp  of  microhomology at the breakpoint junctions (Supplementary Text).  We  subsequently  applied  BreakSeq43  to  infer  formation  mechanisms  for  all  SVs  classified  with  regard  to  ancestral  state.  Using  BreakSeq,  we  inferred  NH  as  the  dominating  deletion  mechanism,  and  MEI  as  the  dominating  insertion  mechanism  (Fig. 4BC, Supplementary Table 11). Furthermore, an abundance of microhomology  at tandem duplication breakpoints suggested frequent formation of this SV class by  a rearrangement process acting in the absence of homology (NH). When relating SV  formation  to  the  variant  size  spectrum,  we  observed  marked  insertion  peaks  for  MEIs  at  300  bp,  corresponding  to  Alu  elements,  and  at  6  kb,  corresponding  to  L1/LINEs  (Fig.  4C).  By  comparison,  NH  and  NAHR  based  mechanisms  occurred  across  a  wide  size‐range,  whereas  VNTR  expansion/shrinkage,  consistent  with  earlier findings1, led to relatively small SV sizes (Figs. 4C,D).  Furthermore,  when  displaying  the  genomic  distribution  of  SVs  (Fig.  5A),  we  observed a notable clustering of SVs into ‘SV hotspots’. We analyzed this clustering  in  detail  by  examining  the  distribution  of  non‐overlapping,  adjacent  SVs,  and  observed a marked clustering of SVs formed by NAHR, VNTR, and NH, respectively, a  signal extending to hundreds of kilobases (Fig. 5B). The clustering was influenced by  an abundance of VNTR near the centromeres43 and NAHR near the telomeres (Fig.  5A).  A  significant  enrichment  of  NAHR  near  recombination  hotspots  (P=1.3e‐15)  and  segmental  duplications  (P=3.1e‐17)  further  contributed  to  the  clustering  (Supplementary Table 13).   To  further  explore  this  clustering  we  devised  a  segmentation  approach  for  predicting SV hotspots (Methods), which yielded a map of 51 putative SV hotspots  (Supplementary  Table  14).  80%  of  the  hotspots  mainly  comprised  SVs  originating    8  from a single formation mechanism (Fig. 5C). Most of these corresponded to NAHR  hotspots,  although  hotspots  dominated  by  NH  and  VNTR  also  were  evident.  These  observations  suggest  that  SV  formation  is  frequently  associated  with  the  locus‐ specific propensity for genomic rearrangement.  Conclusions and discussion    By  generating  an  SV  set  of  unprecedented  size  along  with  breakpoint  assemblies  and  reference  genotypes,  we  demonstrate  the  suitability  of  population‐scale  sequencing  for  SV  analysis.  Nucleotide  resolution  data  allow  the  construction  of  reference  datasets  and  make  SVs  readily  assessable  across  different  experimental  platforms using genotyping approaches. Our fine‐scale map enabled us to examine  the  functional  impact  of  SVs,  as  exemplified  by  our  analysis  of  gene  disruption  variants, which will be of value for genome and exome sequencing studies.   Our  map  further  enabled  us  to  examine  size  spectra  of  SV  formation  mechanisms  and led us to identify genomic SV hotspots that are commonly dominated by a single  formation mechanism. Recurrent rearrangements, implicated in genomic disorders,  are  hypothesized  to  be  associated  with  local  genome  architecture44,  e.g.,  with  segmental  duplications  that  facilitate  NAHR.  Also,  DNA  rearrangement  in  the  absence of homology, i.e., MMBIR, has been implicated in recurrent SV formation8,45.  In  this  regard,  we  noticed  that  out  of  the  hotspots  we  report,  six  fall  into  critical  regions  of  known  genetic  disorders  associated  with  recurrent  de  novo  deletions,  including  Miller‐Dieker  syndrome  and  Leri‐Weill  dyschondrosteosis  (Supplementary  Table  14).  Irrespective  of  potential  disease  relevance,  or  inferred  mechanism  of  formation,  our  analysis  revealed  a  map  of  SV  hotspots  that  may  constitute  local  centers  of  de  novo  SV  formation,  consistent  with  the  concept  that  local genome architecture contributes to genomic instability44.  Our  study  focused  on  characterizing  deletions,  which  are  often  associated  with  disease9.  Facilitated  by  ancestral  analyses  of  SV  loci,  we  also  characterized  insertions and tandem duplications, albeit in less detail than deletions.  Companion  papers  with  more  detailed  analyses  of  MEIs,  and  copy‐number  variation  within  segmental  duplications  are  published  elsewhere34,46.  Of  note,  most  SV  discovery  methods  depend  on  mapping  reads  onto  their  genomic  locus  of  origin,  i.e.,  the  ‘accessible’  fraction  of  the  genome,  a  fraction  lessened  in  segmental  duplications  that are of high interest to SV analysis. Nonetheless, owing to the abilities of RP and  RD  methods  in  detecting  SVs  in  these  regions  and  in  interpreting  reads  with  multiple mapping positions, the ‘accessible’ fraction of the genome is higher for SVs  than for SNPs16. In the future, sequencing technologies generating longer DNA reads  will  increase  the  accessible  genome,  and  will  enable  the  assessment  of  SVs  embedded in long repeat structures, such as balanced inversions.  Our  SV  resource  will enable  the  discovery,  genotyping,  and  imputation  of  SVs  in  larger  cohorts.  Numerous  genomes  will  be  sequenced  in  the  coming  months  to  facilitate  disease  association  studies.  Systematic  characterization  of  SVs  in  these  genomes will benefit from the concepts and datasets presented here.    9  Methods Summary    Samples    Sequence  data  for  179  unrelated  individuals  and  six  individuals  from  parent‐ offspring  trios  were  obtained  as  part  of  the  1000GP.  These  data  were  generated  with  Illumina/Solexa,  Roche/454,  and  Life  Technologies/SOLiD  sequencing  technology platforms.    SV discovery and breakpoint assembly    The SV discovery methods we applied comprised six RP, four RD, three SR, four AS,  and two PD based methods. TIGRA38 was used for targeted breakpoint assembly.    Experimental validation  We  validated  SV  calls  by  PCR,  array  CGH  and  SNP  microarrays,  targeted  assembly,  and  custom  microarray‐based  sequence  capture.  PCR  was  performed  in  various  different  laboratories33,  CGH  analysis  was  performed  based  on  tiling  array  data  provided by the Genome Structural Variation Consortium (ArrayExpress: E‐MTAB‐ 40), and SNP array analysis based on data obtained from the International HapMap  Consortium (http://hapmap.ncbi.nlm.nih.gov).     Genotyping    Genome  STRiP36  was  used  for  deletion  genotyping  in  low  coverage  sequence  data.  Initial  genotype  likelihoods  were  derived  with  a  Bayesian  model  and  imputation  into a SNP genotype reference panel from the HapMap41 (Hapmap3r2) was achieved  with Beagle (v3.1; http://faculty.washington.edu/browning/beagle/beagle.html).    SV formation mechanism analysis    SV breakpoints mapped at nucleotide resolution were analyzed with BreakSeq43 to  classify SVs relative to putative ancestral loci and to infer SV formation mechanisms.  SV hotspots were mapped with custom Perl and R scripts.    10  Display Items    Table 1. Summary of discovered structural variation Deletions Individual Callsets <10% FDR Validated Experimentally* Release set * Tandem Duplications 501 501 11215 10810 22025 Mobile element insertions 5371 5371 Novel sequence insertions 128 128 Total 17087 10938 28025 Only tabulates validated calls which were not already present in the individual callsets with <10% FDR     Table 2. Functional impact of our fine resolution SV set. Figures in parentheses indicate numbers of validated SVs per category. We inferred gene overlap with Gencode gene annotation47. Gene Overlap Total Total Coding SV class Gene InterFull Intron exon overlap genic UTR overlap gene affected overlap overlap (partial) 654 1093 315 7319 9381 12644 Deletions (631) (1031) (290) (6481) (8433) (10386) Tandem duplications Mobile element insertions Novel sequence insertions Sum 2 (2) 656 (633) 7 (6) 3 (-) 1119 (1040) 9 (5) 36 (-) 2 (2) 351 (309) 197 (62) 1304 (97) 49 (49) 8869 (6689) 215 (75) 1348 (112) 51 (51) 10995 (8671) 286 (76) 4023 (758) 77 (77) 17030 (11280)   11    Figure Legends  Figure  1.  SV  discovery  and  genotyping  in  population  scale  sequence  data.  A. Schematic depicting the different modes (i.e., approaches) of sequence based SV  detection  we  used.  The  RP  approach  assesses  the  orientation  and  spacing  of  the  mapped reads of paired‐end sequences14,15 (reads are denoted by arrows); the RD  approach  evaluates  the  read  depth‐of‐coverage25,26;  the  SR  approach  maps  the  boundaries  (breakpoints)  of  SVs  by  sequence  alignment28,29;  the  AS  approach  assembles  SVs30,31,32.  B. Integrated  pipeline  for  SV  discovery,  validation,  and  genotyping.  Colored  circles  represent  individual  SV  discovery  methods  (listed  in  Supplementary  Table  1),  with  modes  indicated  by  a  color  scheme:  green=RP;  yellow=RD;  purple=SR;  red=AS;  green  and  yellow=methods  evaluating  RP  and  RD  (abbreviated  as  ‘PD’).  C.  Example  of  a  deletion,  previously  associated  with  BMI35,   identified  independently  with  RP  (green),  RD  (yellow),  and  SR  (red)  methods.  Targeted assembly confirmed the breakpoints detected by SR.  Figure  2.  Comparative  assessment  of  deletion  discovery  methods.  A. Deletion  size‐range ascertained by different modes of SV discovery. Three groups are visible,  with AS and SR, PD and RP, as well as RD and ‘RL’ (RP analysis involving relatively  long  range  (≥1  kb)  insert  size  libraries,  resulting  in  a  different  deletion  detection  size  range  compared  to  the  predominantly  used  <500kb  insert  size  libraries),  respectively, ascertaining similar size‐ranges. Pie charts display the contribution of  different SV discovery modes to the release set. Outer pie = based on number of SV  calls;  inner  pie  =  based  on  total  number  of  variable  nucleotides.  Of  note,  not  all  approaches  were  applied  across  all  individuals  (see  Supplementary  Table  2).  B. Sensitivity and FDR estimates for individual deletion discovery methods based on  gold  standard sets for individuals sequenced at high (NA12878) and low‐coverage  (NA12156), respectively. All depicted estimates are summarized in Supplementary  Tables  3,  4,  6.  Vertical  dotted  lines  correspond  to  the  specificity  threshold  (FDR≤10%). C. Breakpoint mapping resolution of three deletion discovery methods  (the  respective  method  names  are  in  Supplementary  Table  2).  The  blue  and  red  histograms  are  the  breakpoint  residuals  for  predicted  deletion  start  and  end  coordinates, respectively, relative to assembled coordinates (here assessed in low‐ coverage data). The horizontal lines at the top of each plot mark the 98% confidence  intervals  (labeled  for  each  panel),  with  vertical  notches  indicating  the  positions  of  the most probable breakpoint (the distribution mode).   Figure  3.  Analysis  of  deletion  presence  and  absence  in  two  populations.   AC.  Deletion allele frequencies and observed sharing of alleles across populations,  displayed for deletions discovered in the CEU, YRI, and JPT+CHB population samples  in terms of stacked bars. D. Allele frequency spectra for deletions intersecting with  intergenic (blue), intronic (yellow), and protein‐coding sequences (red).   Figure  4.  Contribution  of  SV  formation  mechanisms  to  the  SV  size  spectrum.  A. Breakpoint junction homology/microhomology length plotted as a function of SV  size for SVs originally identified as deletions compared to a human reference. Dots    12  are colored according to the SVs’ classification as deletions, insertions/duplications,  or  “undetermined”  relative  to  inferred  ancestral  genomic  loci.  Gray  lines  mark  groups  of  SVs  likely  formed  by  a  common  formation  mechanism.  The  diagonal  highlights  tandem  duplications  (and  few  reciprocal  deletion  events),  in  which  the  length of the duplicated sequence correlates linearly with the length of the longest  breakpoint  junction  sequence  identity  stretch.  The  ellipses  indicate  MEIs,  i.e.,  Alu  (~300 bp) and L1 (~6 kb) insertions, associated with target site duplications of up  to 28 bp in size at the breakpoints. The horizontal group corresponds mostly to NH‐ associated deletions with <10 bp microhomology at the breakpoints. The remaining  (ungrouped) SVs comprise truncated MEIs, VNTR expansion and shrinkage events,  as well as NAHR‐associated deletions and duplications. B. Relative contributions of  SV formation mechanisms in the genome. Numbers of SVs are displayed on the outer  pie chart and affected base pairs on the inner. Left panel: SVs classified as deletions  relative  to  ancestral  loci.  Right  panel:  SVs  classified  as  insertions/duplications.  C. Size  spectra  of  deletions  classified  relative  to  ancestral  loci.  D. Size  spectra  of  insertions/duplications.   Figure 5. Mapping hotspots of SV formation in the genome. A. Distribution of SVs  on  chromosome  10  (“chr10”).  Above  the  ideogram,  colored  bars  indicate  SV  formation mechanisms (same color scheme as in B and C); bar lengths relate to the  logarithm  of  SV  size.  Below  the  ideogram,  bar  lengths  are  directly  proportional  to  allele  frequencies.  Arrows  indicate  an  SV  hotspot  near  the  centromere  underlying  mainly  VNTR,  and  several  hotspots  near  the  telomeres  underlying  mainly  NAHR  events.  B. Enrichment  of  SVs  inferred  to  be  formed  by  the  same  formation  mechanism  for  different  genomic  window  sizes.  Displayed  is  an  enrichment  of  nearby,  non‐overlapping  SVs  formed  by  the  same  mechanism  relative  to  an  SV  set  where  mechanism  assignments  are  shuffled  randomly.  C. SV  hotspots  are  mostly  dominated  by  a  single  formation  mechanism.  Colored  bars  depict  numbers  of  SV  hotspots in which at least 50% of the variants were inferred to be formed by a single  formation  mechanism.  The  average  abundance  of  NAHR‐classified  SVs  in  NAHR  hotspots was 70% (compared with 77% for VNTR‐hotspots; 69% for NH). The gray  bar (“mixed”) corresponds to SV hotspots with no single mechanism dominating.         13    References    1  Conrad, D. F. et al. Origins and functional impact of copy number variation in  the human genome. Nature 464, 704‐712 (2010).  2  Pinto,  D.  et  al.  Functional  impact  of  global  rare  copy  number  variation  in  autism spectrum disorders. Nature 466, 368‐372 (2010).  3  Sebat,  J.  et  al.  Strong  association  of  de  novo  copy  number  mutations  with  autism. Science 316, 445‐449 (2007).  4  Stefansson,  H.  et  al.  Large  recurrent  microdeletions  associated  with  schizophrenia. Nature 455, 232‐236 (2008).  5  McCarthy,  S.  E.  et  al.  Microduplications  of  16p11.2  are  associated  with  schizophrenia. Nat Genet 41, 1223‐1227 (2009).  6  Craddock, N. et al. Genome‐wide association study of CNVs in 16,000 cases of  eight  common  diseases  and  3,000  shared  controls.  Nature  464,  713‐720,  (2010).  7  McCarroll,  S.  A.  et  al.  Deletion  polymorphism  upstream  of  IRGM  associated  with  altered  IRGM  expression  and  Crohn's  disease.  Nat  Genet,  40,  1107‐12  (2008).  8  Hastings, P. J., Lupski, J. R., Rosenberg, S. M. & Ira, G. Mechanisms of change in  gene copy number. Nat Rev Genet 10, 551‐564 (2009).  9  Stankiewicz, P. & Lupski, J. R. Structural variation in the human genome and  its role in disease. Annu Rev Med 61, 437‐455 (2010).  10  Sebat, J. et al. Large‐scale copy number polymorphism in the human genome.  Science 305, 525‐528 (2004).  11  Iafrate, A. J. et al. Detection of large‐scale variation in the human genome. Nat  Genet 36, 949‐951 (2004).  12  Sharp,  A.  J.  et  al.  Segmental  duplications  and  copy‐number  variation  in  the  human genome. Am J Hum Genet 77, 78‐88 (2005).  13  McCarroll, S. A. et al. Integrated detection and population‐genetic analysis of  SNPs and copy number variation. Nat Genet 40, 1166‐1174 (2008).  14  Tuzun,  E.  et  al.  Fine‐scale  structural  variation  of  the  human  genome.  Nat  Genet 37, 727‐732 (2005).  15  Korbel, J. O. et al. Paired‐end mapping reveals extensive structural variation  in the human genome. Science 318, 420‐426 (2007).  16  Alkan,  C.  et  al.  Personalized  copy  number  and  segmental  duplication  maps  using next‐generation sequencing. Nat Genet 41, 1061‐1067 (2009).  17  Chen,  K.  et  al.  BreakDancer:  an  algorithm  for  high‐resolution  mapping  of  genomic structural variation. Nat Methods 6, 677‐681 (2009).  18  Hormozdiari,  F.,  Alkan,  C.,  Eichler,  E.  E.  &  Sahinalp,  S.  C.  Combinatorial  algorithms  for  structural  variation  detection  in  high‐throughput  sequenced  genomes. Genome Res 19, 1270‐1278 (2009).  19  Medvedev,  P.,  Stanciu,  M.  &  Brudno,  M.  Computational  methods  for  discovering  structural  variation  with  next‐generation  sequencing.  Nat  Methods 6, S13‐20 (2009).    14  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  McKernan, K. J. et al. Sequence and structural variation in a human genome  uncovered  by  short‐read,  massively  parallel  ligation  sequencing  using  two‐ base encoding. Genome Res 19, 1527‐1541 (2009).  Chiang, D. Y. et al. High‐resolution mapping of copy‐number alterations with  massively parallel sequencing. Nat Methods 6, 99‐103 (2009).  Kidd,  J.  M.  et  al.  Mapping  and  sequencing  of  structural  variation  from  eight  human genomes. Nature 453, 56‐64 (2008).  Lee, S., Cheran, E. & Brudno, M. A robust framework for detecting structural  variations in a genome. Bioinformatics 24, i59‐67 (2008).  Pang,  A.  W.  et  al.  Towards  a  comprehensive  structural  variation  map  of  an  individual human genome. Genome Biol 11, R52 (2010).  Bailey,  J.  A.  et  al.  Recent  segmental  duplications  in  the  human  genome.  Science 297, 1003‐1007 (2002).  Campbell, P. J. et al. Identification of somatically acquired rearrangements in  cancer  using  genome‐wide  massively  parallel  paired‐end  sequencing.  Nat  Genet 40, 722‐729 (2008).  Yoon,  S.,  Xuan,  Z.,  Makarov,  V.,  Ye,  K.  &  Sebat,  J.  Sensitive  and  accurate  detection of copy number variants using read depth of coverage. Genome Res  19, 1586‐1592 (2009).  Mills, R. E. et al. An initial map of insertion and deletion (INDEL) variation in  the human genome. Genome Res 16, 1182‐1190 (2006).  Ye, K., Schulz, M. H., Long, Q., Apweiler, R. & Ning, Z. Pindel: a pattern growth  approach  to  detect  break  points  of  large  deletions  and  medium  sized  insertions  from  paired‐end  short  reads.  Bioinformatics  25,  2865‐2871,  (2009).  Simpson, J. T. et al. ABySS: a parallel assembler for short read sequence data.  Genome Res 19, 1117‐1123 (2009).  Hajirasouliha,  I.  et  al.  Detection  and  characterization  of  novel  sequence  insertions  using  paired‐end  next‐generation  sequencing.  Bioinformatics  26,  1277‐1283 (2010).  Li, R. et al. The sequence and de novo assembly of the giant panda genome.  Nature 463, 311‐317 (2010).  The‐1000‐Genomes‐Project‐Consortium.  A  map  of  human  genome  variation  from population‐scale sequencing. Nature 467, 1061‐1073 (2010).  Sudmant,  P.  H.  et  al.  Diversity  of  human  copy  number  variation  and  multicopy genes. Science 330, 641‐646 (2010).  Willer,  C.  J.  &  Willer,  C.  J.  Six  new  loci  associated  with  body  mass  index  highlight a neuronal influence on body weight regulation. Nat Genet 41, 25‐ 34 (2009).  Handsaker,  R.  E.,  Korn,  J.  M.,  Nemesh,  J.  &  McCarroll,  S.  A.  Discovery  and  genotyping  of  genome  structural  polymorphism  by  sequencing  on  a  population scale. submitted.  Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol  5, e254 (2007).  Chen, L. et al. TIGRA local targeted assembly of structural variants. submitted  (2010).  15    39  40  41  42  43  44  45  46  47  Hasin‐Brumshtein,  Y.,  Lancet,  D.  &  Olender,  T.  Human  olfaction:  from  genomic  variation  to  phenotypic  diversity.  Trends  Genet  25,  178‐184,   (2009).  Hinds,  D.  A.,  Kloek,  A.  P.,  Jen,  M.,  Chen,  X.  &  Frazer,  K.  A.  Common  deletions  and SNPs are in linkage disequilibrium in the human genome. Nat Genet 38,  82‐85,  (2006).  Altshuler,  D.  M.  et  al.  Integrating  common  and  rare  genetic  variation  in  diverse human populations. Nature 467, 52‐58, (2010).  Conrad, D. F. et al. Mutation spectrum revealed by breakpoint sequencing of  human germline CNVs. Nat Genet 42, 385‐391 (2010).  Lam,  H.  Y.  et  al.  Nucleotide‐resolution  analysis  of  structural  variants  using  BreakSeq and a breakpoint library. Nat Biotechnol 28, 47‐55 (2010).  Lupski, J. R. Genomic disorders: structural features of the genome can lead to  DNA  rearrangements  and  human  disease  traits.  Trends  Genet  14,  417‐422,  (1998).  Lee,  J.  A.,  Carvalho,  C.  M.  &  Lupski,  J.  R.  A  DNA  replication  mechanism  for  generating nonrecurrent rearrangements associated with genomic disorders.  Cell 131, 1235‐1247 (2007).  Stewart,  C.  et  al.  A  comprehensive  map  of  mobile  element  insertion  polymorphisms in humans. in preparation.  Harrow,  J.  et  al.  GENCODE:  producing  a  reference  annotation  for  ENCODE.  Genome Biol 7 Suppl 1, S4 1‐9 (2006).    Acknowledgements:  We  would  like  to  acknowledge  Claire  Hardy,  Richard  Smith,  Anniek De Witte, and Shane Giles for their assistance with validation. M.A.B’s group  was supported by grants from the National Institutes of Health (RO1 GM59290) and  G.T.M’s  group  by  grants  R01  HG004719  and  RC2  HG005552,  also  from  the  NIH.  J.O.K.’s  group  was  supported  by  an  Emmy  Noether  Fellowship  of  the  German  Research  Foundation  (Deutsche  Forschungsgemeinschaft).  J.W.’s  group  was  supported  by  the  National  Basic  Research  Program  of  China  (973  program  no.  2011CB809200),  the  National  Natural  Science  Foundation  of  China  (30725008;  30890032;  30811130531;  30221004),  the  Chinese  863  program  (2006AA02Z177;  2006AA02Z334;  2006AA02A302;  2009AA022707),  the  Shenzhen  Municipal  Government  of  China  (grants  JC200903190767A;  JC200903190772A;  ZYC200903240076A;  CXB200903110066A;  ZYC200903240077A;  ZYC200903240076A  and  ZYC200903240080A),  and  the  Ole  Rømer  grant  from  the  Danish Natural Science Research Council. C.L.’s group was supported by grants from  the  National  Institutes  of  Health:  P41  HG004221,  RO1  GM081533,  and  UO1  HG005209  and  X.S.  was  supported  by  a  T32  fellowship  award  from  the  NIH.    We  thank  the  Genome  Structural  Variation  Consortium  (http://www.sanger.ac.uk/humgen/cnv/42mio/)  and  the  International  HapMap  Consortium  for  making  available  microarray  data.  The  authors  acknowledge  the  individuals  participating  in  the  1000  Genomes  Project  by  providing  samples,  including  The  Yoruba  people  of  Ibadan,  Nigeria,  the  community  at  Beijing  Normal  University, the people of Tokyo, Japan, and the people of the Utah CEPH community.    16  Furthermore,  we  thank  Richard  Durbin  and  Lars  Steinmetz  for  comments  on  the  manuscript.  Author  Contributions:  The  authors  contributed  this  study  at  different  levels,  as  described  in  the  following.  SV  discovery:  K.W.,  C.S.,  R.H.,  K.C.,  C.A.,  A.A.,  S.C.Y.,  R.K.C.,  A.C., Y.F., I.H., F.H., Z.I., D.K., R.L., Y.L., C.L., R.L., X.J.M., H.E.P., L.D., G.T.M., J.S., J.W., K.Y.,  K.Y.,  E.E.E.,  M.B.G.,  M.E.H.,  S.A.M.,  and  J.O.K.  SV validation:  R.E.M.,  K.W.,  K.C.,  A.A.,  S.C.Y., F.G., M.K.K., J.K., J.N., A.E.U., X.S., A.M.S., J.A.W., Y.Z., Z.Z., M.A.B., J.S., M.S., M.E.H.,  C.L, J.O.K. SV genotyping: K.W., R.H., M.E.H, and S.A.M. Data analysis: R.E.M., C.S., C.A.,  A.A., R.H., K.C., S.C.Y., R.K.C., A.C., D.C., Y.F., F.H., L.M.I., Z.I., J.M.K., M.K.K., S.K., J.K., E.K.,  D.K.,  H.Y.K.L.,  J.L.,  R.L.,  Y.L.,  C.L.,  R.L.,  X.J.M.,  J.N.,  H.E.P.,  T.R.,  A.S.,  X.S.,  M.P.S.,  J.A.W.,  J.W., Y.Z., Z.Z., M.A.B., L.D., G.T.M., G.M. ,J.S., M.S., J.W., K.Y., K.Y., E.E.E., M.B.G., M.E.H.,  C.L, S.A.M., and J.O.K. Preparation of manuscript display items: R.E.M., K.W., C.S., C.A.,  A.A., R.H., S.C.Y., L.M.I., S.K., E.K., M.K.K., X.J.M., X.S., J.A.W., M.B.G., S.A.M., and J.O.K. Co chairs of the Structural Variation Analysis group: E.E.E., M.E.H., and C.L. The following  were leading contributors to the analysis described in this paper and therefore should  be considered joint first authors: R.E.M., K.W., C.S., R.H., K.C., C.A., A.A., S.C.Y, and K.Y.  The  following  equally  contributed  to  directing  the  described  analyses  and  participating in the design of the study and should be considered joint senior authors:  E.E.E,  M.B.G.,  M.E.H.,  C.L,  S.A.M.,  and  J.O.K.  The  manuscript  was  written  by  the  following authors: R.E.M. and J.O.K.  Data  retrieval:  The  data  sets  described  here  can  be  obtained  from  the  1000  Genomes  Project  website  at  www.1000genomes.org  (July  2010  Data  Release).  Individual  SV  discovery  methods  can  be  obtained  from  sources  mentioned  in  Supplementary  Table  1,  or  upon  request  from  the  authors.  Abbreviations  used  in  this paper are summarized in the Supplementary Text.      17  a Reference Sample genome MEI b Application of diverse SV discovery methods Deletion (Del), Duplication (Dup), and Insertion (Ins) RP RD SR AS PD Del Dup Ins Reference-supporting SV-supporting read-pair (RP) SV-supporting read-depth (RD) SV-supporting read for split-read analysis (SR) or assembly (AS) MEI Mobile element insertion support c Del NEGR1 Alu LINE 100 44 Validation of SVs (deletions, duplications and insertions) NA19240 (YRI) Targeted SV breakpoint assembly (focused on deletions) 90 DNA read mapping quality scores 33 80 22 Precision-aware merging of discovered SVs Release set (algorithms & extensive validations) inclusion of SVs inferred with individual methods (criterion: FDR<10%), followed by validation-aware SV inclusion Algorithm-centric set (algorithms & sparse validations) inclusion of SVs inferred with individual methods, and such with evidence from >2 methods (criterion: FDR<10%) Depth of coverage 70 100 60 44 0 11 22 33 11 NA12878 (CEU) 70 80 90 SV discovery set Genotyping (focused on deletions) 60 72.52 72.54 72.56 72.58 72.60 Chromosome 1 position (in Mb) 0 a 0.0020 24.3% 1.3% 19.2% b 0.0 0.2 0.4 0.6 0.8 1.0 0% 20% 40% FDR 60% 80% 100% 0.5% 1 0.8 0.6 co ve ra g e 0.0015 Sensitivity Density 2.0% 0.6% 21.2% 10.6% 17.4% Sensitivity 49.7% 17.9% 0.0010 35.3% 0.0005 AS SR PD RP RL RD w Lo Release set Tr io 0.4 0.2 0 0.0000 0 1000 2000 c Frequency 5000 SR (LN) n = 5375 4000 3000 2000 1000 0 −10 −5 Deletion size (bp) 3000 4000 5000 6000 7000 210 bp 100% 80% 60% FDR 40% 20% 0% 19 bp 18 bp 300 250 200 150 100 50 RP (SI) n = 5229 220 bp 50 40 30 20 10 700 bp RD (YL) n = 501 900 bp 0 5 10 0 −100 −50 0 50 100 0 −500 −250 0 250 500 O set from breakpoint (bp) a 2500 2000 1500 1000 500 0 0 Number of SVs Number of SVs CEU b 2500 observed also in YRI observed also in JPT+CHB shared among all observed only in CEU YRI 2000 1500 1000 500 0 0 0.2 observed also in CEU observed also in JPT+CHB shared among all observed only in YRI Alternate allele frequency JPT+CHB observed also in CEU observed also in YRI shared among all observed only in JPT+CHB 0.2 0.4 0.6 0.8 1 Alternate allele frequency intergenic intersect with introns intersect with CDS 0.4 0.6 0.8 1 c 2000 1500 1000 500 0 0 0.2 Log10 (number of SVs) 2500 d 4 3 2 1 0 0 Number of SVs Alternate allele frequency 0.4 0.6 0.8 1 Average alternate allele frequency 0.2 0.4 0.6 0.8 1 Length of longest sequence similarity stretch at the SV breakpoint junction (bp) a 300 250 200 150 100 50 insertions/duplications deletions undetermined c 1800 1600 Number of deletions 1400 1200 1000 800 600 400 200 Unclassified MEI VNTR NAHR NH b 100 300 SV length (bp) 500 700 900 4000 6000 8000 0 d Number of Insertions/duplications 100bp 1kb 10kb Size of deletion Alu 100kb 1800 1600 1400 1200 1000 800 600 400 200 0 245 272 NAHR 164kb 1,496 NHR VNTR MEI 226 79 122 Unclassified MEI VNTR NAHR NH 12Mb 393kb 89kb 24Mb 1Mb 21kb 4,500 1,994 LINE 100bp 1kb 10kb 100kb Size of insertion/duplication a chr10 Enrichment Depletion (clustering of SV (no clustering) formation process) 0.1 10 1 200bp 500bp 1kb 2kb 5kb 10kb 20kb 50kb 100kb 200kb 500kb 1Mb b Numbers of genomic SV hotspots (color : dominated by single mechanism) 10 15 20 25 30 0 NAHR NH VNTR MEI mixed 5 NAHR NH MEI VNTR Control (NH vs. NAHR) c