1   Supplementary Information: 2   3   Supplementary Text1. Recombination parameter estimates. 4   H. pylori. As expected, parameter estimates for H. pylori revealed an extremely high value of ρ=472 5   per kb but short tract lengths around ~50 bp that are approximately an order of magnitude smaller than 6   previous estimates from genomic data (1,2) yet in agreement with short lengths reported in carefully designed 7   in vitro experiments that used diverged donor and recipient strains (3). These short tract lengths also agree 8   with PC decaying within 50bp (Figure 1), as the PC vs. distance distribution is expected to asymptote near the 9   mean tract length because SNPs separated by larger distances are equally likely to be unlinked. Such short 10   recombination events have important implications for evolution (below). We note that while it is possible for 11   SNP pairs to occasionally mimic recombination via back mutation (a possibility in H. pylori which has an 12   extremely high mutation rate (4)), we account for this by fitting a finite-sites coalescent model with an 13   empirically parameterized base composition and transition:transversion ratio (Materials and Methods). The 14   differences between infinite- and finite-sites estimates of diversity illustrate the necessity of this model (Table 15   S2). 16   S. pneumoniae. In S. pneumoniae, recombination events are larger (~600 bp) but with a lower rate of 17   ρ=11.5 per kb. Other investigators have identified a bimodal distribution of tract lengths in S. pneumoniae 18   using different methods, and our estimate is similar to their smaller mode (5). However, while heterogeneous 19   recombination processes likely exist in this species, the overall fit of our model to data suggests smaller 20   replacements (previously termed “micro-recombination”) largely dictate genome-wide patterns of linkage 21   (Figure 1). 22   N. gonorrhoeae. To our knowledge, this analysis is the first to infer both a recombination rate (ρ=8.6 23   per kb) and mean tract length in N. gonorrhoeae, which appears to transfer long segments (~2.5 kb) since PC 24   decays slowly with distance (Figure 1). The PC data is notably noisy (in particular the change that occurs 25   among pairs separated by ~500-700bp), which may be due to rearrangements that have occurred since the 26   divergence of our sample and the reference sequence used to estimate inter-SNP distances. Results are similar 27   when we use a different reference sequence (Figure S6A), suggesting the rearrangement may be a derived 28   feature of our sample. While we do not know the exact effect this noise has on our parameter estimates, it may 29   have led to a slight overestimate of recombination rates, as mean PC from simulations parameterized with 30   MDE values listed in Table 1 is lower (~10%) than mean PC between randomly sampled SNP pairs, 31   irrespective of distance (Figure S6B). However, slight overestimates of recombination rates would make our 32   conclusions about the ability of selection to act on epistatic alleles conservative. 33   C. jejuni. Patterns of linkage in C. jejuni seemed relatively more heterogeneous across windows, 34   perhaps leading to model fits that underestimate long-range patterns of linkage (Figure 1) due to 35   recombination hotspots (6) or varied recombination processes involving multimodal tract length distributions. 36   Nonetheless, our parameter point estimates of ρ and mean tract lengths are similar to previous ones obtained 37   using seven housekeeping genes (7). 38   S. aureus. According to our analysis S. aureus frequently transfers (ρ=11.5 per kb) tracts around 39   ~70bp in length, which, like our tract length estimates from H. pylori, is an order of magnitude smaller than 40   previous estimates of ~650bp (8). While S. aureus also recombines via phage-mediated transduction (9) that 41   transfers large segments (~10kb), our results suggest this recombination process occurs less frequently and 42   does not significantly impact PC patterns. Nonetheless, S. aureus still exhibits high genome-wide linkage 43   (Figure 1), since these small transfers affect few SNPs. 44   45   Supplementary Text2. Distributions of pairwise LD between synonymous SNPs. 46   To assess whether epistasis has recently shaped the statistical associations between synonymous SNPs, 47   we polarized mutations using an outgroup species (Table S4) and quantified LD using D = pij-pipj, where pi 48   and pj are the derived allele frequencies at site i and j, and pij is the observed frequency in which both derived 49   alleles occur on the same haplotype. The expected value of D is zero under neutrality, but epistasis or 50   population structure can skew the distribution away from this expectation, even if epistatic fitness effects are 51   equally likely to have positive or negative values (10). For the species examined here, observed distributions   1   52   of D between synonymous SNP pairs largely resembled neutral distributions simulated with the MDE 53   parameters inferred in Table 1 (Table S5). Observed distributions of D had slightly different means (all near 54   zero) but similar variances compared to neutral simulations (Table S5), suggesting epistasis has not strongly 55   shaped genome-wide LD between synonymous SNPs. However, we cannot completely rule out. Incorporating 56   selection into simulation-based models for parameter inference requires knowledge of the distribution of 57   epistatic effects, which we are currently exploring. 58   59   References 60   1. Falush, D. et al. Recombination and mutation during long-term gastric colonization by Helicobacter 61   pylori: Estimates of clock rates, recombination size, and minimal age. Proc. Natl. Acad. Sci. U. S. A. 98, 62   15056–15061 (2001). 63   2. Kennemann, L. et al. Helicobacter pylori genome evolution during human infection. Proc. Natl. Acad. 64   Sci. U. S. A. 108, 5033–5038 (2011). 65   3. Bubendorfer, S. et al. Genome-wide analysis of chromosomal import patterns after natural 66   transformation of Helicobacter pylori. Nat. Commun. 7, 11995 (2016). 67   4. Didelot, X. et al. Genomic evolution and transmission of Helicobacter pylori in two South African 68   families. Proc. Natl. Acad. Sci. U. S. A. 110, 13880–5 (2013). 69   5. Mostowy, R. et al. Heterogeneity in the Frequency and Characteristics of Homologous Recombination 70   in Pneumococcal Evolution. PLoS Genet. 10, 1–15 (2014). 71   6. Yahara, K., Didelot, X., Ansari, M. A., Sheppard, S. K. & Falush, D. Efficient inference of 72   recombination hot regions in bacterial genomes. Mol. Biol. Evol. 31, 1593–1605 (2014). 73   7. Wilson, D. J. et al. Rapid evolution and the importance of recombination to the gastroenteric pathogen 74   Campylobacter jejuni. Mol. Biol. Evol. 26, 385–397 (2009). 75   8. Meric, G. et al. Ecological Overlap and Horizontal Gene Transfer in Staphylococcus aureus and 76   Staphylococcus epidermidis. Genome Biol. Evol. 7, 1313–1328 (2015). 77   9. Novick, R. P., Christie, G. E. & Penadés, J. R. The phage-related chromosomal islands of Gram78   positive bacteria. Nat. Rev. Microbiol. 8, 541–51 (2010). 79   10. Kouyos, R. D., Otto, S. P. & Bonhoeffer, S. Effect of varying epistasis on the evolution of 80   recombination. Genetics 173, 589–597 (2006). 81   11. Lewontin, R. C. The Interaction of Selection and Linkage. I. General Considerations; Heterotic Models. 82   Genetics 49, 49–67 (1964). 83   84   85   86   87   88   89   90   91   92   93   94   95   96   97   98   99   100   101   102     2   103   Supplementary Figures: 104   105   Sequences( Muta,on(index( 1( 2( 3( 4( 5( 6( 106   107   Figure S1. Pairwise compatibility. Shown here is an alignment of four sequences (black lines), which are 108   identical to one another except at positions with mutations (red circles). We index each mutation for reference 109   (numbers below). The first pair of mutations (1 and 2) has a PC score of zero: the presence of all four 110   haplotypes is not compatible with a single phylogeny under the infinite-sites model of mutation. Mutation 1 111   can only be present in two genetic backgrounds (presence or absence of mutation 2) if it arose twice at the 112   same position or, more likely, recombination transferred it to another background. Alternatively, the last pair 113   of mutations (5 and 6) has a PC score of 1 since only three haplotypes are observed. PC is analogous to the 114   four-gamete test. 115   116   117   118   119   120     3   Reference' genome' 5’' Core' COGS' Observed"data" window'1' window'2' …' 3’' window'k' Summary' staIsIcs' Distance'(bp)' Distance'(bp)' Distance'(bp)' Distance'(bp)' Simulated"data" Parameters' (θ,'ρ,'q)' Data'simulator:' coalescent'with' gene'conversion' Simulate'k#Imes' x"k# Distance'(bp)' Calculate'discrepancy'between'k'observed'and'k#simulated'windows' via'KolmogorovLSmirnov'staIsIc,'for'each'parameter'set'(θ,'ρ,'q)' 'Use'discrepancy'to'approximate'the'likelihood'funcIon'needed'for' ABC'inference'framework'' 121   122   Figure S2. Analysis pipeline. We use a reference genome to obtain the relative positions of gene alignments 123   (or COGs), which we extract from de novo assemblies and annotate with an amino acid file from the reference 124   individual (see Materials and Methods). Not every position in the reference genome is represented by a gene 125   alignment because (1) the reference may have unique genes and (2) we only consider reference genes that are 126   present in each individual in the sample (i.e. core gene). COGs are grouped into k windows based on their 127   physical location in the reference genome, and we calculate summary statistics for each window but only 128   show mean pairwise compatibility (PC) between synonymous SNP pairs here. Then, we simulate across 129   parameter space using the neutral coalescent with gene conversion by varying the three parameters of interest 130   (θ, ρ, q), simulating k windows for each parameter set, and calculating the same summary statistics for each 131   window. We compare these k simulated windows with the k observed windows using a Kolmogorov-Smirnov 132   statistic, which we use as a discrepancy measure to assess which parameter values give rise to simulated 133   summary statistics the most resemble observed summaries. 134   135   136   137   138   139   140   141   142   143     4   144   A 0.1$ ρ" 0.01$ 0.001$ 0.01$ 145   146   B 0.1$ θ$ 0.0005& 0.1$ 1/q"" (kb)" 1$ 0.3$ 100$ .01$ 0.1$ θ$ 0.1$ 1/q" (kb)" 1$ 10$ 0.3$ 0.001$ 0.01$ ρ" 0.1& 0.1& 4 3 2 1 0.1$ 5& 4& ρ" 0.005& 1/q" (kb)" 1& GRID" 1/q" (kb)" 1& Comparing$10$ Theta$=$0.1$$ Rho$=$0.01$ 0.05& 10& swiminudloawtesd$w$11i0t0h&k$b1$0$ TrLen$=$1000$ $ 147   148   149   FpaigraumreetSe3r .v(aAlu)eEs,vaanludθac&tailocnuloaftesuthmemdiasrcyresptaantcisytibcest.wWeeensiθtmh& iusl“atwsrotimeuibnes1ude0$1lroavg0wteekednsbd”,o$$$medaaitccahwse$itnadnodw1nnSs0ioo(m1ISws0uioitknleρlabda"st)ot=,oew1wrss=0i=tC0sh5io0m0k0a$nuS$oliawmten$d 150   across a range of parameter values. This discrepancy consists ionfd5eKpeonlmdoegnotrloyv$ -Smirnov statistics, one 3& 2& 1& 151   calculated for each of five summary statistics (Materials and Methods). These grids show that parameter 152   153   vvaalluueess,soimr liolawretrodthisocsreepuasnecdietos,gsehnoewraintegtohuer“cohbosiecreveodf”sudmatmaGsaRertIyDhsa"tvaetissitmicsilaernasibmleuslathteedinsfuemremncaeryosftoatuisrtic 154   155   156   157   158   159   160   161   162   pdSgmaeFstmaaeeioaomcrtnrtoeadhiouersmuemmtsallniaiecctmaiitthsnscineouior(swnn“slMefagositortnbaeceotfrtdsdocfeaieoonnrcpwriwmstvtaaeiielrlssrnbesadt.edienms”nWaodgtnaedwtftedothaireofostsiMras.nim5msvH.0aeeaTtgutesilhehnlureioanenedeptd,onseiawsvomdrw)wiabh.daitesieTceuce2cahhsrdc0avileoamsd0lestacikaeedulfu:aebflgldecsaθarfacrhtt=ireeeted.aa2nddnsg(sNtBte1mtswhµst0h)siee,iomRn2nwρwdt0dou=iekaobsl2trancbhweuNtrdaiseswgrottps,enwnipuaann,Cswtwaoseabinnedriitlsrlnommschudlliiraos&nneaemyechmewqqttuuddy&aaeo&b2,ueps2lloovdmeaaatte0aaa0ehewwtrntDws0wcreerekeptsoirsskli&bvneuidaogfyt&&mnbewhwarr&smge&tonld&&sop&iuθih1aipmtmlasevliih0=ticlsciwime&yd&&ohi10stteirnulci&.d0idym1locea&o,riiitdfrtnlrρeeaaidiidslrn=nneattgatdrpto0oieneeeb.dd1p0tnuoh01dehtfoTRT&nnSoin,eiob2iphrsoohsmnadLnse0ateISontoeeesrkuutipd&rfarnotnb=avrisamleceo1lae&&&eraw=e0=ssm/adetdqt=&&.motidb1e00etns2o=uroer0eus.d0=0nes0tt=mo0g15w;eme03t5we00r&smoe0&an00is&&senn0efa&0&nictorr&bnoayetpihtte.eer. 163   the “observed” dataset have similar simulated summary statistic values, or lower discrepancies. Consequently, 164   the comparison of independently simulated windows to windows that may be partially correlated should not 165   have any large effects on parameter estimation. Here the “observed” dataset consisted of 50 sequences 166   generated using θ = 0.03, ρ = 0.005, and 1/q = 1000bp. The grids show that estimated ρ may be slightly higher 167   (~0.006) than the input value, but this could be stochasticity in the coalescence process since a single 168   simulation of a 200kb fragment for 50 individuals was used as “observed” data. 169   170   171     5   log$ρ$ log$q$ log$q$ 172   173   A -4$ -5$ -6$ -7$ -8$ -9$ 174   175   B -5$ -4$ -3$ log$θ$ -4$ -6$ -8$ -10$ -5$ -4$ -3$ log$θ$ -4$ -6$ -8$ -10$ -8$ -6$ -4$ log$ρ$ 5$ 4.5$ 4$ 3.5$ 3$ 2.5$ Parameter' MDE' Mean' Lower/Upper'95%'CI' θ" 0.0191" 0.0185" 0.0058/0.0585" ρ" 0.0051" 0.0051" 0.0042/0.0061" 1/q" 5000" 5000" 3333/10000" 117767    Note: MDE represents the minimum discrepancy estimate from the GPR, or the parameter value set with the smallest discrepancy 178   between simulated and observed data. “Observed” data were simulated with known parameter values θ=0.02, ρ=0.005, 1/q=5000. 179   180   Figure S4. (A) Gaussian Process Regression (GPR) on a dataset simulated from known parameter 181   values. While Figure S1 shows our summary statistics enable inference of the 3 parameters of interest, we 182   constructed a GPR to further show that it accurately models the discrepancy between simulated and 183   “observed” data (simulated from known parameter values) and, more importantly, correctly identifies the 184   parameter values that produced this “observed” data. This GPR model is then used for inference in an 185   approximate Bayesian computation (ABC) framework. Here, we fit a GPR to discrepancies between 186   simulations and “observed” data simulated with θ = 0.02 per site, ρ = 0.005 per site, and 1/q = 5000 bp for 187   280 individuals and 20 30kb windows (the sample size and window number/size used for inference with S. 188   pneumoniae). Red X’s indicate the minimum discrepancy estimates (MDEs), and the black dots the parameter 189   values for which the coalescent model was simulated. (B) The GPR model of discrepancies for parameter 190   inference accurately captures true parameter values. 191   192   193   194   195     6   A" B" 0.52# LD or PC LD or PC Mean#PC# 0.35 0.40 0.45 0.50 0.55 0.60 FA1090# NCCP11945# 1.0 1.0 0.50# 0.8 PC# 0.6 0.4 0.2 0.8 0.6 0.4 0.2 0.48# 0.46# 0.0 1 0.0 10 100 1000 10000 1 10 100 1000 10000 0.44# Distance (bp) Distance (bp) 0.42# 196   Observed# Simulated# 197   Figure S5. A. PC between SNP pairs as a function of distance, using two different references (FA1090 or 198   NCCP11945) to estimate distances between genes. B. (Left) Mean PC calculated between 106 synonymous 199   SNP pairs, randomly sampled from the N. gonorrhoeae de novo assemblies irrespective of inter-SNP distance. 200   (Right) Mean PC calculated from 150kb segments simulated with MDE parameter estimates shown in Table 1. 201   Each boxplot contains 100 replicates. These data suggest that we may have slightly overestimated 202   recombination rates, as simulated datasets have mean PC values that are ~9% lower than observed mean PC 203   values. Mean PC from simulated datasets are highly variable because of stochasticity in the evolutionary 204   process (or coalescence). 205   206   207   208   209     7   A" 1.4# Nsa=1# Nsa=10# RelR 0.6 0.8 1.0 1.2 1.4 RelR 0.6 0.8 1.0 1.2 1.4 R Rasex 1.2# 1.0# 0.8# R 0.5 0.1 2.0 5.0 0.6# B" 0 0.1 0.3 0.6 1 3 6 10 30 5.0# Nsi 2.0# R 1.0# 0.5# 0.2# 0.1# R 0.5 0.1 2.0 5.0 0 0.1 0.3 0.6 1 3 Nsi 6 10 30 S. aureus C. jejuni S. pneumoniae N. gonorrhoeae H. pylori 0# 0.1# 0.3# 0.6# 1# 3# 6# 10# 30# 0# 0.1# 0.3# 0.6# 1# 3# 6# 10# 30# NNssii% NNssii% 210   211   Figure S6. Relative and absolute selection responses when multilocus epistatic traits are controlled by 212   many loci (L=10). Shown here are selection responses of simulations parameterized by bacterial 213   recombination rates, either relative to an asexual control (A; same as in Figure 2) or in absolute terms (B), for 214   increasing pairwise epistatic selection coefficients (Nsi) when additive effects per locus are weak (Nsa = 1; left) 215   or strong (Nsa = 10; right). Although relative selection responses do not change much for weaker epistasis 216   (N|si| < 3), the populations are still increasingly responding to selection (B). For each plot, the mean of 2000 217   simulations is shown for each parameter set, and selection responses were calculated after 0.2N generations of 218   selection. Simulations were parameterized with values for S. aureus (dark blue), C. jejuni (light blue), S. 219   pneumoniae (orange), N. gonorrhoeae (light red), H. pylori (dark red), and a eukaryote (black). 220   221   222     8   Wmax 1.00 1.02 1.04 1.06 1.08 Wmax 1.00 1.02 1.04 1.06 1.08 A" W max W maxasex 1.08" 1.06" 1.04" 1.02" S. aureus C. jejuni S. pneumoniae N. gonorrhoeae H. pylori Wmax 1.00 1.01 1.02 1.03 1.04 Wmax 1.00 1.01 1.02 1.03 1.04 1.00" B" 00" 00..11" 00..33" 00..66" 11" 33" 1.04" Nsi 66" 1100" 3300" W max W maxasex 1.03" 1.02" 1.01" 00" 00..11" 00..33" 00..66" 11" 33" 66" 1100" 3300" Nsi 1.00" 00" 11..55" 44..55" 99" 1155" 4455" 9900" 115500" 450" 00" 11..55" 44..55" 99" 1155" 4455" 9900" 115500" 450" NNssii$ NNssii$ 223   224   Figure S7. The fittest genotypes are found in populations with higher recombination. Shown is the mean 225   maximum fitness (Wmax) for each parameter set relative to the asexual (measured across all 2000 simulations) 226   for the 10-locus (A) and the 3-locus (B) trait. Here, maximum fitness was measured after 0.2N generations of 227   selection. Epistatic effects (Nsi) per locus pair are shown on the x-axis, and additive effects per locus increase ="12222089"L    ofcroim,"Bth"=e "l3ef"tLcoocluim,"bn ototthh"eartig"0ht.2coNlu"mgen,nweirthaN8soa n=s1,"orre1l0a8fovr e(A" ) or ANs$lae=(3$.3N3soar=313".3 for (BA).$right$Nsa=10" Wm22a33x01, "    where"Wmax"is"mean"Wmax"across"all"sims" B$le($Nsa=3.33" B$right$Nsa=33.33" 232   233   234   235   236   237     9   A" 1.8$ 1.8$ RelR 0.8 1.0 1.2 1.4 1.6 1.8 RelR 0.8 1.0 1.2 1.4 1.6 1.8 1.6$ 1.6$ R 1.4$ 1.4$ Rasex 1.2$ 1.0$ 1.2$ 1.0$ 0.8$ B" 4.0$ R 2.0$ Rasex 1.0$ 00$ 00..11$ 00..33$ 00..66$ 11$ S. aureus C. jejuni S. pneumoniae N. gonorrhoeae H. pylori Nsi 33$ 0.8$ 66$ 1100$ 3300$ 00$ 00..11$ 00..33$ 00..66$ 11$ 33$ Nsi 1.8$ 1.6$ 1.4$ 1.2$ 1.0$ 66$ 1100$ 3300$ RelR 0.8 1.0 1.2 1.6 RelR 0.5 1.0 2.0 4.0 0.5$ 0.8$ 00$ 11..55$ 44..55$ 99$ 1155$ 4455$ 9900$ 115500$ 450$ 00$ 11..55$ 44..55$ 99$ 1155$ 4455$ 9900$ 115500$ 450$ NNssii" NNssii" 238   239   240   Figure S8. Long-term selection responseAs"floer%m"Nultsialo=c1us$ traits. SAh"orwignhhte"rNe asrae=th1e0se$lection responses 241   242   relative to an asexual control generations of selection with v(Rar/Ryiansegx)sftorernaBg1t"hl0es-l%oofc"puNasistrrwaa=iits3e(A.e3p) 3iasnt$adsiaBs3("-rNliosgic)uhastnt"drNapister(aB=lo)3cmu3se.aa3sdu3dri$etidvaefeteffre1c0tsN(Nsa; 243   left vs. right column). Epistatic effects per locus pair are shown on the x-axis, and additive effects per locus 244   increase from the left column to the right column, with Nsa = 1 or 10 for (A) or Nsa = 3.33 or 33.3 for (B). For 245   each plot, the mean of 2000 simulations is shown for each parameter set. Simulations were parameterized with 246   values for S. aureus (dark blue), C. jejuni (light blue), S. pneumoniae (orange), N. gonorrhoeae (light red), H. 247   pylori (dark red), and a eukaryote (black). 248     10   3(Loci( A" 2.5( RelR 1.0 1.5 2.0 2.5 R Rasex 2.0( 1.5( 1.0( D" 2.0( 1.8( R Rasex 1.6( 1.4( 1.2( RelR 0.8 1.0 1.2 1.4 1.6 Posi%ve( S. aureus C. jejuni S. pneumoniae N. gonorrhoeae H. pylori B" 1.6( 1.4( 1.2( 1.0( 00( 00..11( 00..33( 00..66( 11( 33( Nsi 0.8( E"66( 1100( 3300( 1.4( 1.2 1.4 1.2( RelR 1.0 1.0( Nega%ve( C" 1.8( Sign( RelR 0.8 1.0 1.2 1.4 1.6 1.8 1.6( 1.4( 1.2( 1.0( 0.8( F"00( 00..11( 00..33( 00..66( 11( 33( 66( 1100( 3300( Nsi 00( 00..11( 00..33( 00..66( 11( 33( 66( 1100( 3300( Nsi 1.8( 1.6( 1.4( 1.2( 1.0( RelR 0.8 1.0 1.2 1.6 RelR 1.0 1.2 1.4 1.6 2.0 0.8 1.0( 0.8( 0.8( 00( 11..55( 44..55( 99( 1155( 4455( 9900( 115500( 450( 00( 11..55( 44..55( 99( 1155( 4455( 9900( 115500( 450( 00( 11..55( 44..55( 99( 1155( 4455( 9900( 115500( 450( NNssii$ NNssii$ NNssii$ 249   250   Figure S9. Long-term selection responses for multilocus traits with strictly positive and negative 251   epistasis. We studied additional simulations for 10-locus (top row; A,B,C) and 3-locus (bottom row; D,E,F) 252   traits with strictly positive (left column; A,D) or negative (middle column; B,E) epistasis between beneficial 253   alleles (Nsa = 10 for the 10-locus trait, Nsa = 33.3 for the 3-locus trait). On the right (C,F) are simulations of 254   sign epistasis taken from Figure S8 for comparison. We measured selection responses for each parameter set 255   relative to asexual simulations (R/Rasex) after 10N generations with varying strengths of pairwise epistasis 256   (Nsi). For each plot, the mean of 2000 simulations is shown. Simulations were parameterized with values for 257   S. aureus (dark blue), C. jejuni (light blue), S. pneumoniae (orange), N. gonorrhoeae (light red), H. pylori 258   (dark red), and a eukaryote (black). 259   260   261   262   263   264   265   266   267   268   269   270   271   272   273   274   275   276   277   278     11   279   280   281   282   283   284   285   286   287   Figure S10. Comparing commonly used measures of recombination. Pairwise compatibility (PC, green) 288   and two different measures of LD (D’ as blue, R2 as red) are plotted as a function of pairwise inter-SNP 289   distance. While D’ decays with distance between SNPs, PC generally decays faster, indicating that measures 290   of homoplasy are more sensitive to recombination than D’. R2, or the squared coefficient of correlation, 291   indicates that the correlation between synonymous SNPs in the more highly recombining bacteria decays to 292   ~10%. Here, D’ is calculated as D/Dmax (11), where D=pAB-pApB, or the frequency of haplotype AB (pAB) 293   minus its expected frequency with no LD (pApB). R2 is calculated as D2/(pA)(1-pA)(pB)(1-pB). We note that 294   these metrics are sensitive to sample size (n) and are thus properties of the sample, not necessarily the species 295   from which they came. For example, S. pneumoniae has a lower recombination rate than H. pylori yet PC and 296   D’ eventually decay to similar levels in both samples. However, S. pneumoniae has a considerably larger 297   sample size and thus increased power to detect homoplasy between any two SNPs. 298   299     12   300   301   Figure S11. Parameter inference with Sulfolobus islandicus genome. (A) GPR of the discrepancy between 302   simulated and observed data show that there is not sufficient information to accurately infer q, as much of 303   parameter space results in simulated data that looks similar to observed data (right panel). (B) Minimum 304   discrepancy estimates (MDEs), or the parameter value set with the smallest predicted discrepancy between 305   simulated and observed data. (C) Recombination rate (r) estimate using a previously published mutation rate 306   (µ) for S. islandicus, with r=(ρ/θ)*µ. (D) Summary statistics show S. islandicus has low diversity, and the 307   sample used to quantify recombination has a mutation frequency spectrum close to that expected under 308   equilibrium demography (Tajima’s D close to zero). Low diversity, coupled with small sample size, likely 309   made it difficult to accurately infer tract lengths in (A). 310     13   311   312   Figure S12. Gaussian Process regression (GPR) of the discrepancy between simulated and observed 313   data. These GPR models of the discrepancy were used in an ABC framework to approximate the intractable 314   likelihood function, which enabled us to compute the marginal posterior distribution for these three parameters 315   (θ, ρ, and q) using sequential Monte Carlo sampling (Materials and Methods). The marginal posterior 316   distribution of θ for N. gonorrhoeae included the lower boundary of the parameter range (GPR in fourth row 317   indicates small discrepancies, or θ values with higher approximate likelihoods, for much of the parameter 318   space), but this posterior distribution was more narrow when we reconstructed a 1-dimensional GPR 319   exclusively for this parameter, simulating data by drawing θ, ρ, and q from a uniform distribution but only 320   calculating a simpler discrepancy using the number of segregating sites. Estimates from this 1-D GPR are 321   shown in Table 1. Red X’s indicate the minimum discrepancy estimates (MDEs). 322   323   324   325   326     14   A 1.0 Relative Fitness (to max) 0.980 0.985 0.990 0.995 1.000 Mean 0.995 population fitness relative 0.99 to max 0.985 0.98 00 Relative variance in Fitness (to max) 0.0 0.2 0.4 0.6 0.8 1.0 Variance in fitness relative to max 1.0 0.8 0.6 0.4 0.2 0.0 00 Relative Fitness (to max) 0.9980 0.9985 0.9990 0.9995 1.0000 N=N1=01k00 N=N1=01k0000 1.0 0.9995 0.999 0.9985 550000 11000000 11550000 GeNnge=e1rnaktion 22000000 0.998 00 Relative variance in Fitness (to max) 0.0 0.2 0.4 0.6 0.8 1.0 N=1000 1.0 0.8 0.6 0.4 0.2 550000 11000000 11550000 22000000 Gengeernation 0.0 00 55000000 1100000000 1515000000 2200000000 GNe=nge1en0rkation N=10000 55000000 1100000000 1155000000 2200000000 Gengeenration B ρ=0$ ρ=0.005$ ρ=0.05$ N=1000$ N=10000$ N=1000$ N=10000$ N=1000$ N=10000$ T1/2%max%%for%Mean%popula2on%fitness%(Gen)% 230$ 2300$ 570$ 5900$ 630$ 6600$ 327   T1/2%max%%for%Variance%in%fitness%(Gen)% 320$ 3300$ 710$ 7300$ 730$ 7500$ 328   Figure S13. Rescaled simulations are approximately equivalent. (A) We ran simulations with L = 10 loci 329   for two populations of size N=1,000 or N=10,000, each with three recombination rates: ρ=0 (blue), ρ=0.005 330   (orange), and ρ=0.05 (red). For each population, the epistatic variance in fitness VI was defined such that the 331   mean epistatic effect per locus pair N|sI|=0.5. Consequently, VI was larger for the smaller population. There 332   were no additive effects. We tracked evolution by measuring the mean fitness of the population (upper) and 333   the variance in fitness (lower), both with respect to the maximum value over the course of the simulation. The 334   selective dynamics for both populations are remarkably similar with the exception of the time scale in 335   generations, as the x-axis shows selection happening ~10X faster in the smaller population (B). However, if 336   time is instead scaled by the decay of diversity or variance in fitness (which is how we measure when to stop 337   simulations in Figures 2 and 3), simulating different population sizes gives approximately equivalent results as 338   long as the starting variances in fitness are proportionally scaled such that N|sI| is the same. Rescaling 339   simulations with neutral or additive effects display similar properties as shown by Hoggart et al. 2007. 340   341   342   343   344   345   346   347   348   349   350   351   352     15   353   354   355   STuapbplele&m##e.$nGteanryomTiacb$dleasta: sets$and$subsamples$used$in$this$study.$ 356   357   358   Table S1. Genomic datasets and subsamples used in this study. Species S. aureus C. jejuni S. pneumoniae N. gonorrhoeae H. pylori Genomic dataset reference [1]$ [2],[3]$ [4]$ [5]$ [6]$ Sample strategy Geographic Location (subsample) Sample Year (subsample) invasive disease Europe: France 2006-2007 chicken (host) NA* 2008-2009 nasopharyngeal carriage USA: Massachussets 2007 CDC Gonococcal Isolate Surveilance Project USA: Birmingham (AL), New Orleans (LA), Atlanta (GA), Miami (FL) 2001-2003 asymptomatic and symptomatic patients USA: Cleveland, (Ohio) 1987-1989 359   360   361   *Due to host-specific ecology, samples were isolated with respect to host, not geographic location. [1] Aanensen D, Feil E, Holden M, Dordel J, Yeats C, Fedosejev A, … Grundmann H (2016). Whole-Genome Sequencing for Routine Pathogen Surveillance in Public Health: a Population Snapshot of Invasive Staphylococcus aureus in Europe. mBio, 7(3), e00444-16. [2] Sheppard$SK,$Didelot$X,$Meric$G,$Torralbo$A,$Jolley$K,$Kelly$D,$…$Falush$D.$(2013).$GenomeKwide$associaMon$study$idenMfies$vitamin$B$5$biosynthesis$as$a$host$speci$fi$city$ factor$in$Campylobacter.$Proceedings+of+the+Na1onal+Academy+of+Sciences+of+the+United+States+of+America,$110(May),$1–5.$ [3] Yahara$K,$Didelot$X,$Ansari$M,$Sheppard$SK,$and$Falush$D.$(2014).$Efficient$inference$of$recombinaMon$hot$regions$in$bacterial$genomes.$Molecular+Biology+and+Evolu1on,$ 31(6),$1593–1605.$ [4] Croucher N, Finkelstein J, Pelton S, Mitchell P, Lee G, Parkhill J, … Lipsitch M (2013). Population genomics of post-vaccine changes in pneumococcal epidemiology. Nature Genetics, 45(6): 656–63. [5] Grad Y, Harris S, Kirkcaldy R, Green A, Marks D, Bentley S, .. Lipsitch M (2016). Genomic epidemiology of gonococcal resistance to extended spectrum cephalosporins, macrolides, and fluoroquinolones in the US, 2000-2013. Journal of Infectious Diseases, in press. [6] Blanchard T, Czinn S, Correa P, Nakazawa T, Keelan M, Morningstar L, … Fricke W (2013). Genome sequences of 65 helicobacter pylori strains isolated from asymptomatic individuals and patients with gastric cancer, peptic ulcer disease, or gastritis. Pathogens and Disease, 68(2): 39–43. Table S2. Genomic data summary statistics. 362   363   364   365   366   367   368   369   370   371   372   373   374   375   376   377   378   379   Species Reference genome Core genes Total bp analyzed Number synonymous polymorphisms Number Fourfold Degenerate Sites Watterson θ* (infinite-sites) Tajima θ* (finite-sites) Tajima's D* Mean Pairwise LD** Sample Size S. aureus C. jejuni HO_5096_0412 1268 CjeNCTC11168 1046 965025 952428 27760 9163 121273 106783 0.0304 0.0105 0.0353 0.0109 0.107 -0.071 0.896 0.857 30 23 S. pneumoniae TCH8431/19A 888 737637 29296 108459 0.0225 0.0268 -0.486 0.529 280 N. gonorrhoeae FA1090 H. pylori HPY26695 1161 788 930225 617202 3988 58586 152625 73938 0.00442 0.0870 0.00454 0.1209 0.131 -0.106 0.611 0.495 19 21 *Calculated*from*fourfold*degenerate*sites** **Calculated*as*D`*with*synonymous*SNPs** Note:*We*es=mated*an*infinite?sites*version*of*θ*using*WaBerson’s*es=mator*(WaBerson*1975)*and*an*infinite?sites*version*using*the*minimum*number*of*muta=ons* (Tajima*1996).*Because*of*the*discrepancies*between*infinite?sites*and*finite?sites*es=mates*of*θ,*a*custom*finite?sites*coalescent*simulator*was*used*for*parameter* inference*for*S.#aureus,*S.#pneumoniae,*and*H.#pylori.*Here,*θ*is*in*units*per?site.* *   16   380   381   Table S3. Eukaryote recombination parameters used in simulations with selection. Eukaryote ρ θ mutation rate Population (per site) (per site) (per site per generation) size Recombination rate (per site per generation) Reference Bird (zebra finch) 0.0262 0.0137 7*10-10 9.79*1006 1.34*10-09 [1] Bird (long-tailed finch) 0.014 0.0073 7*10-10 5.21*1006 1.34*10-09 [1] Nematode (C. ramnei) 0.0346 0.057 9*10-09 3.17*1006 5.46*10-09 [2] Insect (D. melanogaster) 0.0124 0.008 2.8*10-09 1.43*1006 4.34*10-09 [3],[4] Fungus (S. paradoxus) 0.0019 0.0009 3.3*10-10 1.36*1006 6.97*10-10 [5],[6] 382   383   384   [1] Singhal S, Leffler E, Sannareddy K, Turner I, Venn O, Hooper D, .., Przeworski M (2015). Stable recombination hotspots in birds. Science 350(6263): 928-932. [2] Cutter A, Baird S, and Charlesworth D. (2006). High nucleotide polymorphism and rapid decay of linkage disequilibrium in wild populations of Caenorhabditis remanei. Genetics 174(2): 901-913. [3] Chan A, Jenkins P, and Song Y. (2012). Genome-Wide Fine-Scale Recombination Rate Variation in Drosophila melanogaster. PLoS Genetics 8(12): e1003090. [4] Keightley P, Ness R, Halligan D, Haddrill P. (2014). Estimation of the Spontaneous Mutation Rate per Nucleotide Site in a Drosophila melanogaster. Genetics 196: 313-320. [5] Tsai I, Burt A, and Koufopanou V. (2010). Conservation of recombination hotspots in yeast. PNAS 107(17): 7847-52. [6] Lynch M, Sung W, Morris K, Coffey N, Landry C, Dopman E, …, Thomas WK. (2008). A genome-wide view of the spectrum of spontaneous mutations in yeast. PNAS 105(27): 9272-77. Tabl e S4. Outgroup species used to polarize mutations. Species' Outgroup'species'used' (sequence'name)! Genbank/RefSeq'Assembly'ID' S.#aureus# #S.#epidermidis#(ATCC!12228)# GCF_000007645.1! C.#jejuni# !C.#coli#(76339)# GCA_000470055.1! S.#pneumoniae# S.#pseudopneumoniae# GCA_000221985.1! N.#gonorrhoeae# N.#Meningi7dis#(Alpha!14)# GCA_000083565.1! 338856    H.#pylori# H.#acinonychis# GCA_000009305.1! DistribuIon%of#D%between%synonymous%SNPs%compared%to%neutral%simulaIons% 387   Table S5. Distribution of D between synonymous SNPs compared to neutral simulations 338889    390   391   392   393   394   395   396   Species' Synonymous'' mean'[95%%CI]% Synonymous'' standard'devia2on' Simulated'' mean' Simulated'' standard'devia2on' S.#aureus# %0.00007%[)0.00006,%0.0002]% 0.098% 0.0258% 0.099% C.#jejuni# %0.0013%[0.0011,%0.0015]% 0.127% 0.0033% 0.100% S.#pneumoniae# 0.0029%[0.0027,%0.0031]% 0.063% 0.0056% 0.063% N.#gonorrhoeae# 0.0010%[0.0008,%0.0011]% 0.077% 0.0002% 0.057% H.#pylori# 0.0073%[0.0073,%0.0074]% 0.061% 0.0002% 0.054% Note:%95%%CIs%of%the%mean%were%calculated%using%bootstrap,%resampling%the%data%100%Imes.%Both%synonymous%SNP%data% and%simulaIons%were%analyzed%in%windows%according%to%the%sizes%reported%in%Table%S5,%simulaIng%as%many%windows%as% those%observed%in%the%genome.%D%was%only%calculated%between%SNPs%greater%than%10%%and%less%than%90%%in%the%sample,% that%is%the%same%SNPs%used%to%infer%recombinaIon%parameters.%   17   397   398   Table S6. Pairwise compatibility summary statistics, window sizes, and coalescent simulators used for each species. inter-SNP distance ranges (bp) Species 1 2 3 4 Number/Size+of+ simulated+segments+ Simulator Base composition (A,G,C,T)* Ti:Tv ratio** S. aureus 1-50 201-250 1001-1100 5001-5500 30/20%kb(windows( Coasim 0.360, 0.190, 0.145, 0.305 1.14 C. jejuni 51-100 501-600 10001-10500 20001-21000 11/150%kb(windows( ms NA NA S. pneumoniae 51-100 501-550 5001-5500 10001-10500 20/30%kb(windows( Coasim 0.305, 0.219, 0.187, 0.289 1.97 N. gonorrhoeae 51-100 501-700 5001-6000 20001-21000 15/150%kb(windows( ms NA NA H. pylori 1-10 40-60 501-550 2501-2550 20/4%kb(windows( Coasim 0.320, 0.217, 0.180, 0.283 6.14 399   400   401   402   403   404   405   *Estimated from the reference genomes in Table S2 **Transition (Ti) to Transversion (Tv) ratio, calculated using fourfold degenerate sites. We intentionally did not use all sites, since most transitions at twofold degenerate sites (which are very common) result in synonymous mutations that likely experience less selection than transversions that result in nonysnonymous mutations. This difference in selective pressures between transitions and transversions results in highly biased estimates of ratios. Note: Four summary statistics were used to quantify recombination parameters from genomic data. Each of these summary statistics involved calculating pairwise compatibility (PC) between SNPs, but they differed by the distance separating SNPs. For example, we calculated PC between SNPs that were 1-10bp apart in H. pylori because PC decayed extremely quickly, as a function of inter-SNP distance, in this species. In order to calculate long-range PC for some species, we used fewer but larger genomic windows.   18