Ab initio determination of coarse-grained interactions in double-stranded DNA

We derive the coarse-grained interactions between DNA nucleotides from ab initio total-energy calculations based on density functional theory (DFT). The interactions take into account base and sequence speciﬁcity, and are decomposed into physically distinct contributions that include hydrogen bonding, stacking interactions, backbone, and backbone-base interactions. The interaction energies of each contribution are calculated from DFT for a wide range of conﬁgurations and are ﬁtted by simple analytical expressions for use in the coarse-grained model, which reduces each nucleotide into two sites. This model is not derived from experimental data, yet it successfully reproduces the stable B-DNA structure and gives good predictions for the persistence length. It may be used to re-alistically probe dynamics of DNA strands in various environments at the µ s time scale and the µ m length scale. © 2012 American Institute of Physics . [http://dx.doi.org/10.1063/1.4748105]


I. INTRODUCTION
Biological systems exhibit high degrees of complexity that are essential to the functions they perform.The DNA double helix is one such example: the properties of this macromolecule are directly influenced by its conformational variability as well as by environmental factors that include counterions, impurities, and temperature, as it performs a wide variety of vital cellular functions such as transcription and replication. 1A full account of such biological functions must rely on a realistic description of the physical processes that underlie them.Fine-grained calculations of DNA at the atomic level 2,3 can provide this level of detailed description, but they are restricted to very small systems of order tens of base pairs and time scales of order ns, whereas most biological processes involve DNA behavior at the scale of more than a hundred base-pairs and take place at the µs time scale and beyond.To probe these biologically relevant processes, a realistic and efficient coarse-grained model of DNA is necessary.Examples of crucial functions under current investigation that could benefit from a coarse-grained model of DNA include the translocation of DNA through nanopores 4,5 in the context of ultra-fast electronic sequencing, DNA capture in and ejection from nanoscale capsules/wells, and the study of the interplay between histones and DNA during mitosis. 6he main theoretical challenge in biological systems is to bridge the scales between the atomistic and the macroscopic without wasting computational resources on uninteresting behaviors, such as the internal dynamics within a base which practically never changes shape, or the motion of solvent molecules far from the biomolecule.Computational methods a) Electronic mail: kaxiras@physics.harvard.edu. of this type have been used before, for example, in the study of protein dynamics. 7Multiscale simulations have also been successfully applied to the study of the electronic behavior and electron localization in stretched dry DNA. 8Our ultimate goal is to enable the study of variations in the DNA structure using multiscale approaches that do not sacrifice accuracy while achieving high efficiency.In the present work we focus on the first stage toward this goal, namely, the development of a coarse-grained model capable of accurately reproducing the structure of double-stranded DNA (ds-DNA) in solution, and simple enough to be efficiently combined with multiscale simulation techniques.
0][11][12][13][14][15][16][17][18][19][20][21] Most of them are constructed in a "top-down" fashion, [9][10][11][12][13][14][15][16][17] where the interaction potentials are chosen to reproduce certain sets of experimental data.Bead-string models 10 have been used to study diffusion and structural relaxation of single strands.Rigid basepair models 11 have been used to describe the elastic properties of DNA. 22,23 he three-site-per-nucleotide model by Knotts and coworkers 14 captures the salt-dependent melting of DNA, and has been extended to include solventinduced attraction between DNA strands 24 that helped to gain insights on hybridization [25][26][27] and on certain sequencedependent effects; 28 this model has also been adapted to describe the mechanical denaturation of long DNA 29 and to include explicit solvent molecules. 30Starr and coworkers proposed a simple model that captures the basics of hybridization, 13 and this model has been used to study Holliday junctions 31 and the self-assembly of DNA-linked nanoparticles. 32Ouldridge and coworkers proposed another model that is sequence independent, but can reproduce several structural, mechanical, and thermodynamic properties DNA. 16,33 9][20][21] The model by Savelyev and coworkers 18,34 was parametrized by matching moments of observables in the Hamiltonian.Other approaches to a bottomup construction include minimizing difference between the all-atom and coarse-grained potentials, 19 imposing molecular bonding geometry constraints, 20 and inverting the Boltzmann function to get the coarse-grained potentials. 21ach of these models has its regime of validity, depending on what experimental data were used for the model derivation.In this work, we develop a minimal model of ds-DNA that incorporates sequence specificity and has realistic mechanical robustness to bending and untwisting forces.We seek a model that is chemically accurate, yet not based on empirical observations.For these purposes, we take a bottom-up approach, and construct the potentials of the coarse-grained system directly from first-principles calculations.We divide the interaction potentials into independent parts that come from physically distinct contributions, and for each contribution we carry out ab initio calculations to find the functional forms of the potentials.We impose no a priori assumption on the functional forms of the potentials-these are determined based on results of the ab initio calculations.The present form of the model has its limitations that we will discuss, but it is flexible for future improvements.
This paper is organized as follows: in Sec.II we describe the methodology employed for the first-principles calculations.In Sec.III we present the ab initio data for each energy contribution and the analytical forms of the corresponding potentials.The implementation and performance of the coarsegrained model is described in Sec.IV.Finally, we present validations of the model in Sec.V, and conclude in Sec.VI.

II. METHODOLOGY OF AB INITIO CALCULATIONS
We carry out the first-principles calculations using density functional theory (DFT). 35In our DFT calculations, we do not deal with environmental factors such as solvent molecules and ions, and the calculations are carried out at the ground state (zero temperature).The environment and temperature factors are added a posteriori in the coarse-grained model through electric field screening and through Brownian dynamics.A more accurate approach will include temperature dependence in the coarse-graining procedure, but this is a challenging task and is not yet taken into account in the current work.Presence of the water molecules and ions may also affect the energetics, but their effects are not investigated in this work.
We use SIESTA, 36 an electronic structure code based on atomic-like orbitals, to carry out the DFT calculations.This approach has been previously applied in similar studies of gas-phase DNA bases 37 and successfully reproduces properties such as optical response in comparison to available experimental data. 38We use the Troullier-Martins scheme 39 to obtain pseudopotentials to eliminate the core electrons from the calculation and to produce a smoother valence charge density.We use a basis of double-ζ polarized atomic orbitals for all the atoms involved (13 orbitals for C, N, O, and P; 5 orbitals for H).An auxiliary real space grid equivalent to a plane-wave cutoff of 100 Ry is used for the calculation of the Hartree and exchange-correlation energies.For geometry optimization, a structure is considered fully relaxed when the magnitude of force on every atom is smaller than 0.04 eV/Å.
We use the generalized gradient approximation (GGA) with the PBE exchange-correlation functional 40 to describe the backbone-base and the inter-base-pair interactions, as this functional is known to describe covalent and hydrogen bonds well. 41,42 nteractions within the backbone are treated with the local-density approximation (LDA), 43 which is adequate for describing small radial and angular deviations of covalent bonds from their equilibrium values.The interaction between stacked base-pairs has a large contribution from long-ranged van der Waals (vdW) forces and exhibits an elaborate energy landscape that depends sensitively on the geometry. 44,45 ocal or semi-local exchange-correlation functionals cannot describe such long-range effects: 46 they do not reproduce the ∼r −6 behavior at large separation r that is characteristic of vdW interactions, and usually underestimate the stacking distance between two planar structures.There exists empirical corrections that add the vdW effects to the energies obtained from DFT calculations, 47 but we find that such correction still leads to underestimation of the stacking distance.Therefore, we employ a non-empirical long-range vdW density functional developed by Dion et al. 48to carry out calculations for the stacked base-pairs interactions.
The calculated interaction energy between two constituents may be susceptible to the basis set superposition error (BSSE), which results in unphysical lowering of the interaction energy when the two constituents come close to each other.To correct for BSSE, we take the full counterpoise approach: 49 at each separation r, we optimize the dimer geometry and carry out four additional calculations (constituent A along and constituent B along, with and without ghost orbitals), and obtain the BSSE-corrected interaction energy as where the subscript denotes the geometry: AB is the optimized dimer geometry, A and B are its constituent geometries, and A * and B * are the individually optimized constituent geometries; the prime indicates that ghost orbitals of the other constituent are used in the energy calculation.

III. CONSTRUCTION OF MODEL POTENTIALS
We coarse-grain each nucleotide into two interaction sites: one for the base and one for the sugar-phosphate backbone.The base site is identified with the position of the nitrogen atom (N1 for pyrimidines; N9 for purines) that connects the base to the sugar, and the backbone site is identified with the position of the sugar C1 ′ atom.This representation is illustrated in Fig. 1.This choice of coarse-grained site coordinates allows for unambiguous determination of bonding distance, bonding angles, etc., that enter the model variables.It also facilitates direct comparison to experimental structures during validation of the model.
To derive the effective interaction between these coarsegrained sites, we follow earlier work 50 and decompose the total interaction energy into contributions that have distinct physical meanings: the hydrogen bond between complementary bases E hb , the stacking energy between neighboring basepairs E st , contributions from the backbone E bb , and electrostatic interaction between the charged phosphate groups E el : These are physically distinct contributions, and in our model they are treated as independent and additive.Each contribution depends on several variables, and the effect of these variables are not taken as independent.
The choice of the coarse-grained sites and the decomposition of the interaction potentials has determined the structure of the model.The remaining construction of the model is to find explicit functional forms for each contribution, which we address in the following.

A. Hydrogen bonding
We consider first the interaction between two complementary bases: adenine-thymine (AT) or guanine-cytosine (GC), which comes from hydrogen bonds.We consider its dependence on the base-to-base separation and on the relative angles between the planes of the two bases.The angular dependence keeps the two bases coplanar and maintains the correct base-pair geometry.
We calculate the interaction energy as a function of the distance r hb between the pyrimidine N1 atom and the purine N9 atom (see Fig. 1), by starting from the energy-minimized  3) and ( 4) with parameters given in Table I.
geometry and varying r hb by translating the two bases parallel to the direction of the hydrogen bonds.At each r hb value, we optimize the geometry while fixing the four atoms that correspond to the coarse-grained sites (N1 and H1 of the pyrimidine, and N9 and H9 of the purine) to preserve r hb and the relative angles.At small r hb values, the atoms are also constrained on the base-pair plane to prevent the two bases from rotating out of plane.The effect of flipping angles and of nonplanarity will be examined separately.
Figure 2 shows the calculated interaction energy, which can be described by the universal binding energy relation (UBER) 51 E hb_r (r hb ) with its minimum at r hb = r 0 , E hb_r = E 0 .The parameters E 0 , r 0 , and l are given in Table I.The energy at the minimum is −0.64 eV for AT and −1.17 eV for GC, in agreement with the values −0.60 eV and −1.17 eV, respectively, obtained in previous calculations of the DNA hydrogen bonds. 42We do not parametrize the base-to-base interaction between mismatched base-pairs, but doing so will be a straightforward extension of the current work.
Next we examine the effect of non-planarity of the bases: when the two planes of the bases are not aligned, the hydrogen bond weakens.We measure non-planarity by the dihedral angle θ d between the planes of the two bases.Starting from the energy-minimized geometry (θ d = 0), we vary θ d by rotating the two bases in opposite directions around the line from the pyrimidine N1 atom to the purine N9 atom.To keep θ d fixed, we optimize the geometry while holding the 6-fold ring of the   6) and (7), and the backbone base-sugar-sugar angle potential in Eq. ( 18).k bss is in 10 −4 eV/deg 2 , all angles are in degrees.
Base φ two bases fixed.The resulting energies are shown in the inset of Fig. 2. We fit the interaction energy with the expression where E 0 is the same as in Eq. ( 3); values of the parameter k d are given in Table I.
The interaction between the two bases also depends on the in-plane angles of the two bases.In our coarse-grained model, this is described by the flip-angle φ (i) hb for the two bases (i = 1, 2), defined as the base1-sugar1-sugar2 angle for i = 1 and the base2-sugar2-sugar1 angle for i = 2 (see Fig. 1(b)).The flip-angles in the ground-state configuration, φ (i,0) hb , are listed in Table II.Starting from the energy-minimized geometry (φ (i) hb = φ (i,0) hb ), we consider either rotating the pyrimidine around the H1 atom, or rotating the purine around the H9 atom.The anchoring hydrogen atom mimics the backbone, which in physiological conditions remains relatively stationary when base flipping occurs.To keep φ (i) hb fixed at each rotated configuration, we optimize the geometry while fixing the pyrimidine atoms N1 and H1, and the purine atoms N9 and H9.For rotation inward, the atoms are constrained on the base-pair plane to prevent the two bases from rotating out of plane.
The resulting energies are plotted as a function of hb is positive, the energy is attractive and decays to zero.When φ (i) hb is negative, the two bases repel each other.We describe these two types of behavior separately, and fit the base-flipping interaction en-  5) with parameters given in Table II.
ergy to the expression ) where E (a,i) hb_f is described by an exponential and E (b,i) hb_f is described by a harmonic spring and is the step function.Again, E 0 is the same as in Eq. ( 3).The values of the parameter σ are listed in Table II.
Up to this point we have the interaction potential between the two complementary bases as a function of r hb , θ d , φ (1) hb , and φ (2) hb individually while keeping other variables fixed at the equilibrium values.Here we assume that the interaction depends only on these four variables; even in this case, the general dependence is not trivial, as the four variables are not independent.For example, when θ d or φ (i) hb is large, the hydrogen bond is basically broken, and there should no longer be a strong dependence on r hb .To capture the interdependences between the variables while keeping a reasonably simple form for the interaction, we define the following functions and take the final expression of the hydrogen bonding energy to be Note that the repulsive part of the flipping interaction E (b,i) hb_f is treated as additive since it does not serve to weaken the bond; this also ensures that the modulation functions τ (i) hb_f take values strictly between 0 and 1. Equation ( 9) is reduced to Eqs. (3)-( 5) for close-to-minimum geometries, i.e., Therefore, we expect that close to equilibrium, Eq. ( 9) serves as a good approximation to the interaction energy between the two bases.

B. Stacking interactions
We turn next to the interaction between two stacked basepairs.We consider the dependence on the stacking distance r st between the two base-pairs first.The two base-pairs are relaxed and stacked in parallel.Following usual conventions, 1 we define the axial point of each base-pair to be the 1:2 weighted center of the pyrimidine C4 atom and the purine C6 atom, and define the twist angle θ tw to be the angle made by the C4-C6 vector of the two base-pairs.This is illustrated in Fig. 4(a).The two base-pairs are stacked so that their axial points align at a distance r st apart in the direction normal to the base-pair plane, and so that the twist angle θ tw is 36 • .The precise definition of r st and θ tw in the coarse-grained model will be described in the Sec.IV.
There are ten different combinations of stacking, known as the ten Watson-Crick nearest-neighbors.For each of these ten combinations, we vary r st from 2.5 Å to 5.0 Å and calculate the interaction energy without further relaxation.The results are shown in Fig. 5.At this range of r st , the energy variation is well described by the expression Symbols are BSSE-corrected results from DFT calculations using vdW density functionals, and lines are fitting to Eq. ( 11) with parameters given in Table III.For clarity, curves are shifted upward by increments of 0.6 eV; the curve for AT-AT stacking is not shifted.In our notation, GC-AT indicates the stacking of GC base-pair and AT base-pair with GA and TC following the 3 ′ -5 ′ direction.
which has a minimum at r st = r m , E st_r = ϵ.The resulting values of the parameters from fitting the ab initio values with this expression are listed in Table III.It may seem surprising that the attractive part of the interaction behaves as r −5 st .This is no coincidence: the attraction between the two base-pairs is due to the vdW force, which behaves as r −6 for two point particles and as r −4 for two thin sheets (assuming additivity of the vdW interaction).The range of r st being considered here is comparable to the radius of the base-pairs (about 4 Å), so we expect a power in between the two, i.e., r −5 st .We find this is indeed the case; expressions with any other integer power of r st lead to poor fitting.Again, we do not parametrize the stacking interaction for mismatched base-pairs, which can be a straightforward extension to the current work.
Another dependence of the stacking interaction comes from the twist angle θ tw .To examine this dependence, we fix r st at 3.4 Å and vary θ tw from 0 • to 360 • .Again, the two base-pairs are parallel and have their axial points aligned.The resulting energies E tw are shown in Fig. 6.This energy term has a complex dependence on θ tw , defying a simple analytical expression.Given its periodicity, we fit E tw with a Fourier series: The resulting values of the coefficients in this expansion are given in Table IV.For AT-AT and GC-GC stacking, E tw (− θ tw ) = E tw (θ tw ) by symmetry, and so the coefficients b n are zero.We note that although Eq. ( 12) involves many trigonometric functions, the higher terms can be obtained from trigonometric addition rules or from the Chebyshev method.Therefore, its computational cost is similar to that of Eq. (11).
The stacking interaction also depends on the tilt-angle θ tl , which we define as the angle between the normal vectors of the two base-pair planes.The two base-pairs can tilt in a variety of ways, and so this dependence can be complex.We find that close to the optimal stacked configuration (at small θ tl , and r st = r m , θ tw = 36 • ), the interaction energy dependence can be approximately described by where ε is the same as in Eq. ( 11).This expression also has the physical meaning that the energy is at a minimum in the case of parallel (θ tl = 0) and anti-parallel (θ tl = π ) stacking, and is zero when the two base-pairs are perpendicular to each other (θ tl = ±π /2).The interaction between two stacked base-pairs depends on r st , θ tw , and θ tl .To capture all three dependences and their correlations with a reasonably simple form, we define the following functions: τ tw (θ tw ) = E tw (θ tw )/ϵ, τ tl (θ tl ) = E tl (θ tl )/ϵ (14)   and take the final expression of the stacking interaction to be E st (r st , θ tw , θ tl ) = E st_r (r st )τ tw (θ tw )τ tl (θ tl ), (15)   similar to our treatment of the hydrogen bond.Equation ( 15) involves much simplification, as the true interaction energy may depend on more than these three variables, and the dependence on r st , θ tw , and θ tl may not factorize.By the same argument as for the hydrogen bonds, though, we expect Eq. ( 15) to be a good approximation close to equilibrium.The stacking potential here is formulated as interaction between neighboring base-pairs, rather than between neighboring bases.This approach is sufficient when dealing with ds-DNA structures.However, we note that it may not be appropriate in processes like melting or hybridization that involve single-stranded DNA or broken base-pairs.In such situations, the stacking interaction should be reformulated as interaction between bases instead and be constructed in a similar fashion.

C. Backbone contribution
We next consider contributions from the backbone.First, to extract interactions within the sugar-phosphate backbone as distinct from any electrostatic or stacking interactions, we take the phosphate groups to be protonated and carry no charge, and replace the bases with terminating hydrogens.Starting from the energy-minimized geometry, we uniformly stretch or compress the backbone along the helical direction, as illustrated in Fig. 7(a), and allow the phosphate units to relax.The resulting energy E bb_r is shown in Fig. 8 as a function of r ss , the distance between the C1 ′ atoms of neighboring sugars.The interaction energy per sugar-phosphate unit can be described by the expression E bb_r (r ss ) = c 2 (r ss − r ss_0 ) 2 + c 4 (r ss − r ss_0 ) 4 (16 with c 2 = 18.773 eV/Å 2 , c 4 = 0.333 eV/Å 4 , and r ss_0 = 4.976 Å.The base is covalently bonded to the backbone.For this interaction, we consider only a base and a sugar, with the phosphate groups replaced by hydrogen atoms.Starting from the energy-minimized geometry, we translate the base and the sugar groups in opposite directions along the vector connecting the two.Then we hold the two atoms of the base-sugar covalent bond fixed, optimize the geometry, and calculate the interaction energy.We find results for the four bases to be nearly identical, so only results for guanine will be discussed.These are shown in Fig. 9, with a fit to the UBER expression that includes two additional terms where r CN is the distance between the sugar C1 ′ atom and the base N atoms (N9 of purines or N1 of pyrimidines).The values of the parameters are: E 0 = −3.542eV, r 0 = 1.455Å, l = 0.400 Å, f 2 = −0.132,and f 3 = 0.215.The base also interacts with neighboring backbone groups, and this gives rise to the 5 ′ /3 ′ asymmetry of the double-helix.We characterize this interaction using the an-   17).Inset shows the energy as a function of the sugar-sugar-base angle θ 5 ′ defined in Fig. 7(b), and lines are fits to Eq. ( 18) with parameters given in Table II.
gle between the backbone strand and the base.We consider a segment of single-stranded DNA, with three sugar groups connected by two protonated phosphate groups.The first and third bases are replaced with terminating hydrogen atoms, and the middle base is either A, C, G, or T (see Fig. 7(b)).
From the energy-minimized geometry, we rotate the middle base around the vector r CC5 ′ × r CN (both are shown in Fig. 7(b)), with the pivot point at C1 ′ of the middle sugar.For each rotated configuration, we relax the system while holding the connecting N atom of the base (N9 of purines or N1 of pyrimidines) and the C1 ′ atoms of the three sugars fixed.The resulting energy is shown in the inset of Fig. 9.We consider this energy to be a function of the two angles θ 3 ′ and θ 5 ′ , where θ 3 ′ is the angle between r CN and r CC3 ′ , and θ 5 ′ is the angle between r CN and r CC5 ′ (see Fig. 7(b)).We fit this energy term with the function where θ (0) 3 ′ and θ (0) 5 ′ are the angles in the initial configuration relaxed without constraint, and k bss is obtained from fitting.The resulting values for these parameters are listed in Table II.
We treat Eqs. ( 16)-( 18) as independent interactions, and so the total contribution of the backbone and its interaction with the base-pairs is given by

D. Electrostatics
All the potentials considered so far represent bonded or short-ranged interactions between components of DNA.The DFT calculations considered DNA as being composed of strictly neutral components in vacuum.However, under physiological conditions, DNA is solvated in an aqueous electrolyte, and the phosphate groups in DNA are deprotonated.Therefore, we place a charge of −e on each sugar-phosphate group of the coarse grained model, and include a Coulomb potential between these groups with a distance-dependent dielectric function ε(r), r being the distance between the charges and ε 0 the dielectric constant in vacuum.This ε(r) plays two roles: (i) it incorporates the effects of ionic screening, and (ii) it accounts for the fact that closely spaced interacting groups are only partially solvated by the surrounding electrolyte.We use the following expression, closely related to the formulation of Ref. 52, for the dielectric function: where ε ∞ = 78 is the dielectric constant of water, and ε int is the dielectric constant in the interior of the DNA helix, which we take to be 3. 52 The Debye length is κ −1 , related to the ionic strength I through where k B T is the thermal energy and N A is Avogadro's number.For example, at [Na + ] = 0.1 M, the Debye length is 9.6 Å.The values r 0 and r 1 determine the boundary between unscreened and screened electrostatic interactions.The characteristic sizes of the chemical groups represented by our coarse-grained units range from about 2 Å to about 5 Å.Consequently we set r 0 = 4 Å.Similarly, we choose r 1 = 13 Å, approximately five times the mean water oxygen-water oxygen distance in bulk water, 53 as the distance where the effective dielectric constant recovers the value predicted from the Debye-Hückel theory of screening in a bulk electrolyte.The value of α is then chosen such that ε(r) is continuous.
Between the two charged groups within the same base-pair, ε(r) = ε ∞ is used.Finally, since the electrostatic interaction decays exponentially at large distance, we truncate this interaction for distances above five times the Debye length.The simple electrostatics approach here is a first approximation, and is not meant to capture details such as the dependence of ion condensation on conformation, 54 which will require explicit solvent molecules.

IV. IMPLEMENTATION AND PERFORMANCE OF THE COARSE-GRAINED MODEL
In Sec.III we derived the interactions between the coarsegrained sites.The total interaction energy is given by Eq. ( 2), and the different contributions include summations over all relevant interacting units.Specifically, E hb includes a summation over all base-pairs, and E st includes a summation over all pairs of neighboring base-pairs.Of the three terms that comprise the E bb contribution, E bb_r includes a summation over all pairs of neighboring backbone sites, E bb_c includes a summation over all pairs of connected backbone and base sites, and E bb_b includes a summation over all bases.Finally, E el includes a summation over all pairs of backbone sites.
Next, we relate the geometrical variables to the positions of the coarse-grained sites.In the coarse-grained representation, the base site is identified with be the pyrimidine N1 atom or the purine N9 atom, and the backbone site is identified with the sugar C1 ′ atom.Then the distance and angle variables r hb , φ (1) hb , φ (2) hb , r ss , r CN , θ 3 ′ , and θ 5 ′ follow from the position of the coarse-grained sites.We define the remaining geometrical variables as follows.For each base-pair, we define the normal vector of base i to be n (i) = r CC × r (i) CN with i = 1, 2 corresponding to the two bases, and with r CC and r (i) CN defined in Fig. 1(b).The dihedral angle θ d of this base-pair is given by cos θ d = n(1) • n(2) , with the hat denoting unit vectors.We define the average normal of this base-pair to be n = n (1) + n (2) , and let the tilt-angle θ tl between base-pair j and base-pair j + 1 be given by cos θ tl = nj • nj+1 .As for the stacking distance r st , we take the average distance of the four sites in base-pair j + 1 to the plane of base-pair j.Specifically, we compute the four vectors d 1 , d ′ 1 , d 2 , and d ′ 2 shown in Fig. 4(b), project them onto the two normal vectors as Finally, the twist angle θ tw is defined as the angle between r CC, j and r CC,j +1 − (r CC,j +1 • n(1) j ) n(1) j , which is the projection of r CC, j + 1 onto the plane of base-pair j.
We implement this model for calculations in the microcanonical ensemble (constant number of particles N, volume V , and energy E) and the canonical ensemble (constant N, V , and temperature T) with implicit solvent Brownian dynamics.We choose the mass of each site to be the total mass of the group of atoms that it represents: 178 amu for the backbone site, 134 amu for base A, 125 amu for T, 150 amu for G, and 110 amu for C. Multiple-time-step integrators are used for time propagation: the RESPA algorithm 55 is used for the NV E ensemble, and the multiple-time-step stochastic integrator 56 is used for Brownian dynamics.The Coulomb interaction between different base-pairs is treated as the slow-varying component of the Hamiltonian, integrated with a time-step of 20 fs, while all the other forces are categorized into the fast-varying component of the Hamiltonian, integrated with a time-step of 6.67 fs (i.e., three divisions per large time-step).This choice of time-steps ensures high stability: when running in the NV E ensemble at 300 K, the total energy is conserved to within 5 × 10 −4 eV per base-pair for arbitrary sequences at several conditions tested.
As a measure of the performance of this coarse-grained model, a 20 fs step of the stochastic integrator for a DNA molecule consisting of 250 base-pairs at 0.1 M salt concentration takes 2.4 ms on a single Intel Xeon X5560 processor (2.80 GHz).A 100-ns simulation of such a strand takes 3 h 20 min.This performance exceeds the speed of all-atom simulation methodologies by many orders of magnitude, and is comparable to other coarse-grained models with similar complexity.The computation time scales linearly with the number of base pairs, since the electrostatic interactions are damped.This performance makes simulations of ds-DNA in the µm length scale and µs time scale feasible.

V. MODEL VALIDATION
The different contributions in the model potentials have been derived from calculations of isolated units.Here we test the validity of the combined potential (additivity of the individual contributions), and its performance in larger systems (transferability of the potentials).Also, our a posteriori inclusions of the ionic and temperature effects are crude, and their consequences will also be assessed.

A. Equilibrium structure
First, we test the structure prediction of the model with poly-AT DNA consisting of 250 base-pairs (bp) at temperature 300 K and salt concentration 0.1 M. When initialized with coordinates of the B-form DNA, 57 the B-DNA structure is maintained.To test the robustness of our model potential, we also carried out simulations starting with a highly off-equilibrium structure-an elongated and uncoiled double strand-and followed the evolution.Figure 10 shows that in such case, the system is able to coil back to the B-DNA structure.Signatures of the B-DNA structure such as the major groove, minor groove, and the 10 bps-per-turn period are well reproduced.This demonstrates the model's ability to predict the stable double helix structure, without resorting to Go-like potentials that are defined relative to a specific reference structure as in Ref. 14.
To be quantitative, we calculate the averages of many structural properties after the system equilibrates.These quantities are listed in Table V, with comparison to reference values derived from recent crystallographic data of a naturally occurring 16 bps oligomer. 58The average deviation of our model from experiment is about 2%.Given that no exper-

B. Persistence length
Next, we check the model's ability to predict mechanical properties at long lengths by examining the persistence length (l p ), which gives an estimate of the bending rigidity of DNA.The persistence length can be extracted from the decay of the orientational correlation function where ri is the unit tangent vector at base-pair i, and s ij is the arc length from base-pair i to base-pair j.The tangent vectors and the arc length are evaluated along the axis of the doublehelix of DNA.As the DNA axis is not an explicit interaction site in our coarse-grained model, we extrapolate its location by estimating the direction of the local helical axis and averaging the projection of the backbone sites onto the local axial direction.We consider 250-bp DNA with poly-AT sequence, poly-GC sequence, and a segment of the enterobacteria phage-λ genome (with GC content 0.47).Simulations at 300 K and 0.15 M salt concentration give l p = 53 nm, 47 nm, and 41 nm, respectively, for the poly-AT, poly-GC, and the phage oligomer strand.These values are in close agreement with the experimental value of l p ≈ 50 nm for double-stranded DNA. 59The fittings used to obtain these values are shown in the inset of Fig. 11.
We also test the validity of the implicit salt treatment by considering the salt dependence of the persistence length.This dependence has been shown experimentally 59 to agree well with the nonlinear Poisson-Boltzmann prediction for uniformly charged cylinders 60 where l 0 is the persistence length at infinite salt concentration, κ −1 is the Debye length given in Eq. ( 22), and l B is the Bjerrum length.The second term is the electrostatic contribution, and is inversely proportional to the salt concentration.24) with l 0 = 47 nm, 42 nm, and 39 nm, respectively, for poly-AT, poly-CG, and phage sequences.Inset shows the decay of the orientational correlation for a set of simulations at 0.15 M salt, where symbols are data and lines are exponential fits, Eq. ( 23).
Figure 11 shows that the salt dependence of l p predicted by our model also agrees with Eq. (24).

C. Overstretching
To test the mechanical properties of the model under more extreme conditions, we perform numerical stretching experiments, using a 100-bp DNA from a segment of phageλ, at temperature 300 K in 0.1 M salt.One end of the DNA is fixed to a surface with a harmonic potential, and a constant pulling force is applied onto the backbone site at the other end.It is known that at around 65 pN, the stretched DNA undergoes a sudden extension of about 70% known as superstretching. 61,62 igure 12 shows the extension curve from our model for pulling on either the 3 ′ end or the 5 ′ end.Inset of Fig. 12 provides a typical image of the DNA before and after the sudden stretching and unwinding occur.The simulations capture the superstretching transition, although the critical force in our model is higher, and the amount of the  sudden extension is less.The superstretching transition is a very stringent test for our computational model, as it involves strong departures from the equilibrium structure.Notwithstanding the neglect of solvent interactions and the athermal character, our coarse-grained potential still provides reasonable results under such critical conditions.Simulations also reveal that the superstretching is accompanied by unwinding of the double helix, as shown in Fig. 12. Above the critical force, our model shows that the DNA is twisted slightly in the left-handed direction.

D. Bubble formation
Although our potentials are formulated for ds-DNA, in some applications it is desirable to have a model that can account for broken hydrogen bonds.Here, we test the validity of our model when the double strand in a 50-bp poly-AT DNA is partially broken.The onset of melting can be observed in our model, but the melting temperature is overestimated.At temperature 550 K, the double-strand shows occasional unzipping; one instance is shown in Fig. 13(a).At 800 K, formation of bubbles can be observed; one instance is shown in Fig. 13(b).The overestimation of melting temperature may arise from several reasons.First, our stacking potential is not formulated for single strands or strands with broken hydrogen bonds.Second, the presence of water molecules may lower the energy of the broken-bond states, but this effect is not taken into account.Third, the coarse-grained potentials here are derived at zero temperature, which is stiffer than at finite temperature.However, the appearance of these intermediate molten states is still an indication that our model is stable beyond strictly ds-DNA structures.

VI. CONCLUSION
We have constructed the interaction potentials for a coarse-grained model of double-stranded DNA.The model potentials are derived from first-principles calculations for the DNA bases and base-pairs.Contributions to the potential include hydrogen bonding, base-stacking, backbone stretching, and interactions between bases and the backbone.Analytical functions are fitted to the computed energy, leading to a potential for DNA that is able to handle coarse-grained configurations of random DNA sequences and can be used to model related biophysical processes.Unavoidably, some simplifying assumptions were used in the model formulation and construction.In order to use a minimal number of empirical terms, only the effect of ionic screening was added to the microscopically derived potential.At a later stage, such empirical term can be replaced by a more microscopic approach. 63 series of tests were performed, and they verified that the coarse-grained potential can reproduce the stable B-DNA structure, and that the predicted structure matches the known crystallographic structure of DNA to a few % in key structural parameters.The model produces persistence lengths in close agreement with experimental data, and the response of the persistence length to varying salt concentrations also agrees with the prediction for a linked-cylinder model.These tests suggest that the mechanical properties of the coarse-grained model will show realistic trends when subject to the complex electrokinetic environment in the cell or in artificial systems such as the interior of an nanochannel during translocation experiments.
The tests on overstretching and on bubble formation show that the model may capture the qualitative features in such states, but also reveal some limitations of the model.To describe such states accurately, we may need a more realistic account of the solvent molecules and of the temperature dependence of the coarse-grained potentials.This will be the subject of future work.Other possible and more straightforward extensions of the current model will be to formulate the stacking interaction to work with ss-DNA, and to parametrize interactions for mismatched base-pairs and for uracil (to extend the model to RNA).Also, the current model does not account for inter-strand interactions except for the electrostatic repulsion, and so situations like long DNA strands under confinement may not be appropriate for this model.
The approach described in this work is attractive for describing ds-DNA using a minimal of empirically derived parameters and with a good compromise between accuracy and computational efficiency.We suggest that this model could be a useful tool for simulating the behavior of ds-DNA in a variety of biologically relevant or device-related scenarios with modest computational resources.

FIG. 1 .
FIG. 1. Schematic of an adenine-thymine (AT) base-pair showing the geometric variables in the hydrogen bond potential in (a) the atomic structure calculations and (b) the coarse-grained model.The distance r hb , the two flipangles φ (i) hb , and the dihedral angle θ d (angle between the two base planes, not shown) are used to describe interaction between the two bases.The vectors r CC and r (i) CN are used to define the normal vectors of the two bases and the dihedral angle.In this figure and Figs. 4 and 7, the colored spheres represent C (cyan), H (white), N (blue), O (red), and P (brown) atoms.

FIG. 2 .
FIG. 2. Hydrogen bonding energy versus distance r hb between two complimentary bases.Inset shows the dependence on the dihedral angle θ d between the two bases.Symbols are BSSE-corrected results from DFT calculations, and lines are fittings to Eqs. (3) and (4) with parameters given in TableI.

FIG. 3 .
FIG.3.Hydrogen bonding energy versus flip-angle φ hb of A and G (results for T and C are similar and are not shown).Symbols are BSSE-corrected results from DFT calculations, and lines are fitting to Eq. (5) with parameters given in TableII.

FIG. 4 .
FIG. 4. Schematic showing two stacked base-pairs in (a) the atomic structure calculation (top view), and (b) the coarse-grained model (side view).In (a), the bottom base-pair is shown in gray for clarity.The atoms used to define the twist angle θ tw are colored in green, and the axis of the double helix (shown as a dotted circle) is defined as the 1:2 weighted center of the green atoms.In (b), the planes of the two base-pairs are shown.The twist angle is reduced for clarity of illustration.The four vectors d 1 , d ′ 1 , d 2 , and d ′ 2 as indicated are used to define the stacking distance r st in the coarse grained representation.See main text for description of the other geometric variables involved in stacking.

FIG. 5 .
FIG.5.Stacking energy between two neighboring base-pairs (results for AT-GC, CG-GC, TA-GC, and TA-AT stacking are similar and are not shown).Symbols are BSSE-corrected results from DFT calculations using vdW density functionals, and lines are fitting to Eq. (11) with parameters given in TableIII.For clarity, curves are shifted upward by increments of 0.6 eV; the curve for AT-AT stacking is not shifted.In our notation, GC-AT indicates the stacking of GC base-pair and AT base-pair with GA and TC following the 3 ′ -5 ′ direction.

FIG. 7 .
FIG. 7. (a) A single backbone strand in its relaxed B-DNA form, used to evaluate the backbone interaction E bb_r .The variable r ss is the distance between the C1 ′ atoms of neighboring sugars, and is shown in the magnified part.(b) Structure used to evaluate the base-sugar-sugar interaction E bb_b .The vectors used to define the angles θ 3 ′ and θ 5 ′ are shown.

FIG. 8 .
FIG.8.Backbone energy per sugar-phosphate unit as a function of r ss , the distance between adjacent C1 ′ atoms of the sugars.Solid line is fitting to Eq. (16).

FIG. 9 .
FIG.9.Interaction energy between the backbone and the base.Main plot shows the covalent bond energy between the sugar and the base, as a function of the distance between the sugar C1 ′ atom and the base N1 or N9 atom to which it is bonded; solid line is fitting to Eq. (17).Inset shows the energy as a function of the sugar-sugar-base angle θ 5 ′ defined in Fig.7(b), and lines are fits to Eq. (18) with parameters given in TableII.

FIG. 10 .
FIG.10.Coiling of a 250-bp poly-AT strand in the coarse-grained model.The system starts in a completely uncoiled state and evolves at temperature 300 K. Insets show the middle part of the strand at time 0 ns, 0.5 ns, and 2.0 ns.Backbone sites are in yellow, A's in red, and T's in green.

FIG. 11 .
FIG.11.Salt dependence of the persistence length at 300 K.Each point represents an average over 6 independent simulations, with the error bar showing standard deviation of l p estimated from the individual runs.Lines show Eq.(24) with l 0 = 47 nm, 42 nm, and 39 nm, respectively, for poly-AT, poly-CG, and phage sequences.Inset shows the decay of the orientational correlation for a set of simulations at 0.15 M salt, where symbols are data and lines are exponential fits, Eq. (23).
FIG.12.Average of the rise per base-pair and the twist angle when a 100-bp phage-segment DNA is stretched.Inset shows snapshots before and after the sudden extension, for pulling with 640 pN on the 5 ′ end.

FIG. 13 .
FIG. 13.A 50-bp poly-AT DNA in partially molten states: (a) unzipping from one end, and (b) formation of a bubble.Insets show snapshots of the molecule; the color schemes are the same as in Fig. 10.

TABLE I .
Fitting parameters for the radial dependence of the hydrogen bond interaction in Eq. (3) and the dihedral angle dependence in Eq. (4).

TABLE II .
Fitting parameters for the flip-angle interaction in Eqs. (

TABLE III .
Fitting parameters of the stacking interaction in Eq.(11)for the ten Watson-Crick nearest-neighbors.In our notation, GC-AT indicates the stacking of GC base-pair and AT base-pair with GA and TC following the 3 ′ -5 ′ direction..6.Twisting energy between two neighboring base-pairs.Symbols are BSSE-corrected results from DFT calculations using vdW density functionals, and lines are fitting to Eq. (12) with parameters given in TableIV.For clarity, curves are shifted upward by increments of 0.3 eV; the curve for AT-AT stacking is not shifted.Notation is same as in Fig.5.Note that, although there are 10 unique stacking combinations, only 6 are necessary in the evaluation of E tw .For example, E tw (θ tw ) of AT-GC is given by E tw (360 • − θ tw ) of GC-AT. FIG

TABLE IV .
Fitting parameters of the twisting interaction in Eq.(12).All parameters in units of 0.01 eV.Notation is same as in TableIII.

TABLE V .
Average structural properties of a 250 base-pair poly-AT strand at 300 K and salt concentration 0.1 M. The reference experimental values are derived from the positions of the C1 ′ atoms of the sugars, the N1 atoms of the pyrimidines, and the N9 atoms of the purines in the crystal structure of Ref.58.