Diffusion Monte Carlo Study of Para - Diiodobenzene Polymorphism Revisited

We revisit our investigation of the diﬀusion Monte Carlo (DMC) simulation of p - DIB molecular crystal polymorphism. [J. Phys. Chem. Lett. 2010, 1, 1789-1794] We perform, for the (cid:12)rst time, a rigorous study of (cid:12)nite-size eﬀects and choice of nodal surface on the prediction of polymorph stability in molecular crystals using (cid:12)xed-node DMC. Our calculations are the largest which are currently feasible using the resources of the K computer and provide insights into the formidable challenge of predicting such properties from (cid:12)rst principles. In particular, we show that (cid:12)nite-size eﬀects can in(cid:13)uence the trial nodal surface of a small (1 (cid:2) 1 (cid:2) 1) simulation cell considerably. We therefore repeated our DMC simulations with a 1 (cid:2) 3 (cid:2) 3 simulation cell, which is the largest such calculation to date. We used a DFT nodal surface generated with the PBE functional and we accumulated statistical samples with (cid:24) 6 : 4 (cid:2) 10 5 core-hours for each polymorph. Our (cid:12)nal results predict a polymorph stability consistent with experiment, but indicate that results in our previous paper were somewhat fortuitous. We analyze the (cid:12)nite-size errors using model periodic Coulomb (MPC) interactions and kinetic energy corrections, according to the CCMH scheme of Chiesa, Ceperley, Martin, and Holzmann. We investigate the dependence of the (cid:12)nite-size errors on dif-ferent aspect ratios of the simulation cell ( k -mesh convergence) in order to understand how to choose an appropriate ratio for the DMC calculations. Even in the most expensive simulations currently possible, we show that the (cid:12)nite size errors in the DMC total energies are far larger than the energy diﬀerence between the two polymorphs, although error cancellation means that the polymorph prediction is accurate. Finally, we found that the T -move scheme is essential for these massive DMC simulations in order to circumvent population explosions and large time-step biases.


Introduction
The prediction of molecular crystal polymorphism 1,2 is one of the most challenging issues for current ab initio electronic structure calculations in both a theoretical and computational sense. [23][24][25][26][27][28][29][30][31][32] The polymorphism is governed by very subtle interactions, such as weak noncovalent bonds. In order to address the problem satisfactorily, theoretical methods must possess sufficient accuracy to reproduce such interactions. In addition, molecular crystals generally have larger and more highly anisotropic unit cells with many more atoms than typical metals or semiconductors. The smallest isotropic simulation cell of the molecular crystal is then larger than that of the uniform crystals, leading to difficulty with the theoretical and computational treatment of such periodic systems. Methods are therefore needed which can strike an appropriate balance between accuracy and computational cost if reliable predictions are to be made.
From the viewpoint of computability, density functional theory (DFT) [33][34][35] approaches could be candidates for tackling the polymorphism issue. It is well-known, however, that DFT with standard functionals frequently fail to accurately describe noncovalent interactions, especially dispersion. 35 Though many dispersion-related DFT methods 4, [36][37][38][39][40][41][42][43][44][45] have been exploited and successfully applied recently to typical noncovalent systems, 46 it has been shown that the predictive power strongly depends on the target system. That is, a DFT functional which works well for some specific noncovalent system does not necessarily give good results in another. 47 On the other hand, from the viewpoint of accuracy, post-Hartree-Fock (post-HF) methods such as MP2 (second-order Møller-Plesset perturbation theory) and CCSD(T) (coupled-cluster with single and double excitations including noniterative triples) work well for noncovalent molecular systems. Though the post-HF methods have recently been extended to periodic systems, their computational costs are still too expensive to treat molecular crystals with larger unit cells, and the required additional approximations may weaken their advantage. Very recently, fragment-based schemes have been developed to treat periodic systems in combination with the post-HF methods such as MP2 and CCSD(T). 3,5,13,14,20 Since their applications to the molecular crystals are limited to typical systems such as water and benzene molecular crystals, more benchmarks would be necessary for assessing their performance in the near future.
It has widely been recognized that quantum Monte Carlo (QMC) methods [48][49][50] can reproduce various types of molecular interactions to a high accuracy. 47,[51][52][53][54] Modern massively parallel computers have expanded the applicability of QMC methods not only to larger molecular systems but also to periodic systems because of their high parallel efficiency. 55,56 One of the most practical QMC methods, fixed-node diffusion Monte Carlo (FN-DMC), has been applied to noncovalent molecular systems and demonstrated to have an accuracy comparable to CCSD(T). 47,[51][52][53][54] Since the accuracy of FN-DMC depends critically on the choice of trial nodal surface, one should take care when generating this surface. If one uses DFT, experience shows that the nodal structures generally depend on the functional employed, and many choices are available in the literature. Although a number of QMC studies have reported that the dependence is not strong, some will be better than others and it is nontrivial to decide which method is best a priori. In spite of this issue, FN-DMC is expected to be applicable to periodic systems with high accuracy in practice. Compared to FN-DMC, the "gold-standard" CCSD(T) method has heavier computational costs, and is unable to typically treat such systems without additional approximations.
When applying QMC methods to periodic systems, one should ideally vary the simulation cell size and extrapolate the results to infinity. But such a full extrapolation in QMC is much more difficult than in DFT even using cutting-edge supercomputers. Therefore tractable simulation cell sizes are quite limited in practice, leading to potentially significant finitesize errors (FSEs) in QMC results. A number of correction schemes have been devised to reduce the effects arising from FSEs. These schemes are formally classified into either one-or two-body types. 57 The one-body schemes for metallic systems differ from those for insulating systems due to the existence/nonexistance of a Fermi surface. The twist-averaging technique due to Lin, Zong, and Ceperley 58 significantly improves the FSEs for metals. The k-point shift (from Γ to L), similar to the "special k-point method" in DFT, frequently works well for insulators. 59,60 The MPC (model periodic Coulomb interaction) 61,62 and CCMH (Chiesa, Ceperley, Martin, and Holzmann) 63 schemes are known to be effective two-body schemes.
In addition, Kwee, Zang, and Krakauer devised an a posteriori finite-size correction to the exchange-correlation (XC) potentials within the DFT framework (KZK scheme). 64 The KZK scheme has been recently extended to magnetic systems. 65 A number of large-scale periodic QMC simulations have been performed. [66][67][68][69][70] They show that the above-mentioned schemes work well for isotropic (mostly cubic) systems having small unit cells with only one or two elements. QMC applications to molecular crystals have been traditionally restricted to relatively simple systems, e.g., phase diagrams of ice 71 and sold molecular hydrogen. 72 It has generally been unfeasible to simulate more strongly anisotropic and complicated molecular crystals because of the limits imposed by available computational resources (mostly memory size). In our previous study, 73,74 we investigated for the first time using FN-DMC the polymorphism of the para-diiodobenzene (p-DIB) organic molecular crystal, a strongly anisotropic system. Standard DFT methods contradict experiment in that they predict the α phase to be more stable than the β phase at zero temperature. Our FN-DMC results were consistent with experiment, but they were only performed using an LDA nodal surface and a small 1 × 1 × 1 simulation cell. Since they were the largest calculations we could do at that time using the available supercomputer resources, we adopted the empirical KZK scheme 64 to estimate the FSEs. Very recently, more sophisticated DFT simulations were performed based on DFT-∆ 12 12 and DFT-D, 10 both of which agreed with our FN-DMC results. This does not imply, however, that the KZK scheme adopted in our previous study appropriately describes the FSEs in our FN-DMC simulations because there is no a priori reason that the FSEs in isotropic systems should be similar to those in anisotropic systems. In this sense, our previous study had some limitations, and it is not clear how large the FSEs were, or whether the above-mentioned schemes can effectively correct them for anisotropic molecular crystals. Thus, the following two points should be carefully investigated in the FN-DMC simulations of the p-DIB molecular crystal polymorphism: (1) FSE effects for anisotropic molecular crystals, i.e., choice of simulation cell size, their aspect ratios and the performance of finite-size correction schemes; (2) the nodal surface dependence for an accurate description of noncovalent interactions.
We shall report here that our previous FN-DMC result (DMC/LDA/1×1×1) appears to be fortuitously accurate. In the present study, we show that FN-DMC simulations of the polymorph stabilities with a 1×1×1 unit cell show quite a strong dependence on the choice of nodal surface: e.g. LDA, GGA-PBE, or B3LYP. In particular, we found early on that the DMC/PBE/1 × 1 × 1 result contradicts experiment by predicting the β phase to be most stable, in contrast to our previous paper using an LDA surface. This encouraged us to carefully investigate the results with different nodal surfaces and compare them. We therefore also investigated how the sizes and aspect ratios of simulation cells affect the FSEs within DFT. When taking any two sufficiently large cells, we confirmed that extrapolation of the two energies converges to the same final energy regardless of the cells' aspect ratios.
However, when using smaller cell sizes with three different aspect ratios, 1×1, 1×2, and 1×3, we observed significantly different final energies after extrapolation. This implies that the choice of aspect ratio is especially significant for QMC because it is only applicable to smaller unit cell sizes. In this work, we have performed the largest possible DMC/PBE simulations given our available computational resources on the K computer, 75 with an approximately isotropic 1×3×3 simulation cell. We found that the correct prediction is recovered when we increase the simulation size from 1×1×1 to 1×3×3, implying significant finite size effects in this system. Even with a 1×3×3 cell, the FSE schemes we used (MPC/CCMH) are found to give corrections far larger than the energy difference between the polymorphs. Hence further detailed investigation of finite size errors is necessary for the definitive resolution of these questions in such anisotropic systems, which are often intriguing materials such as strongly-correlated electron systems and molecular crystals.
The paper is organized as follows: Section "Computational Methods" specifies our target systems and methodologies. Section "Results" simply deals with our numerical results of relative stability energies obtained from QMC. Section "Discussions" gives a detailed analysis of several finite-size corrections as well as computational aspects in our DMC simulations.

Target Systems
We treated two polymorphs of the para-diiodobenzene (p-DIB) molecular crystal, known as the α and β phases. The transition from α to β occurs at 327K indicating that the α phase is slightly more stable than the β phase at low temperature. 76,77 To the best of our knowledge, the relative energy between the two phases at zero temperature is not available from any experiment. The lattice symmetries for the α and β polymorphs belong to the P bca (D 15 2h ) and P ccn (D 10 2h ) space groups, respectively, but they both have four p-DIB molecules in an orthorhombic unit cell (see Fig. 1). All the present calculations were performed using experimental molecular geometries (lattice constants and unit cell atomic positions) published in the Cambridge Structural Database 78 (ZZZPRO03 and ZZZPRO04 for the α and β phases, respectively). Note that the unit cell for each phase has an aspect ratio of almost 3 × 1, indicating strong anisotropy.

Methods
For general descriptions of QMC methods adopted in the present study there are several recent review articles [48][49][50] available. To investigate the FSEs in the DMC calculations of p-DIB, we consider the 1 × 1 × 1 and 1 × 3 × 3 simulation cells. Although the k-mesh size convergence in DFT does not coincide with the one-body FSE in DMC completely, it is helpful to understand how the one-body FSE decreases, depending the aspect ratio, as the size increases. To see this, we considered 1×1×1, 1×2×2, 1×3×3, 2×4×4, and 2×6×6 Monkhorst-Pack k-point mesh sizes. 79 We performed LDA (Perdew-Zunger 81; PZ81 80 ) and GGA (Perdew-Burke-Ernzerhof; PBE 81 ) calculations. In addition, we attempted to get results with the B3LYP functional, 82-84 but unfortunately they did not converge, except in the 1×1×1 case. All the DFT calculations were performed using the Quantum Espresso code. 85 The crystalline orbitals were expanded in a plane-wave basis set with a cutoff energy of 40 hartree, such that the energy differences between the two polymorphs converged to 0.01 kcal/mol/cell. The ionic cores of the carbon, iodine, and hydrogen atoms were replaced by Trail-Needs pseudopotentials (TN-PPs), 86,87 available in the CASINO pseudopotential library. 88 Note that the TN-PPs were developed for QMC, but can also be used for planewave based DFT calculations.
For the DMC simulations, we adopted Slater-Jastrow type wavefunctions as trial fixednodes. 48 For the trial wavefunctions, we took a form of Jastrow functions 95 implemented in CASINO, 96 consisting of one-body and two-body terms (imposing the cusp conditions 97 ). The former has 24 adjustable parameters with a cutoff length fixed at 80% of the Wigner-Seitz radius of the simulation cell, while the latter has 12 parameters with a cutoff equal to the Wigner-Seitz radius. All the parameters that appeared linearly in the Jastrow function were optimized by minimization schemes. For handling tiny energy differences such as those in molecular crystal polymorphs, it would be more appropriate to adopt the minimization of the mix of energy and variance 98 in general. In the present study, however, we limited ourselves to optimize the linear parameters with non-linear ones fixed (e.g., cutoff lengths), using the variance minimization technique. 99 This gives a unique minimum with much lower cost.
We employed our 32-core PC clusters to run the DMC simulations for the 1×1×1 cell size. Since the 1×3×3 cell size simulations require approximately 729 (= 9 3 ) times greater computational cost, we used the K computer 75 with 1, 024-node (2, 048-core) parallelization. The simulation for each polymorph took about 5 × 10 5 core-hours. We evaluated the electron-electron interaction using both the Ewald 100,101 and MPC 61,62 schemes, but only the Ewald energy was used in the DMC propagation because it is known that the MPC may artificially distort the exchange-correlation hole in some cases. 57 We set the target DMC population numbers (N pop ) to be 1,280 and 20,480 for the 1 × 1 × 1 and 1 × 3 × 3 cell size calculations, respectively. The T -move scheme 93 was used to evaluate the pseudopotentials with the locality approximation so that the bias can be reduced and to allow the population control 102 be more stable. After equilibrating the random walkers over the first 1, 500 steps, we accumulated statistics over the following 1 × 10 5 and 7 × 10 3 steps (N step ) for the 1×1×1 and 1×3×3, respectively.   Figure 2 shows the DMC evaluations of the energy difference between the two phases,

Discussions Finite size errors
It has been found in Fig. 2  (2) comparison of the Ewald/MPC and CCMH schemes within QMC/1×3×3. The former and the latter give useful insights about one-body and two-body FSEs, respectively, although they do not completely describe the effects.    Though it is known that MPC cannot completely capture the two-body FSE, 57 it may be used as an "alert indicator" to show that the simulation cell size is not large enough when the difference δE = E MPC − E Ewald is remarkably large.  The CCMH scheme provides a finite-size correction to the kinetic energy due to the longranged correlations described in the two-body Jastrow factor, evaluated from the asymptotic behavior as k → 0. We evaluated the correction using the implementation in CASINO (version 2.11), 48 given as ∆T 2 in Table 2. The corrections for the α and β phases are ca. 5 kcal/mol/cell which amount to a half of the MPC corrections (ca. 10 kcal/mol/cell), indicating that the MPC is insufficient to capture the whole two-body FSE, as mentioned above. The reliability of the correction ∆T 2 can be checked by comparing the estimates obtained from two different asymptotic models. 57 The difference between the two estimates was found to be 3% for α and 11% for β, which seems reasonable. An additional 'p term' entering the two-body Jastrow factor is found to be indispensable for achieving such accuracy.
The p term 48 augments the correct behavior at the simulation cell boundaries by a planewave expansion and is important for the quantitative reliability in the correction. Note that a complete lack of the p term gives rise to a difference of 75% and 65% for α and β, respectively. kcal/mol/cell, gives the smallest, which is attributed to the fact that the system is an insulator. We note again that ∆E 1 does not describe the whole one-body FSE in QMC, as mentioned above. Comparing the relative stability of α to β, 2.5 kcal/mol/cell (4 mHa/cell), the FSE corrections themselves (∼ 10 kcal/mol/cell at most) are far larger and hence a careful consideration of FSEs is clearly essential. However, it is reasonable to expect that the corrections would not significantly change the final conclusion that DMC/1×3×3 can predict the correct stability of α relative to β, because of error cancellation between the two phases (∆(α − β) in Table 2). We note that we cannot put any quantitative significance on the value, −0.5kcal/mol/cell, of ∆(α − β) in ∆T 2 because of the ambiguity associated with the choice of asymptotic model for the evaluation; i.e. 3% and 11% for the α and β phases, respectively. For highly anisotropic systems we cannot generally rely on any "handy" FSE correction scheme such as MPC when using a small simulation cell. Instead, we need to attempt a simple extrapolation by enlarging simulation cells with a fixed aspect ratio (approximately cubic). Further QMC investigations using larger simulation cells with various XC functionals would then be required to draw more firm predictions, but they are too expensive to be done.
We shall particularly discuss the computational issues in the following subsection.

Computational requirements
In the present study we could not perform 2 ×6 ×6 or larger simulations simply because of memory size limitations on the K computer, which has 16GB/node. The 1×3×3 requires 9.8GB/node for storing the wavefunction data (in the blip basis 94 ), while a 2 × 6 × 6 cell would require 25GB/node. The use of plane-wave basis functions can reduce the capacity required for the wavefunction data from 25 GB/node to 12 GB/node for 2×6×6, but instead the computational time becomes a few hundred times longer. The most recent cutting-edge facilities such as Tianhe-2 (88GB/node) or Titan (34GB/node) might be able to accommodate such calculations. But even with such resources, the larger 2 × 6 × 6 cell size seems unfeasible in practice because the cost of the equilibration steps in DMC cannot be reduced by the current parallel implementations. We used a DMC implementation, CASINO, which achieves greater than 99% parallel efficiency even using 6 × 10 5 cores on the K computer. 105 But it can reduce the cost linearly as the number of cores increases only for the statistical accumulations, not for the preceding equilibration of the sampling distributions. The equili-bration should be achieved on each parallel core with some required number of steps, which is fixed regardless of the number of cores adopted. When using a tremendous number of cores for large-scale DMC simulations, this implies that the computational cost in terms of cpu-hours is dominated by the equilibration rather than the statistical accumulation. Table 3 lists the number of cores (N core ), the number of MC steps (N step ), computational time (T in hours), computational cost (C in terms of core-hour), and the percentage of the equilibrium cost to the total one (W ) in our DMC/1×3×3 simulation on the K computer with 2,048 cores. We also make an estimate for an idealized 512,000-core case, which represents about 70% of the available cores on the K-computer, 75 assuming that the parallel efficiency is 100% and the total number of sampling points in the statistical accumulation (or equivalently, C) is the same for both cases. Because the required cost for the equilibration is the same for each node it increases linearly with the number of cores, going beyond 10 million corehours when one uses 512,000 cores. It results in a rate of increase in computational cost (R) about 48 times larger to achieve the same speedup (S), about 5 times. Since the QMC computation scales as N 3 with respect to the number of electrons N in the system, the 2×6×6 simulation takes 8 times longer computational time than the 1×3×3 one. This implies that the 2, 048-core parallelization requires 6.2 × 10 6 core-hours over 128 days, while a theoretical 512, 000-core parallelization requires 3.0 × 10 8 core-hours over 24 days. They are both too expensive to be done, where the bottleneck lies in the equilibrium computation. Table 3: The number of cores (N core ), the number of MC steps (N step ), computational time (T in hours), computational cost (C in terms of core-hours), and the percentage of the equilibrium cost to the total one (W ) are listed for DMC/1×3×3 simulations with 2, 048and 512, 000-core. The values for the 512, 000-core parallelization are estimated from those for the 2, 048-core one. A rate of increase in computation cost (R) and speedup (S) from 2, 048 to 512, 000 are also given. equilibrium accumulation A simple way to speed up the equilibration procedure is to reduce the number of walkers per core (N w/c ), i.e., the computational load on each core. Nevertheless, this is not a good strategy. In the current implementation with the annihilation/creation of walkers, too few N w/c may lead to the 'dying out of walkers' at several cores, which suspends simulation runs. We can circumvent this difficulty by using the 'weighted walker scheme' 48,106 where the annihilation/creation is replaced by the weight accumulation on a walker. A more pressing reason why we cannot simply reduce N c , however, arises from a consideration of load-balancing and communication costs. 56 The annihilation/creation occurs individually on each core, bringing about unbalanced loads as a calculation evolves. To recover the balance,

T -move scheme and numerical stability
To perform DMC/1×3×3 simulations efficiently, we have selected the computational conditions very carefully. Our previous 1 × 1 × 1 simulations 73,74 forced us to accumulate an enormous number of steps (N step = 1.2 × 10 7 ) to achieve the required statistical accuracy, even with a large target population (N pop = 16, 384). The computation of each phase took about 6 months using 128 cores in those days. In the present study we therefore chose a larger δτ DMC for a more efficient sampling, which is proved to be possible only with the T -move scheme. 93 This scheme has been devised to suppress a divergence of local energy when a sampling occurs at the nodal surfaces, which is effective to control population explosions in DMC. 93 In the present study, we adopted a CASINO implementation of the T -move scheme. 48 The scheme is found to be essential in this work to complete such a large size simulation with a reasonably small time step bias. Table 4 shows a comparison of the time-step bias between our previous work 73, 74 and the without T -move, the population explosions actually occur and then we can not exceed N step greater than 30, 000 (60, 000), giving rise to the larger error bars. This is the reason for the choice of δτ DMC = 0.001 and the sufficiently large population (N pop = 16, 384) in our previous work 73,74 to avoid the population explosions when using a QMC implementation without T -move. 89 It is also observed in the present study that the simulations with δτ DMC larger than 0.015 are impossible due to the instability of DMC/LDA/1×1×1 with respect to population explosion.
The strong time step bias seen in Table 4 is found to be reduced considerably when we apply the T -move scheme. Figure 5 shows the time step dependence of ∆E DMC with

Concluding Remarks
We have performed, for the first time, a rigorous study of finite-size errors (FSEs) and choice of nodal surface on the prediction of polymorph stability in molecular crystals using fixednode DMC. Our calculations are the largest which are currently feasible using the resources of the K computer. Our results show that our previous predictions in Ref 73,74 using a small (1 × 1 × 1) simulation cell were fortuitously accurate. Our new DMC simulations with a 1×3×3 simulation cell, using a PBE functional to generate the nodal surface, yield the same prediction for the polymorph stability, and agree with experiment. However, our observations of the finite-size effects and the choice of nodal surface provide insights into the formidable challenge of predicting such properties from first principles.
In particular, we applied the MPC and kinetic energy finite-size correction schemes to the DMC/PBE/1 × 3 × 3 calculations, where the simulation cell was approximately cubic and we accumulated statistical samples with ∼ 6.4 × 10 5 core-hours for each polymorph.
To the best of our knowledge, this is the largest such calculation to date. However, it was found that the MPC and kinetic energy corrections to the energy difference between the polymorphs were larger than the original difference itself. The two corrections themselves significantly contribute to the total energy for each polymorph, indicating that even larger simulation cells are needed, with extrapolation to infinity. On the other hand, we show that a calculation with the next largest simulation cell, 2×6×6, is unfeasible, even with hundreds of thousands of cores and the large memory capacities provided by massively-parallel conventional supercomputers. This is because the current equilibration implementations cannot be accelerated by MPI parallelization. We therefore conclude that technical advances are needed to accelerate the equilibration step if a more complete understanding of FSEs in DMC simulations of systems with large anisotropic unit cells is to be achieved.
We also found a considerable effect of finite-size errors on the trial nodal surface in the DMC/1×1×1 calculations. This may be attributed to the fact that the MPC corrections are evaluated using the DFT charge densities where the unit cell is strongly anisotropic with an aspect ratio of 3 : 1. At the DFT level, we found that there is a significant dependence of the converged values on the choice of the aspect ratio when using small cells, though extrapolations with larger cells converge to the same energy. This highlights another issue which must be carefully managed in future studies. In order to reach a decisive conclusion about the dependence of the trial nodes on the FSE, we must at least carry out a DMC/LDA/1×3×3 calculation in addition. Unfortunately, our CPU allocation on the K computer is currently exhausted due to our other simulations, so this will need to be done in a future study.
Finally, this work illustrates the technical importance of the T -move scheme in such largescale DMC simulations. We note that our QMC calculations would have been impossible without this technique, due to unstable population behavior and a large time-step bias.