Publication: Accelerating Discovery in Virtual Chemical Libraries
No Thumbnail Available
Open/View Files
Date
2023-05-15
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
The Harvard community has made this article openly available. Please share how this access benefits you.
Citation
Graff, David Elliot. 2023. Accelerating Discovery in Virtual Chemical Libraries. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.
Research Data
Abstract
High-throughput virtual screening (HTVS) is an essential workflow in molecular discovery programs that aids researchers in prioritizing compounds for experimental testing within large libraries of molecules. The application of a given computational technique using HTVS can rapidly estimate the target properties of each molecule in a library at a fraction of the experimental cost. Previous studies have used HTVS with molecular docking to screen ultra-large virtual libraries (containing 107–109 molecules) to identify novel and potent small-molecule protein binders. However, this practice comes at a cost; brute-force screening of these large libraries with even the cheapest of simulations is expensive, and these costs will continue to increase with the size of the libraries. To alleviate these growing costs, new approaches must be developed to efficiently explore chemical spaces. This dissertation describes the application of model-guided optimization algorithms to improve sample efficiency in HTVS as well as software tools to implement these approaches.
Chapter 2 studies Bayesian optimization applied to docking-based HTVS. It explores the impact of broad optimization hyperparameters, such as surrogate model architecture, acquisition function, and batch size, on final optimization performance. Across a variety of virtual libraries and protein targets, Bayesian optimization results in significant improvement in sample efficiency compared to the random search baseline. In one ultra-large library of 99.5M compounds docked against AmpC β-lactamase, Bayesian optimization identifies 94.8% of the top-50 000 ligands after sampling only 2.4% of the library using a message-passing neural network surrogate model and upper confidence bound acquisition strategy.
Chapter 3 describes pyscreener, a Python package to facilitate running molecular docking calculations with Python function calls. The package is a central component of the MolPAL software detailed in Chapter 2, allowing for the prospective application of Bayesian optimization to docking-based HTVS. More broadly, pyscreener allows users to dock molecules directly from their respective SMILES string and seamlessly scale these calculations to distributed hardware allocations with no changes in client code.
Chapter 4 presents design space pruning (DSP), an extension to model-guided optimization algorithms, which reduces surrogate model inference costs via iterative pruning of unpromising candidates from the design space at each iteration. Across a wide range of molecular docking tasks, DSP reduces inference costs by nearly 40% with no reduction in
final performance compared to the baseline optimization. DSP is a promising extension to model-guided optimization in domains where objective costs do not overwhelm surrogate model overhead costs, such as in docking-based HTVS.
Chapter 5 details ROGI-XD, a reformulation of the roughness index (ROGI) that enables quantiative comparison of quantitative structure-property relationship (QSPR) surface roughness across representations. For a variety of QSPR tasks, the ROGI-XD correlates strongly with cross-validated model error across representations (median r = 0.72–0.88). Comparison of ROGI-XD values for various pretrained chemical representations to those of fingerprints or descriptors reveals that pretrained representations are not smoother than fixed representations in most cases. This observation is consistent with empirical results showing that pretrained representations do not consistently outperform fingerprint or descriptor baselines. These results suggest that explicitly incorporating priors of smoothness with respect to molecular structure during model pretraining can aid in the downstream generation of smoother QSPR surfaces.
Altogether, the algorithms and software introduced herein provide theory and implementation to both accelerate and democratize the practice of HTVS in increasingly large virtual libraries.
Description
Other Available Sources
Keywords
drug discovery, machine learning, molecular docking, optimization, structure-property relationships, virtual screening, Computational chemistry, Chemistry, Biophysics
Terms of Use
This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service