Publication: Path to Protein Dynamics: Advancing Crystallographic Data Analysis via Deep Generative Models
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
Artificial neural networks and machine learning methods have transformed numerous scientific research paradigms. A milestone in this transformation was the release of AlphaFold2 in 2021, DeepMind's machine learning model for protein structure prediction. By delivering an accurate and accessible computational solution for predicting static protein structures from sequence and coevolutionary data, AlphaFold2 marked a new era for structural biology. This breakthrough, beyond its technical wizardry, owes much of its success to the availability of extensive protein structure databases built over decades of community effort, particularly the Protein Data Bank. This underscores a boarder lesson: foundation models thrive on large, well-curated data. In structural biology, the next frontier is to understand protein dynamics, their responses to perturbations such as binding and mutations, and their interactions with biomolecules like nucleic acids. While deep learning approaches hold promise, they are limited by the scarcity of large, high-quality datasets that capture this complexity. At the same time, advancements in hardware and techniques have greatly increased the throughput of raw data reporting on protein dynamics and conformational heterogeneity across modalities such as X-ray crystallography and cryo-EM. However, inherent pathologies and noise in biophysical data, as well as the inherently rugged nature of protein (free) energy landscapes, continue to make model building a major bottleneck, limiting our ability to interpret these datasets with confidence and efficiency. Significant human effort and intuition are still required. This dissertation presents new methods to expedite automated model constructions with experimental data under Bayes' principle. First, I introduce SFCalculator, a GPU-accelerated, differentiable function for calculating atomic model agreement with crystallographic data, bridging crystallographic data analysis and machine learning techniques. Second, I describe how this interface can aid in interpreting crystallographic data by incorporating priors from molecular mechanics or pretrained predictive models, while using experimental data to guide generative models for improved accuracy. I then introduce VAE-Assisted Ligand Discovery (VALDO), a method aimed at boosting the signal-to-noise ratio in crystallographic fragment screening for drug discovery. Collectively, these developments demonstrate how generative models can be applied to advance crystallography data analysis in a principled manner. Finally, I offer insights into the importance and potential pathways for inferring protein dynamics by integrating experimental data from diverse modalities.