Computational methods

Hit Identification

Method type (check all that applies)

Deep learning

High-throughput docking

Physics-based

Hybrid of the above

Description of your approach (min 200 and max 800 words)

Our hit identification technique involves three stages. The first involves aggregating a list of initial compounds. We utilize our in-house database of in-stock and on-demand compounds from various vendors (MCule, Enamine, etc) and aggregators (ZINC) (see citations). We also utilize a database of synthetically accessible compounds created through computationally running known synthetic reaction pathways (SAVI). This selection technique produces over a billion compounds, either in-stock and available for purchase or based on a known reaction pathway from in-stock building blocks. We limit ourselves generally to in-stock compounds but will include simple reaction pathways if the compound scores extraordinarily well. The second stage involves simulating the protein target and targeting a binding site. Utilizing DeepDriveMD, we simulate available structures for microseconds and use anharmonic conformational analysis-enabled autoencoder to sample the state space to produce a series of static conformations to dock against. We also utilize the transition information for eluding a binding site. This produces an ensemble of protein structures. We then utilize state-of-the-art commercial docking protocol using our scalable workflow environment to run docking on HPC systems. After running docking on the initial seed set of in-stock orderable compounds, we train a deep learning model to act as a 50,000x faster surrogate than performing docking. With this fast surrogate model, we screen the remaining billion compounds from a make-on-demand database such as Enamine Real or SAVI. A short list is chosen from these two lists by sampling from clusters of high-scoring surrogate compounds and high-quality and in-stock poses. We dock the deep learning scored compounds to verify the correctness of the model. This is performed across the ensemble of structures. Lastly, compounds from each cluster are resimulated and run through DeepDriveMD to determine if any compounds are causing a significant change in protein dynamics or present decoy-like features (free energy calculations score poorly, flies away from site). This information is used to select compounds that elucidate interesting modifications to the protein state space, indicating interaction is likely.

Method Name

Ensemble-Based Docking

Commercial software packages used

OpenEye Toolkit

Free software packages used

OpenMM, RDKit, PyTorch

Relevant publications of previous uses by your group of this software/method

Computational Docking - High Throughput Virtual Screening and Validation of a SARS-CoV-2 Main Protease Non-Covalent Inhibitor Journal of Chemical Informatics 2022. - IMPECCABLE: Integrated Modeling PipelinE for COVID Cure by Assessing Better LEads 50th International Conference on Parallel Processing (ICPP 21). - Scalable HPC and AI Infrastructure for COVID-19 Therapeutics” in the Platform for Advanced Scientific Computing Platform for Advanced Scientific Computing (PASC ‘21). - Pandemic Drugs at Pandemic Speed: Accelerating COVID-19 Drug Discovery with Hybrid Machine Learning-and Physics-based Simulations on High Performance Computers Interface Focus 2021 HPC Screening of Compounds Targeting SARS-CoV-2 with AI-and HPC-enabled lead generation: a first data release, PASC '21. Deep Drive MD (State space sampling) - AI-driven multiscale simulations illuminate mechanisms of SARS-CoV-2 spike dynamics. The International Journal of High-Performance Computing Applications, Gordon Bell Special Prize for HPC-Based COVID-19 Research ‘20. - Stream-AI-MD: Streaming AI-driven Adaptive Molecular Simulations for Heterogeneous Computing Platforms Platform for Advanced Scientific Computing (PASC ‘21).

Hit Optimization Methods

Method type (check all that applies)

Deep learning

Physics-based

Hybrid of the above

Description of your approach (min 200 and max 800 words)

We have two strategies for lead optimization. The first utilizes reinforcement learning for molecular modeling. This technique simulates the docked pose of the active compound. While the simulation is running, pausing every 20-50ns, small perturbations are made to the ligand based on known reactions. These modifications are made algorithmically based on the favorability of the interactions during the last simulation time. The system utilizes reinforcement learning to optimize the compound in the local state space slowly. The simulations are also analyzed using Amber MMPBSA calculations to estimate the quality of the perturbations overall as a series. The second technique for developing a broader range of compounds, though connected to RLMM for local exploration, is our technique called scaffold-induced molecular subgraphs. Utilizing our in-house database of compounds, we create a large graph structure that organizes chemical space based on scaffold. This allows queries to be designed which sample the local region of the active compounds found from our hit. Based on sampling the area and running finner free-energy calculations, we can propagate using graph propagation techniques the scores to find regions that are likely to optimize the compounds. Together, these two techniques allow structure-activity-relationship series to be designed and computationally explored utilizing state-of-the-art graph propagation techniques to allow for computationally economically free-energy calculations and using reinforcement learning constrained by reaction pathways (or known scaffolds) for optimizing the compound further.

Method Name

SIM-SG + graph methods

Commercial software packages used

OpenEye

Free software packages used

OpenMM, RDKit,

Relevant publications of previous uses by your group of this software/method

SAR Optimization of a lead Structural, electronic and electrostatic determinants for inhibitor binding to subsites S1 and S2 in SARS-CoV-2 main protease Journal of Medicinal Chemistry 2021 ScaffoldGraph Scaffold embeddings: Learning the structure spanned by chemical fragments, scaffolds and compounds NeurIPS Workshop on Learning Meaningful Representation of Life ‘21. Scaffold-Induced Molecular Graph (SIMG): Effective Graph Sampling Methods for High-Throughput Computational Drug Discovery BMC Bioinformatics, forthcoming. Sixth Computational Approaches to Cancer Workshop at Supercomputing ‘20.

Challenge #1