Challenge #2

Hit Identification

Method type (check all that applies)

De novo design

Deep learning

Free energy perturbation

High-throughput docking

Machine learning

Physics-based

Description of your approach (min 200 and max 800 words)

Our hit identification workflow combines physics-based cheminformatics methods together with novel machine learning algorithms. We employ a fragment-based virtual screening with significant speed-ups from our novel pharmacophore matching algorithm. Secondly, we enrich the pool of the potential hits with de novo generated drug-like candidates. These candidates are then ranked and refined using sequential binding affinity estimation techniques of increasing accuracy. Initial rounds of selection are done using docking scores. Subsequently, the most promising candidates are scored using MD-based methods, including MM-PBSA and FEP with QM-enhanced force fields. Additionally, we employ a Bayesian optimization process using a probabilistic machine learning model to score compounds. This model is iteratively trained on scores derived from MD and guides the selection for subsequent rounds of MD simulation.

We leverage a fragment combination approach to search a vast combinatorial space of reagent combinations while reducing the need for computationally expensive methods. We construct several pharmacophore models based on the target site, which we use for generating complementary pharmacophore embeddings. Each pharmacophore is divided into non-overlapping sub-components. Fragment-sized compounds are matched against each pharmacophore subcomponent. Hits from complementary pharmacophore sets are fed into a reaction predictor which determines whether the fragments can be combined, based on both chemical and geometrical constraints. If the combined compound maintains compatibility with the full pharmacophore this compound is passed for scoring.

We use Enamine (building blocks) fragment libraries, as these are combined by Enamine to derive the full Enamine REAL database. This way we yield compounds that are available from Enamine. Furthermore, this approach enables us to effectively explore the full 4.5bn compound space while only having to run pharmacophore matching on ~300,000 fragment compounds. This approach explores a combinatorial reaction product space that grows quadratically (in case of two-component reactions) or cubically (in case of three-component reactions) while computational costs are linear with the number of fragments.

Fragment combinations that pass the pharmacophore model are further scored by consensus docking [Autodock Vina, PLANTS, Schrodinger Glide, Molecular Operating Environment]. For the most promising pool of candidates, we compute the absolute free energy of the binding using the molecular mechanics Poisson−Boltzmann surface area (MM-PBSA) [AmberTools and Gromacs software]. Free energy perturbation calculations will be employed to choose between top-ranking ligands with similar structures and to evaluate minor modifications to the final structures.

One of the aforementioned pipelines of molecules into the binding affinity estimation funnel is provided by a variational autoencoder (VAE) trained on SMILES string representations from the Enamine Hit Locator Library. By training on this library of ~460k molecules we are able to generate novel molecules with similar characteristics to those present in Enamine. In addition to the novelty advantage, the VAE also allows us to do molecular sampling in the neural network’s latent space. This can act as a molecular scaffold similarity search and, additionally, allow interpolation between molecular structures. Once we have estimated binding affinities for molecules from any of the pipelines, we will sample the highest scoring molecules and retrain the VAE using the high-scoring subset. This has the effect of tuning the VAE’s generative distribution towards molecular structures that have higher predicted binding affinity, thereby increasing the sampling efficiency.

A key aspect of our strategy is ensuring efficient sampling of the 4.5bn compound space of the Enamine REAL database, while also ensuring that the most computationally intensive methods (MM-PBSA & FEP) are reserved for the most promising candidate molecules. Using cluster compute resources we are able to dock several million molecules: this includes carrying out ensemble docking against multiple conformations of the receptor.

The ~4 million curated set of molecules that we dock and score will be composed of the following:

Pharmacophore matches generated from combining Enamine building blocks
Molecules generated by the VAE
~3m molecules from the Enamine high-throughput screening library
~460k molecules from the Enamine hit locator library
~200k molecules from the Enamine building blocks library

Using this combination of building-block combinatorial models, ML generative models, and targeted libraries we have probed most of the relevant molecular space of the 4.5bn compound Enamine REAL database while reducing docking computations by multiple orders of magnitude.

To ensure that we maximise the information gained from the most computationally intensive MD-based scoring methods, we will follow the Bayesian optimization strategy as outlined in [Hernández-Lobato et al. 2017]. We train a Bayesian graph neural network (BGNN) to estimate the MD-derived binding affinity, we then use the predictions of the BGNN to select which molecules we should run MD scoring on next, the network is retrained with the new data and the cycle is repeated.

To ensure a diversity of structures in the final submission, we will cluster compounds based on structure and submit one sample from each cluster. FEP will be used to decide between structurally similar molecules.

What makes your approach stand out from the community? (<100 words)

First, partial-pharmacophore fragment matching combined with the fragment fuser allows us to reduce pharmacophore embedding and searching from billions of molecules to a few hundred thousand. Second, we use a VAE as an additional source of diverse candidate molecules. Finally, we optimise the use of computationally expensive MD-based scoring by iteratively training a BGNN classifier to predict these scores and guide which molecules are evaluated next. Each of these three novel methods independently complements our core tried-and-tested virtual screening funnel:

1) Initial filter / Pharmacophore

2) Docking

3) MD-based scoring

4) Molecular optimization

Method Name

Hit Stream

Commercial software packages used

Molecular Operating Environment (MOE)
Glide (Schrödinger)
Q-Chem
Gaussian

Free software packages used

AutoDock Vina
Protein-Ligand ANT System (PLANTS)
GROMACS
Dock 3.7 (Kuntz Group UCSF)

Relevant publications of previous uses by your group of this software/method

Badaoui, Magd, Pedro J. Buigues, Dénes Berta, Gaurav M. Mandana, Hankang Gu, Tamás Földes, Callum J. Dickson, et al. ‘Combined Free-Energy Calculation and Machine Learning Methods for Understanding Ligand Unbinding Kinetics’. Journal of Chemical Theory and Computation 18, no. 4 (12 April 2022): 2543–55.

https://doi.org/10.1021/acs.jctc.1c00924

Berta, Dénes, Magd Badaoui, Sam Alexander Martino, Pedro J. Buigues, Andrei V. Pisliakov, Nadia Elghobashi-Meinhardt, Geoff Wells, Sarah A. Harris, Elisa Frezza, and Edina Rosta. ‘Modelling the Active SARS-CoV-2 Helicase Complex as a Basis for Structure-Based Inhibitor Design’. Chemical Science 12, no. 40 (2021): 13492–505. https://doi.org/10.1039/D1SC02775A

Bradshaw, John, Brooks Paige, Matt J. Kusner, Marwin H. S. Segler, and José Miguel Hernández-Lobato. ‘Barking up the Right Tree: An Approach to Search over Molecule Synthesis DAGs’. arXiv, 21 December 2020.

http://arxiv.org/abs/2012.11522

Cook, Nicola J, Wen Li, Dénes Berta, Magd Badaoui, Allison Ballandras-Colas, Andrea Nans, Abhay Kotecha, Edina Rosta, Alan N Engelman, Peter Cherepanov. ‘Structural basis of second-generation HIV integrase inhibitor action and viral resistance’. Science 367 (6479), 806-810.

https://www.science.org/doi/abs/10.1126/science.aay4919  

Hernández-Lobato, José Miguel, James Requeima, Edward O. Pyzer-Knapp, and Alan Aspuru-Guzik. ‘Parallel and Distributed Thompson Sampling for Large-Scale Accelerated Exploration of Chemical Space’. Proceedings of Machine Learning Research, 2017. https://dash.harvard.edu/handle/1/35164973

Karlova, Andrea, Wim Dehaen, and Andrei Penciu. ‘How to Reward Your Drug Agent?’, NeurIPS Workshop 2021.

https://openreview.net/forum?id=aJt1LUxUIqB

Karlova, Andrea, Wim Dehaen, Andrei Penciu, Richard Dallaway, and Suran Goonatilake. ‘PEPSI: Post-Docking Evaluation with Protein-Small Molecules Interaction’, ICML Workshop on Computational Biology 2021

https://icml-compbio.github.io/2022/papers/WCBICML2022_paper_57.pdf

Nguyen, Minh N., Neeladri Sen, Meiyin Lin, Thomas Leonard Joseph, Candida Vaz, Vivek Tanavde, Luke Way, Ted Hupp, Chandra S. Verma, and M. S. Madhusudhan. ‘Discovering Putative Protein Targets of Small Molecules: A Study of the P53 Activator Nutlin’. Journal of Chemical Information and Modeling 59, no. 4 (22 April 2019): 1529–46. https://doi.org/10.1021/acs.jcim.8b00762.

Sen, Neeladri, Tejashree Rajaram Kanitkar, Ankit Animesh Roy, Neelesh Soni, Kaustubh Amritkar, Shreyas Supekar, Sanjana Nair, Gulzar Singh, and M. S. Madhusudhan. ‘Predicting and Designing Therapeutics against the Nipah Virus’. PLOS Neglected Tropical Diseases 13, no. 12 (12 December 2019): e0007419. https://doi.org/10.1371/journal.pntd.0007419

Virtual screening of merged selections

Method type (check all that applies)

Free energy perturbation

High-throughput docking

Physics-based

Description of your approach (min 200 and max 800 words)

Our approach to predicting actives from the pooled candidate molecules will use the physics-based scoring techniques described in the Hit Optimization stage, combined with additional MD analysis. Our approach generates 3d coordinates of the ligands, uses docking to get an initial binding location, followed by MD-based methods: MM-PBSA and FEP to estimate absolute and relative binding affinity. Finally, a further MD analysis to determine if probable binders are likely to disrupt RNA-binding.

The helicase structure is particularly flexible, depending on bound substrates and interacting partners at the domain level, especially domain 1B. Evaluation of potential binders must take this flexibility into account, hence we employ multiple sampling of the protein conformational ensemble in our screening.

The accuracy of binding prediction with MM-PBSA and FEP hinges on the molecular mechanics parameters of ligands. We support the parameter fitting with QM calculations on top of the generic all-atom force field approach to create an accurate description of non-bonding interactions.

What makes your approach stand out from the community? (<100 words)

Our team has extensive experience in running MD simulations on NSP13 as illustrated in Berta et al. 2021. This includes simulations with ATP, ADP or empty binding pockets, as well as models with and without RNA. Given that NSP13 is a particularly flexible receptor, we believe that sampling a variety of conformations is likely to be an important factor in assessing biological activity. Furthermore, we expect our MD analysis to yield further insights as to whether any known binders are likely to be biologically active.

Method Name

Not given

Commercial software packages used

See section 1

Free software packages used

See section 1

Relevant publications of previous uses by your group of this software/method

See section 1

Hit Optimization Methods

Method type (check all that applies)

De novo design

Deep learning

Free energy perturbation

High-throughput docking

Machine learning

Physics-based

Description of your approach (min 200 and max 800 words)

Our optimisation strategy consists of two components. Firstly, searching for molecular optimisations to the experimental hits via a modify and test cycle. Proposed modifications to functional groups and atoms will be scored using free energy perturbation and accepted with a probability in proportion to the change in binding free energy (subject to a simulated annealing regime). Secondly, Experimental hits (and to some extent negatives) will be used to update the pharmacophore and fine-tune the output of the VAE. Updating these methods at the top of the Hit Identification pipeline will yield new candidates that will be scored in the same way as described in sections 1 & 2. Additional candidates to put through the scoring pipeline will be generated from a structure similarity search of the experimental hits against the Enamine database. Further candidates will also be generated by running the genetic-algorithm lead optimization software AutoGrow4, seeded with a population consisting of the experimental hits and some structural analogs. Our BGNN binding affinity predictor allows us to rapidly score millions of structural analogs of the experimental hits. This allows us to efficiently select which compounds should be evaluated with more computationally expensive methods such as MM-PBSA and FEP. Rapid prediction with the BGNN means that we can enumerate a larger space of molecules that encompass multi-step modifications to the lead molecules which would not be possible if we relied solely on MD-based binding affinity estimations.

What makes your approach stand out from the community? (<100 words)

Our team includes researchers experienced in drug discovery from multiple disciplines including physics, pharmacy, structural molecular biology and computer science with additional industry experience in medicinal chemistry. Our approach reflects this diverse range of experience, combining expertise in state-of-the-art machine learning architectures, pharmacophore design, and advanced free energy calculations using classical and QM-based MD simulations.

Method Name

Not given

Commercial software packages used

See section 1

Free software packages used

See section 1

Relevant publications of previous uses by your group of this software/method

See section 1