Challenge #4

Hit Identification

Description of your approach (min 200 and max 800 words)

Our proposed hit-identification workflow extends pipelines developed by our team during CACHE Challenge 2 and 3. The core methodology consists of high-throughput docking followed by binding affinity estimation using Molecular Mechanics Poisson-Boltzman Surface Area (MMPBSA) on multiple poses drawn from a molecular dynamics (MD) run of the protein-ligand complex. Given the computational cost of running MD and MMPBSA. To ensure that we are able to explore most of the molecular space of Enamine REAL while only doing the expensive evaluations on the most promising molecules. We use an ensemble of Bayesian graph neural networks to predict the binding affinity scores derived from MMPBSA and a second model that distinguishes experimental hits with high binding affinity from decoys these ML models are used to quickly score molecules from Enamine REAL.

Bayesian Graph Neural Network (BGNN) binding affinity predictor: we create two training datasets which are used to train two separate binding affinity predictor models. The first dataset is generated by running MMPBSA on a few hundred molecules to generate binding affinity estimates. The second dataset is trained to distinguish between the experimentally observed hits for CBLB and decoy molecules with similar chemical properties. We then train BGNNs on each of these datasets using the Chemprop library to predict binding affinity. These models are used to generate predictions on the Enamine REAL database and those molecules which have the highest upper-confidence bound are selected to pass through the docking + MMPBSA pipeline. After obtaining more MMPBSA scores the first model can be re-trained and the cycle repeated. In CACHE 2 we were able to generate predictions for 1bn molecules per day and we found good concordance between the MMPBSA predicted score and the BGNN score.

A key aspect of our strategy is ensuring efficient sampling of the 5.5bn compound space of the Enamine REAL database, while also ensuring that the most computationally intensive methods (MMPBSA & FEP) are reserved for the most promising candidate molecules. Using cluster compute resources we are able to dock several million molecules: this includes carrying out ensemble docking against multiple conformations of the receptor.

The ~2.5 million curated set of molecules that we dock and score will be composed of the following:

Molecules from Enamine REAL with high score from the BGNN
~2.1m molecules from the Enamine high-throughput screening library
~460k molecules from the Enamine hit locator library
~59k molecules from the Enamine premium collection

Ligands are initially screened for molecular weight and solubility. Final selections are screened against PAINS [BadApples webserver] to exclude ligands likely to be toxic or promiscuous binders. To ensure a diversity of structures in the final submission, we will cluster compounds based on structure and submit one sample from each cluster.

What makes your approach stand out from the community? (<100 words)

We optimise the use of computationally expensive MD-based scoring by iteratively training a BGNN classifier to predict these scores and guide which molecules are evaluated next using a bayesian optimization strategy. This allows us to quickly screen Enamine REAL. Each of these methods complements our virtual screening funnel which uses docking followed by binding affinity estimation using MMPBSA.

Free software packages used

AutoDock Vina

AutoDock4

rDock

GROMACS

DOCK v6 & v3 (Kuntz Group UCSF)

PyAutoFEP

OpenBabel

Chemprop

Virtual screening of merged selections

Method type (check all that applies)

Deep learning

Free energy perturbation

High-throughput docking

Machine learning

Physics-based

Description of your approach (min 200 and max 800 words)

Our approach to predicting actives from the pooled candidate molecules will use the physics-based scoring techniques described in the Hit Optimization stage. Our approach generates 3d coordinates of the ligands [OpenBabel], uses docking to get an initial binding location [AutoDock Vina], followed by MD-based methods: MMPBSA [GROMACS, gmx-MMPBSA] and FEP [GROMACS, PyAutoFEP] to estimate absolute and relative binding affinity. In combination with the physics-based methods we will leverage the relatively large number of experimentally observed hits for CBLB we hope that the machine learning methods (described in section 1) for predicting binding affinity based on experimentally observed actives and decoys will provide a useful signal for ranking the merged selections.

What makes your approach stand out from the community? (<100 words)

We use a combination of physics-based methods (docking + MMPBSA) combined with machine learning models trained on the experimental hits.

Commercial software packages used

See section 1

Free software packages used

See section 1

Relevant publications of previous uses by your group of this software/method

See section 1

Hit Optimization Methods

Method type (check all that applies)

Deep learning

Free energy perturbation

High-throughput docking

Machine learning

Physics-based

Description of your approach (min 200 and max 800 words)

Our optimisation strategy consists of two components. Firstly, searching for molecular optimisations to the experimental hits via a modify and test cycle. Proposed modifications to functional groups and atoms will be scored using free energy perturbation and accepted with a probability in proportion to the change in binding free energy (subject to a simulated annealing regime). Secondly, Experimental hits (and to some extent negatives) will be used to re-train the QSAR model. Updating these methods at the top of the Hit Identification pipeline will yield new candidates that will be scored in the same way as described in sections 1 & 2. Additional candidates to put through the scoring pipeline will be generated from a structure similarity search of the experimental hits against the Enamine database. Our BGNN binding affinity predictor allows us to rapidly score millions of structural analogs of the experimental hits. This allows us to efficiently select which compounds should be evaluated with more computationally expensive methods such as MM-PBSA and FEP. Rapid prediction with the BGNN means that we can enumerate a larger space of molecules that encompass multi-step modifications to the lead molecules which would not be possible if we relied solely on MD-based binding affinity estimations.

What makes your approach stand out from the community? (<100 words)

Our team includes researchers experienced in drug discovery from multiple disciplines including physics, pharmacy, structural molecular biology and computer science with additional industry experience in medicinal chemistry. Our approach reflects this diverse range of experience, combining expertise in state-of-the-art machine learning architectures, pharmacophore design, and advanced free energy calculations using classical and QM-based MD simulations.

Commercial software packages used

See section 1

Free software packages used

See section 1

Relevant publications of previous uses by your group of this software/method

See section 1