Challenge #3

Hit Identification

Method Name

DockAI

Hybrid of the above

We propose a hybrid approach where we are training a model based on docking scores, and apply active learning strategies to identify promising hits from large libraries of molecules or using our generative AI technology.

Description of your approach (min 200 and max 800 words)

Our approach called DockAI is a new technology that combines docking with a state-of-the-art active learning methodology to significantly improve the efficiency and effectiveness of virtual screening and hit identification.

With the advent of make-on-demand commercial libraries, the number of purchasable compounds available for virtual screening has grown exponentially in recent years, with several libraries containing over one billion compounds. These ultra-large libraries offer a wealth of potential hit compounds, but traditional docking approaches that score every compound individually can be cost-prohibitive and time-consuming. That's where DockAI comes in.

Our advanced active learning methodology enables us to select the most informative compounds from a chemical library for docking and scoring, ensuring that we are focusing on the most promising examples and maximizing the chances of identifying hit compounds. This not only increases efficiency but also enhances the overall performance of the method, as demonstrated in case studies where we outperformed other virtual screening approaches, recovering more than 75% of the best docking compounds with a 100-fold reduction in compute cost.

In addition to our active learning methodology, DockAI also utilizes a robust docking pipeline that has been carefully designed and tested to handle even the largest and most diverse chemical libraries. With DockAI, we can efficiently search ultra-large libraries or virtual compounds for hit compounds, saving time and resources for other important aspects of your drug discovery research.

The pipeline starts by sampling a subset that will be docked and will constitute the first training set. Then each active learning iteration consists of five steps :

– Train the model on the docked compounds,

– Infer the whole library,

– Pick the best-predicted compounds,

– Dock them within the pocket,

– Add them to the training set

Step after step, the distribution of docking scores in the training set tends to move to good scores. Consequently, the model will be better to identify good docking compounds in the library.

Finally, a medical chemist filters the whole set of docked molecules by selecting a sample that maximizes chemical diversity in the molecules that have the best docking scores.

We have also developed a generative AI technology based on synthesis templates. Combined with DockAI, we have a unique capability to explore a huge (10¹⁵) chemical space of easy-to-make molecules, with an associated estimated cost of synthesis. The molecules identified using DockAI are not only promising but also readily synthesizable and accessible, making them ideal for further development and optimization.

The top molecules are then pushed into MMGBSA rescoring in a single point.

MMGBSA (Molecular Mechanics Generalized Born Surface Area) is a computational method used in drug discovery projects to rescore chemical compounds. The method combines molecular mechanics and continuum electrostatics to calculate the free energy of binding between a ligand and a protein receptor. The MMGBSA approach uses a molecular dynamics simulation to sample the conformational space of the protein-ligand complex, and then estimates the free energy of binding by summing up the energy components from the molecular mechanic force field and the continuum electrostatics calculations. The final MMGBSA score is a combination of the ligand-receptor interaction energy and the solvation free energy of the complex. The MMGBSA approach is used to rank the predicted binding modes of a set of compounds and identify the most promising candidates for further experimental studies.

And the final listing is reviewed visually by the computational chemist team and the medicinal chemist team.

What makes your approach stand out from the community? (<100 words)

We have developed what we believe is one of the best active learning protocols to retrieve the most promising molecules in a minimum amount of computational time. This drastic reduction of costs allows us to evaluate multiple combinations of protein structures, docking parameters, and scoring functions to increase the chances of success. All these elements are making our approach different from the community.

Commercial software packages used

Dock6

Free software packages used

Torch
Torch serv
Openbabel
Rdkit
Ambertools
Gromacs
Sander

Virtual screening of merged selections

Method type (check all that applies)

Deep learning

High-throughput docking

Machine learning

Hybrid of the above

It’s a hybrid since we iteratively use all of them in a kind of sequential approach.

Description of your approach (min 200 and max 800 words)

Our main approach relies on machine learning coupled with molecular descriptors and molecular representations. Docking and MMGBSA score will be used in post-processing.

We are applying an automated QSAR protocol. Once the molecules have been transformed into a molecular representation, a mathematical model tries to detect patterns and trends in the features that could explain the variations in the target property. Several models exist (linear models, decision trees, ensemble models, neural networks, etc.) and can leverage the features in different ways. Since we generally handle relatively small datasets (< 5000 molecules), the type of model does not significantly impact the quality of the trained predictor. Generally, the best models we use are the linear models because they are simple, efficient, have fewer unanticipated secondary outcomes, and tend to generalize better.

In this phase, we train and build several models using different representations, optimization, and split of the dataset. We perform a hyper-optimization of all the parameters for each model. For all models, we compute the predictive power and performances l based on a validation dataset.

We have split the models into 2 categories: the classification models and regression models.
For categorical data. e.g. toxicity of a compound (toxic = 1, non-toxic = 0). The model is fitted to discriminate between the different classes and outputs a likelihood score that a compound is either toxic or non-toxic. This score is usually between 0 and 1 and accounts for the confidence of the model in its prediction. For instance, a value of 0.2 would mean that the model is pretty confident that the compound will be non-toxic, 0.8 that it will be toxic, and with 0.5 the model is uncertain.

Estimating the performance of a model requires leaving a part of the initial dataset aside. It is the train/test split (called outer split). The model is fitted on the train, and its performance is computed on the test. There are multiple ways of splitting the data, which will dramatically impact the performance of the model. A random split is made because the models are then applied to generated molecules close to the train set. Other examples of splits could be time-split (most recent compounds in the test, others in the train) or scaffold split. For all cases with multiple folds, the predictions/values are concatenated and the metrics are computed on the full vector.

Aside from the outer split that is used for model performance estimation, another split is made for parameters optimization: the inner split, which is performed after the outer split, on its trainouter sets.

Afterward, we validate the active compound through our docking and pose processing pipeline. We validate the compound binding mode in 3D and then evaluate each molecule with MMGBSA if we correlate with activity values.

What makes your approach stand out from the community? (<100 words)

Our approach is robust, fully automated, and optimized through years of application to real use cases. It can be applied even to small datasets. Combined with our structure-based pipeline and expertise, we believe our approach is very competitive with other state-of-the-art methods.

Method Name

AutoML

Commercial software packages used

Dock6

Free software packages used

Hyperopt
Optuna
Scikit-learn
Lightgbm
Torch
Rdkit
Ambertools
Gromacs

Hit Optimization Methods

Method type (check all that applies)

De novo design

Deep learning

Free energy perturbation

High-throughput docking

Machine learning

Physics-based

Hybrid of the above

We designed a pipeline that uses all approaches to optimize the results to the fullest using all data available.

Description of your approach (min 200 and max 800 words)

We use an in-house generative model that can generate chemically valid molecules of interest. This general generative model is applied with different constraints and methodologies as well as different optimization protocols. We use our proprietary retrosynthesis routes predictor to predict synthetic routes for the generated compounds.

We also use a generator of synthesis routes. From a provided initial molecular fragment, the model selects which commercial compounds should react with the fragment. The goal is to optimize the final molecules generated toward a given scoring function. The generator comprises a reaction predictor and a building block selector (commercial compounds). The generator can select commercial compounds from a dataset of over 1 million compounds.

This method relies on already known molecules and can optimize either a part of the molecule or a core (it can be used as fragment linking). The second approach that we have developed, called fine-tuning, can learn to generate similar compounds from a small dataset of selected molecules.

Fine-tuning:

This method tunes a sequence-based generative model for molecular de novo design that through reinforcement algorithms can learn to generate structures with certain specified desirable properties. It can be pre-trained on the large molecule database. The reward of reinforcement is an aggregation of several objectives.

MMGBSA calculations will be performed for the most promising compounds.

If needed, we will apply our pipeline for Thermodynamic Integration. To run such heavy calculations, the selected compounds must answer several criteria:

A common core
One of the compounds must have a validated experimental value
The changes between molecules must remain small
The input conformation must be well aligned
Each molecule must have at least one “connection/edge” with one other molecule to compute the Relative Binding Free Energy (RBFE)

Thermodynamic integration (TI) is a computational method used to calculate the relative binding free energies of compounds in drug discovery projects. The method is based on the principle that the free energy change of a system is equal to the integral of the heat absorbed or released by the system as its temperature is varied. In the context of binding free energy calculations, TI involves simulating a series of systems in which the ligand-receptor interaction energy is gradually increased, while the solvation free energy is held constant. The integral of the heat absorbed or released by the system as the interaction energy is increased is used to estimate the change in the free energy of the system.

The basic idea of TI is to calculate the free energy difference between two states using a path that connects these two states. In the case of ligand-receptor binding, the two states are the unbound and bound states of the protein and ligand. The path connecting these two states is defined by a set of intermediate states in which the interaction energy between the protein and ligand is gradually increased. The free energy difference is calculated by integrating the heat absorbed or released by the system as the interaction energy is increased along the path.

The method can be applied to both implicit and explicit solvent models, and a variety of different techniques can be used to calculate the heat absorbed or released, such as thermodynamic integration, thermodynamic perturbation, or the Bennett acceptance ratio method. These methods are particularly useful for RBFE calculations where one wants to compare the binding of different ligands to the same receptor or compare the binding of the same ligand to different receptors.

Method Name

Hit optimization protocol

Commercial software packages used

Dock6

Free software packages used

Torch
Torch serv
Openbabel
Rdkit
Ambertools
Gromacs
Sander
Lomap
pmx