Computational methods

Hit Identification

Method Name

AIsembleDD

Hybrid of the above

Ensemble of diverse DL methods including post processing via Docking and FEP

Description of your approach (min 200 and max 800 words)

GPCRs are the largest class of drugs targets, still the identification of novel inhibitors is hampered by the complex mechanism of action of GPCRs. In this project, we aim to identify novel MCHR1 antagonists, by building on the great success of deep learning (DL) models in drug design.

We will develop and orchestrate an ensemble of deep-learning (DL) methods for binding affinity prediction embarking on our groundwork exemplified in the recent “TeachOpenCADD goes DL” manuscript [Backenköhler et al., 2023].

We plan to include ligand-based models focusing on different ligand representations: (i) a transformer model using string-based ligand representations, (ii) a message- passing neural network (MPNN) where ligands are represented as 2D-graphs, and (iii) an equivariant GNN (EGNN) where the ligand-graph structure incorporates 3D information. Additionally, we plan to include models that use protein as well as ligand information such as (i) a drug target interaction (DTI) prediction model where protein and ligand are encoded separately, and (ii) a diffusion-based model that generates additional drug candidates. Since different architectures and representations have different advantages, combining them with ensemble techniques can give us the best of these models.

With our approach, we plan to rank compounds from the Enamine REAL Space and apply consensus docking as well as MD and FEP simulations for further filtering.

Modeling details: We explored the proposed DL methods in our work “TeachOpenCADD goes DL” [Backenköhler et al., 2023], where we presented architectures based on different ligand representations, mainly string-based and graph-based (2D and 3D). Since language models, for example based on transformers, can capture long-range relations of tokens in sentences, we can apply them to string-based representations of molecules, such as SMILES, to infer a latent representation of a molecule. Molecular graphs are another representation, where nodes and edges correspond to atoms and covalent bonds, respectively. We will explore strong GNN baselines (e.g. Kipf et. al., 2016; Xu et. al., 2018) combined with expressivity add-ins such as virtual nodes (Gilmer et. al., 2017), graph-positional encodings and graph attention (Rampasek et. al., 2023) for processing such molecular graphs. If additional 3D information (e.g. in silico ligand conformers) is to be incorporated, we intend to use GNNs which are invariant to rotations and translations of the entire ligand but still able to make use of its conformational information. To accommodate for this, we will use the EGNN by Satorras et. al., 2021.

The DTI model will have two encoders, one for the ligand and one for the target. Their embeddings will be input into a discriminative model. This Y-shaped architecture is widely used in drug-target interactions (DTI) prediction (Ötztürk et al., 2018, Nguyen et al., 2021). We will rely on a protein language model for the protein encoder and combine it with one of the ligand-based models explained above.

We also employ diffusion models to sample from the Enamine REAL database's extensive chemical space. We can combine this with a (e.g., GNN-based) scoring model that guides and selects generated ligand candidates based on the target MCHR1. We expect this to yield complementary information to applying the scoring model directly to the REAL database.

Training the models: All above-described models will be trained on the provided MCHR1-specific bioactivity data and evaluated on a hold-out set that will be created using DataSAIL [Joeres, et al., 2023] and post-processed to eliminate any potential data leakage. We will pre-train the DTI model on the BindingDB dataset (general drug-protein interaction data) and subsequently on the GLASS dataset (GPCR-specific interaction data) before fine-tuning it to the MCHR1-specific data. For ligand-centric models we will skip any supervised pre-training and train directly on the MCHR1-specific data, with the exception of the chemical SMILES language model, which has been pre-trained on larger unlabelled data sets.

Application of the models, ensembling, selection and post-processing: We will use the ensemble as trained above to find potential MCHR1 antagonists in the Real Diversity Set of ~48M molecules. Using this subset of the REAL database allows for a faster search of the immense REAL chemical space, with the potential to add a more local search once promising chemical scaffolds are identified.

The models' predictions will be combined using ensembling. The ensemble model will be selected and fitted using model predictions on cross-validation folds of the initial data set. This way, we can leverage the strengths of different models and achieve more reliable predictions. As a by-product, we will gain insights on the strengths and weaknesses of diverse predictive methods on the GPCR domain.

Once we have ranked potential compounds, we will apply a consensus docking procedure using a predicted MCHR1 structure. We will first refine and optimize the structure using a mixture of MD simulations and subpocket sampling. We will then conduct a re-docking study to validate our selection of docking algorithms and scoring functions based on the data available in the GPCR_DB and the provided list of active compounds. Decoy generation will be used to help validate our chosen docking pipeline. The final pipeline will then be used to prioritize compounds identified as a result of our previous ensembling procedure.

Validation of compounds: Binding affinities of most promising compounds will be tested using all-atom MD simulation and alchemical absolute binding free energy calculations.

What makes your approach stand out from the community? (<100 words)

The approach's novelty lies in combining several established deep learning architectures into one pipeline to find hits for a specific protein target. Because of its end-to-end trainability, the model can be reused for other protein targets by simply retraining the model on appropriate datasets. In addition, the pre-selected compounds are further validated using classical approaches such as docking and FEP calculations. Our team combines expertise from several groups on campus, i.e., CADD, computer science, language models, bio-physics and medicinal chemistry, thus, allowing to approach the problem from diverse angles.

Free software packages used

Python

TeachOpenCADD

HuggingFace

Gypsum-DL

GNINA, SMINA, PLANTS, QVINA

RFScoreVS, RTMScore, KORP-PL, ConvexPLR, SCORCH

RDKit

Enamine REAL Diversity Set

GROMACS

Challenge #5