Challenge #5 – COMPUTATIONAL METHODS

Here is a list of all computational methods used for hit identification in CACHE Challenge #5. Click on the Description for more details. Some participants preferred not to release their publications to stay anonymous at this time.

Description

Method name

Commercial software

Free software

AlphaFold with MSA subsampling and possibly alternative models will be used to generate an ensemble of protein conformations. Selected conformations will be simulated with molecular dynamics to relax the system. Simulations will be analyzed to identify druggable pockets with some consistency to published data (e.g.

GNINA FTW

None.

GNINA, AMBER (partially free), OpenFold/LocalColabFold

We will use structure-based ultra-large virtual screenings using VirtualFlow 2.0 [Gorgulla 2023]. The procedure will consist of four steps.

VirtualFlow/Ultra-Large Virtual Screens

Maestro (protein preparation)

VirtualFlow, AutoDock Vina, QuickVina, GWOVina

In the first stage of virtual ligand screening, we will apply structure-based GPCR-specific QSAR model trained on the GPCR-ligand bound complexes.

Tetra-d

Tetra-d

ICM-Pro

graphLambda Gromacs RDkit Deepchem Smina Gnina Tensorflow Pytorch

Our approach follows multiple stages that gradually funnel massive ligand libraries into hits, leads, and optimized leads. The multiple stages combine earlier data-driven methods and latter principle/physics-driven methods as detailed as follows.

DeepAffinity, CPAC, AutoDock Vina, NAMD

We will employ a structure-guided drug discovery approach based on a unique molecular generative model (SAGE) recently developed in our lab. This generative model is specifically tuned to produce Enamine REAL Space ligands targeting the 3D structure of an input protein binding pocket. Compared to traditional virtual screening, this approach offers unparalleled speed, enabling us to rapidly sample from the entire enamine library space (30 billion compounds) in a structure-guided way.

SAGE

Schrodinger's Prime, Glide, Maestro

AlphaFold 2, Amber (MD simulations), MDAnalysis, PENSA, e3fp

STEP 1: Protein Structure Prediction

The first part of the solution is predicting the protein structure using flow-matching methods to generate protein ensemble prediction. Molecular Dynamics Simulation will be performed on the predicted structure to ensure stability. Binding sites will be identified using our in-house binding site prediction algorithm based on few geometric deep-learning methods.

STEP 2: Training/ Finetuning different AI/ML models

Virtual Screening with QSAR, and binding affinity models with active learning

Boltchem

RDKit, FABind, AutoDock Vina, OpenMM, Diffdock, Gromacs

GPCRs are the largest class of drugs targets, still the identification of novel inhibitors is hampered by the complex mechanism of action of GPCRs. In this project, we aim to identify novel MCHR1 antagonists, by building on the great success of deep learning (DL) models in drug design.

AIsembleDD

Python TeachOpenCADD HuggingFace Gypsum-DL GNINA, SMINA, PLANTS, QVINA RFScoreVS, RTMScore, KORP-PL, ConvexPLR, SCORCH RDKit Enamine REAL Diversity Set GROMACS

1.Primary screening with evolutionary chemical binding similarity model

ECBS

RDKit, ECBS, DiffDock, Alphafold

The approach will be a combination of Machine learning, Molecular docking, and Molecular Dynamics simulation.

Stage 1

We will perform the Molecular dynamics simulation with the AlphaFold homology model of MCHR1. We will use the assembly of conformations to proceed with assemble docking.

Stage 2

The complex approach utilizing the structure-based and ligand-based strategies powered by ML

ICM-Pro is provided by MolSoft.

RDKit, KNIME

Our goal through this competition is to validate whether our enhanced hit-finding workflow demonstrates the anticipated efficacy compared to our existing workflow (CACHE challenge #4). This workflow integrates various in silico drug development techniques, from target protein structure modeling to ultra-high-throughput virtual screening and de novo design using generative models, while maintaining simplicity.

SNU-Dock

Autodock-GPU Amber

We are convinced that deep/machine learning applications to the drug discovery problem are only as good as the data they are trained on, hence our emphasis is primarily on a robust and stratified data strategy to avoid overfitting on narrow data regimes and instead promote generalization capabilities of our framework.

Orchestdrug

None.

We make use of standard open source cheminformatics software like RDKit (https://www.rdkit.org/) and pretrained protein language models like ESM-2 (https://github.com/facebookresearch/esm). Furthermore our approach is based on previously published models, ConPLex (https://github.com/samsledje/ConPLex), DrugBAN (https://github.com/peizhenbai/DrugBAN), TransfomerCPI2.0 (https://github.com/lifanchen-simm/transformerCPI2.0) and PSICHIC (https://github.com/huankoh/PSICHIC). All models are implemented in pytorch, data collection, cleaning and model training and evaluation is performed through python making use of webclients from PubChem and ChEMBL and standard data science packages, like pandas and numpy.

Our approach combines the expertise of Kozakov Lab at Stony Brook and Tropsha Lab at UNC. For this specific challenge, where substantial number of molecules active against the MCHR1 target is already known, we will use both structure-based (ML-accelerated docking) and ligand based (QSAR) methods developed in our laboratories and published in the open literature; we will not use any commercial software.

Frag2Hits

None

FTMap server (https://ftmap.bu.edu/), HIDDEN GEM (https://github.com/molecularmodelinglab/HiDDEN-GEM); RDKit;

This project will begin with a ligand-based approach by clustering the known ligands to identify groups of compounds with potentially similar interaction profiles. This will be done by generating Morgan fingerprints and calculating their Tanimoto distances, and by clustering the compounds in LigandScout by 3D pharmacophore similarity. The clusters will then be analyzed using LigandScout to generate ligand-based 3D pharmacophores.

Ligand guided 3D pharmacophore modeling

InteLigand - LigandScout

OpenEye - ROCS

CCG - MOE

CCDC - GOLD

OpenMMDL OpenMM PyRod AlphaFold / alphafold-multistate RDkit

1 protein structure determination

Deep learning, diffusion generative model, molecular dynamics, free energy methods

GPCR-I-TASSER, RosettaLigandEnsemble, DiffDock, PoseBusters

In this CACHE challenge, we will prioritize implementing Active Learning (AL) into our workflows. Historically, we, and our software, have been limited by smaller compound libraries due to the compute-intensive framework of our virtual high throughput screening. In previous CACHE challenges, we have also identified a liability in ultimately considering smaller libraries.

Multi-stage hit identification workflow (Includes Active Learning and Pharmacophore modelling)

Forecaster Suite, In-house developed packages (Active learning, and Pharmacophore modelling)

RDkit, Deepchem, TensorFlow, PyTorch, GROMACS

We will employ a comprehensive computational protocol to enable the discovery and optimization of novel lead compounds for melanin-concentrating hormone receptor 1 (MCHR1). Since there is no existing crystal structure, we will execute a workflow for target analysis, specifically focusing on target validation and identification. This step involves analyzing structures generated through homology modeling, utilizing tools like AlphaFold or Schrödinger software.

Schrödinger, AMBER

AlphaFold, NAMD, Python

We had built a custom model, ligand based, to predict the IC50 of ligands. Subsequently, it will be used to screen a large number of ligands in the ENAMINE database (1 billion small molecules). The model would leverage on transfer learning, using latent representations from models trained on large dataset, published previously.

NYAN

NO

NYAN

A computer-implemented method for screening ligand candidates for a target protein. This is done through an in-house developed, integrated ensemble machine learning (ML) model for predicting binding affinity with very high speed and precision.

protein modeling and iScore

Desmond by Schrödinger

F-Pocket, D-Pocket, RDKit, AlphaFold2, RoseTTAFold, i-TASSER, YASARA

Decrypting orphan GPCR drug discovery via multitask learning

W. C. Huang, W. T. Lin, M. S. Hung, J. C. Lee and C. W. Tung

J Cheminform 2024 Vol. 16 Issue 1 Pages 10

Accession Number: 38263092 PMCID: PMC10804799 DOI: 10.1186/s13321-024-00806-3

Multitask learning with high-throughput docking scoring assessment

Microsoft packages may be used in reporting or publication.

Programs were developed in the Ubuntu 20.04.3 operating system using Python programming language version 3.7.11. The study used several Python packages, including numpy, pandas, matplotlib, beautifulsoup4, scikit-learn, bitarray, rdkit-pypi, torch, and AutoGluon v0.5.2. These packages were used for various tasks of data manipulation, visualization, machine learning, web scraping, and deep learning, respectively. Secondly, the AutoDock Vina was employed to dock MCHR1.

We would like to start from a traditional ligand-based strategy like the QSAR model and combine it with structure-based ranking. QSAR model will be developed based on historical data of known hits. Enamine Real will be screened with the QSAR model. The hit molecules will be docked, and poses will be refined with ML-accelerated QM (ML force fields) calculations and traditional MD simulations. If time permits, we will also attempt binding free energy (BFE) calculations.

N/A

ML: scikit-learn, xGBoost, rdkit, python, pytorch ML&QM: python, pytorch, torchani, aimnet2, auto3d MD: Gromacs, Amber, Amber tools QM: ORCA

A combined ligand-, structure-, and interaction-based approach will be applied to identify novel ligands for MCHR1. In this combination, the structure-based component enables reaching out far beyond the training set, while available structure-activity relationships (SAR) data help keeping those false-positive-prone techniques in check.

FRASE-based hit-finding robot (FRASE-bot)

None

All components of our platform were developed in Python using public libraries, such as RDKit, Tensorflow, pandas, and others. We use Autodock Vina as the docking engine and GROMACS for MD simulations. We progressively share our code on the lab’s GitHub page (https://github.com/kireevlab) as soon as we find them robust enough (new models and algorithms are being posted regularly).

Our approach is a ligand-based method where the most active compound will be used as a reference to find potential MCH antagonists. Our method partitions the reference structure in multiple fragments and compares each of the fragments against a library of the building blocks used to create the EnamineREAL library. The selection of the best building blocks is performed using the hydrophobic profile in 3D derived from QM-based descriptors.

exaScreen

exaScreen (https://pharmacelera.com/exascreen/)

RDKit (https://www.rdkit.org/) rDock (https://rdock.github.io/)

Considering the availability of the two given sets of known MCHR1 binders we will employ our already developed ML-enforced ligand-based virtual screening tool named PyRMD. This tool implements the Random Matrix Discriminant (RMD) ML (Machine Learning) algorithm at its core which has been demonstrated to stand out for its denoising capabilities.

PyRMD

none

PyRMD

In approaching the challenge of predicting novel MCHR1 antagonists, given the absence of available crystal structures, a multi-faceted strategy is essential. The first hurdle is to establish an accurate model of MCHR1, which can be achieved through either homology modeling or utilizing the structure from AlphaFold. To ensure robustness, both methods would be employed, and the resulting structures compared to gauge their accuracy relative to other G protein-coupled receptors (GPCRs).

Integrated Orthosteric Docking and Multiconformational Screening: A Comprehensive Approach for Predicting Novel MCHR1 Antagonists

Schrodinger Maestro

Zinc 22 Swiss Model AlphaFold