Computational methods

Hit Identification

Method type (check all that applies)

Deep learning

Description of your approach (min 200 and max 800 words)

We are convinced that deep/machine learning applications to the drug discovery problem are only as good as the data they are trained on, hence our emphasis is primarily on a robust and stratified data strategy to avoid overfitting on narrow data regimes and instead promote generalization capabilities of our framework. This is based on prior learnings from our work on large scale mutational data to predict protein properties and drug binding sites (https://www.nature.com/articles/s41588-019-0431-x, https://www.nature.com/articles/s41586-022-04586-4, https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02091-3, https://www.biorxiv.org/content/10.1101/2023.10.27.564339v1).

Algorithms:

Our framework employs an ensemble of four sequence-based deep learning algorithms to identify drug-target interactions. The commonality between the employed algorithms is that they predict drug-target interactions based on small molecule and protein sequences only, with no need for 3D structural protein or protein-ligand complex information. The employed algorithms differ in how they encode small molecules and protein sequences and decode their interactions, using convolutional, graph convolutional and attention-based approaches, as well as pre-trained features from protein and chemical language models., thus providing us with varied assessments of given drug-target interactions.

Data:

We have trained the ensemble of algorithms on a drug-target interaction dataset specifically curated in-house for hit identification scenarios, making use of large amounts of published HTS screening data, with a focus on broad but balanced coverage of the chemical and protein space, skewed interaction distributions (at max weak binders with a majority of non-binders) and minimal data leakage through appropriate data splitting strategies to ensure maximal generalization capabilities.

Specifically for the MCHR1 antagonist prediction scenario in the CACHE Challenge, we will subsequently first evaluate, then fine-tune our model on the available GPCR1 family and MCHR1 antagonist assay data.

To summarize, our approach will leverage pre-existing knowledge about proteins, small molecule ligands and their interactions on several levels:

by using protein language models we attain informative feature representations from the evolutionary data from hundreds of millions of protein sequences
Similarly, encoding small molecules through a pre-trained network yields meaningful representations
making use of a custom made drug-target-interaction training dataset tailored towards hit identification build from PubChem and ChEMBL
fine-tuning the employed models on GPCR1 family and MCHR1 data to adapt the model towards hit id for MCHR1

Hit identification for MCHR1 antagonists will then be performed by predicting MCHR1 interactions with compounds from a subsampled set of the Enamine REAL space of 38B compounds and subsequent dense screening of analogues around initial top hits.

Final compound selection will be taking into account predicted interaction strength, chemical diversity as well as novelty compared to previously known interacting MCHR1 ligands.

What makes your approach stand out from the community? (<100 words)

To our knowledge, current machine learning approaches to sequence based drug discovery focus on individual and specific ways to encode small molecule ligands and target protein sequences. Even more so, much less attention has been paid to the preparation of a training data set suitable for hit identification. Training a diverse set of models on a suitable data set thus will allow us to score better than individual approaches and hence sets us apart from current state of the art algorithms in the field.

Method Name

Orchestdrug

Commercial software packages used

None.

Free software packages used

We make use of standard open source cheminformatics software like RDKit (https://www.rdkit.org/) and pretrained protein language models like ESM-2 (https://github.com/facebookresearch/esm). Furthermore our approach is based on previously published models, ConPLex (https://github.com/samsledje/ConPLex), DrugBAN (https://github.com/peizhenbai/DrugBAN), TransfomerCPI2.0 (https://github.com/lifanchen-simm/transformerCPI2.0) and PSICHIC (https://github.com/huankoh/PSICHIC). All models are implemented in pytorch, data collection, cleaning and model training and evaluation is performed through python making use of webclients from PubChem and ChEMBL and standard data science packages, like pandas and numpy.

Relevant publications of previous uses by your group of this software/method

Our approach is proprietary / unpublished.

Challenge #5