We are convinced that deep/machine learning applications to the drug discovery problem are only as good as the data they are trained on, hence our emphasis is primarily on a robust and stratified data strategy to avoid overfitting on narrow data regimes and instead promote generalization capabilities of our framework. This is based on prior learnings from our work on large scale mutational data to predict protein properties and drug binding sites (https://www.nature.com/articles/s41588-019-0431-x, https://www.nature.com/articles/s41586-022-04586-4, https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02091-3, https://www.biorxiv.org/content/10.1101/2023.10.27.564339v1).
Algorithms:
Our framework employs an ensemble of four sequence-based deep learning algorithms to identify drug-target interactions. The commonality between the employed algorithms is that they predict drug-target interactions based on small molecule and protein sequences only, with no need for 3D structural protein or protein-ligand complex information. The employed algorithms differ in how they encode small molecules and protein sequences and decode their interactions, using convolutional, graph convolutional and attention-based approaches, as well as pre-trained features from protein and chemical language models., thus providing us with varied assessments of given drug-target interactions.
Data:
We have trained the ensemble of algorithms on a drug-target interaction dataset specifically curated in-house for hit identification scenarios, making use of large amounts of published HTS screening data, with a focus on broad but balanced coverage of the chemical and protein space, skewed interaction distributions (at max weak binders with a majority of non-binders) and minimal data leakage through appropriate data splitting strategies to ensure maximal generalization capabilities.
Specifically for the MCHR1 antagonist prediction scenario in the CACHE Challenge, we will subsequently first evaluate, then fine-tune our model on the available GPCR1 family and MCHR1 antagonist assay data.
To summarize, our approach will leverage pre-existing knowledge about proteins, small molecule ligands and their interactions on several levels:
- by using protein language models we attain informative feature representations from the evolutionary data from hundreds of millions of protein sequences
- Similarly, encoding small molecules through a pre-trained network yields meaningful representations
- making use of a custom made drug-target-interaction training dataset tailored towards hit identification build from PubChem and ChEMBL
- fine-tuning the employed models on GPCR1 family and MCHR1 data to adapt the model towards hit id for MCHR1
Hit identification for MCHR1 antagonists will then be performed by predicting MCHR1 interactions with compounds from a subsampled set of the Enamine REAL space of 38B compounds and subsequent dense screening of analogues around initial top hits.
Final compound selection will be taking into account predicted interaction strength, chemical diversity as well as novelty compared to previously known interacting MCHR1 ligands.