We developed a multi-scale and multi-task neural network to learn interaction domain and binding affinities between compounds and proteins. The model takes graph representation of compounds and proteins as input. The compound was processed by a physics-driven graph neural network, integrating the geometry and momentum information to the topological structure. While the protein was processed by a multi-scale graph neural network, connecting surface to structure and sequence. Hierarchical pooling layers are introduced to narrow the whole protein information to the pocket level, where the multiple pockets of training proteins are pre-defined by top-predicted sites (including allosteric sites) through ensembling exsiting tools and self-developed tools. This strategy can help prune the model and avoid overfitting due to oversized protein graphs. A multi-head attention mechanism is employed to capture the interactions between a compound and pockets of a protein with extra supervision from the labels extracted from available high-quality co-crystal complex structures (PDBbind). A siamese network is coupled for predicting the binding affinity between the given pair of compound and protein/pocket, using the previously extracted compound features, pocket features as well as the derived interaction features. A industry-scale structure-activity relationship (SAR) dataset with compound structures and protein crystal structure/ Alphafold-predicted structure was cleaned for binding affinity learning. The training dataset contains more than 5 million SAR data (mainly derived from CHEMBL and PubChem), including 1252 protein targets and 1.3 million unique compounds. The whole process was trained end-to-end and have been validated in the benchmark test sets. The prediction process will focus on WDR domain of LRKK2, where we will select those compounds that our model believes bind to the WR2 domain with high affinity. The compounds that have top-predicted affinity will be further selected according to their predicted blood–brain barrier score (specific for PD), solubility, diversity and customized MCS/drug-likeness filters, ensuring the quality of the hits.
Python, Torch, RDKIT, Biopython, P2Rank
1. Zheng, S.*, Li, Y., Chen, S. et al (2020). Predicting drug–protein interaction using quasi-visual question answering system. Nature Machine Intelligence, 2(2), 134-140. 2. Zheng, S.*, Yan, X., Yang, Y., & Xu, J. (2019). Identifying structure–property relationships through SMILES syntax analysis with self-attention mechanism. Journal of chemical information and modeling, 59(2), 914-923. 3. Song, Y., Zheng, S.*, Niu, Z., Fu, Z. H., Lu, Y., & Yang, Y. (2020). Communicative Representation Learning on Attributed Molecular Graphs. IJCAI (pp. 2831-2838). 4. Liu, Z., Huang, D., Zheng, S*. et al. (2021). Deep learning enables discovery of highly potent anti-osteoporosis natural products. European Journal of Medicinal Chemistry, 210, p.112982. 5. Yuan, Q., Chen, S., Rao, J., Zheng*, S., Zhao, H. and Yang, Y., (2021). AlphaFold2-aware protein–DNA binding site prediction using graph transformer. Briefings in Bioinformatics.