Challenge #2
Application
HIT IDENTIFICATION
We developed a multi-scale and multi-task neural network to learn binding poses and binding affinities between compounds and proteins. The model takes geometric graph representation of compounds and proteins as input. The compound was processed by a physics-driven graph neural network, integrating the geometry and momentum information into the topological structure. While the protein was processed by a multi-scale graph neural network, connecting surface to structure and sequence. Hierarchical pooling layers are introduced to narrow the whole protein information to the pocket level, where the multiple pockets of training proteins are pre-defined by top-predicted sites (including RNA binding site and allosteric sites) through self-developed tools[4]. This strategy can help prune the model and avoid overfitting due to oversized protein graphs. A multi-head attention mechanism with trigonometry consistency is employed to capture the interactions between a compound and pockets of a protein with extra supervision from the labels extracted from available high-quality co-crystal complex structures (PDBbind). A siamese network is coupled for predicting the binding affinity between the given pair of compound and protein/pocket, using the previously extracted compound features, pocket features as well as the derived interaction features. An industry-scale structure-activity relationship (SAR) dataset with compound structures and protein crystal structure/ Alphafold-predicted structure was cleaned (with AstraZeneca) for binding affinity learning[2]. The training dataset contains more than 5 million SAR data (mainly derived from CHEMBL and PubChem), including 1252 protein targets and 1.3 million unique compounds. The whole process was trained end-to-end and has been validated in the benchmark test sets. The prediction process will focus on the RNA-binding site of helicase NSP13, where we will select those compounds that our model believes bind to the RNA-binding domain with high affinity. Ensemble predictions will be applied by using multiple states and crystal forms of NSP13. Due to the efficiency of the deep learning algorithm, we can screen the whole Enamine REAL libraries.
To reduce the false positive rate, the compounds that have top-predicted affinity will be further selected according to their predicted anti-coronaviruses bioactivity (deep learning-driven QSAR model), solubility, diversity, and customized MCS/drug-likeness filters, ensuring the quality of the hits.
Python, Torch, RDKIT, Biopython, P2Rank
1. Zheng, S.*, Li, Y., Chen, S. et al (2020). Predicting drug–protein interaction using quasi-visual question answering system. Nature Machine Intelligence, 2(2), 134-140.
2. Wang, P., Zheng, S.*, Jiang, Y., Li, C., Liu, J., Wen, C., ... & Yang, Y. (2022). Structure-aware multimodal deep learning for drug–protein interaction prediction. Journal of chemical information and modeling, 62(5), 1308-1317.
3. Lu, W., Wu, Q., Zhang, J., Rao, J., Li, C., & Zheng, S*. (2022). TANKBind: Trigonometry-Aware Neural NetworKs for Drug-Protein Binding Structure Prediction. bioRxiv.
4. Sun, Z., Zheng, S*., Zhao, H., Niu, Z., Lu, Y., Pan, Y., & Yang, Y. (2021). To improve the predictions of binding residues with DNA, RNA, carbohydrate, and peptide via multi-task deep neural networks. IEEE/ACM transactions on computational biology and bioinformatics.
5. Liu, Z., Huang, D., Zheng, S*. et al. (2021). Deep learning enables discovery of highly potent anti-osteoporosis natural products. European Journal of Medicinal Chemistry, 210, p.112982.
6. Zheng, S.*, Yan, X., Yang, Y., & Xu, J. (2019). Identifying structure–property relationships through SMILES syntax analysis with self-attention mechanism. Journal of chemical information and modeling, 59(2), 914-923.
7. Song, Y., Zheng, S.*, Niu, Z., Fu, Z. H., Lu, Y., & Yang, Y. (2020). Communicative Representation Learning on Attributed Molecular Graphs. IJCAI (pp. 2831-2838).