Computational methods

Hit Identification

Hybrid of the above

Physics-inspired deep learning method

Description of your approach (min 200 and max 800 words)

We developed a multi-scale and multi-task neural network to learn binding poses and binding affinities between compounds and proteins. The model takes geometric graph representation of compounds and proteins as input. The compound was processed by a physics-driven graph neural network, integrating the geometry and momentum information into the topological structure. While the protein was processed by a multi-scale graph neural network, connecting surface to structure and sequence. Hierarchical pooling layers are introduced to narrow the whole protein information to the pocket level, where the multiple pockets of training proteins are pre-defined by top-predicted sites (including RNA binding site and allosteric sites) through self-developed tools[4]. This strategy can help prune the model and avoid overfitting due to oversized protein graphs. A multi-head attention mechanism with trigonometry consistency is employed to capture the interactions between a compound and pockets of a protein with extra supervision from the labels extracted from available high-quality co-crystal complex structures (PDBbind). A siamese network is coupled for predicting the binding affinity between the given pair of compound and protein/pocket, using the previously extracted compound features, pocket features as well as the derived interaction features. An industry-scale structure-activity relationship (SAR) dataset with compound structures and protein crystal structure/ Alphafold-predicted structure was cleaned (with AstraZeneca) for binding affinity learning[2]. The training dataset contains more than 5 million SAR data (mainly derived from CHEMBL and PubChem), including 1252 protein targets and 1.3 million unique compounds. The whole process was trained end-to-end and has been validated in the benchmark test sets. The prediction process will focus on the RNA-binding site of helicase NSP13, where we will select those compounds that our model believes bind to the RNA-binding domain with high affinity. Ensemble predictions will be applied by using multiple states and crystal forms of NSP13. Due to the efficiency of the deep learning algorithm, we can screen the whole Enamine REAL libraries.

To reduce the false positive rate, the compounds that have top-predicted affinity will be further selected according to their predicted anti-coronaviruses bioactivity (deep learning-driven QSAR model), solubility, diversity, and customized MCS/drug-likeness filters, ensuring the quality of the hits.

What makes your approach stand out from the community? (<100 words)

1. Advanced physical-inspired deep virtual screening algorithm

2. Abundant, high-quality training data for model learning

3. Our approach has multiple complementary mechanisms to prevent false-positive cases.

4. Targeted ensemble and screening strategies for RNA-binding site of NSP13.

Method Name

Multi-scale Drug-Protein Interaction prediction (M-DPI)

Free software packages used

Python, Torch, RDKIT, Biopython, P2Rank

Relevant publications of previous uses by your group of this software/method

1. Zheng, S.*, Li, Y., Chen, S. et al (2020). Predicting drug–protein interaction using quasi-visual question answering system. Nature Machine Intelligence, 2(2), 134-140.

2. Wang, P., Zheng, S.*, Jiang, Y., Li, C., Liu, J., Wen, C., ... & Yang, Y. (2022). Structure-aware multimodal deep learning for drug–protein interaction prediction. Journal of chemical information and modeling, 62(5), 1308-1317.

3. Lu, W., Wu, Q., Zhang, J., Rao, J., Li, C., & Zheng, S*. (2022). TANKBind: Trigonometry-Aware Neural NetworKs for Drug-Protein Binding Structure Prediction. bioRxiv.

4. Sun, Z., Zheng, S*., Zhao, H., Niu, Z., Lu, Y., Pan, Y., & Yang, Y. (2021). To improve the predictions of binding residues with DNA, RNA, carbohydrate, and peptide via multi-task deep neural networks. IEEE/ACM transactions on computational biology and bioinformatics.

5. Liu, Z., Huang, D., Zheng, S*. et al. (2021). Deep learning enables discovery of highly potent anti-osteoporosis natural products. European Journal of Medicinal Chemistry, 210, p.112982.

6. Zheng, S.*, Yan, X., Yang, Y., & Xu, J. (2019). Identifying structure–property relationships through SMILES syntax analysis with self-attention mechanism. Journal of chemical information and modeling, 59(2), 914-923.

7. Song, Y., Zheng, S.*, Niu, Z., Fu, Z. H., Lu, Y., & Yang, Y. (2020). Communicative Representation Learning on Attributed Molecular Graphs. IJCAI (pp. 2831-2838).

Hit Optimization Methods

Hybrid of the above

Pyhsics-based, Deep learning and De novo design

Description of your approach (min 200 and max 800 words)

First, we will determine the dominant scaffolds based on the returned data and finetune our model based on these active/inactive data to improve the model and enrichment protocol.

For the optimization stage, we have developed a deep learning-based hit optimization algorithm to improve the hits’ properties and activity. The key idea is to use a generative model to learn the hit optimization strategies implicitly embedded in the SAR training dataset. Inspired by matched molecular pairs (MMPs) cutting algorithm, we constructed an optimization training dataset by sampling protein-specific molecular pairs ((X; Y)|Z) where the source molecule X and target molecule Y need to be similar in the 3D sense (3D similarity > threshold to keep the main binding mode), and Y has significant bioactivity improvement over X (pXC50 >= 1) in the context of protein target Z. We measured the 3D molecular similarity through the shape and color similarity score (SC score). The source molecules were restricted to those with low bioactivity (pXC50 ≤ 6.5) and the target molecules were restricted to those with high bioactivity (pXC50 > 7).

A multi-modal graph transformer model was employed for generating lead compounds with inputs of a source compound (hits) and a protein pocket structure based on the Transformer neural network architecture. Briefly speaking, the model comprises three main components: (1) a molecular 3D graph neural network for molecular conformer embedding, (2) a pre-trained encoder for protein pocket embedding, and (3) a Transformer neural network for mapping the constructed molecular pairs. We trained the model on the optimization training dataset.

For this already trained model, we input the experimentally validated hits with the dominant scaffold and RNA-binding pocket structure of NSP-13 to generate a large number of optimized virtual molecules. We then use the optimized binding affinity prediction model to rank the generated molecules and finally select the more advantageous lead compounds. The enrichment processed will be optimized based on the retroperspective study on returned data.

What makes your approach stand out from the community? (<100 words)

We develop a novel deep learning-based conditional molecular optimization method inspired by matched molecular pairs (MMPs) cutting algorithm. It is designed with the feature representation of protein pockets in mind and is therefore more directional.

Method Name

DeepHop, MetaMO and Molsty

Commercial software packages used

Free software packages used

Python, Torch, RDKIT, Biopython, Openbabel

Relevant publications of previous uses by your group of this software/method

1. Zheng, S.*, Yan, X., Gu, Q., Yang, Y., Du, Y., Lu, Y., & Xu, J. (2019). QBMG: quasi-biogenic molecule generator with deep recurrent neural network. Journal of cheminformatics, 11(1), 1-12.

2. Zheng, S.*, Lei, Z., Ai, H. et al. (2021) Deep scaffold hopping with multimodal transformer neural networks. Journal of cheminformatics13, 87.

3. Yang, Y., Zheng, S.*, Su, S., Zhao, C., Xu, J., & Chen, H. (2020). SyntaLinker: automatic fragment linking with deep conditional transformer neural networks. Chemical science, 11(31), 8312-8322.

4. Wang, J., Zheng, S.*, Chen, J., & Yang, Y. (2021). Meta Learning for Low-Resource Molecular Optimization. Journal of Chemical Information and Modeling, 61(4), 1627-1636.

5. Zheng, S.*, Song, Y., Pan, Z., Li, C., Song, L., & Yang, Y. (2021). Molecular Attributes Transfer from Non-Parallel Data. arXiv preprint arXiv:2111.15146.

Challenge #2