Computational methods

Hit Identification

Hybrid of the above

Physics-based and Deep learning

Description of your approach (min 200 and max 800 words)

We developed a multi-scale and multi-task neural network to learn interaction domain and binding affinities between compounds and proteins. The model takes graph representation of compounds and proteins as input. The compound was processed by a physics-driven graph neural network, integrating the geometry and momentum information to the topological structure. While the protein was processed by a multi-scale graph neural network, connecting surface to structure and sequence. Hierarchical pooling layers are introduced to narrow the whole protein information to the pocket level, where the multiple pockets of training proteins are pre-defined by top-predicted sites (including allosteric sites) through ensembling exsiting tools and self-developed tools. This strategy can help prune the model and avoid overfitting due to oversized protein graphs. A multi-head attention mechanism is employed to capture the interactions between a compound and pockets of a protein with extra supervision from the labels extracted from available high-quality co-crystal complex structures (PDBbind). A siamese network is coupled for predicting the binding affinity between the given pair of compound and protein/pocket, using the previously extracted compound features, pocket features as well as the derived interaction features. A industry-scale structure-activity relationship (SAR) dataset with compound structures and protein crystal structure/ Alphafold-predicted structure was cleaned for binding affinity learning. The training dataset contains more than 5 million SAR data (mainly derived from CHEMBL and PubChem), including 1252 protein targets and 1.3 million unique compounds. The whole process was trained end-to-end and have been validated in the benchmark test sets. The prediction process will focus on WDR domain of LRKK2, where we will select those compounds that our model believes bind to the WR2 domain with high affinity. The compounds that have top-predicted affinity will be further selected according to their predicted blood–brain barrier score (specific for PD), solubility, diversity and customized MCS/drug-likeness filters, ensuring the quality of the hits.

Method Name

Multi-scale drug-protein interaction prediction (MS-DPI)

Commercial software packages used

None

Free software packages used

Python, Torch, RDKIT, Biopython, P2Rank

Relevant publications of previous uses by your group of this software/method

1. Zheng, S.*, Li, Y., Chen, S. et al (2020). Predicting drug–protein interaction using quasi-visual question answering system. Nature Machine Intelligence, 2(2), 134-140. 2. Zheng, S.*, Yan, X., Yang, Y., & Xu, J. (2019). Identifying structure–property relationships through SMILES syntax analysis with self-attention mechanism. Journal of chemical information and modeling, 59(2), 914-923. 3. Song, Y., Zheng, S.*, Niu, Z., Fu, Z. H., Lu, Y., & Yang, Y. (2020). Communicative Representation Learning on Attributed Molecular Graphs. IJCAI (pp. 2831-2838). 4. Liu, Z., Huang, D., Zheng, S*. et al. (2021). Deep learning enables discovery of highly potent anti-osteoporosis natural products. European Journal of Medicinal Chemistry, 210, p.112982. 5. Yuan, Q., Chen, S., Rao, J., Zheng*, S., Zhao, H. and Yang, Y., (2021). AlphaFold2-aware protein–DNA binding site prediction using graph transformer. Briefings in Bioinformatics.

Hit Optimization Methods

Method type (check all that applies)

De novo design

Deep learning

Physics-based

Hybrid of the above

Pyhsics-based, Deep learning and De novo design

Description of your approach (min 200 and max 800 words)

First we will determine the dominant scaffolds based on the returned data and finetune our model based on these active/inactive data to improve the the model and enrichment protocol. For the optimization stage, we have developed a deep learning-based hit optimization algorithm to improve the hits’ properties and activity. The key idea is to use generative model to learn the hit optimization strategies implicitly embedded in the SAR training dataset. Inspired by matched molecular pairs (MMPs) cutting algorithm, we constructed a optimization training dataset by sampling protein-specific molecular pairs ((X; Y)|Z) where the source molecule X and target molecule Y need to be similar in the 3D sense (3D similarity > threshold to keep the main binding mode), and Y has significant bioactivity improvement over X (pXC50 >= 1) in the context of protein target Z. We measured the 3D molecular similarity through the shape and color similarity score (SC score). The source molecules were restricted to those with low bioactivity (pXC50 ≤ 6.5) and the target molecules were restricted to those with high bioactivity (pXC50 > 7). A multi-modal graph transformer model was employed for generating lead compounds with inputs of a source compound (hits) and a protein pocket structure based on the Transformer neural network architecture. Briefly speaking, the model comprises three main components: (1) a molecular 3D graph neural network for molecular conformer embedding, (2) a pre-trained encoder for protein pocket embedding, and (3) a Transformer neural network for mapping the constructed molecular pairs. We trained the model on the optimization training dataset. For this already trained model, we input the experimentally validated hits with dominant scaffold and pocket structure of WDR domain to generate a large number of optimized virtual molecules. We then use the optimized binding affinity prediction model to rank the generated molecules and finally select the more advantageous lead compounds. The enrichment processed will be optimized based on the retroperspective study on returned data.

Method Name

DeepHop, MetaMO and Molsty

Commercial software packages used

None,

Free software packages used

Python, Torch, RDKIT, Biopython, Openbabel

Relevant publications of previous uses by your group of this software/method

1. Zheng, S.*, Yan, X., Gu, Q., Yang, Y., Du, Y., Lu, Y., & Xu, J. (2019). QBMG: quasi-biogenic molecule generator with deep recurrent neural network. Journal of cheminformatics, 11(1), 1-12. 2. Zheng, S.*, Lei, Z., Ai, H. et al. (2021) Deep scaffold hopping with multimodal transformer neural networks. Journal of cheminformatics13, 87. 3. Yang, Y., Zheng, S.*, Su, S., Zhao, C., Xu, J., & Chen, H. (2020). SyntaLinker: automatic fragment linking with deep conditional transformer neural networks. Chemical science, 11(31), 8312-8322. 4. Wang, J., Zheng, S.*, Chen, J., & Yang, Y. (2021). Meta Learning for Low-Resource Molecular Optimization. Journal of Chemical Information and Modeling, 61(4), 1627-1636. 5. Zheng, S.*, Song, Y., Pan, Z., Li, C., Song, L., & Yang, Y. (2021). Molecular Attributes Transfer from Non-Parallel Data. arXiv preprint arXiv:2111.15146.

Challenge #1