Computational methods

Hit Identification

Method type (check all that applies)

High-throughput docking

Machine learning

Physics-based

Description of your approach (min 200 and max 800 words)

1. Primary screening with evolutionary chemical binding similarity model

We will apply our proprietary evolutionary chemical binding similarity (ECBS) model (PMID: 31504818) for primary virtual screening of the REAL database. The ECBS model is designed to learn chemical properties conserved for binding targets that are evolutionarily related. Specifically, the evolutionarily related chemical pairs (ERCPs) and unrelated pairs are distinguished by binary classification similarity learning. If the targets of the chemical pairs are the same or share an evolutionary annotation, the chemical pairs are said to be evolutionarily related.

Among ECBS variants for virtual screening (VS), we will use the TS-ensECBS model, which is specifically trained to detect chemical pairs that bind to a predetermined VS target. The TS-ensECBS model only considers ERCPs from targets that are evolutionarily linked to the VS targets and utilizes multiple ECBS models to incorporate different evolutionary information about the VS target. The TS-ensECBS model assigns each chemical a similarity score to the known active molecules, with a higher similarity representing a higher likelihood of binding to the VS target. Our previous work includes the underlying principles and detailed construction process for the ECBS models (PMID: 31504818).

2. Secondary screening with a binding affinity prediction model

The TS-ensECBS model has proven successful in identifying potential hits in various applications (PMID: 36142303, PMID: 36127129, PMID: 35887321); however, it only provides probabilities for binding (i.e., whether or not binding occurs) and does not yield relative binding affinity values or binding site information required for the CACHE competition. Therefore, we will develop a binding affinity prediction model using the IC50 values of 895 compounds to rank ECBS-screened compounds by binding affinity order and adopt a blind molecular docking approach to confirm the correct binding site.

The binding affinity prediction model will utilize various machine-learning approaches, such as DNN, Random Forest, SVM, etc., to learn relative binding affinity and compare prediction performance. We will select the method with the highest accuracy for calculating the final hit score. The hit score will be determined by combining the TS-ensECBS score and the binding affinity prediction score in the following manner.

Hit-Score = P(Active) x P(Higher Binding | Active), where P(Acitve) is estimated from ECBS score and P(Higher Binding | Active) from binding affinity prediction score.

3. Candidate filtering by blind docking simulation to bind binding pocket in the closed conformation of TKB domain

The final candidates, ranked by the combined hit-score, will undergo blind docking simulation (e.g., DiffDock) to confirm the correct binding pocket in the closed conformation of CBLB TKB domain. Each chemical candidate will be docked to the receptor structure (PDB code 8GCY) without prior knowledge of the binding site, and only those candidates that correctly bind to the receptor will proceed to the next screening procedure.

In addition to blind docking, molecular docking for the defined binding site will be carried out using various methods (e.g., AutoDock Vina, Dock6, DiffDock) to assess the correlation between docking scores and IC50 values for the 895 compounds. The optimal docking method and scoring scheme will be identified based on the correlation results, and will be used to prioritize active compounds. In a later stage, these scores from structure-based methods will serve as another threshold for selecting the final candidates for experimental validation of the highest scoring compounds by TS-ensECBS.

4. Candidate filtering by chemical structure similarity to represent novel chemical template

To ensure that the final candidates identified through virtual screening have a novel chemical structure, we will apply structure-similarity filters by comparing them to the previously-identified active molecules. Candidates with a Tanimoto coefficient higher than a certain threshold (e.g., MACCS Tanimoto coefficient < 0.23, p-value < 0.1) will be removed from consideration.

What makes your approach stand out from the community? (<100 words)

The combination of ECBS and structure-based methods, along with AI-based VS approach and blind docking, can enhance high-affinity inhibitor discovery and provide valuable insights to the scientific community. Ligand-based VS is fast but may lead to many false positives due to a lack of structural information on receptor-ligand interactions. This bias can be particularly severe when screening a large number of compounds. Therefore, ensuring the correct binding mode of candidate compounds through structural approaches when combined with data-driven ligand-based VS is critical to efficiently identifying active compounds during the hit identification process.

Method Name

Evolutionary Chemical Binding Similarity (ECBS)

Commercial software packages used

BIOVIA Discovery Studio

Free software packages used

RDKit, AutoDock VINA, DOCK6, DiffDock, ECBS

Relevant publications of previous uses by your group of this software/method

- Method and Algorithm

K. Park, Y. Ko, P. Durai, C. Pan, "Machine learning-based chemical binding similarity using evolutionary relationships of target genes", Nucleic acids research, 2019, Volume 47, Issue 20, e128

P. Durai, Y. Ko, C. Pan, K. Park, "Evolutionary Chemical Binding Similarity Approach Integrated with 3D-QSAR Method for Effective Virtual Screening", 2020, BMC Bioinformatics, 21, Article number: 309

- Virtual Screening Applications

J. Lim, K. Park, K. Choi, C. Kim, J. Lee, R. Weicker, C. Pan, S. Kim, K. Park, "Drug Discovery Using Evolutionary Similarities in Chemical Binding to Inhibit Patient-Derived Hepatocellular Carcinoma", 2022, Int. J. Mol. Sci., 23(14), 7971

Y. Wijaya, T. Setiawan, I. Sari, K. Park, C. Lee, K. Cho, Y. Lee, J. Lim, J. Yoon, S. Lee, H. Kwon, "Ginsenoside Rd ameliorates muscle wasting by suppressing the signal transducer and activator of transcription 3 pathway", 2022, Journal of Cachexia, Sarcopenia and Muscle, 13(6), 3149-3162

S. Kim, K. Park, J. Lim, H. Yun, S. Kim, K. Choi, C. Kim, J. Lee, R. Weicker, C. Pan, K. Park, "Potential therapeutic agents against paclitaxel- and sorafenib-resistant papillary thyroid carcinoma", 2022, Int. J. Mol. Sci., 23(18), 10378

P. Durai, Y. Ko, J. Kim, C. Pan, K. Park, "Identification of tyrosinase inhibitors and their structural-activity relationships via evolutionary chemical binding similarity and structure-based methods", 2021, Molecules, 26(3), 566

Hit Optimization Methods

Method type (check all that applies)

Deep learning

High-throughput docking

Machine learning

Physics-based

Description of your approach (min 200 and max 800 words)

For chemicals found during the hit identification process, the TS-ensECBS method is re-trained with the new experimental validation data to further optimize the model. The in-stock REAL database will then be screened using the retrained TS-ensECBS model, and top-ranked molecules will then go through additional processes that were previously used to identify hits. The updated model will also be used to score the designed molecules generated by deep learning methods such as REINVENT.

Another strategy uses deep learning-based drug design methods to create new virtual molecules with multiple desired properties based on structural information from hit molecules. REINVENT combines a generative model, reinforcement learning, and an advanced scoring function to produce molecules with a range of parameters as a reward. The ChEMBL database is first used to train a generative model. The generative model is then put through transfer learning with the identified hit molecules and used for sampling new molecules. To obtain compounds with high molecular docking scores, improved QED scores and synthesizable properties, the LibINVENT and DockStream methods will be integrated into REINVENT. The weight for a docking score is increased up to two to emphasize the compatibility of direct receptor-ligand interactions. Since the core structures of the hit and output molecules will be identical, it will be possible to calculate the relative binding free energies for ligand pairs.

All experimentally validated hit molecules will be further designed by deep learning-based methods to have more optimized synthesizable molecules. The designed molecules will be evaluated by TS-ensECBS, binding prediction model and docking score used for hit identification. The blind docking will also be applied to check the binding site compatibility. The identical scoring and filtering procedure used for hit identification will be applied for the final selection.

What makes your approach stand out from the community? (<100 words)

We will retrain the TS-ensECBS model by incorporating both active and inactive compounds from the previous hit identification process. This retraining will fine-tune the model for the specific target (TKB domain) and bioassay. Additionally, we will test with various model retraining schemes using experimental data to determine the optimal model for hit-optimization. The resulting optimized hit molecules by combining deep generative modeling and ECBS fine-tuning approaches can offer valuable insights to the community to devise an effective hit optimization strategy. The generative modeling approach will design core structures of previously-discovered hits to generate synthesizable compounds possessing all necessary properties to bind to the receptor structure while also being drug-like.

Method Name

ECBS+REINVENT

Commercial software packages used

BIOVIA Discovery Studio Client

Free software packages used

REINVENT, LibINVENT, DockStream, Gromacs, pmx, RDKit, gmx_MMPBSA

Relevant publications of previous uses by your group of this software/method

Same to the Hit identification above.

Challenge #4