First, we will use an ML protein-ligand binding predictor as the primary filter to accelerate a massive number of molecular docking calculations. To do this, we will collect docking data for the given target protein using small molecules from the Enamine Diversity Library (approximately 3.9 million compounds) as ligands. The docking programs we will use are AutoDock-GPU, Vina-GPU, rDock, and LeDock, which are all open-source programs widely used in high-throughput virtual screening (HTVS) for drug discovery. We will normalize the docking data obtained from these three programs. Using this information, we will train our docking-score prediction models that predict the docking scores only from their SMILES representations. This prediction model will allow screening with ultra-large libraries. We will re-rank the molecules using a machine-learning model that performs consensus docking based on their docking scores. The consensus docking model is trained with the DUD-E and LIT-PCBA sets. Based on the predicted docking scores and consensus docking machine, we will predict the binding likelihood for an even larger dataset, the Enamine Real Data (over 6 billion compounds), to select molecules with high predicted binding likelihoods to the specified binding site of the target protein (1st Screening results).
Next, we will perform docking calculations on the target protein for the primarily selected molecules and predict the stability of the binding state for the estimated binding poses. Based on this, we will select the top molecules and obtain the primary hit candidate group (2nd Screening results). At this point, we will use a Rescoring model to perform score scaling to predict the optimal ligand docking pose. The rescoring models we will use are RTM-score, RF-score-VS, and AK-score v2. We will standardize and rank the results from these three rescoring methods, then select the molecules with the highest average rank.
Using the primary hit candidates as a basis, we will generate novel hit candidates using a molecular generative model. The molecular generative models we will use are MolFinder and MolGAN. For the obtained novel hit candidates, we will perform docking calculations using the three previously mentioned docking programs (AutoDock-GPU, Vina-GPU, LeDock) to select the top molecules as hit candidates. During this selection, we will prioritize overlapping molecules from the top candidates of each docking program (2nd Hit candidates). We will then combine the two groups of hit candidate materials and analyze the binding stability of each molecule to identify the final hit candidates.
During the merging process, we will use RetroTRAE (Retrosynthetic translation of atomic environments with Transformer) and GASA (Graph Attention-based assessment of Synthetic Accessibility, YU, Jiahui, et al., Journal of Chemical Information and Modeling, 2022, 62, 2973-2986.) methods to evaluate the synthetic accessibility of the hit candidate materials. These models are deep learning-based artificial intelligence models that predict the Synthetic Accessibility Score (SAS), an indicator used in the drug discovery field to assess the ease or difficulty of compound synthesis. We believe that the hit candidate identified through this entire process will have the highest practical feasibility.