- Selection of 10,000: compounds available for purchase (in stock) will be obtained from the ZINC database, from which Morgan fingerprints are computed using RDKit with BAMBU(https://pypi.org/project/bambu-qsar/). The outliers and dimensionality of the dataset will be reduced using Principal Component Analysis (PCA), preserving 95% of the variance, followed by the UMAP algorithm, reducing to two dimensions. Subsequently, the molecules will be clustered using HDBSCAN. For each cluster, a centroid is chosen based on the molecule with the smaller Euclidean mean. A total of 10,000 molecules will be selected in this step while maintaining a uniform proportion of clusters in the dataset (DB-diverse). With DB-diverse, we aim to maintain the diversity of the selected molecules.
- Exploration of DB -diverse: With these pre-selected ligands, we will create a set of molecular descriptors using RDKit 2D/3D, BINANA, PaDel, etc. We will also create reactivity descriptors of the Conceptual Density Functional Theory with PRIMoRDiA. We will examine the values of these molecular descriptors to better understand their variance, properties, correlations, etc. Based on these analyzes, we can apply a new filtering step or we can return to step 1 and select more ligands from the clusters.
- Docking DB -diverse: Using our virtual screening framework (FrameworkVS) we are going to perform molecular docking simulations with Vina1.1.2 and the ligands selected in DB-diverse. In this tool, the user configures an experiment on a web-based platform and specifies the path of the input and output files (receptor (1,N) and ligands (1,N)), as well as the size, center, and variation of the binding site(s). Then, the proposed framework generates a Python script that runs the VS experiment on the user's computer. The new version of FrameworkVS will be available on our Github soon. Considering the docking results of this step, the energy values obtained with Vina will be converted into z-score and p-values and selected as positive hits with high interaction energy (low dissociation energy) and p-value < 0.05
- ML model DB-diverse: using the docking results of DB-diverse we prepare a dataset consisting in Morgan Fingerprints generated by RDKit with BAMBU and as target attribute the z-score. This dataset will then be divided into training and validation subsets (75/25%) and training data, and a semi-supervised learning algorithm is applied for classification. The labeled training data will be considered to train a deep neural network model, which will be used to classify the unlabeled training data, producing weak-labels that are going to be applied in new iterations of training. The final model will be evaluated with the validation subset. In this step we are going to adapt scripts develop by our group for BAMBU
- FIlter ZINC - 1,000 most promising candidates: the ML model DB -diverse will be used to predict the probability of affinity for all ZINC molecules. The top 1,000 molecules, ranked by the model's prediction probability, will be selected, and this dataset is called DB-candidates.
- Docking of DB-candidates: This ligand dataset will be docked to the crystal structure of NSP13 using the FrameworkVS previously described. In this step, it is very important to include the flexibility of the receptor in the docking. This is still a challenge due to the many degrees of freedom involved. Therefore, we propose to study the target receptor and its flexibility using molecular dynamics simulations (MD). We will evaluate the impact and importance of including receptor flexibility in the following docking steps by selecting some snapshots from this MD simulation (using PCA or clustering algorithms) and/or identifying the most flexible side chains to be considered flexible in docking.
- Re-scoring with RFL-Score: the docking results of the DBcandidates are re-scored with the Machine Learning scoring function developed by our group, namely RFL-Score. RFL-Score was trained with PDBBind and CSAR (Decoys and NRC HiQ). Molecular descriptors were obtained from these complexes using BiNANA, RDKit 2D/3D, SASA, PaDEL and Vina. The most promising attributes were selected using LassoCV and a Random Forest (RF) model was proposed. The RF model was validated with CASF 2013/2016 and performed very well compared to its counterparts (scoring, docking, screening, and ranking power benchmarks) in all evaluation types. The top 500 docking results will be selected for the next step.
- Consensus scoring and final top 100 list: Other types of scoring functions, i.e., Physics-based, Empirical and Knowledge-based, will also be used to re-score the ligands. These results will then be combined to produce a consensus score. We will use scoring functions such as Delta-Vina XGBoost, Convex- PL, KORP-PL, Rf-Score, NNScore 2.0 and so on. Also, the pharmacokinetic properties will be calculated using ADMETLab tool for the 500 molecules. Then, the top 100 molecules with the highest consensus score will be selected, excluding those with unfavorable pharmacokinetic/toxicological properties.
AutoDock Vina 1.1.2, AutoDock Tools, Primordia, RDKit, BINANA, PaDEL Descriptors, ZINC Database, PDB Database, Alphafold EBI Database, Gromacs, Pymol, VMD, Python and Biopython
KADUKOVA, M. ; DOS SANTOS MACHADO, KARINA ; CHACON, P. ; GRUDININ, S. . KORP-PL: a coarse-grained knowledge-based scoring function for protein-ligand interactions. BIOINFORMATICS, v. 1, p. 1-8, 2020.
SEUS, V. R. ; SILVA JR., L. V. ; GOMES, JORGE ; SILVA, P. E. A. ; Prates, N. S. ; ZANATTA, N. ; WERHLI, A. V. ; MACHADO, K. S. . A Framework for Virtual Screening. In: ACM SAC 2016, 2016, Pisa, Italy. Proceedings of the 31st Annual ACM Symposium on Applied Computing. New York: Association for computer machinery, 2016. v. 1. p. 31-36.
DOS SANTOS, VITOR PIMENTEL ; RODRIGUES, ANDRÉ ; DUTRA, GABRIEL ; BASTOS, LUANA ; MARIANO, DIEGO ; MENDONÇA, JOSÉ GUTEMBERGUE ; LOBO, YAN JERÔNIMO GOMES ; MENDES, EDUARDO ; MAIA, GIOVANA ; MACHADO, KARINA DOS SANTOS ; WERHLI, ADRIANO VELASQUE ; ROCHA, GERD ; DE LIMA, LEONARDO HENRIQUE FRANÇA ; DE MELO-MINARDI, RAQUEL . E-Volve: understanding the impact of mutations in SARS-CoV-2 variants spike protein on antibodies and ACE2 affinity through patterns of chemical interactions at protein interfaces. PeerJ, v. 10, p. e13099, 2022.
ROCHA, RAFAEL E. O. ; CHAVES, ELTON J. F. ; FISCHER, PEDRO H. C. ; COSTA, LEON S. C. ; GRILLO, IGOR BARDEN ; DA CRUZ, LUIZ E. G. ; GUEDES, FABIANA C. ; DA SILVEIRA, CARLOS H. ; SCOTTI, MARCUS T. ; CAMARGO, ALEX D. ; MACHADO, KARINA S. ; WERHLI, ADRIANO V. ; FERREIRA, RAFAELA S. ; ROCHA, GERD B. ; DE LIMA, LEONARDO H. F. . A higher flexibility at the SARS-CoV-2 main protease active site compared to SARS-CoV and its potentialities for new inhibitor virtual screening targeting multi-conformers. JOURNAL OF BIOMOLECULAR STRUCTURE & DYNAMICS, v. -, p. 1-21, 2021.