Computational methods

Hit Identification

Method type (check all that applies)

Deep learning

High-throughput docking

Machine learning

Description of your approach (min 200 and max 800 words)

Selection of 10,000: compounds available for purchase (in stock) will be obtained from the ZINC database, from which Morgan fingerprints are computed using RDKit with BAMBU(https://pypi.org/project/bambu-qsar/). The outliers and dimensionality of the dataset will be reduced using Principal Component Analysis (PCA), preserving 95% of the variance, followed by the UMAP algorithm, reducing to two dimensions. Subsequently, the molecules will be clustered using HDBSCAN. For each cluster, a centroid is chosen based on the molecule with the smaller Euclidean mean. A total of 10,000 molecules will be selected in this step while maintaining a uniform proportion of clusters in the dataset (DB-diverse). With DB-diverse, we aim to maintain the diversity of the selected molecules.
Exploration of DB -diverse: With these pre-selected ligands, we will create a set of molecular descriptors using RDKit 2D/3D, BINANA, PaDel, etc. We will also create reactivity descriptors of the Conceptual Density Functional Theory with PRIMoRDiA. We will examine the values of these molecular descriptors to better understand their variance, properties, correlations, etc. Based on these analyzes, we can apply a new filtering step or we can return to step 1 and select more ligands from the clusters.
Docking DB -diverse: Using our virtual screening framework (FrameworkVS) we are going to perform molecular docking simulations with Vina1.1.2 and the ligands selected in DB-diverse. In this tool, the user configures an experiment on a web-based platform and specifies the path of the input and output files (receptor (1,N) and ligands (1,N)), as well as the size, center, and variation of the binding site(s). Then, the proposed framework generates a Python script that runs the VS experiment on the user's computer. The new version of FrameworkVS will be available on our Github soon. Considering the docking results of this step, the energy values obtained with Vina will be converted into z-score and p-values and selected as positive hits with high interaction energy (low dissociation energy) and p-value < 0.05
ML model DB-diverse: using the docking results of DB-diverse we prepare a dataset consisting in Morgan Fingerprints generated by RDKit with BAMBU and as target attribute the z-score. This dataset will then be divided into training and validation subsets (75/25%) and training data, and a semi-supervised learning algorithm is applied for classification. The labeled training data will be considered to train a deep neural network model, which will be used to classify the unlabeled training data, producing weak-labels that are going to be applied in new iterations of training. The final model will be evaluated with the validation subset. In this step we are going to adapt scripts develop by our group for BAMBU
FIlter ZINC - 1,000 most promising candidates: the ML model DB -diverse will be used to predict the probability of affinity for all ZINC molecules. The top 1,000 molecules, ranked by the model's prediction probability, will be selected, and this dataset is called DB-candidates.
Docking of DB-candidates: This ligand dataset will be docked to the crystal structure of NSP13 using the FrameworkVS previously described. In this step, it is very important to include the flexibility of the receptor in the docking. This is still a challenge due to the many degrees of freedom involved. Therefore, we propose to study the target receptor and its flexibility using molecular dynamics simulations (MD). We will evaluate the impact and importance of including receptor flexibility in the following docking steps by selecting some snapshots from this MD simulation (using PCA or clustering algorithms) and/or identifying the most flexible side chains to be considered flexible in docking.
Re-scoring with RFL-Score: the docking results of the DBcandidates are re-scored with the Machine Learning scoring function developed by our group, namely RFL-Score. RFL-Score was trained with PDBBind and CSAR (Decoys and NRC HiQ). Molecular descriptors were obtained from these complexes using BiNANA, RDKit 2D/3D, SASA, PaDEL and Vina. The most promising attributes were selected using LassoCV and a Random Forest (RF) model was proposed. The RF model was validated with CASF 2013/2016 and performed very well compared to its counterparts (scoring, docking, screening, and ranking power benchmarks) in all evaluation types. The top 500 docking results will be selected for the next step.
Consensus scoring and final top 100 list: Other types of scoring functions, i.e., Physics-based, Empirical and Knowledge-based, will also be used to re-score the ligands. These results will then be combined to produce a consensus score. We will use scoring functions such as Delta-Vina XGBoost, Convex- PL, KORP-PL, Rf-Score, NNScore 2.0 and so on. Also, the pharmacokinetic properties will be calculated using ADMETLab tool for the 500 molecules. Then, the top 100 molecules with the highest consensus score will be selected, excluding those with unfavorable pharmacokinetic/toxicological properties.

What makes your approach stand out from the community? (<100 words)

The group's experience with molecular docking and machine learning. Karina S. Machado has been working with docking and machine learning since 2005, specifically applying docking to different targets, flexible receptors in docking, and scoring functions. Frederico Kremer is developing BAMBU and has been working with QSAR, virtual screening, and machine learning since 2018. In addition, the group has been working with proteins from SARS-CoV-2 since March 2020.

Method Name

RFL-Bambu

Free software packages used

AutoDock Vina 1.1.2, AutoDock Tools, Primordia, RDKit, BINANA, PaDEL Descriptors, ZINC Database, PDB Database, Alphafold EBI Database, Gromacs, Pymol, VMD, Python and Biopython

Relevant publications of previous uses by your group of this software/method

KADUKOVA, M. ; DOS SANTOS MACHADO, KARINA ; CHACON, P. ; GRUDININ, S. . KORP-PL: a coarse-grained knowledge-based scoring function for protein-ligand interactions. BIOINFORMATICS, v. 1, p. 1-8, 2020.

SEUS, V. R. ; SILVA JR., L. V. ; GOMES, JORGE ; SILVA, P. E. A. ; Prates, N. S. ; ZANATTA, N. ; WERHLI, A. V. ; MACHADO, K. S. . A Framework for Virtual Screening. In: ACM SAC 2016, 2016, Pisa, Italy. Proceedings of the 31st Annual ACM Symposium on Applied Computing. New York: Association for computer machinery, 2016. v. 1. p. 31-36.

DOS SANTOS, VITOR PIMENTEL ; RODRIGUES, ANDRÉ ; DUTRA, GABRIEL ; BASTOS, LUANA ; MARIANO, DIEGO ; MENDONÇA, JOSÉ GUTEMBERGUE ; LOBO, YAN JERÔNIMO GOMES ; MENDES, EDUARDO ; MAIA, GIOVANA ; MACHADO, KARINA DOS SANTOS ; WERHLI, ADRIANO VELASQUE ; ROCHA, GERD ; DE LIMA, LEONARDO HENRIQUE FRANÇA ; DE MELO-MINARDI, RAQUEL . E-Volve: understanding the impact of mutations in SARS-CoV-2 variants spike protein on antibodies and ACE2 affinity through patterns of chemical interactions at protein interfaces. PeerJ, v. 10, p. e13099, 2022.

ROCHA, RAFAEL E. O. ; CHAVES, ELTON J. F. ; FISCHER, PEDRO H. C. ; COSTA, LEON S. C. ; GRILLO, IGOR BARDEN ; DA CRUZ, LUIZ E. G. ; GUEDES, FABIANA C. ; DA SILVEIRA, CARLOS H. ; SCOTTI, MARCUS T. ; CAMARGO, ALEX D. ; MACHADO, KARINA S. ; WERHLI, ADRIANO V. ; FERREIRA, RAFAELA S. ; ROCHA, GERD B. ; DE LIMA, LEONARDO H. F. . A higher flexibility at the SARS-CoV-2 main protease active site compared to SARS-CoV and its potentialities for new inhibitor virtual screening targeting multi-conformers. JOURNAL OF BIOMOLECULAR STRUCTURE & DYNAMICS, v. -, p. 1-21, 2021.

Hit Optimization Methods

Method type (check all that applies)

Deep learning

High-throughput docking

Machine learning

Description of your approach (min 200 and max 800 words)

Our proposed approach for this step is similar to the one used for hit identification, but we will re-train the machine learning models in some steps considering the results of the previous selection. We will also include a step for optimization using quantum chemistry.

- Our RFL-Score scoring function will be retrained including the subset of ligands proposed in the hit identification with their correspondent experimental data. We are going to evaluate the impact of this on the Scoring function performance and evaluate if generating a model specific to a target can be promising.

- we are developing a new scoring function for molecular docking using descriptors generated by software Primordia. This software calculates global reactivity descriptors (Hardness, softness, Ionization Potential, Electron Affinity, and more six related with electronic energies), Local reactivity Descriptors (Four working methods for Local hardness, Local Softness, hyper softness, local electrophilicity, multiphilcity, Fukui functions, electron density, and other common local electrostatic properties) and Local Reactivity Descriptors for residues from biological systems, total electron density and Molecular Orbitals. If this function is trained and tested during this challenge we are going to use this in this hit optimization step.

What makes your approach stand out from the community? (<100 words)

Same as for hit identification.

Method Name

RFL-Primordia-Bambu

Commercial software packages used

Same as for hit identification.

Free software packages used

Same as for hit identification.

Relevant publications of previous uses by your group of this software/method

Same as for hit identification.

Challenge #2