Our approach follows multiple stages that gradually funnel massive ligand libraries into hits, leads, and optimized leads. The multiple stages combine earlier data-driven methods and latter principle/physics-driven methods as detailed as follows.
0. Screening libraries. We will start with ZINC20 (https://zinc20.docking.org/tranches/home/) of about 1 billion compounds. By applying filters such as logP>3.5, molecular_weight>400Da, and heavy_atom_number<100, the size of the effective compound library would reduce almost by half to around 573 million. For hit identification, the library could be further reduced by at least an order of magnitude by clustering. For lead optimization, the library would also be reduced by focusing on cluster members of identified hits. As needed, we could perform a similar size reduction with Enamine REAL (https://enamine.net/compound-collections/real-compounds/real-database) of 2 billion compounds, or at least use the larger library during much focused lead optimization.
1. Our own deep learning-based early-stage screening. We developed DeepAffinity (Bioinformatics 2019) that is semi-supervised: exploiting massive unlabeled protein and ligand data, such as sequences and graphs, for pre-training molecular encoders as well as using affinity-labeled protein-ligand pairs for jointly training encoders and affinity predictors end-to end. We also introduced joint attention mechanisms to explain affinity predictions with residue-atom non-bonded interactions. DeepAffinity was later evolved to DeepAffinity+ and DeepRelations to regularize and supervise the joint attention (JCIM 2021) and to use cross-modality and self-supervised learning (Bioinformatics 2022).
Considering the known, desired binding site of the target (the macrodomain of SARS-CoV-2 NSP3) and the rich structural information, we will re-train / refine the latest DeepAffinity (Bioinformatics 2022) to embed protein sequence input through pre-trained language models and additionally embed the protein binding site input through roto-translationally equivariant neural networks, using our previously curated dataset of over 3700 protein-compound pairs with affinity measurements and co-crystal structures. By using homology-based data splits and self-supervised pretraining, we expect an explainable and robust meta model for affinity and contact prediction, which would be tested against the target’s known fragments (54 crystal structures with fragments occupying the adenine binding cavity and 9 with fragments at the proximal ribose site) and known hits (inhibitors with low micromolar potency, and in some case cellular activity). The meta model will then be fine-tuned for the specific target using the data described above. And the fine-tuned target-specific deep learning model will be utilized for the three tasks differently:
- For hit identification, it will be used as a filter. It will be applied on the initial compound library for hit identification: the cluster heads of the initial screening library (expected to be 10^8) will be screened using the meta model and retain 10^4 top candidates for the next stage of physics-based docking. Additionally, compounds containing the previously identified fragments or resembling the previously identified inhibitors will be screened with a scrutiny.
- For virtual screening of merged lists, it will be used together with docking-based energy estimation as scoring. We will use the sum of the ranks in the two scoring systems to choose the majority based on the consensus; and we will also retain a minority that stand out in individual scoring systems.
- For lead optimization, it will be used together with docking-based energy estimation as scoring. Compared to hit identification, it will be applied to a different set of compounds, cluster members of identified hits from Enamine REAL. In addition, attention mechanisms embedded in our models proved to be useful (JCIM 2021), by predicting the decomposed affinity contribution (which functional groups should be replaced, and if replaced by given candidates what the resulting affinity prediction would be).
2. Structure-based docking and energy calculation/decomposition for the late stage. With only 10^3 to 10^4 candidates screened with aforementioned deep learning models, we can afford using slower, physics-driven methods such as structure-based docking and screening (Autodock Vina) as well as energy calculation and decomposition (MM/PBSA). Molecular dynamics for dozens of selected compounds would be performed for more mechanistic prediction of structures, dynamics, and activities, especially when lead optimization is needed. Positive controls for docking (and simulations) include aforementioned fragments and inhibitors identified for the target, some of which are with known co-crystal structures.