Our approach follows multiple stages that gradually funnel massive ligand libraries into hits, leads, and optimized leads. The multiple stages combine earlier data-driven methods and latter principle/physics-driven methods as detailed as follows.
0. Screening libraries. We will start with ZINC20 (https://zinc20.docking.org/tranches/home/) of about 1 billion compounds. By applying filters such as logP>3.5, molecular_weight>400Da, and heavy_atom_number<100, the size of the effective compound library would reduce almost by half to around 573 million. For hit identification, the library could be further reduced by at least an order of magnitude by clustering. For lead optimization, the library would also be reduced by focusing on cluster members of identified hits. As needed, we could perform a similar size reduction with Enamine REAL (https://enamine.net/compound-collections/real-compounds/real-database) of 2 billion compounds, or at least use the larger library during much focused lead optimization.
1. Our own deep learning-based early-stage structure-free screening. (Bioinformatics 2019, 2022) Using protein sequences and compound chemical graphs as inputs, our deep learning models exploit massive unlabeled molecular data to pretrain molecular encoders and affinity-labeled protein-compound pairs to fine-tune encoders and jointly predict affinities. Joint attention mechanisms are introduced to explain affinity predictions with residue-atom non-bonded interactions.
We will first re-tool our methods from affinity prediction to target-based compound screening. We will split datasets DUD-E and LIT-PCBA into non-homologous target series, revise the loss function for target-specific compound classification/ranking, and re-train our deep learning models accordingly.
We will then fine-tune the aforementioned meta-screener progressively into classifying/ranking GPCR antagonists and then MCHR1 antagonists, using the GPCR subset of BindingDB data and the two compound libraries for MCHR1, respectively.
The fine-tuned target-specific deep learning model will be utilized for the three tasks differently:
- For hit identification, it will be used as a filter. It will be applied on the initial compound library for hit identification: the cluster heads of the initial screening library (expected to be 10^8) will be screened using the meta model and retain 10^4 top candidates for the next stage of physics-based docking. Additionally, compounds containing the previously identified fragments or resembling the previously identified inhibitors will be screened with a scrutiny.
- For virtual screening of merged lists, it will be used together with docking-based energy estimation as scoring. Besides emphasizing novelty, we will use the sum of the ranks in the two scoring systems to choose the majority based on the consensus; and we will also retain a minority that stands out in individual scoring systems.
- For lead optimization, it will be used together with docking-based energy estimation as scoring. Compared to hit identification, it will be applied to a different set of compounds, cluster members of identified hits from Enamine REAL.
2. Our new structure-based deep learning methods for docking and de novo design.
Although the target structure is not available and the AlphaFold Database prediction in the extracellular domain remains low confidence, there are SwissModel predictions based on quality homology structures, such as agonist-bound human SSTR2 (PDB ID: 7WIG; seq identity 35%) and human OPRD1 (PDB ID: 4N6H; seq identity 29%). We will use these homology models as starting structures, dock multiple known MCHR1 antagonists (AutoDock Vina), and run molecular dynamics (NAMD2) to derive antagonist conformations of MCHR1.
Given an ensemble of MCHR1 antagonist conformations, we will (1) perform high throughput docking of the 10^3 to 10^4 candidates screened with aforementioned deep learning models; and (2) design de novo compounds using a target structure-conditioned diffusion model that simultaneous search for molecular entities and poses. We will verify and filter the de novo designs using high throughput docking; and then use the top designs as molecular fingerprints to search for additional 10^3 compounds from the library. With 10^4 candidates in total, we can afford using slower, physics-driven methods such as structure-based docking and screening (Autodock Vina) as well as energy calculation and decomposition (MM/PBSA). Molecular dynamics for hundreds of selected compounds would be performed for more mechanistic prediction of structures, dynamics, and activities, especially to validate the antagonism.