We propose to apply a massive library screening workflow which exhaustively screens the 4.5 billion compound Enamine REAL database using a deep-learning-based Drug Target Interaction (DTI) prediction engine to identify molecules likely to bind to the RNA binding site of NSP13 helicase of SARS-CoV-2.
Recently, DTI tools have emerged as a new class of predictive drug discovery algorithms  that train on large datasets of pairwise protein-ligand binding pairs available through bioactivity databases (e.g., ChEMBL, BindingDB, STITCH, and GOSTAR). At their core, DTI models combine protein and ligand feature vectors into neural networks to train models capable of identifying interacting pairs. The proposed DTI model differs from other literature-reported solutions in that it systematically maps DTI pairs derived from bioactivity databases onto 3D protein structures to generate local, site-specific structural protein features. The use of local site-specific 3D features was designed to boost inter-protein generalizability and novel target performance. Model predictions are therefore structurally-informed, but ligand-independent.
For the CACHE hit identification stage, the workflow will consist of the following steps:
- Create a machine learning pocket representation of the RNA binding site of SARS-CoV-2 NSP13 helicase that incorporates multiple data sources, including the 3D structure of the binding pocket and functional annotations for the NSP13 helicase.
- The NSP13 RNA binding site will be screened against all 4.5 billion molecules of the Enamine REAL database using the DTI prediction model to identify candidate hit molecules.
- The 15,000 highest-scoring molecules will be filtered by physical-chemical properties, molecular docking score, predicted ADMET properties, and predicted off-target activity against human proteins. Activity against human proteins will be predicted through a counter-screened using the aforementioned DTI model against a human proteome composed of 79,817 p2rank-predicted pockets  from 16,818 Alphafold2-modelled structures made available from the EBI Alphafold2 (AF2) repository .
- Molecules that pass the filtering process will be clustered by fingerprint similarity. The final 100 compounds will be selected from the representatives of each cluster to maintain structural diversity.
 MacKinnon, S. S., Madani Tonekaboni, S. A., & Windemuth, A. (2021). Proteome-scale drug-target interaction predictions: Approaches and applications. Current Protocols, 1, e302. doi: 10.1002/cpz1.302
 Krivák, R., Hoksza, D. P2Rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure. J Cheminform 10, 39 (2018). https://doi.org/10.1186/s13321-018-0285-8
 Mihaly Varadi, Stephen Anyango, Mandar Deshpande, Sreenath Nair, Cindy Natassia, Galabina Yordanova, David Yuan, Oana Stroe, Gemma Wood, Agata Laydon, Augustin Žídek, Tim Green, Kathryn Tunyasuvunakool, Stig Petersen, John Jumper, Ellen Clancy, Richard Green, Ankur Vora, Mira Lutfi, Michael Figurnov, Andrew Cowie, Nicole Hobbs, Pushmeet Kohli, Gerard Kleywegt, Ewan Birney, Demis Hassabis, Sameer Velankar, AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models, Nucleic Acids Research, Volume 50, Issue D1, 7 January 2022, Pages D439–D444, https://doi.org/10.1093/nar/gkab1061