Computational methods

Hit Identification

Method type (check all that applies)

Deep learning

Machine learning

Hybrid of the above

Massive library screening using structurally-augmented Drug-Target Interaction (DTI) prediction models

Description of your approach (min 200 and max 800 words)

We propose to apply a massive library screening workflow that exhaustively screens the 4.5 billion compound Enamine REAL database using a deep-learning-based Drug Target Interaction (DTI) prediction engine to identify molecules likely to bind to SARS-CoV-2 Nsp3.

Recently, DTI tools have emerged as a new class of predictive drug discovery algorithms [1] that train on large datasets of pairwise protein-ligand binding pairs available through bioactivity databases (ChEMBL, BindingDB, STITCH, and GOSTAR). At their core, DTI models combine protein and ligand feature vectors into neural networks to train models capable of identifying interacting pairs. The proposed DTI model differs from other literature-reported solutions in that it systematically maps DTI pairs derived from bioactivity databases onto 3D protein structures to generate local, site-specific structural protein features. The use of local site-specific 3D features was designed to boost inter-protein generalizability and novel target performance. Model predictions are therefore structurally-informed, but ligand-independent.

For the CACHE hit identification stage, the workflow will consist of the following steps:

Create a machine learning pocket representation of documented ligand binding sites for SARS-CoV-2 Nsp3 that incorporates multiple data sources, including the 3D structure of the binding pocket and functional annotations for the NSP13 helicase.
Binding sites will be screened against all 4.5 billion molecules of the Enamine REAL database using the DTI prediction model to identify candidate hit molecules.
The 15,000 highest-scoring molecules will be filtered by physical-chemical properties, molecular docking score, salient ADMET properties, and predicted off-target activity against human proteins. Activity against human proteins will be predicted through a counter-screened using the aforementioned DTI model against a human proteome composed of 79,817 p2rank-predicted pockets [2] from 16,818 Alphafold2-modelled structures made available from the EBI Alphafold2 (AF2) repository [3], to ensure signal specificity.
Molecules that pass the filtering process will be clustered by fingerprint similarity. The final 100 compounds will be selected from the representatives of each cluster to maintain structural diversity.

[1] MacKinnon, S. S., Madani Tonekaboni, S. A., & Windemuth, A. (2021). Proteome-scale drug-target interaction predictions: Approaches and applications. Current Protocols, 1, e302. doi: 10.1002/cpz1.302

[2] Krivák, R., Hoksza, D. P2Rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure. J Cheminform 10, 39 (2018). https://doi.org/10.1186/s13321-018-0285-8

[3] Mihaly Varadi, Stephen Anyango, Mandar Deshpande, Sreenath Nair, Cindy Natassia, Galabina Yordanova, David Yuan, Oana Stroe, Gemma Wood, Agata Laydon, Augustin Žídek, Tim Green, Kathryn Tunyasuvunakool, Stig Petersen, John Jumper, Ellen Clancy, Richard Green, Ankur Vora, Mira Lutfi, Michael Figurnov, Andrew Cowie, Nicole Hobbs, Pushmeet Kohli, Gerard Kleywegt, Ewan Birney, Demis Hassabis, Sameer Velankar, AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models, Nucleic Acids Research, Volume 50, Issue D1, 7 January 2022, Pages D439–D444, https://doi.org/10.1093/nar/gkab1061

What makes your approach stand out from the community? (<100 words)

As a neural network model, our DTI model is both high-throughput and generalizable across a broad array of proteins. Once trained, it can evaluate approximately 800 pre-featurized protein-ligand pairs per cpu-second on modern computing infrastructure, which allows us to exhaustively screen the 4.5 billion ligands from the Enamine REAL database. The method can also be applied to both human and viral proteins, allowing us to control for off-target effects by screening candidate molecules against the human proteome through the same framework.

Method Name

Massive library screening using structurally-augmented Drug-Target Interaction (DTI) prediction models

Free software packages used

RD-Kit, Vina, PyTorch

Hit Optimization Methods

Method type (check all that applies)

De novo design

Machine learning

Description of your approach (min 200 and max 800 words)

We will apply a related approach to expand on the first round actives and identify more potent molecules from the deeper Enamine Real Space (up to 31 billion molecules). The DTI prediction engine will remain the core predictive driver, but our approach to chemical space exploration will differ.

Synthon-based expansion: We will decompose the identified hit(s) from our first screen into individual synthons provided by enamine. These will be used to seed an optimization algorithm that will iteratively substitute, add, and remove synthons in search of better-scoring configurations. The iterative molecular reconfiguration will be based on RD-Kit [1] and obey the molecular rules that govern the Enamine REAL space. This approach will allow us to expand beyond the 4.5 billion molecule pre-enumerated Enamine REAL database to the ~31 billion molecule REAL Space, which are also available from their made-on-demand service.

We will apply the relevant filters and exclusion criteria described in steps 3 and 4 of the hit identification workflow to the top-scoring compounds from the synthon-expansion search. Moreover, we will apply new ML explainability approaches, including molecular counterfactuals and computational scaffold decomposition to identify the core active moieties of the molecules.

[1] RDKit: Open-source cheminformatics. https://www.rdkit.org. https://doi.org/10.5281/zenodo.591637

What makes your approach stand out from the community? (<100 words)

Synthon-based expansion allows us to generate a highly diverse set of molecules while maintaining synthetic accessibility. When used in conjunction with the DTI prediction model, this approach can identify the optimal chemical structure over a broad chemical space.

Method Name

Synthon Expansion

Free software packages used

RD-Kit, PyTorch

Challenge #3