Computational methods

Hit Identification

Method type (check all that applies)

Deep learning

High-throughput docking

Machine learning

Description of your approach (min 200 and max 800 words)

We propose to apply a massive library screening workflow which exhaustively screens the 4.5 billion compound Enamine REAL database using a deep-learning-based Drug Target Interaction (DTI) prediction engine to identify molecules likely to bind to the RNA binding site of NSP13 helicase of SARS-CoV-2.

Recently, DTI tools have emerged as a new class of predictive drug discovery algorithms [1] that train on large datasets of pairwise protein-ligand binding pairs available through bioactivity databases (e.g., ChEMBL, BindingDB, STITCH, and GOSTAR). At their core, DTI models combine protein and ligand feature vectors into neural networks to train models capable of identifying interacting pairs. The proposed DTI model differs from other literature-reported solutions in that it systematically maps DTI pairs derived from bioactivity databases onto 3D protein structures to generate local, site-specific structural protein features. The use of local site-specific 3D features was designed to boost inter-protein generalizability and novel target performance. Model predictions are therefore structurally-informed, but ligand-independent.

For the CACHE hit identification stage, the workflow will consist of the following steps:

Create a machine learning pocket representation of the RNA binding site of SARS-CoV-2 NSP13 helicase that incorporates multiple data sources, including the 3D structure of the binding pocket and functional annotations for the NSP13 helicase.
The NSP13 RNA binding site will be screened against all 4.5 billion molecules of the Enamine REAL database using the DTI prediction model to identify candidate hit molecules.
The 15,000 highest-scoring molecules will be filtered by physical-chemical properties, molecular docking score, predicted ADMET properties, and predicted off-target activity against human proteins. Activity against human proteins will be predicted through a counter-screened using the aforementioned DTI model against a human proteome composed of 79,817 p2rank-predicted pockets [2] from 16,818 Alphafold2-modelled structures made available from the EBI Alphafold2 (AF2) repository [3].
Molecules that pass the filtering process will be clustered by fingerprint similarity. The final 100 compounds will be selected from the representatives of each cluster to maintain structural diversity.

[1] MacKinnon, S. S., Madani Tonekaboni, S. A., & Windemuth, A. (2021). Proteome-scale drug-target interaction predictions: Approaches and applications. Current Protocols, 1, e302. doi: 10.1002/cpz1.302

[2] Krivák, R., Hoksza, D. P2Rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure. J Cheminform 10, 39 (2018). https://doi.org/10.1186/s13321-018-0285-8

[3] Mihaly Varadi, Stephen Anyango, Mandar Deshpande, Sreenath Nair, Cindy Natassia, Galabina Yordanova, David Yuan, Oana Stroe, Gemma Wood, Agata Laydon, Augustin Žídek, Tim Green, Kathryn Tunyasuvunakool, Stig Petersen, John Jumper, Ellen Clancy, Richard Green, Ankur Vora, Mira Lutfi, Michael Figurnov, Andrew Cowie, Nicole Hobbs, Pushmeet Kohli, Gerard Kleywegt, Ewan Birney, Demis Hassabis, Sameer Velankar, AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models, Nucleic Acids Research, Volume 50, Issue D1, 7 January 2022, Pages D439–D444, https://doi.org/10.1093/nar/gkab1061

What makes your approach stand out from the community? (<100 words)

As a neural network model, our DTI model is both high-throughput and generalizable across a broad array of proteins. Once trained, it can evaluate approximately 800 pre-featurized protein-ligand pairs per cpu-second on modern computing infrastructure, which allows us to exhaustively screen the 4.5 billion ligands from the Enamine REAL database. The method can also be applied to both human and viral proteins, allowing us to control for off-target effects by screening candidate molecules against the human proteome through the same framework.

Method Name

Massive Library Screening using Structurally-Augmented Drug-Target Interaction (DTI) prediction models

Commercial software packages used

MatchMaker (Cyclica Inc.)

Free software packages used

Python-based ML stack (PyTorch, scikit-learn)

BioPython computational biology toolkit

RD-Kit computational chemistry toolkit

Various structural biology tools for structural analysis and visualization, including P2Rank, NGL viewer, Autodock Vina.

Relevant publications of previous uses by your group of this software/method

This recent review linked below [1], published by our group outlines the general ML strategy behind “DTI Models”. The second link provides a sample application towards drug repurposing, whereby a smaller drug repurposing library was screened to discover previously unreported off-target interactions with distinct bioactivities, rather than an exhaustive 4.1b molecule REAL database which has only recently become technically feasible. The last reference provides a recent example of a small molecule hit identified by a massive library screen on a novel target (publication currently in preparation).

MacKinnon, S. S., Tonekaboni, S. A. M. & Windemuth, A. Proteome‐Scale Drug‐Target Interaction Predictions: Approaches and Applications. Curr Protoc 1, (2021). Link.
1.Sugiyama, M. G. et al. Multiscale interactome analysis coupled with off-target drug predictions reveals drug repurposing candidates for human coronavirus disease. Sci Rep-uk 11, 23315 (2021). Link.
Kimani, S., Owen, J., Dong, A., Li, Y., Hutchinson, A., Seitova, A., Shahani, V.M., Schapira, M., Arrowsmith, C.H., Edwards, A.M., Halabelian, L., “Crystal structure of the WDR domain of human DCAF1 in complex with CYCA-117-70”. Cyclica Press Release. SGC Link. PDB Link 7SSE.

Hit Optimization Methods

Method type (check all that applies)

Machine learning

Description of your approach (min 200 and max 800 words)

We will apply a related approach to expand on the first round actives and identify more potent molecules from the deeper Enamine Real Space (up to 31 billion molecules). The DTI prediction engine will remain the core predictive driver, but our approach to chemical space exploration will differ.

Synthon-based expansion: We will decompose the identified hit(s) from our first screen into individual synthons provided by enamine. These will be used to seed an optimization algorithm that will iteratively substitute, add, and remove synthons in search of better-scoring configurations. The iterative molecular reconfiguration will be based on RD-Kit [1] and obey the molecular rules that govern the Enamine REAL space. This approach will allow us to expand beyond the 4.5 billion molecule pre-enumerated Enamine REAL database to the ~31 billion molecule REAL Space, which are also available from their made-on-demand service.
We will apply the relevant filters and exclusion criteria described in steps 3 and 4 of the hit identification workflow to the top-scoring compounds from the synthon-expansion search. Moreover, we will apply new ML explainability approaches, including molecular counterfactuals and computational scaffold decomposition to identify the core active moieties of the molecules.

[1] RDKit: Open-source cheminformatics. https://www.rdkit.org. https://doi.org/10.5281/zenodo.591637

What makes your approach stand out from the community? (<100 words)

Synthon-based expansion allows us to generate a highly diverse set of molecules while maintaining synthetic accessibility. When used in conjunction with the DTI prediction model, this approach can identify the optimal chemical structure over a broad chemical space.

Method Name

Synthon Expansion

Commercial software packages used

MatchMaker

Free software packages used

Deriver: An opened source python library for chemical space exploration. GitHub. Publication.

Relevant publications of previous uses by your group of this software/method

Reeves, S, DiFrancesco, B, Shahani, V, MacKinnon, S, Windemuth, A, Brereton, AE. Assessing methods and obstacles in chemical space exploration. Applied AI Letters. 2020; 1:e17. https://doi.org/10.1002/ail2.17

Challenge #2