Computational methods

Hit Identification

Method type (check all that applies)

Deep learning

High-throughput docking

Machine learning

Hybrid of the above

Description of your approach (min 200 and max 800 words)

We propose applying Cyclica’s Ligand Design massive library screening workflow, which has been successfully applied on over several commercial hit-finding programs. The multi-scale workflow exhaustively screens Enamine’s REAL database, composed of 4.1 billion compounds through a series of three successive predictive tasks of increasingly demanding computational requirements. The core predictive engine for the proposed workflow is Cyclica’s MatchMaker engine, a deep-learning based Drug Target Interaction (DTI) prediction tool. Recently, DTI tools have emerged as a new class of predictive drug discovery algorithms1, that train on large datasets of pairwise protein-ligand binding pairs available through bioactivity databases (e.g. ChEMBL or BindingDB). At their core, DTI models combine protein+ligand feature vectors into neural networks to train models capable of identifying interacting pairs. MatchMaker differs from other reported DTI models in that it systematically maps DTI pairs derived from bioactivity databases onto 3D protein structures to generate local, site-specific structural protein features. The use of local site-specific 3D features was designed to boost MatchMaker inter-protein generalizability and novel target performance. Once trained, MatchMaker can evaluate approximately 800 pre-featurized protein-ligand pairs per cpu-second on modern computing infrastructure, which allows us to exhaustively search 4.1b ligands on a single-target with manageable cloud computing resources. For the CACHE hit-finding stage, we will apply this approach: 1. Cyclica staff scientists will select LRRK2 pockets using an internal target feasibility app, which combines several pocket identification tools, protein visualization software and other MatchMaker-specific target pocket analytics. 2. The initial mass library screen will be conducted against a single pocket from the LRRK2 protein, using Cyclica’s MatchMaker predictive engine on Google Compute Engine. We will screen all 4.1 Billion Enamine REAL database molecules. The top scoring molecule s will be counterescreened against the full human proteome (~15,000 proteins) and subjected to additional predictive tasks, such as molecular docking or ligand-based activity models to evaluate overall suitability. 3. We will apply standard clustering and filtering protocols that diversify structural characteristics among the top 100 selected compounds. This step also intends to eliminate molecules with problematic functional groups or structural characteristics (e.g. reactive groups). --- Revised methods 2022-05-11 While the LRRK2 target has over 5,500 publicly-reported bioactivities, most have assay descriptions implying interactions with the target's kinase domain. Few, if any, reported bioactivities are known to engage the target's WDR domain, limiting the applicability of ligand-based bioactivity models. Moreover, it was unclear if the top scoring molecules from our DTI model screen would target the kinase or WDR domain, given the combined use of global protein features and 3D pocket features in training our MatchMaker model. To overcome this challenge, we developed a ligand-based classification model to detect screening hits specific to the WDR domain signal. We first constructed a dataset of small molecule ligands known to interact directly with WDR40 domains based on annotations provided in the ECOD database[1]. Following the removal of common crystal artefacts and visual inspection, we were able to assemble 112 different ligands bound to wd40 domains. To build a reference non-binding set, we performed a blast search on the PDB database, in search of alternative crystal forms and homologs with co-crystal structures. We then performed structural alignments for each blast hit onto an Alphafold2 model for LRRK2 [2-3] to superimpose and visually inspect binding sites of their co-crystal ligands. Once common crystal artefacts were removed, we identified 324 different ligands, none of which map directly to the LRRK2 WDR domain. We then built a ligand-based classification model using POEM to discriminate between the 114 WDR-domain binding ligands and the 324 ligands binding other domains in LRRK2 or its homologs. Using this model, we were able to confirm that the top scoring compounds from our MatchMaker-based screen had some compounds that resemble known WDR-40 binders, but too few to confidently select 100 for experimental testing. Instead, the classifier described above was used to pre-screen the enamine and mcule datasets. For each dataset, we applied a POEM confidence threshold set to 0.8 which reduced the screening libraries to ~3% of their original size. Following the prescreening step, we proceeded in accordance with the originally proposed outlined method, counter screening the library with a MatchMaker-based LRRK2 kinase pocket scores, filtering the remaining ligands on the basis of physicochemical properties, and using standard molecular docking and clustering workflows to select the top 150 compounds. [1] H. Cheng, R. D. Schaeffer, Y. Liao, L. N. Kinch, J. Pei, S. Shi, B. H. Kim, N. V. Grishin. (2014) ECOD: An evolutionary classification of protein domains. PLoS Comput Biol 10(12): e1003926. [2] Jumper, J et al. Highly accurate protein structure prediction with AlphaFold. Nature (2021). [3] Varadi, M et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Research (2021). [4] Andrew E Brereton et al. Predicting drug properties with parameter-free machine learning: pareto-optimal embedded modeling (POEM) 2020 Mach. Learn.: Sci. Technol. 1 025008

Method Name

Ligand Design - Massive Library Screening

Commercial software packages used

MatchMaker

Free software packages used

Python-based ML stack (PyTorch, scikit-learn); BioPython computational biology toolkit; RD-Kit computational chemistry toolkit; Various structural biology tools for structural analysis and visualization, including P2Rank, NGL viewer, Autodock Vina.

Relevant publications of previous uses by your group of this software/method

This recent review linked below1, published by our group outlines the general ML strategy behind “DTI Models”, including a description of the MatchMaker approach. The second link provides a sample application towards drug repurposing, whereby a smaller drug repurposing library was screened to discover previously unreported off-target interactions with distinct bioactivities, rather than an exhaustive 4.1b molecule REAL database which has only recently become technically feasible. The last reference provides a recent example of a small molecule hit identified by a massive library screen on a novel target (publication currently in preparation). The example demonstrates our current platform capabilities and more closely represents the workflow we intend to apply in CACHE. 1. MacKinnon, S. S., Tonekaboni, S. A. M. & Windemuth, A. Proteome‐Scale Drug‐Target Interaction Predictions: Approaches and Applications. Curr Protoc 1, (2021). Link: https://currentprotocols.onlinelibrary.wiley.com/doi/full/10.1002/cpz1.302 2. 1.Sugiyama, M. G. et al. Multiscale interactome analysis coupled with off-target drug predictions reveals drug repurposing candidates for human coronavirus disease. Sci Rep-uk 11, 23315 (2021). Link: https://www.nature.com/articles/s41598-021-02432-7 3. Kimani, S., Owen, J., Dong, A., Li, Y., Hutchinson, A., Seitova, A., Shahani, V.M., Schapira, M., Arrowsmith, C.H., Edwards, A.M., Halabelian, L., “Crystal structure of the WDR domain of human DCAF1 in complex with CYCA-117-70”. Cyclica Press Release link: https://www.cyclicarx.com/press-releases/cyclica-and-the-structural-genomics-consortium-co-crystallize-dcaf1 SGC Link:https://www.thesgc.org/structures/7SSE PDB Link 7SSE.

Hit Optimization Methods

Method type (check all that applies)

Machine learning

Other (specify)

Other

Description of your approach (min 200 and max 800 words)

Cyclica will apply a related approach to expand on the first round actives and identify more potent molecules from the deeper Enamine Real Space (upto 21 billion molecules). Ultimately, our MatchMaker model (described above) will remain the core predictive driver, but our approach to chemical space exploration will differ. 1. Synthon-based expansion: We will decompose the identified hit(s) from our first screen into individual synthons provided by enamine. These will be used to seed an optimization algorithm that will iteratively substitute/add/remove synthons in search for better scoring configurations. The iterative molecular reconfiguration will be performed with Cyclica’s open source Deriver package, adapted to obey the molecular rules governed by Enamine while each molecule will be scored using MatchMaker. This approach will allow us to expand beyond the 4.1 billion molecule pre-enumerated Enamine REAL database, to their ~21 billion molecule REAL Space, which are also available from their Made-on-demand service. 2. We will apply the relevant filters and exclusion criterias described in section 3 from the hit-identification workflow to the top scoring synthon-expansion search. Moreover, Cyclica will apply new ML explainability approaches, including molecular counterfactuals and computational scaffold decomposition to analyze the core active moieties of the molecules according to the model.

Method Name

Ligand Design - Semi Generative

Commercial software packages used

MatchMaker

Free software packages used

Deriver: An opened source python library for chemical space exploration developed and maintained by Cyclica. GitHub: https://github.com/cyclica/deriver Publication: https://onlinelibrary.wiley.com/doi/full/10.1002/ail2.17

Relevant publications of previous uses by your group of this software/method

1.Reeves, S. et al. Assessing methods and obstacles in chemical space exploration. Appl Ai Lett 1, (2020). Link: https://onlinelibrary.wiley.com/doi/full/10.1002/ail2.17

Challenge #1