We propose applying Cyclica’s Ligand Design massive library screening workflow, which has been successfully applied on over several commercial hit-finding programs. The multi-scale workflow exhaustively screens Enamine’s REAL database, composed of 4.1 billion compounds through a series of three successive predictive tasks of increasingly demanding computational requirements. The core predictive engine for the proposed workflow is Cyclica’s MatchMaker engine, a deep-learning based Drug Target Interaction (DTI) prediction tool. Recently, DTI tools have emerged as a new class of predictive drug discovery algorithms1, that train on large datasets of pairwise protein-ligand binding pairs available through bioactivity databases (e.g. ChEMBL or BindingDB). At their core, DTI models combine protein+ligand feature vectors into neural networks to train models capable of identifying interacting pairs. MatchMaker differs from other reported DTI models in that it systematically maps DTI pairs derived from bioactivity databases onto 3D protein structures to generate local, site-specific structural protein features. The use of local site-specific 3D features was designed to boost MatchMaker inter-protein generalizability and novel target performance. Once trained, MatchMaker can evaluate approximately 800 pre-featurized protein-ligand pairs per cpu-second on modern computing infrastructure, which allows us to exhaustively search 4.1b ligands on a single-target with manageable cloud computing resources. For the CACHE hit-finding stage, we will apply this approach: 1. Cyclica staff scientists will select LRRK2 pockets using an internal target feasibility app, which combines several pocket identification tools, protein visualization software and other MatchMaker-specific target pocket analytics. 2. The initial mass library screen will be conducted against a single pocket from the LRRK2 protein, using Cyclica’s MatchMaker predictive engine on Google Compute Engine. We will screen all 4.1 Billion Enamine REAL database molecules. The top scoring molecule s will be counterescreened against the full human proteome (~15,000 proteins) and subjected to additional predictive tasks, such as molecular docking or ligand-based activity models to evaluate overall suitability. 3. We will apply standard clustering and filtering protocols that diversify structural characteristics among the top 100 selected compounds. This step also intends to eliminate molecules with problematic functional groups or structural characteristics (e.g. reactive groups). --- Revised methods 2022-05-11 While the LRRK2 target has over 5,500 publicly-reported bioactivities, most have assay descriptions implying interactions with the target's kinase domain. Few, if any, reported bioactivities are known to engage the target's WDR domain, limiting the applicability of ligand-based bioactivity models. Moreover, it was unclear if the top scoring molecules from our DTI model screen would target the kinase or WDR domain, given the combined use of global protein features and 3D pocket features in training our MatchMaker model. To overcome this challenge, we developed a ligand-based classification model to detect screening hits specific to the WDR domain signal. We first constructed a dataset of small molecule ligands known to interact directly with WDR40 domains based on annotations provided in the ECOD database. Following the removal of common crystal artefacts and visual inspection, we were able to assemble 112 different ligands bound to wd40 domains. To build a reference non-binding set, we performed a blast search on the PDB database, in search of alternative crystal forms and homologs with co-crystal structures. We then performed structural alignments for each blast hit onto an Alphafold2 model for LRRK2 [2-3] to superimpose and visually inspect binding sites of their co-crystal ligands. Once common crystal artefacts were removed, we identified 324 different ligands, none of which map directly to the LRRK2 WDR domain. We then built a ligand-based classification model using POEM to discriminate between the 114 WDR-domain binding ligands and the 324 ligands binding other domains in LRRK2 or its homologs. Using this model, we were able to confirm that the top scoring compounds from our MatchMaker-based screen had some compounds that resemble known WDR-40 binders, but too few to confidently select 100 for experimental testing. Instead, the classifier described above was used to pre-screen the enamine and mcule datasets. For each dataset, we applied a POEM confidence threshold set to 0.8 which reduced the screening libraries to ~3% of their original size. Following the prescreening step, we proceeded in accordance with the originally proposed outlined method, counter screening the library with a MatchMaker-based LRRK2 kinase pocket scores, filtering the remaining ligands on the basis of physicochemical properties, and using standard molecular docking and clustering workflows to select the top 150 compounds.  H. Cheng, R. D. Schaeffer, Y. Liao, L. N. Kinch, J. Pei, S. Shi, B. H. Kim, N. V. Grishin. (2014) ECOD: An evolutionary classification of protein domains. PLoS Comput Biol 10(12): e1003926.  Jumper, J et al. Highly accurate protein structure prediction with AlphaFold. Nature (2021).  Varadi, M et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Research (2021).  Andrew E Brereton et al. Predicting drug properties with parameter-free machine learning: pareto-optimal embedded modeling (POEM) 2020 Mach. Learn.: Sci. Technol. 1 025008
Python-based ML stack (PyTorch, scikit-learn); BioPython computational biology toolkit; RD-Kit computational chemistry toolkit; Various structural biology tools for structural analysis and visualization, including P2Rank, NGL viewer, Autodock Vina.
This recent review linked below1, published by our group outlines the general ML strategy behind “DTI Models”, including a description of the MatchMaker approach. The second link provides a sample application towards drug repurposing, whereby a smaller drug repurposing library was screened to discover previously unreported off-target interactions with distinct bioactivities, rather than an exhaustive 4.1b molecule REAL database which has only recently become technically feasible. The last reference provides a recent example of a small molecule hit identified by a massive library screen on a novel target (publication currently in preparation). The example demonstrates our current platform capabilities and more closely represents the workflow we intend to apply in CACHE. 1. MacKinnon, S. S., Tonekaboni, S. A. M. & Windemuth, A. Proteome‐Scale Drug‐Target Interaction Predictions: Approaches and Applications. Curr Protoc 1, (2021). Link: https://currentprotocols.onlinelibrary.wiley.com/doi/full/10.1002/cpz1.302 2. 1.Sugiyama, M. G. et al. Multiscale interactome analysis coupled with off-target drug predictions reveals drug repurposing candidates for human coronavirus disease. Sci Rep-uk 11, 23315 (2021). Link: https://www.nature.com/articles/s41598-021-02432-7 3. Kimani, S., Owen, J., Dong, A., Li, Y., Hutchinson, A., Seitova, A., Shahani, V.M., Schapira, M., Arrowsmith, C.H., Edwards, A.M., Halabelian, L., “Crystal structure of the WDR domain of human DCAF1 in complex with CYCA-117-70”. Cyclica Press Release link: https://www.cyclicarx.com/press-releases/cyclica-and-the-structural-genomics-consortium-co-crystallize-dcaf1 SGC Link:https://www.thesgc.org/structures/7SSE PDB Link 7SSE.