Computational methods

Hit Identification

Method type (check all that applies)

Deep learning

Free energy perturbation

High-throughput docking

Machine learning

Physics-based

Hybrid of the above

Our CODASS 3.0 workflow has components of all the boxes checked above (see details below)

Description of your approach (min 200 and max 800 words)

CODASS3

This proposal is a substantially enhanced version of our previously successful CACHE2 proposal. It includes improvements to every stage of our COmbined Docking And Similarity Search 2.0 (CODASS2) workflow that was applied to that challenge, as well as introducing additional tools and features to boost both its throughput (and thus the size of its screening library) and the reliability of its predictions. In summary, these improvements are:

Improved Virtual Chemical Library

Our approach strives to leverage the chemical diversity accessible by Enamine in an efficient manner. For this, we will generate virtual libraries following a dynamic, hierarchical strategy. A small, diverse subset of Enamine’s catalogue (HitLocator, ca. 460 k compounds) will be evaluated with our computational models and serve as the starting point to identify virtual chemical seeds. The seeds will be used to compile a focused library from Enamine’s REAL catalogue (22 billion compounds), which will then be also evaluated using our models. At each step, compounds will be filtered for suitable physicochemical properties (e.g., MW, logP, PAINS, absence of carboxylates). This process will be repeated by generating more refined libraries based on the predictions from the parent library. This evolutionary process will allow us to minimise the computational resources spent on unpromising chemotypes and exploit more deeply the chemical diversity accessible by Enamine.

SCORCH2

An evolution of the DL-based Scoring Function SCORCH (which was very recently published, see relevant publications), SCORCH2 adds several features. These include using a larger number data sources, leveraging binding affinities as part of its training, new descriptors to better take capture binding entropy, a broader range of model architectures to improve its accuracy and uncertainty estimates,additional data augmentation, improved data splitting and model finalisation, improved handling of class imbalance, and improved GPU compatibility. The result is a superior docking, screening, and ranking power as well as a higher throughput.

See below for more details of the consensus scheme built-in to SCORCH2 itself.

Autodock-SS

A wholly new addition, Autodock-SS introduces 3D ligand-based virtual screening to our workflow, which previously relied only on structure-based virtual screening methods with some 2D similarity metrics. It is a development of Autodock-GPU that repurposes the docking algorithm to evaluate 3D molecular similarity (rather than fit to a protein's pocket), while treating the library molecules as conformationally flexible. Unlike any freely-available LBVS tool, as well as being more realistic, this obviates the need for pre-generation of a multiconformer library.

Both SCORCH2 and Autodock-SS have shown performance beyond the state of the art (as described in the literature) in rigorous in-house benchmarking against databases of experimentally-derived structures and binding data, hence the preparations for their publication in peer-reviewed journals (see list below).

GaMD+CaFE+MOPAC

Calculation of binding affinity via molecular dynamics and semiempirical quantum mechanics is a wholly new addition to our workflow. It leverages the new Gaussian Accelerated Molecular Dynamics technique (in conjunction with CaFE and MOPAC2016) to boost sampling and thus reliability. We have recently sued this method to evaluate docking results from a virtual screening campaign carried out against Trypanosomatid phosphofructokinases, after benchmarking against experimentally-determined structures and binding data showed it to usefully boost the reliability of hit selection. It will be used to further validate the binding potential of top hits selected by the upstream methods, and will be particularly valuable during the Hit Optimization stage of the challenge.

Vina-GPU+

With a similar reliability in pose prediction to our previous initial screening/docking tool PSOVina2, but with approximately fivefold increase in throughput, the use of Vina-GPU+ as our initial docking tool means the size of the library that can be processed (and thus the probability that it contains true hits) increases.

Consensus Methods:

Posing - also known as Consensus Docking (and not to be confused with consensus scoring), consensus posing leverages multiple docking programs to substantially boost the reliability of the predicted poses that are fed into the downstream scoring schemes. Rigorous evaluation by multiple research groups (including ours, which made the initial discovery) has shown this to hold true under a variety of conditions; see our publications and the following, which reference our work:

https://pubmed.ncbi.nlm.nih.gov/27311630/

https://www.mdpi.com/1420-3049/28/1/175/pdf

Classical Scoring - a battery of 6 methods based on force-field, knowledge-based, and machine learning approaches, with a proven track record in reliability (see our publications).

DL Scoring - SCORCH2 is itself a consensus method as it combines the ML/DL methods of its predecessor SCORCH1 (GBDT using XGBoost, a FF NN, and a W&D NN) in a new way, namely by implementing a consensus model by average prediction score (known as a multi-balanced model), and also adds a Random Forest-based scoring method to this scheme.

For CACHE2, we used a standard rank-by-rank consensus scheme to merge all of these scores, but CODASS 3.0 will now use the new exponential ranking consensus scheme described here:

https://pubmed.ncbi.nlm.nih.gov/30914702/

What makes your approach stand out from the community? (<100 words)

Our focus is on leveraging a comprehensive battery of state-of-the-art and beyond-state-of-the-art structure-based and ligand-based methodologies in a sophisticated workflow that has been expanded and refined over the last 10 years. Our rigorous in-house testing, against many different targets, in both prospective and retrospective studies, has shown that consensus methods (which include consensus pose prediction and the distinct consensus scoring) are more reliable than single methods used alone, particularly when leveraged against a new target. Our succesful CACHE2 methodology has been significantly enhanced, with many improvements made to our existing tools, as well as new tools introduced.

Method Name

COmbined Docking and Similarity Search 3.0 (CODASS3)

Commercial software packages used

None

Free software packages used

Autodock, Vina-GPU+, GWOVina, RF-Score-VS v2, SCORCH2, Osiris DataWarrior, PDB2PQR, OpenBabel, RDKit, Autodock-SS, NAMD, MOPAC2016 (free to academics), Filter-it

Relevant publications of previous uses by your group of this software/method

SCORCH: Improving structure-based virtual screening with machine learning classifiers, data augmentation, and uncertainty estimation. doi: 10.1016/j.jare.2022.07.001.

Comparison of ATP-binding pockets and discovery of homologous recombination inhibitors. doi: 10.1016/j.bmc.2022.116923

Consensus Docking: Improving the Reliability of Docking in a Virtual Screening Context. doi:10.1021/ci300399w

Design of drug-like hepsin inhibitors against prostate cancer and kidney stones. doi: 10.1016/j.apsb.2019.09.008

Structure- and ligand-based virtual screening identifies new scaffolds for inhibitors of the oncoprotein MDM2. doi: 10.1371/journal.pone.0121424

Identification and activity of inhibitors of the essential nematode-specific metalloprotease DPY-31. doi: 10.1016/j.bmcl.2015.10.077

Inhibition of the ERCC1-XPF structure-specific endonuclease to overcome cancer chemoresistance. doi: 10.1016/j.dnarep.2015.04.002

Discovery of a novel ligand that modulates the protein-protein interactions of the AAA+ superfamily oncoprotein reptin. doi: 10.1039/c4sc03885a

UFSRAT: Ultra-fast Shape Recognition with Atom Types - the discovery of novel bioactive small molecular scaffolds for FKBP12 and 11βHSD1. doi: 10.1371/journal.pone.0116570

Gao G, Houston DR. “A new Score Function Based on a Random Forest Model for Structure-based Virtual Screening”, manuscript in preparation

Gumbis G, Ben Y, Houston DR. "Creation of Pharmacophore Model for Small Molecule Inhibitors of T. brucei Phosphofructokinase and Analysis of Inhibitor-Protein Complex by Docking, Molecular Dynamics and SEQM", manuscript in preparation

Boyang N, Wang R, Khalaf H, Blay-Roger V, Houston DR. “Autodock-SS: AutoDock for Multiconformational Ligand-Based Virtual Screening”, manuscript in preparation.

Note preliminary methodologies, data and findings for the above manuscripts in preparation will automatically be made open to the public in Mar/Apr 2023 in the form of MSc Dissertations, according to the University of Edinburgh’s Central Library publishing schedule:

https://www.sps.ed.ac.uk/students/postgraduate/taught-msc/your-studies/msc-taught-dissertations/msc-dissertation-library

The Original CODASS 1.0 method was presented at the "Computational Chemical Biology: probing biology with in silico tools” Conference at The University of Manchester:

https://www.researchgate.net/publication/259216315_CODASS_A_New_Process_for_Ligand_Discovery_In_Silico

Our CODASS 2.0 workflow was described in detail in our CACHE2 application.

All of our tools, methodology and workflow (which we have named COmbined Docking And Similarity Search 3.0 or CODASS3), are open-source, and described in the literature (or soon will be). Our latest improvements are so new that some of them are not yet published. These are listed in the References section as Mnauscripts in Preparation.

Challenge #3