Computational methods

Hit Identification

Method type (check all that applies)

High-throughput docking

Machine learning

Physics-based

Other (specify)

Hydration site analysis, 3D pharmacophoric screening

Description of your approach (min 200 and max 800 words)

Our team of Computational Chemists and Machine Learning experts is part of a Science CRO that has continuous impact on the drug discovery community by collaborating with big pharma and incubating biotechs. In drug discovery projects, we prioritize compounds based on our million-scale in-house compound database, which includes structures, bioactivities, and PhysChem data. Additionally, we use validated machine learning models to predict PhysChem properties that were trained on our legacy database.

The fragment binding modes to NSP13 in the X-ray structures will be summarized in a merged binding site model. The adjacent, so far unexploited subpockets will be investigated further with the help of hydration site analysis tools, such as 3D-RISM and PyRod. These tools will allow us to identify hotspots to gain enthalpy and entropy terms upon ligand binding. We believe the information of fragment interactions as well as the energetics and pharmacophoric characteristics of hydration sites will yield an augmented binding site model. This model will be detrimental to find compounds that extend from the fragments towards unexploited subpockets.

The binding site model will be used in a pharmacophoric screen of the virtual compound space including iterative optimization. Hits will be filtered in accordance with the traffic light system of CACHE (fraction sp3 carbons, polar surface area, logD, solubility, number of rotatable bonds, and molecular weight). For optimal property prediction accuracy our machine learning models trained on proprietary in-house data will be applied.

The filtered hitlist will be docked to a prepared 3D structure model of the target NSP13 (multiple conformations, if required) for confirmation, including optimal shape fit to the pocket and ligand strain. Depending on the number of entries the following work packages might be applied in an iterative fashion.

Docking hits will be triaged using two docking scores implemented in ICM. A classical score including a desolvation term and a machine learning score trained on protein-ligand interactions (Radial ant Topological Convolutional Neural Net - RTCNN).

Docking hit poses will undergo physics-based method for free energy of binding estimation, such as Molecular Mechanics Poisson-Boltzmann surface area continuum solvation (MM-PBSA) on the docking pose or a short molecular dynamics simulation. Protein-ligand interaction fingerprints might help preselection for visual inspection to assure binding pose plausibility.

Optionally, the hitlist will be expanded by rescreening with a revised binding site model to enrich the hitlist with favored scaffolds, which will undergo the same workflow including docking and subsequent steps, if required.

Eventually, the hitlist will be revised by a senior medicinal chemist for lead-likeness and medchem attractiveness. Hit confirmation, revision and hit expansion might be repeated until a list of up to 100 promising virtual molecules is obtained.

What makes your approach stand out from the community? (<100 words)

We are a Science CRO with a large in-house database of molecules that reflects the legacy of the pharmaceutical industry. While we generate new data for ongoing projects, we can apply our (machine) learnings for our client’s needs and this CACHE challenge.

The missing bioactivity for the four published fragments requires a critical assessment of the binding modes reported in the X-ray structures. A hydration site analysis will provide an unbiased physics-based readout to identify the most critical interactions for a pharmacophoric virtual screen. More expensive computational methods will be used to prioritize molecules further down the pipeline.

Method Name

Hydration site analysis guided virtual screening campaign

Commercial software packages used

Molsoft ICM-Pro

CCG MOE

AMBER

KNIME server

Free software packages used

Python + libraries (OpenMM, RDKit, pandas, matplotlib, numpy)

PyRod

Relevant publications of previous uses by your group of this software/method

Schaller et al. 2019 - https://doi.org/10.1021/acs.jcim.9b00281

Pach et al. 2020 - https://doi.org/10.1021/acsmedchemlett.9b00629

Hit Optimization Methods

Method type (check all that applies)

De novo design

Free energy perturbation

High-throughput docking

Machine learning

Physics-based

Other (specify)

Substructure search

Description of your approach (min 200 and max 800 words)

The activity results of the submitted 100 compounds from Stage 1 will be carefully analyzed for synthetic feasibility and CACHE-relevant molecular properties such as molecular weight, fraction sp3 etc. Up to 10 compounds will be picked for hit optimization.

The binding mode for each of the 10 selected compounds will be examined and Murcko-like scaffolds generated that only contain substituents essential for binding to NSP13. These scaffolds will be used for substructure-based searches of virtual compound libraries. Up to 10k hits will be accepted per substructure resulting in up to 100k analogues.

If this strategy does not deliver enough analogues, we will discuss a combinatorial library enumeration approach in the binding site also involving medicinal chemists to utilize our in-house building blocks. (If such compounds pass all steps, we would also be willing to synthesize compounds in-house and deliver them to the CACHE team for in-vitro testing.)

Selected analogues will be docked into the NSP13 binding pocket using the parent compound’s docking pose as a shape- and electrostatics-based constrain (APF) in ICM-Pro. Compounds will be filtered for docking poses being able to perform the most critical interactions as identified with the parent compound. For each of the 10 prioritized parent compounds up to 1k compounds will be selected by docking score as well CACHE-relevant molecular properties resulting in up to 10k analogues. As stated in the first stage of this challenge critical properties such solubility and logD will be predicted using machine learning models trained on our in-house proprietary life science database.

The docking poses of the 10k analogues will be evaluated in terms of their stability in short molecular dynamics simulation. The ligand’s root-mean-square deviation or similar descriptors will be calculated and used as a filter to reduce the number of compounds to a maximum of 1k.

Finally, free energy calculations will be performed. The estimated relative free energy will be used to prioritize compounds for a last round of visual inspection including experienced medicinal chemists.

What makes your approach stand out from the community? (<100 words)

We apply state-of-the-art methods in a pipeline to reach a sensible set of analogues. We combine this pipeline with tools that are applied in real-life drug discovery campaigns, such as machine learning models trained on our proprietary in-house life science database.

The hitlist might be enriched with attractive chemical matter with a combinatorial library enumeration inside the binding pocket. De novo compounds can be synthesized by our MedChem colleagues using our in-house building block library.

Method Name

Scaffold-based analogues prioritized by free energy calculations

Commercial software packages used

Molsoft ICM-Pro
CCG MOE
AMBER
KNIME server

Free software packages used

Python + libraries (OpenMM, RDKit, pandas, matplotlib, numpy)

Challenge #2