Computational methods

Hit Identification

Method type (check all that applies)

High-throughput docking

Other (specify)

Genetic algorithms

Description of your approach (min 200 and max 800 words)

I have developed a genetic algorithm (GA) that can search Enamine Real Space and will use it to find molecules with good docking scores to the target. My previous work has shown that GAs are very effective in searching a large chemical space for molecules for optimal properties,1-4 including docking scores.5 The genes are defined as a set of compatible synthons that can be combined to form molecules using a set of reaction rules and these genes are evolved to give molecules with good docking scores, starting from randomly generated genes. The open source program SynthI6 is used to create the synthons from the building blocks (freely available from the Enamine web page) used to make the Enamine Real Space library. The exact reaction rules used to make the Enamine Real Space library are proprietary, but the reaction rules implemented in SynthI seems to be a close approximation. Preliminary data shows that many of the molecules found by the GA can either be found in the Enamine Real Space library or differ from such compounds by only a few atoms (found by a search using the SmallWorld algorithm7). Molecules with small changes will then be docked to the target to make sure that these small differences do not affect the docking scores appreciably. The final list of 100 molecules will be assembled based on docking scores, inspection of the docking poses, number of commercially available derivates, drug-likeness, and structural diversity. The synthon-based GA code (Synthon-GA) can also be used with other synthon sets, for example, based on in-house sets of building blocks. It can also be used with ML models of activity instead docking scores, so it will be a very general tool for molecule discovery using huge make-on-demand libraries.

Method Name

Synthon-GA

Commercial software packages used

Glide for docking (though open source docking programs can also be used).

Free software packages used

Synthi, Synthon-GA

Relevant publications of previous uses by your group of this software/method

1. Jensen, Jan H. 2019. “A Graph-Based Genetic Algorithm and Generative model/Monte Carlo Tree Search for the Exploration of Chemical Space.” Chemical Science 10 (12): 3567–72. 2. Henault, Emilie S., Maria H. Rasmussen, and Jan H. Jensen. 2020. “Chemical Space Exploration: How Genetic Algorithms Find the Needle in the Haystack.” PeerJ Physical Chemistry 2 (July): e11. 3. Koerstz, Mads, Anders S. Christensen, Kurt V. Mikkelsen, Mogens Brøndsted Nielsen, and Jan H. Jensen. 2021. “High Throughput Virtual Screening of 230 Billion Molecular Solar Heat Battery Candidates.” PeerJ Physical Chemistry 3 (February): e16. 4. Ree, Nicolai, Mads Koerstz, Kurt V. Mikkelsen, and Jan H. Jensen. 2021. “Virtual Screening of Norbornadiene-Based Molecular Solar Thermal Energy Storage Systems Using a Genetic Algorithm.” The Journal of Chemical Physics 155 (18): 184105. 5.Steinmann, Casper, and Jan H. Jensen. 2021. “Using a Genetic Algorithm to Find Molecules with Good Docking Scores.” PeerJ Physical Chemistry 3 (May): e18. 6. Zabolotna, Yuliana, Dmitriy M. Volochnyuk, Sergey V. Ryabukhin, Kostiantyn Gavrylenko, Dragos Horvath, Olga Klimchuk, Oleksandr Oksiuta, Gilles Marcou, and Alexandre Varnek. 2021. “SynthI: A New Open-Source Tool for Synthon-Based Library Design.” Journal of Chemical Information and Modeling, November. https://doi.org/10.1021/acs.jcim.1c00754. 7. Irwin, John J., Khanh G. Tang, Jennifer Young, Chinzorig Dandarchuluun, Benjamin R. Wong, Munkhzul Khurelbaatar, Yurii S. Moroz, John Mayfield, and Roger A. Sayle. 2020. “ZINC20-A Free Ultralarge-Scale Chemical Database for Ligand Discovery.” Journal of Chemical Information and Modeling 60 (12): 6065–73.

Virtual screening of merged selections

Method type (check all that applies)

High-throughput docking

Description of your approach (min 200 and max 800 words)

Here we will simply use Glide to compute docking scores, and label molecules as active based on some cutoff value.

Method Name

Glide

Commercial software packages used

Glide

Hit Optimization Methods

Method type (check all that applies)

High-throughput docking

Machine learning

Other (specify)

Genetic algorithm

Description of your approach (min 200 and max 800 words)

The primary goal for this round is to identify the additional 100 best molecules for training a robust ML model of activity. The objective of the GA search1 is thus to identify the 100 molecules that, together with the 100 molecules from the previous round, results in the optimum training set, i.e. leads to the best performance on a pre-defined diverse test set. The performance is judged by docking scores augmented by the experimental data2 which is taken as a proxy for the actual activity. Each gene in the population is now a collection of 100 synthon-genes. The score for each “test set gene” is computed by docking the synthon-genes, training a ML model using these data plus the 100 molecules for Round 1, and computing the mean absolute error of that ML model for a predefined test set. The test set is chosen to be significantly different from the 100 molecules from Round 1 and the molecules in the population are constrained to be different from the test set, using some Tanimoto similarity cutoff. The test set genes are evolved to lower this score starting from randomly generated genes. Mating operations consists of combining different test set genes, while mutation operations consists of randomly incorporating new synthons. The exact choice of ML model remains to be decided, but for datasets of this size I have had good luck with either RF or LightGBM using either a ECFP4- or CDDD latent vector3-representation of the molecules. Once the Round 2 results are back, the model is then re-trained on the experimental data. The method is then tested on all data from this challenge once it is released.

Method Name

Synthon-GA

Commercial software packages used

Glide

Free software packages used

Synthon-GA

Relevant publications of previous uses by your group of this software/method

1. The idea comes from this study: Browning, Nicholas J., Raghunathan Ramakrishnan, O. Anatole von Lilienfeld, and Ursula Roethlisberger. 2017. “Genetic Optimization of Training Sets for Improved Machine Learning Models of Molecular Properties.” Journal of Physical Chemistry Letters 8 (7): 1351–59. 2. Ji, Beihong, Xibing He, Yuzhao Zhang, Jingchen Zhai, Viet Hoang Man, Shuhan Liu, and Junmei Wang. 2021. “Incorporating Structural Similarity into a Scoring Function to Enhance the Prediction of Binding Affinities.” Journal of Cheminformatics 13 (1): 11. 3. Winter, Robin, Floriane Montanari, Frank Noé, and Djork-Arné Clevert. 2019. “Learning Continuous and Data-Driven Molecular Descriptors by Translating Equivalent Chemical Representations.” Chemical Science 10 (6): 1692–1701.

Challenge #1