Computational methods

Hit Identification

Method type (check all that applies)

Deep learning

High-throughput docking

Machine learning

Description of your approach (min 200 and max 800 words)

We propose a comprehensive workflow for identifying potential hit compounds using natural language processing, molecular docking, and unsupervised learning. Our goal is to leverage the vast high-performance computing resources at Argonne National Laboratory to perform inference on the entire enamine database.

First, we will estimate the binding affinity of all 5.5 billion enamine compounds to the tyrosine kinase binding (TKB) domain of Casitus b-lineage lymphoma (CBLB) using a docking surrogate. To achieve this, we will create a diverse training subset of 1 million compounds by conducting scaffold-based clustering of the 5.5 billion Enamine molecules, followed by a stratified sampling of each cluster. We will then use AutoDock Vina to dock each molecule in the training set to the TKB domain of CBLB. The computed docking scores will be used to train a docking surrogate using a Bidirectional Encoder Representations from Transformers (BERT)-like Transformer that learns a vector embedding of the SMILES representation for each molecule. The extracted SMILES embeddings will be used to train a surrogate regression model to predict the docking scores.

Once we have trained the BERT-like Transformer, we will use it to encode all compounds in the Enamine dataset in a latent space. To enhance the diversity of potential hits, we will cluster the latent space using a Gaussian mixture model. We will filter out clusters not containing any of the 895 known CBLB inhibitors to remove compounds not likely to compete with the co-crystallized ligand. All remaining compounds will then be fed into the docking surrogate to predict docking scores. Each cluster will be ranked according to the surrogate docking score, and the top 10% of each cluster with the best binding affinity will be selected for further screening.

Next, we will use AutoDock Vina to dock each compound in this reduced space to obtain more accurate binding affinity estimates and structural information of the docking pose. All compounds will be ranked based on two criteria: binding affinity and center of mass distance to the crystal structure compound (calculated using MDAnalysis). Before calculating the center of mass distance, we will run a molecular dynamics simulation of the CBLB crystal structure and use the most stable position of the co-crystallized compound as the reference. Since the scales of these two measures are different we will standardize each measure according to its distribution and take the arithmetic mean as the final ranking score. We will recommend the top 100 compounds from this approach for experimental testing. Our workflow leverages advanced algorithms and high performance computing to accelerate the discovery of lead compounds.

What makes your approach stand out from the community? (<100 words)

The inclusion of a Transformer allows for high throughput screening of compounds without expensive preprocessing such as generation of Mordred descriptors or molecular fingerprints. Furthermore, the proposed use of unsupervised segmentation of data allows for a parallelized approach for hit identification. The supercomputing resources and cyberinfrastructure at Argonne National Laboratory would allow us to seamlessly scale the workflow across many compute nodes, vastly increasing the number of compounds we could feasibly test.

Method Name

Parallelized Inhibitor Prediction using Transformers (PIPT)

Commercial software packages used

None

Free software packages used

AutoDock Vina 1.1.2, AutoDock Tools, RDKit, Enamine Database, VMD, Python, MDAnalysis, PDB Database, NAMD, Tensorflow, PyTorch, scikit-learn

Relevant publications of previous uses by your group of this software/method

Saadi, A.A., Alfe, D., Babuji, Y., Bhati, A., Blaiszik, B., Brace, A., Brettin, T., Chard, K., Chard, R., Clyde, A. and Coveney, P. ‘Impeccable: Integrated modeling pipeline for covid cure by assessing better leads’. In 50th International Conference on Parallel Processing (2021): 1-12.

Lee, H., Merzky, A., Tan, L., Titov, M., Turilli, M., Alfe, D., Bhati, A., Brace, A., Clyde, A., Coveney, P. and Ma, H. ‘Scalable HPC & AI infrastructure for COVID-19 therapeutics’. In Proceedings of the Platform for Advanced Scientific Computing Conference (2021): 1-13.

Clyde, A., Galanie, S., Kneller, D.W., Ma, H., Babuji, Y., Blaiszik, B., Brace, A., Brettin, T., Chard, K., Chard, R. and Coates, L. ‘High-throughput virtual screening and validation of a SARS-CoV-2 main protease noncovalent inhibitor’. Journal of chemical information and modeling 62, no. 1 (2021): 116-128.

Zvyagin, M., Brace, A., Hippe, K., Deng, Y., Zhang, B., Bohorquez, C.O., Clyde, A., Kale, B., Perez-Rivera, D., Ma, H. and Mann, C.M. ‘GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics’. bioRxiv(2022).

Haloi, N., Vasan, A.K., Geddes, E.J., Prasanna, A., Wen, P.C., Metcalf, W.W., Hergenrother, P.J. and Tajkhorshid, E. ‘Rationalizing the generation of broad spectrum antibiotics with the addition of a positive charge’. Chemical Science 12, no. 45:15028-15044.

Vasan, A.K., Haloi, N., Ulrich, R.J., Metcalf, M.E., Wen, P.C., Metcalf, W.W., Hergenrother, P.J., Shukla, D. and Tajkhorshid, E. ‘Role of internal loop dynamics in antibiotic permeability of outer membrane porins’. Proceedings of the National Academy of Sciences 119, no. 8: p.e2117009119.

Challenge #4