We had built a custom model, ligand based, to predict the IC50 of ligands. Subsequently, it will be used to screen a large number of ligands in the ENAMINE database (1 billion small molecules). The model would leverage on transfer learning, using latent representations from models trained on large dataset, published previously.
To represent the ligands, NotYetAnotherNightshade (NYAN) model was used (our recent paper published in Nature Machine Intelligence, https://doi.org/10.1038/s42256-023-00683-9). NYAN is a graph variational encoder model that can represent ligands in the latent space. The input which NYAN model requires is the Simplified Molecular Input Line Entry System (SMILES) code of the ligand. Information from the protein would not be considered in the model. Supervised learning approach was used whereby a multi-layer perceptron (MLP) model was trained using the NYAN embeddings as features and the IC50 values as the target values, which are available in the competition dataset (753 patent compounds). Hyper-parameter tuning was then conducted to ensure the model trained can predict IC50 well. The R2 test results of 5-fold cross-validation are shown here: 0.56152, 0.46183, 0.33454, 0.48296, 0.32956.
Then we test this model on the 2605 compounds from ChEMBL (given by the Challenge) by predicting the IC50 of each compound. 95.2% of the 2609 compounds were predict to have IC50 less than 1 uM. Additionally, Mann-Whitney U test was performed and the null hypothesis of IC50 distribution of ligands with confidence score 9 being equal or greater than that of confidence score 8 was rejected, with p-value of 1.20e-05 From these data, we believe that our current model can function as a good regression model to predict the IC50 of small molecules binding to MCHR1.
To focus on identifying potential ligands with dissimilar structure from the ones given in the competition dataset, pairwise Tanimoto score would be computed between the ligands from the ChemBridge Database and competition dataset. All ligands with Tanimoto score larger than 0.7 would be discarded. Our model will predict the IC50 of the remaining ligands and then select top 1000 molecules.
The final top 100 compounds will be selected after filtering with physicochemical properties.