Challenge #4

Hit Identification

Method type (check all that applies)

De novo design

Machine learning

Description of your approach (min 200 and max 800 words)

A computer-implemented method for screening ligand candidates for a target protein. This is done through an in-house developed, integrated ensemble machine learning (ML) model for predicting binding affinity with very high speed and precision.

The input into the AI engine are drug candidates in SMILES format generated by our in-house ML-based de novo molecular generator (iGen), and a protein structure in the form of a .pdb file. The binding pocket features are analyzed and ligands capable of fitting into the target pocket are estimated according to matching between the features of the binding pocket and the ligand molecules. The compounds are filtered and screened with our in-house Ultra-Fast Screening approach to end up with the most fitting compounds based on their characteristics. The remaining molecular candidates are ranked according to their predicted binding affinities, obtained using a novel ML-based scoring function (iScore) trained on the largest available training sets from which the best data was handpicked.

iGen has the capacity to produce valid SMILES at 90.0 %, valid molecules at 87.4 %, with compound uniqueness at over 99.0 % and a speed of around 2000 SMILES per second on a single A100 node. If one reduces the speed and does not produce compounds in batches, valid SMILES increase to 98.4 %, valid molecules 95.9 % while uniqueness remains the same.

Having a list of top candidates, the compounds or close analogues will be sought in Enamine Real database (and also MolPort and eMolecules, if necessary), and a database of 100 analogues for each of the top 100 candidates generated and screened using iScore for proper ranking of available compounds.

What makes your approach stand out from the community? (<100 words)

Avoiding conformational sampling speeds up the hit identification process considerably as well as produces some of the most accurate affinity predictions to date. In CASF-2016 and CSAR benchmarks and case studies, our tool consistently performs best in scoring power, ranking power, and screening power. With our novel Ultra-Fast Screening approach (UFS), we can furthermore screen compounds several orders of magnitude faster than any current software we came across. Our iGen module takes advantage of the accuracy and speed of our proprietary methods, making the exploration of the whole drug-like chemical space feasible; something that has been elusive thus far. 

Method Name

i-TripleD by ANYO Labs

Commercial software packages used

none

Free software packages used

F-Pocket, D-Pocket, RDKit

Relevant publications of previous uses by your group of this software/method

The software has been developed, thoroughly tested and refined during the last couple of years. The team behind the project has incorporated and became ANYO Labs AB in December 2022 and the method subject to patent filing after in-depth FTO analysis in January 2023. Because of this, the team has kept the methodology a trade secret and will publish articles related to the method in about a year’s time. We are currently in the process of preparing articles for 2 of our ongoing projects. Professor Leif Eriksson has several publications in theoretical and computational chemistry, but none related to our current method.

Virtual screening of merged selections

Method type (check all that applies)

Machine learning

Description of your approach (min 200 and max 800 words)

A computer-implemented method for screening ligand candidates for a target protein. This is done through an in-house developed integrated ensemble machine learning (ML) model for predicting binding affinity with very high speed and precision.

The input into the AI engine are drug candidates in SMILES format either generated by our ML-based de novo molecular generator (iGen) or available in an external dataset, and a protein structure in the form of a .pdb file. Having the full database of proposed binders (~25 competitors, 100 compounds from each), these will first be prepared wrt most likely protonation states. The ligands features are estimated and matched against the binding pocket features of the protein, analyzed in the first step. Compounds capable of fitting into the target pocket are estimated according to matching between the features of the binding pocket and the ligand molecules.The molecular candidates are ranked according to their predicted binding affinities, obtained using the novel ML-based scoring function iScore, trained on largest available training sets.

The performance of i-TripleD i, due to the novel ML-architectures and optimally trained models, is unprecedented. For example, we screen a database of 1bn compounds on a single 64 core compute node in less than 36 hours. Screening the full dataset of the CACHE competition (~2500 compounds) will hence be performed in less than 1 second.

What makes your approach stand out from the community? (<100 words)

Avoiding conformational sampling speeds up the hit identification process considerably, and our scoring function produces some of the most accurate affinity predictions to date. In CASF-2016 and CSAR benchmarks and case studies, iScore consistently was best in scoring power, ranking power, and screening power. With our novel Ultra-Fast Screening (UFS) approach, we can furthermore screen compounds several orders of magnitude faster than any current software we are aware of.

Method Name

i-TripleD by ANYO Labs

Commercial software packages used

none

Free software packages used

F-Pocket, D-Pocket, RDKit

Relevant publications of previous uses by your group of this software/method

Hit Optimization Methods

Method type (check all that applies)

Machine learning

Hybrid of the above

Machine learning for scaffolding

Description of your approach (min 200 and max 800 words)

Based on the best performing compounds, key scaffold variants will be derived for each molecule. The in-house developed iterative substitutive scaffold optimization (ISSO) is then implemented, where the scaffold input in SMILES format is decorated to generate a desired number of analogous compounds. Any ‘accessible’ atom site can be decorated. The obtained dataset, for example 10 000 derivatives for a certain scaffold, is then screened and ranked towards the protein active site using the Ultra-Fast Screening of iTripleD and the iScore scoring function as outlined above. The best N derivatives are then used in a second round of decoration, filtering and screening, generating successively improved pKd values in iterative cycles until saturation.

Having worked through the full list of scaffolds generating a final list of top candidates, the compounds as such or their close analogues will be sought in Enamine Real database (and also MolPort and eMolecules, if necessary), and a database of 100 analogues for each of the top 100 candidates generated and screened using iScore for proper ranking of available compounds.

What makes your approach stand out from the community? (<100 words)

The iterative substitutive scaffold optimization (ISSO) from ANYO Labs utilizes our i-TripleD software combined with the iGen module for scaffold decoration. Any site of the scaffold can be chosen for generating any number of de novo scaffold decorations aiming to optimize pKd. The generated compound set is screened and ranked as outlined above in the i-TripleD software, and any number from the dataset can be chosen for subsequent iterative rounds, giving maximum flexibility and performance. In a recent test, a scaffold was optimized from a pKd of 7.1 to 8.6 in 3 consecutive iterations.

Method Name

ISSO and i-TripleD by ANYO Labs

Commercial software packages used

none

Free software packages used

F-Pocket, D-Pocket, RDKit

Relevant publications of previous uses by your group of this software/method