Our approach consists of two general steps, each of which has some flexibility.
- Use a fully Bayesian model to model binding activity, trained on a mixture of data from docking and experimental data from the literature.
- Use established techniques from batch Bayesian optimization to select a list of molecules such that the probability that at least one molecule in the list is active is maximized.
Step 2 is the critical part of our proposal, which leans on our group's expertise in Bayesian optimization and will likely differentiate us from other approaches. To motivate this, consider that the final step for most methods will be to select a list of 100 molecules from a potentially much larger list. One way to do this would be to select the 100 molecules with the highest predicted score. Although this could work well, it may be the case that the 100 top molecules are all quite similar (e.g. minor variations of the same scaffold), and therefore their activities will likely be highly correlated. This could be avoided using any number of heuristics, such as choosing at most 10 molecules with the same scaffold, but using these heuristics may end up selecting molecules with low predicted scores (e.g. if there are only 5 promising scaffolds, then 50% of the molecules will probably be poor). It is not clear how to trade-off high predictions against diversity in a general way.
Fortunately, this problem has been extensively studied in the context of Bayesian optimization (BO). A number of techniques have been proposed, which generally suggest selecting a set of data points which jointly maximize a probabilistic or information-theoretic quantity: for example, the total bits of information as measured by the model, the expected value of the best data point, or the probability that one data point exceeds the best point in the dataset. These objectives will naturally not select batches of very similar data points because their outcomes will be highly correlated: e.g. if one molecule is inactive, the probability that a similar molecule will also be active will be very low, and therefore including it will only marginally improve any of these objectives. At the same time, including molecules will low predicted scores will also not optimize these objectives. In general, these techniques have the potential to optimally trade-off quality and diversity, with the additional advantage of being principled and usually having an intuitive interpretation. However, calculating these quantities requires special kinds of models, such as Gaussian processes, where the full predictive distribution can be easily calculated. Our group has extensive expertise in these methods.
Although we cannot know exactly which model we will use before fitting the data, in general we expect to use a Gaussian process model trained either on molecular fingerprints or on the outputs of a deep neural network, trained on a mixture of docking data and real-world activity data. This will be the model in step 1. Similarly, we will examine many possible techniques and equations to select the batch of data, many of which are listed here: https://botorch.org/api/acquisition.html