The Transformer deep learning model architecture is an AI technology able to provide a novel and efficient tool to accelerate hit identification in antimicrobial drug discovery. They feature a novel attention mechanism that derives contextual information to improve data generalisation on large datasets[1]. This is pertinent to increasing predictive model performance for unbalanced datasets containing mostly inactive compounds and is relevant to antibiotic discovery through enhancing utilisation of the relatively few active antimicrobial compounds that often represent far less than 1% of existing large databases. In this study, we investigate if our predictive Transformer modelling methodology could enhance in silico compound screening in commercial libraries to find novel hits for antimicrobial development.
We developed Transformer models analysing our CO-ADD database, a crowd-sourced collection of unique chemical diversity that assayed over 500,000 compounds for inhibitory activity under standardised conditions against five key ESKAPE pathogens (E. coli, K. pneumoniae, A. baumannii, P. aeruginosa, S. aureus (MRSA)), and two fungi species (C. albicans, C. neoformans) represented using MolBERT fingerprints[2].
F1 scores measure the model performance trade-off between true positive, false positive, and false negative predictions relevant in majority negative datasets where metrics considering true negatives overestimate performance. Initial Transformer models found 0.88 (E. coli), 0.78 (K. pneumoniae), 0.83 (A. baumannii), 0.82 (P. aeruginosa), 0.80 (S. aureus), 0.62 (C. albicans), and 0.67 (C. neoformans) F1 predicting the held-out external validation dataset (n=1607). A multitask fungi Transformer improved F1 to 0.72 (C. albicans) and 0.76 (C. neoformans). All Transformer models featured double to over quadruple the true positive rate of previous 2020 CO-ADD deep learning models.
The E. coli Transformer model set found 465 potential hits in the ChemDiv Preplated Compound Library (n=100,000) and 145 potential hits in the ChemDiv Natural Products Library (n=17056). Post-prediction analyses with internal antibiotic resistance mechanism rules, model uncertainty and applicability domain checks are ongoing to filter and refine the compound selection to be acquired for in vitro experimental validation.