Mathematik | Informatik


Léon Albrecht, 2002 | Bern, BE


Breast cancer has developed to be one of the major life-time limiting factors in women for years. One of the major complications in regressing the development of the disease is the availability of drugs. As conventional drug discovery projects are implemented by testing different molecules, and modified by trial and error processes, this seems a time and cost consuming method. This project focused on using machine learning tools to specifically design molecules inhibiting aromatase, a key enzyme synthesizing oestrogen. Previous studies revealed a correlation of high oestrogen concentration and development of hormone-receptor-positive breast cancer. The artificial intelligence (AI)-generated molecules were tested In-Silico through docking analyses for inhibition of the enzyme aromatase. Comparative analyses in this project, led to the identification of 3 promising drug candidates featuring high binding affinities, whereby they competitively bind to the active site of our target enzyme and inhibit its key function. In an effort to successfully treat and cure hormone- receptor-positive breast cancer, by oestrogen deprivation


The rise in breast cancer cases and the existing shortage in diverse breast cancer drugs are seen as a major obstacle in treating this disease. Consequently, the research question emerged: Can AI generate potent competitive aromatase inhibitors?


To generate synthetic molecules, a Natural Language Processing Sequencing model has been used, which was fed with SMILES codes, to learn its representation. SMILES are simplified chemical structure codes of molecules encrypted in strings. After successful pre-training with a non specific dataset, that contained over 315’000 random molecules. The model was then provided with a target specific dataset, that consisted of 58 compounds labelled as active towards our target, which had been selected by their low IC50 values. Since the better a drug binds to its target the higher is its inhibitory effect, which is reflected by a low IC50 value. Through training and validation, the model recognized pattern similarities in the target specific molecules and calculated probabilities of each following character in strings. Subsequently, the model autoregressively generated new sequences/synthetic molecules. 3 datasets were generated, where as they differ in their softmax sampling temperature: V1, temp: 1/ V2, temp: 2/ V3 temp: 1.5. Whereby higher temperature is linked to greater diversity in molecule structure. Following the molecules were selected after Lipinski’s rule of 5 and docked using Autodock Vina. Exemestane and Anastrazole, commercially available drugs inhibiting aromatase, were selected as control group, to validate the docking. The drugs exhibited a binding affinity of -8.3kcal/mol and -7.1kcal/mol, these values were transferred and used as thresholds of significance. Thereupon the highest binding affinity exhibiting molecules from each dataset were extracted, and their protein-ligand interactions were analyzed for hydrogen bond interactions.


The trained LSTM model achieved a 99.49% accuracy in encoding and decoding SMILES, which means that the model is confident in conversion. The model generated 29 ligands which includes 9 ligands that featured higher binding affinities than Exemestan. The overall binding affinity range of the generated molecules are between -4.3 and -9kcal/mol, with an average of -7.06kcal/mol. Out of the 3 datasets, the respective molecules with highest binding affinities, featured between 1-4 hydrogen bond interactions, which weren’t present in the drug Exemestan.


Considering the binding values, it is not possible to detect any explicit advantage in using different sampling temperatures. There were some limitations that hindered us from gaining more diverse outcomes, such as a very limited transfer learning dataset, which consisted of only 58 compounds. The model could’ve made more sophisticated molecules, if other characteristics of molecules were taken into consideration for the predictions of molecule structure, such as their physical properties. Nevertheless, it should be considered that all results are successful in-Silico only, a further hurdle will then be the experimental validation.


In conclusion we were able to find 3 potent drug candidates, which most probably have an effect on the target protein aromatase. These finding show that the application of AI in computational chemistry allows fast generation of target identifier molecules and speeds up processes in the field of drug discovery.



Würdigung durch den Experten

Dr. Andreas Steiner

This project spans a wide range of disciplines, from computer science (sequence generation) to computational biology (docking analysis) and medicine (inhibiting tumor metabolism). Léon’s personal inspiration for this application has motivated him to initiate the project, in addition to his regular Maturaarbeit and Matura exams. Léon has shown great enthusiasm in this interdisciplinary project combining his strong interest in biology applied to machine learning, adapting methods from a project targeting COVID metabolism to find new candidate molecules that might specifically inhibit tumor growth.


sehr gut

Sonderpreis Universität Fribourg – Department of Chemistry




Gymnasium Muristalden Bern, Bern
Lehrerin: Susanne Steiner