Application and Comparison of Deep Learning Methods in the Prediction of RNA Sequence Degradation and Stability

Chemie | Biochemie | Medizin

Ankit Singhal, 2004 | Reinach, BL

mRNA vaccines are receiving increased interest as potential alternatives to conventional methods for the prevention of several diseases, including Covid-19. This paper proposes and evaluates three deep learning models (Long Short Term Memory, Gated Recurrent Unit, and Graph Convolutional Networks) as methods to predict the stability and degradation risk of RNA sequences. These predictions can be very useful in the development of mRNA vaccines as they can reduce the number of sequences synthesized and tested by helping to identify the most promising candidates. Reasonably accurate results were obtained, with the GCN being the best predictor of reactivity, while the Gated Recurrent Unit Network was the best, overall, at predicting risks of degradation under various circumstances with an accuracy value of 76%. The results suggest the feasibility of applying such methods in mRNA vaccine research in the near future.

Introduction

How effectively can selected Recurrent Neural Networks predict experimental values for RNA sequences using just their primary structures?

Methods

The dataset was first preprocessed by augmenting it with the CONTRAfold method in the ARNiE package and hot encoding it into vector matrices, so that a computer could operated on it. The BPP data was processed into a matrix for the LSTM/GRU and into a graph (nodes representing the bases and edges representing the non-covalent interactions, i.e. hydrogen bonding) for the GCN. Then, the three models were generated using Keras and a TensorFlow backend. It was at this stage that hyperparameter optimization took place with a grid search algorithm (epochs=50, batch size=64, K-Fold Cross Validation k=4). The model was trained, validated, and tested. 20 trials were conducted with performance measurement by two loss functions, Root Mean Square Error and Mean Absolute Error, to generate the results.

Results

After the training component, the LSTM performed the best, followed by the GRU and GCN. The training RMSE values for the three algorithms were 0.1089, 0.1143, and 0.1752, respectively. However, in the testing process, the GRU proved to be the best algorithm overall, performing the best at predicting degradation values under all four degradation conditions, whereas the GCN performed the best at predicting the reactivity of a sequence. This suggests that the LSTM was more prone to overfitting. Additionally, when predicting the degradation at pH10 with Mg and degradation patterns at 50 degrees Celsius, the LSTM had higher RMSE values but lower MAE values than the GCN, suggesting that it is more prone to larger individual errors. The Mean Absolute Percentage Error for the GRU was calculated to be 24%. Therefore, the most accurate model performed at an accuracy of about 76% overall, on all five indicators.

Discussion

In this manuscript, three RNNs were applied to the regression task of predicting reactivity and degradation values of RNA sequences, as was the initial goal. There were varying degrees of success, and the two performance metrics allowed for the determination of the relative sizes of individual errors. The models were also quite computationally efficient as they usually only required a couple of hours to train on a Virtual Machine Instance on the Google Cloud Platform with 4 vCPUs, 128 GB RAM, and an 80GB NVIDIA A100 GPU, making them scalable to larger datasets, especially given the resources present in commercial and academic environments. As with any ML project, data limitations proved to be a hurdle due to the size of the RNA sequences tested. Although 76% is a reasonably high accuracy rate, given the nature of the problem, the innacuracy present is not insignificant. It is also difficult to determine if it is an accurate indication for larger sequences that are utilized in mRNA vaccine technologies (107 vs 2000-3000 bases).

Conclusions

Ultimately, despite limitations, the work presented demonstrates that Deep Learning algorithms are a promising solution to save time during research in mRNA stability, an especially valuable commodity during disease outbreaks. In its current form, a binary classifier at the end of the GRU algorithm, to predict if a molecule is stable enough for further resources to be spent on it, could be useful. By minimizing the False Negative rate, it could be used as a screening tool to remove highly unstable sequences. Of course, as more robust datasets are published with longer RNA sequence lengths, the above models could be retrained to determine applicability to longer sequences as well, bringing them another step closer to becoming fully-fledged research tools.

Würdigung durch die Expertin

Dr. Eileen Jackson

Ankit’s work applies a highly advanced technique, deep learning, to a highly relevant problem, the prediction of RNA sequence degradation and stability. This work shows understanding of the field and a practical implementation to a real-life problem.

Prädikat:

hervorragend

Sonderpreis Gebauer Stiftung – Regeneron International Science and Engineering Fair (ISEF)

International School Basel, Reinach
Lehrerin: Nicola Mason