Mathematik | Informatik
Sean Findlay, 2004 | Reinach, BL
Machine learning allows computers to learn to solve complex problems by analysing data associated with the task at hand. This project explored how such methods could be applied to the task of music generation. Using a dataset of more than ten thousand pieces of music for solo piano, various machine learning models were trained with the task of producing similar, but original music. In a first step, Markov models and Long Short-Term Memory (LSTM) networks were developed for this task, along with a set of metrics for assessing how well the models performed. It was found that these models suffered from overfitting and so failed to generate truly original music. Thus, in a second stage, these models were exchanged for the Transformer, a more modern deep neural network architecture. Two adjustments of the Transformer architecture were developed. The quality of the output produced by each of the two architectures was analysed using a similarity metric as well as through subjective listening. Based on this analysis, it was shown that both of the Transformer architectures developed are capable of generating music of significantly higher quality than the earlier Markov models and LSTM networks.
Machine learning methods enable a computer to learn complex associations between variables by presenting a model with a large number of examples demonstrating the interrelation of these variables. Such methods allow the programmer to develop a program which completes a certain task without explicitly telling the computer how to do so, or the precise rules and intricacies of said task. Using a dataset consisting of more than ten thousand pieces of music for solo piano, I wanted to train various machine learning models to generate new pieces of music in a similar style to these original pieces.
Having found in a first step that Markov models and LSTM networks suffered from overfitting and so failed to generate truly original music, Transformer models were trained on random subsequences of notes from the training data. Two adjustments of the basic Transformer architecture were employed. The first treated notes as a contiguous array of values, not differentiating between which values represent pitch or duration. The second was a double-headed architecture that allows the model to consider pitch and duration in two separate steps: pitch is predicted in a first step and then, based on this pitch prediction, a second step makes a prediction for the duration.
A statistical distance metric was developed in order to assess the quality of a model’s output based on its similarity to the pieces in the original dataset, where a lower value indicates greater similarity. All Transformer models had a significantly lower statistical distance value than the LSTM models, confirming the empirical observation that the Transformer models generate output of a much higher quality than the LSTM models. The data suggested that the double-headed architecture does not perform better than the one-dimensional architecture. The double-headed model trained on absolute note data produced the output with the lowest statistical distance to the original dataset.
While both models successfully generated original music, on subjective listening the pieces generated by the double-headed architecture were the most tonal. Further, the pieces generated by the double-headed model trained on absolute pitches made better use of the full pitch range and sounded more pleasing than those from the same architecture trained on intervals. Pieces generated by the one-dimensional architecture often lacked polyphony, had large sections with very low note density, and were prone to repeating loops of notes in direct succession.
In this project, various machine learning models were developed and trained to generate original music based on music provided as training data. Two Transformer-based architectures succeeded in learning to do so, with the double-headed architecture producing output judged subjectively as slightly more “musical” than the one-dimensional architecture’s output. Both performed significantly better at this task than the Markov models and LSTM networks developed. A possible next step would be to swap the order of the heads in the double-headed architecture and condition the prediction of the pitch on the duration chosen in advance. Additionally, employing hyperparameter tuning may help to improve the quality of the models’ output.
Würdigung durch den Experten
The project deals with the challenging task of generating music with machine learning (ML) methods. Sean implemented several approaches for achieving this goal, some of which produce good quality music, based on using the MIDI standard. In the second project iteration he explored novel architectures, using so-called «double-headed transformer». A subjective evaluation shows that the generated outputs indeed do sound musical to human ears and have a satisfying quality.
Gymnasium Münchenstein, Münchenstein
Lehrer: Alexandre Warin