[2106.05630] MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training

Significance

Learning musical representation of symbolic music scores

Keypoints

  • Build a large-scale symbolic music corpus
  • Propose encoding and self-supervised pre-training practice for symbolic MIDI data
  • Demonstrate performance on four downstream tasks for music understanding

Review

Background

Music shares certain similarity with natural language. While the pre-trained natural language models (e.g. BERT) have provided powerful representation learning from unlabelled text corpora, taking this idea directly to learning representation of music is a bit more complicated. For example, the music songs are more structural, requires pitch/instrument information, etc. Other practical difficulties include the lack of large-scale music corpora, and naively encoding the sequence of naive music notes too long to be processed at once. The authors address these issues and provide practical solutions by building a large-scale symbolic music corpus, proposing an efficient encoding of music notes, and pre-training the model with appropriate masking scheme that fits the musical structure.

Keypoints

Build a large-scale symbolic music corpus

The authors focus on symbolic music representation learning, which is musical songs written in a set of symbols that denote musical information such as pitch, velocity, duration of a note. Million-MIDI dataset (MMD) is a new large-scale symbolic MIDI dataset which contains over 1.5 million songs and 2 billion notes. The songs are deduplicated to make models possible to learn powerful representation from the dataset.

Propose encoding and self-supervised pre-training practice for symbolic MIDI data
OctupleMIDI encoding

To encode musical notes more efficiently, the authors propose OctupleMIDI encoding, which encodes 8 elements of a note including time signature, tempo, bar and position, instrument, pitch, duration, and velocity. OctupleMIDI encoding significantly reduces number of tokens when compared to baseline encoding methods (CP, REMI). 210611-1 Scheme of OctupleMIDI and other baseline encodings The OctupleMIDI encoding is concatenated and linearly mapped to output a single token per sequence before being input to the model.

Bar-level masking strategy

To pre-train the Transformer encoder in a self-supervised way as in the case of BERT, the authors employ masking strategy. However, randomly masking a sequence token is not as effective in the case of symbolic music data since adjacent notes within the same bar can carry too much information of the masked note. The MusicBERT is the self-supervised pre-training of symbolic music data with OctupleMIDI encoding and bar-level masking scheme. 210611-2 Scheme of the proposed MusicBERT It is shown from experiments that the bar-level masking is essential for properly learning the symbolic music representation with the Transformer encoder.

Demonstrate performance on four downstream tasks for music understanding

The performance of MusicBERT is evaluated on four downstream tasks including melody completion, accompaniment suggestion, genre classification and style classification. It can be seen that the MusicBERT outperforms other baseline models in all four tasks 210611-3 Performance of MusicBERT in four music understanding tasks

Ablation of encoding methods and bar-level masking result in significantly degraded performance. 210611-4 Ablation of OctupleMIDI encoding 210611-5 Ablation of bar-level masking

Training the model from scratch without pre-training also suggest the importance of self-supervised pre-training for better performance in the downstream tasks. 210611-6 Effect of MusicBERT pre-training in music understanding tasks

Related

Share

Comment

#image-generation #multi-modal #language-model #retrieval-augmentation #robotics #forecasting #psychiatry #instruction-tuning #diffusion-model #notice #graph-neural-network #responsible-ai #privacy-preserving #scaling #mixture-of-experts #generative-adversarial-network #speech-model #contrastive-learning #self-supervised #image-representation #image-processing #object-detection #pseudo-labeling #scene-text-detection #neural-architecture-search #data-sampling #long-tail #graph-representation #zero-shot #metric-learning #federated-learning #weight-matrix #low-rank #vision-transformer #computer-vision #normalizing-flow #invertible-neural-network #super-resolution #image-manipulation #thread-summarization #natural-language-processing #domain-adaptation #knowledge-distillation #scene-text #model-compression #semantic-segmentation #instance-segmentation #video-understanding #code-generation #graph-generation #image-translation #data-augmentation #model-pruning #signal-processing #text-generation #text-classification #music-representation #transfer-learning #link-prediction #counterfactual-learning #medical-imaging #acceleration #transformer #style-transfer #novel-view-synthesis #point-cloud #spiking-neural-network #optimization #multi-layer-perceptron #adversarial-training #visual-search #image-retrieval #negative-sampling #action-localization #weakly-supervised #data-compression #hypergraph #adversarial-attack #submodularity #active-learning #deblurring #object-tracking #pyramid-structure #loss-function #gradient-descent #generalization #bug-fix #orthogonality #explainability #saliency-mapping #information-theory #question-answering #knowledge-graph #robustness #limited-data #recommender-system #anomaly-detection #gaussian-discriminant-analysis #molecular-graph #video-processing