[2111.09296] XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale

Significance

Large-scale speech model is here

Keypoints

  • Introduce a self-supervised large-scale cross-lingual wav2vec model
  • Demonstrate performance of the proposed model by experiments

Review

Background

Pre-trained language models at scale with over billions of parameters, such as GPT-3, HyperCLOVA have opened a new horizon in the field of natural language representation learning. The authors bring this simple fact to the domain of speech representation learning. A wav2vec 2.0 model with upto 2 billion parameters is pre-trained with multi-lingual speech data of 423K hours from multiple sources.

211118-1 Multi-lingual speech data & wav2vec model at scale

As can be expected (but cannot easily be experimented), the pre-trained model achieves state-of-the-art performance on downstream speech tasks.

Keypoints

Demonstrate performance of the proposed model by experiments

Detailed description of the dataset and the model is referred to the original paper. In short, the pre-training dataset includes 436K hours of speech data from VoxPopuli, Multilingual Librispeech, CommonVoice, VoxLingua107, and BABEL. The model is trained by solving a contrastive task over masked feature output.

211118-2 The XLS model architectures include three types depending on the number of parameters

The model performance is tested on various speech representation tasks with benchmark datasets.

211118-3 Speech translation BLEU performance (X->En) on CoVoST-2

211118-4 Speech translation BLEU performance (En->X) on CoVoST-2

211118-5 Speech recognition WER performance on BABEL

211118-6 Phoneme recognition PER performance on CommonVoice

211118-7 Language identification ER performance on VoxLingua107

211118-8 Speaker identification accuracy performance on VoxCeleb1

It can be seen that the pre-trained XLS-R outperforms most of the baseline methods in various speech tasks/datasets.

Related

Share

Comment

#image-generation #multi-modal #language-model #retrieval-augmentation #robotics #forecasting #psychiatry #instruction-tuning #diffusion-model #notice #graph-neural-network #responsible-ai #privacy-preserving #scaling #mixture-of-experts #generative-adversarial-network #speech-model #contrastive-learning #self-supervised #image-representation #image-processing #object-detection #pseudo-labeling #scene-text-detection #neural-architecture-search #data-sampling #long-tail #graph-representation #zero-shot #metric-learning #federated-learning #weight-matrix #low-rank #vision-transformer #computer-vision #normalizing-flow #invertible-neural-network #super-resolution #image-manipulation #thread-summarization #natural-language-processing #domain-adaptation #knowledge-distillation #scene-text #model-compression #semantic-segmentation #instance-segmentation #video-understanding #code-generation #graph-generation #image-translation #data-augmentation #model-pruning #signal-processing #text-generation #text-classification #music-representation #transfer-learning #link-prediction #counterfactual-learning #medical-imaging #acceleration #transformer #style-transfer #novel-view-synthesis #point-cloud #spiking-neural-network #optimization #multi-layer-perceptron #adversarial-training #visual-search #image-retrieval #negative-sampling #action-localization #weakly-supervised #data-compression #hypergraph #adversarial-attack #submodularity #active-learning #deblurring #object-tracking #pyramid-structure #loss-function #gradient-descent #generalization #bug-fix #orthogonality #explainability #saliency-mapping #information-theory #question-answering #knowledge-graph #robustness #limited-data #recommender-system #anomaly-detection #gaussian-discriminant-analysis #molecular-graph #video-processing