[2112.10684] Efficient Large Scale Language Modeling with Mixture-of-Experts

Significance

Meta is working on efficient language models with MoE too

Keypoints

  • Demonstrate performance of the large-scale language models with mixture-of-experiments by experiments

Review

Background

A week ago, I have reviewed a paper from Google which demonstrated compute efficient large scale language model by introducing Mixture-of-experts (MoE). This paper has a very similar motivation and idea, again introducing MoE to the language model for efficient computation while maintaining its performance. For a brief review on the importance of the efficiency in model scaling, please refer to my previous post.

Keypoints

Demonstrate performance of the large-scale language models with mixture-of-experiments by experiments

The dense model architecture is based on the GPT-3, with difference in (i) using only dense attention, and (ii) using sinusoidal positional embeddings. The sparse model counterpart is based on the GShard with 512 experts in each expert layer, and the top-2 expert selection was used.

211221-1 Specification of the experimented models

As the result can be expected, MoE models achieve comparable performance to its dense counterparts with around 4 times less compute. 211221-2 Perplexity as a function of ZFLOPs for in-domain (left) and out-of-domain (right) data.

211221-3 Average zero-shot priming accuracy as a function of ZFLOPs.

Other results are consistent with the GLaM results, suggesting the efficiency of MoE models.

Efficient language models are expected to reduce energy consumption and CO2 emission related to large-scale computing.

211221-4 Estimated training time and CO2 emission of the experimented models

Can MoE become a standard option for further scaling of the language models?

Related

Share

Comment

#image-generation #multi-modal #language-model #retrieval-augmentation #robotics #forecasting #psychiatry #instruction-tuning #diffusion-model #notice #graph-neural-network #responsible-ai #privacy-preserving #scaling #mixture-of-experts #generative-adversarial-network #speech-model #contrastive-learning #self-supervised #image-representation #image-processing #object-detection #pseudo-labeling #scene-text-detection #neural-architecture-search #data-sampling #long-tail #graph-representation #zero-shot #metric-learning #federated-learning #weight-matrix #low-rank #vision-transformer #computer-vision #normalizing-flow #invertible-neural-network #super-resolution #image-manipulation #thread-summarization #natural-language-processing #domain-adaptation #knowledge-distillation #scene-text #model-compression #semantic-segmentation #instance-segmentation #video-understanding #code-generation #graph-generation #image-translation #data-augmentation #model-pruning #signal-processing #text-generation #text-classification #music-representation #transfer-learning #link-prediction #counterfactual-learning #medical-imaging #acceleration #transformer #style-transfer #novel-view-synthesis #point-cloud #spiking-neural-network #optimization #multi-layer-perceptron #adversarial-training #visual-search #image-retrieval #negative-sampling #action-localization #weakly-supervised #data-compression #hypergraph #adversarial-attack #submodularity #active-learning #deblurring #object-tracking #pyramid-structure #loss-function #gradient-descent #generalization #bug-fix #orthogonality #explainability #saliency-mapping #information-theory #question-answering #knowledge-graph #robustness #limited-data #recommender-system #anomaly-detection #gaussian-discriminant-analysis #molecular-graph #video-processing