[2205.01068] OPT: Open Pre-trained Transformer Language Models

Significance

Pre-trained large language models open to public for responsible AI

Keypoints

  • Report technical details for implementing, training, evaluating, and sharing pre-trained language models

Review

Background

The dark side of the large scale language models is that they are pre-trained with a large corpus of text datasets which are prone to biased toxic language contents. Furthermore, the pre-trained models consists of over hundreds of billions of parameters, which cannot easily be shared with public in its full form. The authors raise the issue that this can be ethically harmful, and pre-train 8 Transformer based language models for opening them to public. This paper includes technical details for

Keypoints

Report technical details for implementing, training, evaluating, and sharing pre-trained language models
Models

The authors train 8 Transformer based models, named open pre-trained transformers (OPT), ranging from 125M to 175B parameters. The 175B parameter OPT model corresponds to the GPT-3 model.

220503-1 Specification of 8 Transformer models

The number of layers (#L), heads (#H), embdding size (d_model), and the peak learning rate (LR) are reported in the Table above.

Pre-training corpus

The pre-training corpus includes roughly 180B tokens, filtered after concatenating the datasets used in previous works (RoBERTa, the Pile, PushShift.io Reddit)

Training process

Training was performed on 992 80GB A100 GPUs for the OPT-175B model for two months. Training was restarted from checkpoints when the loss diverged or hardware failure was found.

Evaluation

Evaluating the pre-trained OPT models included performing 16 standardized NLP tasks for prompting and zero/few-shot: HellaSwag, StoryCloze, PIQA, ARC Easy and Challenge, OpenBookQA, WinoGrad, Wino- Grande, and SuperGLUE. Performance of OPT for zero/few-shot prompting showed comparable performance to the GPT-3 model.

220503-2 Zero-shot performance results

220503-3 Few-shot performance results

Perplexity and Unigram F1 are also evaluated on multiple open source dialogue datasets with other benchmark models and show competitive performance to supervised models.

220503-4 Dialogue evaluation results

Bias & Toxicity

Potential harm of the OPT-175B was studied with benchmark datasets. Hate speech detection (ETHOS), bias measure (CrowS-Pairs, StereoSet), tendency to respond to toxic language (RealToxicityPrompts), and dialogue safety evaluations (SaferDialogues, Safety Bench Unit Tests) were performed.

220503-5 Hate speech detection results. F1 scores are reported.

220503-7 CrowS-Pairs bias measure results. Lower is better.

220503-8 StereoSet bias measure results.

220503-9 RealToxicityPrompts results. OPT-175B is more likely to generate toxic responses.

220503-10 Dialogue safety evaluation results. OPT-175B performs worse in the ‘Unsafe’ setting

The authors conclude that the large language model is still premature for commercial deployment and comment that sharing the full pre-trained model parameters can promote people from various fields to improve this model for responsible application.

Related

Share

Comment

#image-generation #multi-modal #language-model #retrieval-augmentation #robotics #forecasting #psychiatry #instruction-tuning #diffusion-model #notice #graph-neural-network #responsible-ai #privacy-preserving #scaling #mixture-of-experts #generative-adversarial-network #speech-model #contrastive-learning #self-supervised #image-representation #image-processing #object-detection #pseudo-labeling #scene-text-detection #neural-architecture-search #data-sampling #long-tail #graph-representation #zero-shot #metric-learning #federated-learning #weight-matrix #low-rank #vision-transformer #computer-vision #normalizing-flow #invertible-neural-network #super-resolution #image-manipulation #thread-summarization #natural-language-processing #domain-adaptation #knowledge-distillation #scene-text #model-compression #semantic-segmentation #instance-segmentation #video-understanding #code-generation #graph-generation #image-translation #data-augmentation #model-pruning #signal-processing #text-generation #text-classification #music-representation #transfer-learning #link-prediction #counterfactual-learning #medical-imaging #acceleration #transformer #style-transfer #novel-view-synthesis #point-cloud #spiking-neural-network #optimization #multi-layer-perceptron #adversarial-training #visual-search #image-retrieval #negative-sampling #action-localization #weakly-supervised #data-compression #hypergraph #adversarial-attack #submodularity #active-learning #deblurring #object-tracking #pyramid-structure #loss-function #gradient-descent #generalization #bug-fix #orthogonality #explainability #saliency-mapping #information-theory #question-answering #knowledge-graph #robustness #limited-data #recommender-system #anomaly-detection #gaussian-discriminant-analysis #molecular-graph #video-processing