[2109.01652] Finetuned Language Models Are Zero-Shot Learners

Significance

Training natural language models to learn with natural language

Keypoints

  • Propose a instruction based finetuning method for improving zero-shot task performance
  • Demonstrate zero-shot performance of the instruction tuned language model

Review

Background

Language models with large number of parameters, such as GPT-3, have been shown to perform well in few-shot learning. These models are pre-trained on a large language corpus and finetuned on a specific task to improve its performance on the task. The authors propose an intuitive finetuning method called instruction tuning by leveraging the fact that pre-trained language models are already capable of extracting the representation of the natural language. Specifically, instruction tuning is to finetune a language model by describing the NLP tasks in natural language instructions. 210906-1 Finetuning, prompting, and proposed instruction tuning of the pre-trained language model

Teaching a language model in nautral language to do better in language tasks sounds strange, but interesting.

Keypoints

Propose a instruction based finetuning method for improving zero-shot task performance

To instruct the language model in natural language, the authors reformulated available natural language datasets into twelve clusters. 210906-2 Sixty tasks in twelve clusters for instruction learning Sixty tasks are included in the twelve clusters, where a task is defined as a particular set of input-output pair given by a dataset. Tasks from the same clusters are not used during training as the inference task are left out during training. To describe the original tasks with more diversity, each tasks are further composed to ten templates 210906-3 Example of template composition from a natural language inference task Each templates are designed to fit the objectives of the corresponding task by the authors.

The model used for the instruction tuning experiment is named Base LM, which is a dense left-to-right decoder-only transformer with 137B parameters. The model is pre-trained with a collection of web documents similar to GPT-3, but the training data is not as clean as the training data of the GPT-3. The pre-trained Base LM is then instruction tuned to derive FLAN (Finetuned LAnguage Net).

Demonstrate zero-shot performance of the instruction tuned language model

FLAN (137B) is compared mostly with the GPT-3 (175B) in zero-shot performance of various tasks. Following tables demonstrate that instruction tuned FLAN significantly outperforms GPT-3 in zero-shot performance of NLI, QA, reasoning, and translation.

Natural language inference

210906-4

Question-answering

210906-5

Reasoning

210906-6

Translation

210906-7

Further ablation studies and training details are referred to the original paper.

Related

Share

Comment

#image-generation #multi-modal #language-model #retrieval-augmentation #robotics #forecasting #psychiatry #instruction-tuning #diffusion-model #notice #graph-neural-network #responsible-ai #privacy-preserving #scaling #mixture-of-experts #generative-adversarial-network #speech-model #contrastive-learning #self-supervised #image-representation #image-processing #object-detection #pseudo-labeling #scene-text-detection #neural-architecture-search #data-sampling #long-tail #graph-representation #zero-shot #metric-learning #federated-learning #weight-matrix #low-rank #vision-transformer #computer-vision #normalizing-flow #invertible-neural-network #super-resolution #image-manipulation #thread-summarization #natural-language-processing #domain-adaptation #knowledge-distillation #scene-text #model-compression #semantic-segmentation #instance-segmentation #video-understanding #code-generation #graph-generation #image-translation #data-augmentation #model-pruning #signal-processing #text-generation #text-classification #music-representation #transfer-learning #link-prediction #counterfactual-learning #medical-imaging #acceleration #transformer #style-transfer #novel-view-synthesis #point-cloud #spiking-neural-network #optimization #multi-layer-perceptron #adversarial-training #visual-search #image-retrieval #negative-sampling #action-localization #weakly-supervised #data-compression #hypergraph #adversarial-attack #submodularity #active-learning #deblurring #object-tracking #pyramid-structure #loss-function #gradient-descent #generalization #bug-fix #orthogonality #explainability #saliency-mapping #information-theory #question-answering #knowledge-graph #robustness #limited-data #recommender-system #anomaly-detection #gaussian-discriminant-analysis #molecular-graph #video-processing