[2111.07991] LiT:fire:: Zero-Shot Transfer with Locked-image Text Tuning

Significance

Image-text pre-training with pre-trained image model enhances zero-shot performance

Keypoints

  • Propose a method for image-text contrastive learning with pre-trained image model
  • Demonstrate performance gain of the proposed method by experiments

Review

Background

Contrastive learning usually refers to a training scheme that minimizes the latent space distance between the same input sample with different view, while maximizing the distance from different samples. A different, but contrastive approach has recently been proposed which is to minimize (or maximize) the latent space distance between the paired (or unpaired) image-text data from separate text and image models. This contrastive image-text training has shown to be capable of zero-shot image classification tasks. In this paper, the authors propose contrastive-tuning by adopting a heavily pre-trained image model to this contrastive image-text training scheme. The idea does not seem to be very novel, but the performance gain of the contrastive-tuning is exceptional which is demonstrated by extensive comparative experiments.

Keypoints

Propose a method for image-text contrastive learning with pre-trained image model

Two possible options exist for adopting a pre-trained image model. First option is to load weights of the pre-trained model, and freeze the model during the image-text contrastive learning (denoted L). Second option is to let the pre-trained model be fine-tuned during the image-text contrastinve learning (denoted U). Previous methods trained from randomly initialized model weights (denoted u) for both image and text models.

211116-1 Lu, Uu, uu training schemes for image-text contrastive learning

The authors find that the Lu (or LiT short for Locked-image Text) scheme, which is to use locked-weight pre-trained image model and randomly initialized text model improves the performance of the zero-shot transfer learning most.

211116-3 Design choice comparison experiment results

Another important question that can be asked is whether pre-trained text models can help zero-shot transfer performance. 211116-4 Final performance for different training durations for all training combinations

As can be seen from above figure, pre-trained text models did not provide significant performance gain.

Demonstrate performance gain of the proposed method by experiments

LiT-tuning is performed on the public CC12M, YFCC100m datasets along with in-house 4 billion image/alt-text pair datasets. Experimented image models include the ResNet, ViT, and MLP-Mixer. The performance is evaluated on zero-shot ImageNet classification and MSCOCO image-text retrieval tasks.

Performance of the proposed LiT-tuning scheme is compared to baseline methods including CLIP and ALIGN. 211116-2 Quantitative performance of the proposed method compared to baseline methods

It can be seen that on five out-of-distribution (OOD) test variants, the proposed method shows significant robustness over baseline methods.

211116-5 Training on YFCC or private datasets both show significant performance gain over baseline methods

The authors performed extensive comparative/ablation studies to evaluate the best training scheme of the proposed method, which is referred to the original paper. These are not detailed in this review, since the most important point in this work can be summarized into: pre-trained image models locked + random initialized text models contrastive learning significantly boosts zero-shot transfer performance.

Related

Share

Comment

#image-generation #multi-modal #language-model #retrieval-augmentation #robotics #forecasting #psychiatry #instruction-tuning #diffusion-model #notice #graph-neural-network #responsible-ai #privacy-preserving #scaling #mixture-of-experts #generative-adversarial-network #speech-model #contrastive-learning #self-supervised #image-representation #image-processing #object-detection #pseudo-labeling #scene-text-detection #neural-architecture-search #data-sampling #long-tail #graph-representation #zero-shot #metric-learning #federated-learning #weight-matrix #low-rank #vision-transformer #computer-vision #normalizing-flow #invertible-neural-network #super-resolution #image-manipulation #thread-summarization #natural-language-processing #domain-adaptation #knowledge-distillation #scene-text #model-compression #semantic-segmentation #instance-segmentation #video-understanding #code-generation #graph-generation #image-translation #data-augmentation #model-pruning #signal-processing #text-generation #text-classification #music-representation #transfer-learning #link-prediction #counterfactual-learning #medical-imaging #acceleration #transformer #style-transfer #novel-view-synthesis #point-cloud #spiking-neural-network #optimization #multi-layer-perceptron #adversarial-training #visual-search #image-retrieval #negative-sampling #action-localization #weakly-supervised #data-compression #hypergraph #adversarial-attack #submodularity #active-learning #deblurring #object-tracking #pyramid-structure #loss-function #gradient-descent #generalization #bug-fix #orthogonality #explainability #saliency-mapping #information-theory #question-answering #knowledge-graph #robustness #limited-data #recommender-system #anomaly-detection #gaussian-discriminant-analysis #molecular-graph #video-processing