[2103.10697] ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases

Significance

Another improvement to the vision-transformer-based models with a theoretical rationale

Keypoints

  • Propose a self-attention layer which can compensate locality
  • Show sample-efficiency of the model
  • Investigate importance and tendency of locality in vision-transformers

Review

Background

Replacing the convolution layer with the self-attention layer (a.k.a. Transformer encoder) for solving computer vision tasks has been an active topic of research since the introduction of the Transformer. The Vision Transformer (ViT) makes use of the self-attention layer without any convolution operation, and achieves a performance comparable to the SOTA for the image classification task. Attempts are being made to further improve the ViT-based models, and this work claims that sample-efficiency is achieved by soft injecting of the locality inductive bias (as in the convolution layers) to the self-attention layer, which is inherently non-local.

Keypoints

Propose a self-attention layer which can compensate locality

The Positional Self-Attention (PSA) layer is defined as following: \begin{equation*} \boldsymbol{A}^{h} _ {ij} := \texttt{softmax} (\boldsymbol{Q}^{h} _ {i} \boldsymbol{K}^{h\top} _ {j} + \boldsymbol{v}^{h\top} _ {pos}\boldsymbol{r} _ {ij}) \end{equation*}

The authors note that summation of the two terms on the right-hand-side of the equation can lead to simply ignoring the smaller of the two because of the softmax function. The two terms respectively represent the content and position of the image (or patch of the image), and it is better not to ignore either one. convit Proposed ConViT and GPSA Gated Positional Self-Attention (GPSA) layer is proposed as a linear interpolation of the content and the position after the softmax function to address this issue. To inject locality inductive bias at the beginning of the training, the GPSA layer is initialized to mimic convolution layer based on a previous finding. Finally, the proposed ConViT consists of 10 blocks of GPSA layer followed by 2 blocks of conventional self-attention layer, and progressively gains non-locality throughout the training.

Show sample-efficiency of the model

210324-2 Sample and parameter efficiency of ConViT The authors show that the proposed ConViT has advantage over DeiT in both sample and parameter efficiency. It is mentioned in the paper that the efficiency might come from the convolutional initialization of the GPSA layers.

Investigate importance and tendency of locality in vision-transformers

The tendency of (non-)locality in ConViT throughout the training, and its importance regarding the classification performance are presented as experiment results. 210324-3 Tendency of non-locality in ConViT 210324-4 Relationship between locality and performance

Related

Share

Comment

#image-generation #multi-modal #language-model #retrieval-augmentation #robotics #forecasting #psychiatry #instruction-tuning #diffusion-model #notice #graph-neural-network #responsible-ai #privacy-preserving #scaling #mixture-of-experts #generative-adversarial-network #speech-model #contrastive-learning #self-supervised #image-representation #image-processing #object-detection #pseudo-labeling #scene-text-detection #neural-architecture-search #data-sampling #long-tail #graph-representation #zero-shot #metric-learning #federated-learning #weight-matrix #low-rank #vision-transformer #computer-vision #normalizing-flow #invertible-neural-network #super-resolution #image-manipulation #thread-summarization #natural-language-processing #domain-adaptation #knowledge-distillation #scene-text #model-compression #semantic-segmentation #instance-segmentation #video-understanding #code-generation #graph-generation #image-translation #data-augmentation #model-pruning #signal-processing #text-generation #text-classification #music-representation #transfer-learning #link-prediction #counterfactual-learning #medical-imaging #acceleration #transformer #style-transfer #novel-view-synthesis #point-cloud #spiking-neural-network #optimization #multi-layer-perceptron #adversarial-training #visual-search #image-retrieval #negative-sampling #action-localization #weakly-supervised #data-compression #hypergraph #adversarial-attack #submodularity #active-learning #deblurring #object-tracking #pyramid-structure #loss-function #gradient-descent #generalization #bug-fix #orthogonality #explainability #saliency-mapping #information-theory #question-answering #knowledge-graph #robustness #limited-data #recommender-system #anomaly-detection #gaussian-discriminant-analysis #molecular-graph #video-processing