[2111.06377] Masked Autoencoders Are Scalable Vision Learners

Significance

Fast representation learning with autoencoders by masked image reconstruction

Keypoints

  • Propose a method for training a image representation encoder by autoencoding masked images
  • Demonstrate speed and performance of the representation learning with proposed method

Review

Background

Self-supervised learning has greatly helped the field of natural language processing by enabling models to learn the representation of the input data itself. Masked language models, such as BERT or GPT, have led the self-supervised learning scheme which are to mask a certain number of words within the input and to recover the masked word. Similar scheme of training has been applied to computer vision models, but usually exploiting image augmentation instead of masking the image. This work tries to adopt the masking - recovering scheme to the self-supervised training of computer vision models. The authors address the issues of differences between the natural language processing tasks and the computer vision tasks, including the (i) difference of model computation, difference of information density of the data components, and the difference of reconstruction task nature. It should be noted that the proposed method can be thought of an extension of the denoising autoencoders (DAE) which have long been known for its capability for learning image representation in the encoded latent space.

Keypoints

Propose a method for training a image representation encoder by autoencoding masked images

211112-1 Schematic illustration of the proposed method The method is very simple and straight forward. The authors employ ViT as the encoder and the decoder to address the issue of (i) difference of model computation. During training, the input image is randomly patch-masked by a prespecified proportion and only the un-masked portion of the images are input to the encoder. The encoded latents are concatenated with the mask tokens and the decoder takes the encoded+mask tokens to reconstruct the original image. After the training has finished, only the pretrained encoder part (which learned to extract the image representation) is used for downstream tasks. Based on comparative studies, the authors mention that the best practice is to use mean squared error for reconstruction objective with 75\% of input image masked with uniformly random position patches.

211112-2 Masking ratio of 75\% shows good tradeoff between the performance and speed

211112-3 Comparative study results for the best training practices

Demonstrate speed and performance of the representation learning with proposed method

Since the method takes only 25\% of the input image, computating speed and scalability is greatly improved when compared to other self-supervised training methods. Comparative study of the fine-tuned model with baseline self-supervised pre-training methods demonstrate the exceptional performance and scalability of the proposed method.

211112-4 Comparison with other self-supervised training methods of ViT

211112-5 Comparison with supervised pre-training results

211112-6 Comparison with MoCo-v3 with limited number of Transformer blocks (partial fine-tuning)

Further transfer learning performance results are referred to the original paper. I would say that the proposed is method is very meaningful in that (i) it improves scalability and performance of ViT model pre-training significantly, and that (ii) it generalizes a well-known previous method (DAE) with a simple approach.

Related

Share

Comment

#image-generation #multi-modal #language-model #retrieval-augmentation #robotics #forecasting #psychiatry #instruction-tuning #diffusion-model #notice #graph-neural-network #responsible-ai #privacy-preserving #scaling #mixture-of-experts #generative-adversarial-network #speech-model #contrastive-learning #self-supervised #image-representation #image-processing #object-detection #pseudo-labeling #scene-text-detection #neural-architecture-search #data-sampling #long-tail #graph-representation #zero-shot #metric-learning #federated-learning #weight-matrix #low-rank #vision-transformer #computer-vision #normalizing-flow #invertible-neural-network #super-resolution #image-manipulation #thread-summarization #natural-language-processing #domain-adaptation #knowledge-distillation #scene-text #model-compression #semantic-segmentation #instance-segmentation #video-understanding #code-generation #graph-generation #image-translation #data-augmentation #model-pruning #signal-processing #text-generation #text-classification #music-representation #transfer-learning #link-prediction #counterfactual-learning #medical-imaging #acceleration #transformer #style-transfer #novel-view-synthesis #point-cloud #spiking-neural-network #optimization #multi-layer-perceptron #adversarial-training #visual-search #image-retrieval #negative-sampling #action-localization #weakly-supervised #data-compression #hypergraph #adversarial-attack #submodularity #active-learning #deblurring #object-tracking #pyramid-structure #loss-function #gradient-descent #generalization #bug-fix #orthogonality #explainability #saliency-mapping #information-theory #question-answering #knowledge-graph #robustness #limited-data #recommender-system #anomaly-detection #gaussian-discriminant-analysis #molecular-graph #video-processing