[2106.07631] Improved Transformer for High-Resolution GANs

Significance

Attention is all you need for GANs too

Keypoints

  • Propose self-attention generator for high-resolution image generation with adversarial training
  • Demonstrate performance of the HiT as a generator/decoder

Review

Background

Transformer models that rely on on self-attention is being widely adopted to computer vision tasks. However, applying the Transformer to image generation with adversarial training is still behind the trend. The main challenges of applying self-attention to GANs lies in (i) quadratic scaling problem of computational complexity, and (ii) a higher demand for spatial coherency. These issues are addressed by introducing efficient generator architecture with self-attention and self-modulation.

Keypoints

Propose self-attention generator for high-resolution image generation with adversarial training

The authors propose HiT, a GAN generator without any convolution operation. To reduce computational cost, the HiT includes self-attention in only earlier layers which is responsible for low-resolution stages. The efficiency of self-attention is further obtained by the multi-axis blocked self-attention architecture, which is similar to the AxialTransformer operation but with block splitting of the input features. 210615-1 Multi-axis blocked self-attention The attention of the multi-axis blocked self-attention is applied along the blocks and the pixels, which can be thought of regional and dilated attention, respectively.

Later layers of HiT does not include self-attention but only the multi-layer perceptron (MLP) with linear complexity which further reduces the computational cost. This is based on the assumption that the spatial dependency is already modeled in the earlier low-resolution stages of the generator. 210615-2 Schematic illustration of the proposed HiT

The last point of the HiT is the cross-attention for self-modulation. This is to improve the global information flow by letting the intermediate features to directly attend to the input latent.

Demonstrate performance of the HiT as a generator/decoder

The HiT can be used not only as a generator of the GAN, but also the decoder of the VAE too. Performance of unconditional image generation with HiT as a generator of the GAN is first experimented with the ImageNet 128$\times$128 dataset. 210615-3 Performance of unconditional image generation on ImageNet 128$\times$128 210615-5 Exemplar images of unconditional image generation from HiT The reconstruction FID of HiT as a decoder of the VQ-VAE is also reported on the ImageNet 256$\times$256 dataset. 210615-4 Performance of image reconstruction on ImageNet 256$\times$256 It can be seen that HiT achieves better FID score when compared to its CNN counterparts as a generator or a decoder.

Higher resolution (256$\times$256 and 1024$\times$1024) image generation is experimented on the CelebA-HQ and the FFHQ datasets. The results demonstrate that HiT obtains state-of-the-art FID scores at resolution of 256$\times$256. However, HiT fell slightly behind the StyleGAN2 in terms of FID score at higher 1024$\times$1024 resolution. 210615-6 Quantitative results of the HiT on CelebA-HQ and FFHQ datasets 210616-7 Qualitative results of the HiT on CelebA-HQ dataset

Ablation studies, throughput comparison, and evaluation of regularization effects are referred to the original paper.

Related

Share

Comment

#image-generation #multi-modal #language-model #retrieval-augmentation #robotics #forecasting #psychiatry #instruction-tuning #diffusion-model #notice #graph-neural-network #responsible-ai #privacy-preserving #scaling #mixture-of-experts #generative-adversarial-network #speech-model #contrastive-learning #self-supervised #image-representation #image-processing #object-detection #pseudo-labeling #scene-text-detection #neural-architecture-search #data-sampling #long-tail #graph-representation #zero-shot #metric-learning #federated-learning #weight-matrix #low-rank #vision-transformer #computer-vision #normalizing-flow #invertible-neural-network #super-resolution #image-manipulation #thread-summarization #natural-language-processing #domain-adaptation #knowledge-distillation #scene-text #model-compression #semantic-segmentation #instance-segmentation #video-understanding #code-generation #graph-generation #image-translation #data-augmentation #model-pruning #signal-processing #text-generation #text-classification #music-representation #transfer-learning #link-prediction #counterfactual-learning #medical-imaging #acceleration #transformer #style-transfer #novel-view-synthesis #point-cloud #spiking-neural-network #optimization #multi-layer-perceptron #adversarial-training #visual-search #image-retrieval #negative-sampling #action-localization #weakly-supervised #data-compression #hypergraph #adversarial-attack #submodularity #active-learning #deblurring #object-tracking #pyramid-structure #loss-function #gradient-descent #generalization #bug-fix #orthogonality #explainability #saliency-mapping #information-theory #question-answering #knowledge-graph #robustness #limited-data #recommender-system #anomaly-detection #gaussian-discriminant-analysis #molecular-graph #video-processing