Mobile-Former: Bridging MobileNet and Transformer
MobileNet and Transformer are bridged, rather than merged
ViTGAN: Training GANs with Vision Transformers
Attention is all you need for GAN discriminators too
How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers
A white paper for training your ViT
XCiT: Cross-Covariance Image Transformers
Self-attention through features perform better and faster for ViTs
Improved Transformer for High-Resolution GANs
Attention is all you need for GANs too
Scaling Vision Transformers
Scaling up vision transformers takes it higher
Anticipative Video Transformer
Action anticipation from video with transformers
Self-Supervised Learning with Swin Transformers
Swin-T + (MoCo + BYOL) = Encouraging result
Multiscale Vision Transformers
CNNs have pooling layers. Why not ViTs?
LocalViT: Bringing Locality to Vision Transformers
Merging locality of CNN seamlessly with any ViTs
Understanding Robustness of Transformers for Image Classification
Keep calm and use vision-transformer
ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases
Another improvement to the vision-transformer-based models with a theoretical rationale