Significance
Keypoints
- Propose a transformer module which replaces dot product attention into element-wise multiplication
- Demonstrate performance and efficiency of the proposed method
Review
Background
Transformers employ dot-product between a sequence of query and key vectors to compute attention between each items in the sequence.
This means that for a sequence of
Keypoints
Propose a transformer module which replaces dot product attention into element-wise multiplication
The AFT is defined as:
Computational complexity of the proposed method and Transformer variants
The authors suggest variants of the AFT to achieve locality while further reducing the computational complexity.
AFT-local refers to the AFT which the learnable position biases
Demonstrate performance and efficiency of the proposed method
The performance and efficiency of the proposed method is demonstrated by experiments on image autoregressive modeling, language modeling, and image classification tasks.
For the image autoregressive modeling, the negative log likelihood on the CIFAR-10 test dataset is compared with baseline methods.
Image autoregressive modeling result on CIFAR-10
It can be seen that the propose AFT-local and AFT-simple achieve better performance with low computational cost.
Language modeling performance is evaluated with the Enwik8 dataset, with the same negative log likelihood metric.
Language modeling result on Enwik8
The test result show a comparable performance to the Transformer with faster computing time.
Results of the ablation/comparative study on window size
Lastly, image classification performance on ImageNet-1K dataset is evaluated.
Image classification result on ImageNet-1K
The proposed AFT variants achieve better performance with computational efficiency when compared to baseline models.
Another finding is that intializing some of the parameters of the AFT (value weight matrix, etc) from the pre-trained DeiT provide improvement in the image classification performance.