Significance
Keypoints
- Propose a Transformer based method for video action anticipation
- Demonstrate performance of the proposed method
Review
Background
Predicting future action from a video requires not only understanding the current frame, but also modeling temporal context across the frames. For example, a frame with a plate of food with a fork may suggest current action of eating, but anticipating this action before the frame is given requires reasoning over preceding events. The authors propose a transformer based method for predicting future action from video which learns to attend across space and time, and show state-of-the-art performance.
Keypoints
Propose a Transformer based method for video action anticipation
The authors propose anticipatory video transformer (AVT), which consists of two-step spatio-temporal attention with Transformers. First step is the backbone, where each image frames of the video $\mathbf{X}_{t}$ is encoded into the latent $\mathbf{z}_{t}$ with a Vision Transformer. Second step is the head, where the sequence of encoded latent is processed with a masked Causal Transformer Decoder as in the GPT-2 to output $\hat{\mathbf{z}_{t}}$. Schematic illustration of the proposed method
The AVT is trained with a supervised cross-entropy loss with the labelled future action $c_{T+1}$: \begin{align} \mathcal{L}_{\text{next}} = -\log \hat{\mathbf{y}}[c_{T+1}], \end{align} where $T$ is the length of the given video sequence. For promoting the model to learn anticipatory features, the future feature matching loss: \begin{align} \mathcal{L}_{\text{feat}} = \sum^{T-1}_{t=1} || \hat{\mathbf{z}}_{t}-\mathbf{z}_{t+1} ||^{2}_{2}, \end{align} and the action class level anticipative loss: \begin{align} \mathcal{L}_{\text{cls}} = \sum^{T-1}_{t=1} \mathcal{L}^{t}_{\text{cls}}; \mathcal{L}^{t}_{\text{cls}}=\begin{cases} -\log \hat{\mathbf{y}}_{t}[c_{t+1}], \quad &\text{if} c_{t+1}\geq 0 \\ 0, &\text{otherwise}\end{cases}. \end{align} The final loss $\mathcal{L}$ is defined as the sum of the above three losses.
Demonstrate performance of the proposed method
The proposed model is experimented on four popular datasets, EpicKitchens-100 (EK100), EpicKitchens-55 (EK55), EGTEA Gaze+, and 50-Salads (50S). The proposed method outperformed previous methods and other submissions from the CVPR 2021 EK100 challenge. Performance of the proposed method on the EK100 dataset The AVT also outperformed all other previous methods on the validation and the seen (S1) and unseen (S2) test data of the EK55 dataset. Performance of the proposed method on the EK55 dataset This exceptional performance of AVT was apparent in the EGTEA Gaze+ dataset and the 50-Salads dataset too. Performance of the proposed method on the EGTEA Gaze+ dataset Performance of the proposed method on the 50-Salads dataset
Anticipation can be computed for a longer-term with the AVT by an autoregressive approach. Long-term anticipation by rolling out predictions autoregressively
Other results including ablative study and the qualitative analysis of the attention are referred to the original paper.