Multiscale Vision Transformers (MViT)
AI News Sharing
Facebook AI has built Multiscale Vision Transformers (MViT), a Transformer architecture for representation learning from visual data such as images and videos. It’s a family of visual recognition models that incorporate the seminal concept of hierarchical representations into the powerful Transformer architecture. MViT is the first such system that can train entirely from scratch on a video recognition data set (like Kinetics 400) and achieve state-of-the-art performance across a variety of transfer learning tasks, like video classification and human action localization.
The central advance of MViT is developing a spatiotemporal feature hierarchy within the Transformer backbone. Typical Vision Transformer models use a constant resolution and feature dimension throughout all layers and an attention mechanism to determine which previous tokens it should focus on. In MViT, we replace that with a pooling attention mechanism that pools the projected query, key, and value vectors, enabling reduction of the visual resolution. We couple this with increasing the channel dimension to construct a hierarchy from simple features with high visual resolution to more complex, high-dimensional features with low resolution.
MViT marks a significant improvement over prior attempts at video understanding with Transformers, which require computationally expensive pretraining on massive data sets (such as ImageNet-21K) and are extremely parameter-dense, requiring multistep training schemes. In contrast, MViT trains from scratch in a single step with no external pretraining. It also significantly improves state-of-the-art performance across well-studied recognition benchmarks, like ImageNet, Kinetics-400, Kinetics-600, AVA, etc.
Further, MViT models demonstrate superior understanding of temporal cues without getting pinned down in spurious spatial biases, a common pitfall of prior methods. Though much more work is needed, the advances enabled by MViT could significantly improve detailed human action understanding, which is a crucial component in real-world AI applications such as robotics and autonomous vehicles. In addition, innovations in video recognition architectures are an essential component of robust, safe, and human-centric AI.