Efficient and High Performance Transformer Design for Real Time Applications on Mobile Devices
In recent years, the vision transformer (ViT) has shown promising results in computer vision tasks. However, due to its large number of parameters and model design, the ViT-based models are usually several times slower than lightweight convolutional neural networks. This poses a particular challenge for real-time applications and resource-constrained hardware such as mobile devices. In this work, we propose an efficient transformer design that can run at a MobileNet speed while achieving high performance on mobile devices. We first review the network architecture and operator used in ViT-based models and identify inefficient designs. Then, we introduce a dimension-consistent pure transformer as a design paradigm. We show that the proposed efficient transformer design utilizes hardware-friendly 4D blocks and powerful 3D MHS smoothly, resulting in a significant improvement in inference time.