In recent years, the vision transformer (ViT) has shown promising results in computer vision tasks. However, due to its large number of parameters and model design, the ViT-based models are usually several times slower than lightweight convolutional neural networks. This poses a particular challenge for real-time applications and resource-constrained hardware such as mobile devices. In this work, we propose an efficient transformer design that can run at a MobileNet speed while achieving high performance on mobile devices. We first review the network architecture and operator used in ViT-based models and identify inefficient designs. Then, we introduce a dimension-consistent pure transformer as a design paradigm. We show that the proposed efficient transformer design utilizes hardware-friendly 4D blocks and powerful 3D MHS smoothly, resulting in a significant improvement in inference time.
Efficient and High Performance Transformer Design for Real Time Applications on Mobile Devices
文件列表
EfficientFormer-main
(预估有个2000文件)
cpu_popcnt.c
1KB
cpu_avx512cd.c
779B
cpu_avx512_cnl.c
972B
cpu_ssse3.c
725B
cpu_avx512_knl.c
981B
cpu_avx512f.c
775B
cpu_avx512_skx.c
1KB
cpu_vxe.c
813B
cpu_avx512_knm.c
1KB
cpu_asimd.c
845B
暂无评论