In recent years, the vision transformer (ViT) has shown promising results in computer vision tasks. However, due to its large number of parameters and model design, the ViT-based models are usually several times slower than lightweight convolutional neural networks. This poses a particular challenge for real-time applications and resource-constrained hardware such as mobile devices. In this work, we propose an efficient transformer design that can run at a MobileNet speed while achieving high performance on mobile devices. We first review the network architecture and operator used in ViT-based models and identify inefficient designs. Then, we introduce a dimension-consistent pure transformer as a design paradigm. We show that the proposed efficient transformer design utilizes hardware-friendly 4D blocks and powerful 3D MHS smoothly, resulting in a significant improvement in inference time.