SMYRF: Efficient Attention using Asymmetric Clustering

We propose a novel type of balanced clustering algorithm to approximate attention. Attention complexity is reduced from $O(N^2)$ to $O(N \log N)$, where $N$ is the sequence length.Our algorithm, SMYRF, uses Locality Sensitive Hashing (LSH) in a novel way by defining new Asymmetric transformations and an adaptive scheme that produces balanced clusters. The biggest advantage of SMYRF is that it can be used as a drop-in replacement for dense attention layers without any retraining. On the contrary, prior fast attention methods impose constraints (e.g. queries and keys share the same vector representations) and require re-training from scratch. We apply our method to pre-trained state-of-the-art Natural Language Processing and Computer Vision models and we report significant memory and speed benefits. Notably, SMYRF-BERT outperforms (slightly) BERT on GLUE, while using $50\%$ less memory. We also show that SMYRF can be used interchangeably with dense attention before and after training. Finally, we use SMYRF to train GANs with attention in high resolutions. Using a single TPU, we were able to scale attention to 128x128=16k and 256x256=65k tokens on BigGAN on CelebA-HQ.

SMYRF:使用非对称聚类的有效注意力

我们提出了一种新型的平衡聚类算法来近似注意力。注意复杂度从 Ø(ñ2) 至 Ø(ñ日志⁡ñ) ,在哪里 ñ 是序列长度。.. 我们的算法SMYRF通过定义新的非对称变换和产生平衡簇的自适应方案,以新颖的方式使用了局部敏感哈希(LSH)。SMYRF的最大优点是,它可以用作密集注意力层的直接替代品,而无需任何重新培训。相反,现有的快速关注方法会施加约束(例如,查询和键共享相同的矢量表示),并且需要从头开始进行重新训练。我们将我们的方法应用于经过预训练的最新自然语言处理和计算机视觉模型,并且报告了显着的内存和速度优势。值得注意的是,在使用时,SMYRF-BERT在GLUE上的表现(略)优于BERT 50% 更少的内存。我们还显示,在训练前后,SMYRF可以在注意力高度集中互换使用。最后,我们使用SMYRF在高分辨率下训练GAN。使用单个TPU,我们能够将注意力扩展到CelebA-HQ上BigGAN上的128x128 = 16k和256x256 = 65k令牌上。 (阅读更多)