SMYRF - Efficient Attention using Asymmetric Clustering

We propose a novel type of balanced clustering algorithm to approximate attention. Attention complexity is reduced from $O(N^2)$ to $O(N \log N)$, where N is the sequence length.Our algorithm, SMYRF, uses Locality Sensitive Hashing (LSH) in a novel way by defining new Asymmetric transformations and an adaptive scheme that produces balanced clusters. The biggest advantage of SMYRF is that it can be used as a drop-in replacement for dense attention layers without any retraining. On the contrary, prior fast attention methods impose constraints (e.g. tight queries and keys) and require re-training from scratch. We apply our method to pre-trained state-of-the-art Natural Language Processing and Computer Vision models and we report significant memory and speed benefits. Notably, SMYRF-BERT outperforms (slightly) BERT on GLUE, while using 50% less memory. We also show that SMYRF can be used interchangeably with dense attention before and after training. Finally, we use SMYRF to train GANs with attention in high resolutions. Using a single TPU, we train BigGAN on Celeba-HQ, with attention at resolution 128x128 and 256x256, capable of generating realistic human faces.

SMYRF-使用非对称聚类的有效注意力

我们提出了一种新型的平衡聚类算法来近似注意力。注意复杂度从 Ø(ñ2) 至 Ø(ñ日志⁡ñ) ,其中N是序列长度。.. 我们的算法SMYRF通过定义新的非对称变换和产生平衡簇的自适应方案,以新颖的方式使用了局部敏感哈希(LSH)。SMYRF的最大优点是,它可以用作密集注意力层的直接替代品,而无需任何重新培训。相反,现有的快速关注方法会施加约束(例如严格的查询和键),并且需要从头开始进行重新培训。我们将我们的方法应用于经过预训练的最新自然语言处理和计算机视觉模型,并且报告了显着的内存和速度优势。值得注意的是,SMYRF-BERT在GLUE上的性能优于BERT(略),同时使用的内存减少了50%。我们还显示,在训练前后,SMYRF可以在注意力高度集中互换使用。最后,我们使用SMYRF在高分辨率下训练GAN。 (阅读更多)