Objective-Based Hierarchical Clustering of Deep Embedding Vectors

We initiate a comprehensive experimental study of objective-based hierarchical clustering methods on massive datasets consisting of deep embedding vectors from computer vision and NLP applications. This includes a large variety of image embedding (ImageNet, ImageNetV2, NaBirds), word embedding (Twitter, Wikipedia), and sentence embedding (SST-2) vectors from several popular recent models (e.g. ResNet, ResNext, Inception V3, SBERT).Our study includes datasets with up to $4.5$ million entries with embedding dimensions up to $2048$. In order to address the challenge of scaling up hierarchical clustering to such large datasets we propose a new practical hierarchical clustering algorithm B++&C. It gives a 5%/20% improvement on average for the popular Moseley-Wang (MW) / Cohen-Addad et al. (CKMM) objectives (normalized) compared to a wide range of classic methods and recent heuristics. We also introduce a theoretical algorithm B2SAT&C which achieves a $0.74$-approximation for the CKMM objective in polynomial time. This is the first substantial improvement over the trivial $2/3$-approximation achieved by a random binary tree. Prior to this work, the best poly-time approximation of $\approx 2/3 + 0.0004$ was due to Charikar et al. (SODA'19).

深度嵌入向量的基于目标的层次聚类

我们在包含来自计算机视觉和NLP应用程序的深层嵌入向量的海量数据集上启动了基于目标的层次聚类方法的综合实验研究。这包括来自最近流行的几种模型(例如ResNet,ResNext,Inception V3,SBERT)的各种图像嵌入(ImageNet,ImageNetV2,NaBirds),单词嵌入(Twitter,Wikipedia)和句子嵌入(SST-2)向量。.. 我们的研究包括多达 4.5 嵌入尺寸达百万的条目 2048 。为了解决将分层聚类扩展到如此大的数据集的挑战,我们提出了一种新的实用的分层聚类算法B ++&C。与流行的Moseley-Wang(MW)/ Cohen-Addad等人相比,该方法平均提高了5%/ 20%。(CKMM)目标(标准化)与各种经典方法和最新启发式方法进行比较。我们还介绍了一种理论算法B2SAT&C,该算法可实现 0.74 -多项式时间内CKMM目标的近似值。这是对微不足道的第一个重大改进 2/3 -由随机二叉树实现的逼近。在进行这项工作之前,最好采用 ≈2/3+0.0004 是由于Charikar等人。(SODA'19)。 (阅读更多)