Speaker Representation Learning using Global Context Guided Channel and Time-Frequency Transformations

In this study, we propose the global context guided channel and time-frequency transformations to model the long-range, non-local time-frequency dependencies and channel variances in speaker representations. We use the global context information to enhance important channels and recalibrate salient time-frequency locations by computing the similarity between the global context and local features.The proposed modules, together with a popular ResNet based model, are evaluated on the VoxCeleb1 dataset, which is a large scale speaker verification corpus collected in the wild. This lightweight block can be easily incorporated into a CNN model with little additional computational costs and effectively improves the speaker verification performance compared to the baseline ResNet-LDE model and the Squeeze&Excitation block by a large margin. Detailed ablation studies are also performed to analyze various factors that may impact the performance of the proposed modules. We find that by employing the proposed L2-tf-GTFC transformation block, the Equal Error Rate decreases from 4.56% to 3.07%, a relative 32.68% reduction, and a relative 27.28% improvement in terms of the DCF score. The results indicate that our proposed global context guided transformation modules can efficiently improve the learned speaker representations by achieving time-frequency and channel-wise feature recalibration.

使用全局上下文指导的频道和时频转换进行演讲者表示学习

在这项研究中,我们提出了全局上下文指导的通道和时频转换,以对说话人表示中的远程,非本地时频依赖性和通道方差建模。我们使用全局上下文信息来增强重要渠道,并通过计算全局上下文和局部特征之间的相似性来重新校准显着的时频位置。.. 在VoxCeleb1数据集上评估了建议的模块以及基于ResNet的流行模型,该数据集是在野外收集的大规模说话者验证语料库。与基线ResNet-LDE模型和Squeeze&Excitation模块相比,这种轻量级模块可以轻松地合并到CNN模型中,而几乎没有额外的计算成本,并且可以有效地提高说话者验证性能。还进行了详细的消融研究,以分析可能影响建议模块性能的各种因素。我们发现通过采用建议的L2-tf-GTFC转换模块,均等错误率从4.56%降低到3.07%,相对降低了32.68%,DCF评分提高了27.28%。 (阅读更多)