MSAF: Multimodal Split Attention Fusion

sponge_30797 30 0 .pdf 2021-01-24 06:01:35

MSAF: Multimodal Split Attention Fusion

Multimodal learning mimics the reasoning process of the human multi-sensory system, which is used to perceive the surrounding world. While making a prediction, the human brain tends to relate crucial cues from multiple sources of information.In this work, we propose a novel multimodal fusion module that learns to emphasize more contributive features across all modalities. Specifically, the proposed Multimodal Split Attention Fusion (MSAF) module splits each modality into channel-wise equal feature blocks and creates a joint representation that is used to generate soft attention for each channel across the feature blocks. Further, the MSAF module is designed to be compatible with features of various spatial dimensions and sequence lengths, suitable for both CNNs and RNNs. Thus, MSAF can be easily added to fuse features of any unimodal networks and utilize existing pretrained unimodal model weights. To demonstrate the effectiveness of our fusion module, we design three multimodal networks with MSAF for emotion recognition, sentiment analysis, and action recognition tasks. Our approach achieves competitive results in each task and outperforms other application-specific networks and multimodal fusion benchmarks.

MSAF：多模式分裂注意力融合

多模式学习模仿了人类多感官系统的推理过程，该系统用于感知周围世界。在做出预测时，人脑倾向于将来自多种信息来源的关键线索联系起来。.. 在这项工作中，我们提出了一种新颖的多峰融合模块，该模块学习着重强调所有模态的更多贡献特征。具体而言，建议的多模式拆分注意力融合（MSAF）模块将每个模态拆分为各个通道相等的特征块，并创建一个联合表示形式，该联合表示用于为整个功能块上的每个通道生成软注意力。此外，MSAF模块设计为与各种空间尺寸和序列长度的特征兼容，适用于CNN和RNN。因此，可以轻松地将MSAF添加到任何单峰网络的融合特征中，并利用现有的预训练单峰模型权重。为了证明我们的融合模块的有效性，我们设计了三个带有MSAF的多峰网络，用于情感识别，情感分析和动作识别任务。（阅读更多）