On estimating gaze by self-attention augmented convolutions

Estimation of 3D gaze is highly relevant to multiple fields, including but not limited to interactive systems, specialized human-computer interfaces, and behavioral research. Although recently deep learning methods have boosted the accuracy of appearance-based gaze estimation, there is still room for improvement in the network architectures for this particular task.Therefore we propose here a novel network architecture grounded on self-attention augmented convolutions to improve the quality of the learned features during the training of a shallower residual network. The rationale is that self-attention mechanism can help outperform deeper architectures by learning dependencies between distant regions in full-face images. This mechanism can also create better and more spatially-aware feature representations derived from the face and eye images before gaze regression. We dubbed our framework ARes-gaze, which explores our Attention-augmented ResNet (ARes-14) as twin convolutional backbones. In our experiments, results showed a decrease of the average angular error by 2.38% when compared to state-of-the-art methods on the MPIIFaceGaze data set, and a second-place on the EyeDiap data set. It is noteworthy that our proposed framework was the only one to reach high accuracy simultaneously on both data sets.

关于通过自注意力增强卷积估计凝视

3D凝视的估计与多个领域高度相关,包括但不限于交互式系统,专门的人机界面和行为研究。尽管最近深度学习方法已经提高了基于外观的凝视估计的准确性,但是针对此特定任务的网络体系结构仍有改进的空间。.. 因此,我们在这里提出一种基于自注意力增强卷积的新颖网络体系结构,以在训练较浅的残差网络期间提高学习特征的质量。基本原理是,自我注意机制可以通过学习全脸图像中远距离区域之间的依存关系来帮助其胜过更深的架构。在凝视回归之前,该机制还可以创建从脸部和眼睛图像派生的更好且更具空间意识的特征表示。我们将我们的框架称为ARes-gaze,该框架将注意力增强的ResNet(ARes-14)探索为双卷积主干。在我们的实验中,结果显示,与MPIIFaceGaze数据集上的最新方法相比,与EyeDiap数据集上的第二位方法相比,平均角度误差降低了2.38%。 (阅读更多)