Can gradient clipping mitigate label noise?

Gradient clipping is a widely-used technique in the training of deep networks, and is generally motivated from an optimisation lens: informally, it controls the dynamics of iterates, thus enhancing the rate of convergence to a local minimum. This intuition has been made precise in a line of recent works, which show that suitable clipping can yield significantly faster convergence than vanilla gradient descent.In this paper, we propose a new lens for studying gradient clipping, namely, robustness: informally, one expects clipping to provide robustness to noise, since one does not overly trust any single sample. Surprisingly, we prove that for the common problem of label noise in classification, standard gradient clipping does not in general provide robustness. On the other hand, we show that a simple variant of gradient clipping is provably robust, and corresponds to suitably modifying the underlying loss function. This yields a simple, noise-robust alternative to the standard cross-entropy loss which performs well empirically.

梯度裁剪可以减轻标签噪声吗?

梯度修剪是深度网络训练中广泛使用的技术,通常是从优化角度出发:非正式地,它控制迭代的动态,从而将收敛速度提高到局部最小值。在最近的一系列工作中,这种直觉已经变得很精确,这表明合适的削波比香草梯度下降能产生明显更快的收敛。.. 在本文中,我们提出了一种用于研究梯度削波的新透镜,即鲁棒性:非正式地,人们希望削波为噪声提供鲁棒性,因为人们不会过分相信任何单个样本。出乎意料的是,我们证明了对于分类中标签噪声的常见问题,标准梯度剪切通常不会提供鲁棒性。另一方面,我们证明了梯度削波的一个简单变体被证明是可靠的,并且对应于适当地修改了基础损失函数。这就产生了一种简单的,对噪声的鲁棒性替代方案,可以替代标准的交叉熵损耗,该损耗在经验上表现良好。 (阅读更多)