A block coordinate descent optimizer for classification problems exploiting convexity

Second-order optimizers hold intriguing potential for deep learning, but suffer from increased cost and sensitivity to the non-convexity of the loss surface as compared to gradient-based approaches. We introduce a coordinate descent method to train deep neural networks for classification tasks that exploits global convexity of the cross-entropy loss in the weights of the linear layer.Our hybrid Newton/Gradient Descent (NGD) method is consistent with the interpretation of hidden layers as providing an adaptive basis and the linear layer as providing an optimal fit of the basis to data. By alternating between a second-order method to find globally optimal parameters for the linear layer and gradient descent to train the hidden layers, we ensure an optimal fit of the adaptive basis to data throughout training. The size of the Hessian in the second-order step scales only with the number weights in the linear layer and not the depth and width of the hidden layers; furthermore, the approach is applicable to arbitrary hidden layer architecture. Previous work applying this adaptive basis perspective to regression problems demonstrated significant improvements in accuracy at reduced training cost, and this work can be viewed as an extension of this approach to classification problems. We first prove that the resulting Hessian matrix is symmetric semi-definite, and that the Newton step realizes a global minimizer. By studying classification of manufactured two-dimensional point cloud data, we demonstrate both an improvement in validation error and a striking qualitative difference in the basis functions encoded in the hidden layer when trained using NGD. Application to image classification benchmarks for both dense and convolutional architectures reveals improved training accuracy, suggesting possible gains of second-order methods over gradient descent.

用于利用凸度分类问题的块坐标下降优化器

二阶优化器具有深度学习的潜在潜力,但与基于梯度的方法相比,其成本增加,并且对损失表面的非凸性敏感。我们引入了一种协调下降方法来训练用于分类任务的深层神经网络,该方法利用线性层权重中的交叉熵损失的全局凸性。.. 我们的牛顿/梯度下降混合(NGD)方法与隐藏层的解释(提供自适应基础)和线性层(为基础提供对数据的最佳拟合)相一致。通过在寻找线性层全局最优参数的二阶方法和训练隐层的梯度下降之间交替,我们确保在整个训练过程中自适应基础与数据的最佳拟合。二阶步骤中的Hessian大小仅与线性层中的数字权重成比例,而不与隐藏层的深度和宽度成比例。此外,该方法适用于任意隐藏层体系结构。先前将这种自适应基础观点应用于回归问题的工作表明,在降低培训成本的同时,准确性得到了显着提高,这项工作可以看作是这种方法对分类问题的扩展。我们首先证明所得的Hessian矩阵是对称半定的,并且Newton步骤实现了全局最小化。通过研究制造的二维点云数据的分类,当使用NGD训练时,我们证明了验证错误的改善和隐藏层中编码的基本函数的明显质性差异。应用于密集和卷积体系结构的图像分类基准显示了改进的训练精度,表明在梯度下降时可能获得二阶方法。通过研究制造的二维点云数据的分类,我们证明了使用NGD训练时,在隐藏层中编码的基本函数中,验证错误的改善和显着的质量差异。应用于密集和卷积体系结构的图像分类基准显示了改进的训练精度,表明在梯度下降时可能获得二阶方法。通过研究制造的二维点云数据的分类,我们证明了使用NGD训练时,在隐藏层中编码的基本函数中,验证错误的改善和显着的质量差异。应用于密集和卷积体系结构的图像分类基准显示了改进的训练精度,表明在梯度下降时可能获得二阶方法。 (阅读更多)