Predicting Training Time Without Training

hospitable_26882 25 0 .pdf 2021-01-24 08:01:49

Predicting Training Time Without Training

We tackle the problem of predicting the number of optimization steps that a pre-trained deep network needs to converge to a given value of the loss function. To do so, we leverage the fact that the training dynamics of a deep network during fine-tuning are well approximated by those of a linearized model.This allows us to approximate the training loss and accuracy at any point during training by solving a low-dimensional Stochastic Differential Equation (SDE) in function space. Using this result, we are able to predict the time it takes for Stochastic Gradient Descent (SGD) to fine-tune a model to a given loss without having to perform any training. In our experiments, we are able to predict training time of a ResNet within a 20% error margin on a variety of datasets and hyper-parameters, at a 30 to 45-fold reduction in cost compared to actual training. We also discuss how to further reduce the computational and memory cost of our method, and in particular we show that by exploiting the spectral properties of the gradients' matrix it is possible predict training time on a large dataset while processing only a subset of the samples.

无需培训即可预测培训时间

我们解决了预测预训练的深度网络收敛到损失函数给定值所需的优化步骤数的问题。为此，我们利用了以下事实：在微调过程中，深层网络的训练动力学可以很好地被线性模型的训练动力学近似。.. 这使我们能够通过求解函数空间中的低维随机微分方程（SDE）来估计训练过程中任意点的训练损失和准确性。使用此结果，我们可以预测随机梯度下降（SGD）将模型微调到给定损失所需的时间，而无需执行任何训练。在我们的实验中，我们能够预测ResNet在各种数据集和超参数上误差在20％以内的训练时间，与实际训练相比，其成本降低了30到45倍。我们还讨论了如何进一步降低我们方法的计算和存储成本，尤其是表明，通过利用梯度矩阵的光谱特性，可以在仅处理一部分样本的情况下预测大型数据集上的训练时间。（阅读更多）