Densely Guided Knowledge Distillation using Multiple Teacher Assistants
With the success of deep neural networks, knowledge distillation which guides the learning of a small student network from a large teacher network is being actively studied for model compression and transfer learning. However, few studies have been performed to resolve the poor learning issue of the student network when the student and teacher model sizes significantly differ.In this paper, we propose a densely guided knowledge distillation using multiple teacher assistants that gradually decrease the model size to efficiently bridge the gap between teacher and student networks. To stimulate more efficient learning of the student network, we guide each teacher assistant to every other smaller teacher assistant step by step. Specifically, when teaching a smaller teacher assistant at the next step, the existing larger teacher assistants from the previous step are used as well as the teacher network to increase the learning efficiency. Moreover, we design stochastic teaching where, for each mini-batch during training, a teacher or a teacher assistant is randomly dropped. This acts as a regularizer like dropout to improve the accuracy of the student network. Thus, the student can always learn rich distilled knowledge from multiple sources ranging from the teacher to multiple teacher assistants. We verified the effectiveness of the proposed method for a classification task using Cifar-10, Cifar-100, and Tiny ImageNet. We also achieved significant performance improvements with various backbone architectures such as a simple stacked convolutional neural network, ResNet, and WideResNet.
使用多个老师助理进行密集指导的知识蒸馏
随着深度神经网络的成功,正在积极研究指导从大型教师网络学习小型学生网络的知识提炼,以进行模型压缩和转移学习。但是,当学生和老师的模型大小明显不同时,很少有研究可以解决学生网络学习效果差的问题。.. 在本文中,我们提出了使用多个助教的密集指导知识提炼,这些助教逐渐减小模型的大小,以有效地弥合师生网络之间的差距。为了促进学生网络的更有效学习,我们会逐步指导每个助教到其他每个较小的助教。具体而言,当在下一步中教较小的助教时,将使用上一步中现有的较大的助教以及教师网络来提高学习效率。此外,我们设计了随机教学,在训练过程中,对于每个小批量,随机分配一名教师或助教。这可以像丢包一样充当正则化器,以提高学生网络的准确性。从而,学生可以始终从多个来源(从老师到多个助教)学习丰富的提炼知识。我们验证了使用Cifar-10,Cifar-100和Tiny ImageNet进行分类任务的方法的有效性。我们还通过各种骨干架构(例如简单的堆叠式卷积神经网络,ResNet和WideResNet)实现了显着的性能改进。 (阅读更多)
暂无评论