Parallel Blockwise Knowledge Distillation for Deep Neural Network Compression

Deep neural networks (DNNs) have been extremely successful in solving many challenging AI tasks in natural language processing, speech recognition, and computer vision nowadays. However, DNNs are typically computation intensive, memory demanding, and power hungry, which significantly limits their usage on platforms with constrained resources.Therefore, a variety of compression techniques (e.g. quantization, pruning, and knowledge distillation) have been proposed to reduce the size and power consumption of DNNs. Blockwise knowledge distillation is one of the compression techniques that can effectively reduce the size of a highly complex DNN. However, it is not widely adopted due to its long training time. In this paper, we propose a novel parallel blockwise distillation algorithm to accelerate the distillation process of sophisticated DNNs. Our algorithm leverages local information to conduct independent blockwise distillation, utilizes depthwise separable layers as the efficient replacement block architecture, and properly addresses limiting factors (e.g. dependency, synchronization, and load balancing) that affect parallelism. The experimental results running on an AMD server with four Geforce RTX 2080Ti GPUs show that our algorithm can achieve 3x speedup plus 19% energy savings on VGG distillation, and 3.5x speedup plus 29% energy savings on ResNet distillation, both with negligible accuracy loss. The speedup of ResNet distillation can be further improved to 3.87 when using four RTX6000 GPUs in a distributed cluster.

深度神经网络压缩的并行块知识提取

如今,深度神经网络(DNN)在解决自然语言处理,语音识别和计算机视觉中的许多具有挑战性的AI任务方面已经非常成功。但是,DNN通常需要大量的计算,内存需求和耗电,这严重限制了它们在资源受限的平台上的使用。.. 因此,已经提出了多种压缩技术(例如,量化,修剪和知识蒸馏)以减小DNN的大小和功耗。逐块知识蒸馏是可以有效减小高度复杂的DNN大小的压缩技术之一。但是,由于培训时间长,因此未得到广泛采用。在本文中,我们提出了一种新颖的并行块式蒸馏算法,以加快复杂DNN的蒸馏过程。我们的算法利用本地信息进行独立的块级蒸馏,利用深度上可分离的层作为有效的替换块体系结构,并正确解决了影响并行性的限制因素(例如,依赖性,同步性和负载平衡)。在配备有四个Geforce RTX 2080Ti GPU的AMD服务器上运行的实验结果表明,我们的算法在VGG蒸馏中可以实现3倍的加速和19%的节能,在ResGG蒸馏中可以实现3.5倍的加速和29%的节能,而精度损失都可以忽略不计。在分布式集群中使用四个RTX6000 GPU时,ResNet蒸馏的速度可以进一步提高到3.87。 (阅读更多)