Learning to Augment for Data-Scarce Domain BERT Knowledge Distillation

Despite pre-trained language models such as BERT have achieved appealing performance in a wide range of natural language processing tasks, they are computationally expensive to be deployed in real-time applications. A typical method is to adopt knowledge distillation to compress these large pre-trained models (teacher models) to small student models.However, for a target domain with scarce training data, the teacher can hardly pass useful knowledge to the student, which yields performance degradation for the student models. To tackle this problem, we propose a method to learn to augment for data-scarce domain BERT knowledge distillation, by learning a cross-domain manipulation scheme that automatically augments the target with the help of resource-rich source domains. Specifically, the proposed method generates samples acquired from a stationary distribution near the target data and adopts a reinforced selector to automatically refine the augmentation strategy according to the performance of the student. Extensive experiments demonstrate that the proposed method significantly outperforms state-of-the-art baselines on four different tasks, and for the data-scarce domains, the compressed student models even perform better than the original large teacher model, with much fewer parameters (only ${\sim}13.3\%$) when only a few labeled examples available.

学习增强数据稀缺域BERT知识提炼

尽管诸如BERT之类的经过预训练的语言模型在各种自然语言处理任务中均取得了令人满意的性能,但要在实时应用中进行部署,它们的计算量很大。一种典型的方法是采用知识蒸馏将这些大型的预训练模型(教师模型)压缩为小学生模型。.. 但是,对于缺少训练数据的目标领域,教师几乎无法将有用的知识传递给学生,这会导致学生模型的性能下降。为了解决此问题,我们提出了一种方法,该方法通过学习跨域操纵方案来针对数据稀缺的域BERT知识蒸馏进行扩充,该方案借助资源丰富的源域自动增强目标。具体而言,所提出的方法生成从目标数据附近的平稳分布中获取的样本,并采用强化选择器来根据学生的表现自动优化扩增策略。大量实验表明,在四个不同的任务上,对于数据稀缺的领域,该方法明显优于最新的基线, 〜13.3% )时,只有几个标记的示例可用。 (阅读更多)