作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (5): 90-96. doi: 10.19678/j.issn.1000-3428.0064366

• 人工智能与模式识别 • 上一篇    下一篇

基于知识蒸馏的企业命名实体识别模型

毛亮1, 赵林均1, 余敦辉1,2, 孙斌1,2   

  1. 1. 湖北大学 计算机与信息工程学院, 武汉 430062;
    2. 湖北省教育信息化工程技术中心, 武汉 430062
  • 收稿日期:2022-04-02 修回日期:2022-06-02 发布日期:2022-08-22
  • 作者简介:毛亮(1998-),男,硕士研究生,主研方向为知识图谱;赵林均,本科生;余敦辉,教授、博士;孙斌(通信作者),讲师、硕士。
  • 基金资助:
    国家重点研发计划(2017YFB1400602);国家自然科学基金(61977021);湖北省技术创新专项(2018ACA13)。

Enterprise-Named Entity Recognition Model Based on Knowledge Distillation

MAO Liang1, ZHAO Linjun1, YU Dunhui1,2, SUN Bin1,2   

  1. 1. College of Computer and Information Engineering, Hubei University, Wuhan 430062, China;
    2. Hubei Education Informationization Engineering and Technology Center, Wuhan 430062, China
  • Received:2022-04-02 Revised:2022-06-02 Published:2022-08-22

摘要: BERT词嵌入模型能够解决简单命名实体识别模型预测精度低的问题,但基于BERT类的复杂词嵌入模型具有计算复杂度高、模型预测时间过长等缺陷。针对该问题,构建基于知识蒸馏的命名实体识别模型,将BERT+CRF模型作为教师模型,获取较高的命名实体识别精度,并基于模型结构相似原则将BiGRU+CRF作为学生模型,在学生模型训练的过程中进行知识蒸馏。知识蒸馏根据教师模型Softmax层和学生模型Softmax层输出的标注概率矩阵分别作为教师模型的知识和学生模型的知识,通过均方损失函数计算教师模型知识与学生模型知识之间的差距,将获得的结果作为软标签误差,将学生模型预测的标签结果与真实标签之间的误差作为硬标签误差,总误差为软标签误差与硬标签误差的加权和,通过误差反向传播进行模型的训练,在减小总误差的同时缩小教师模型知识与学生模型知识之间的差距,使学生模型预测精度接近教师模型。最终使用学生模型进行预测,在接近教师模型预测精度的同时保证相对较短的预测时间。在DuIE2.0数据集上的实验结果表明,该命名实体识别模型在F1值损失2.6%的情况下,可使模型参数规模缩小93.7%,从而缩短了65.2%的运算时间。

关键词: 知识蒸馏, 命名实体识别, 教师模型, 学生模型, BERT模型

Abstract: Currently,to address the low prediction accuracy problem of simple named entity recognition model,the Bidirectional Encoder Representations from Transformers(BERT) word embedding model is widely adopted. However, the complex word embedding model based on BERT suffers from high computational complexity and significant model prediction time. To address this problem,in this study,a named entity recognition model based on knowledge distillation is proposed,which uses a BERT + Conditional Random Field(CRF) model as a teacher model to obtain high named entity recognition accuracy.Using the Bidirectional Gated Recurrent Unit(BiGRU)+CRF as the student model based on the model structure similarity principle.Knowledge distillation uses the annotated probability matrix output taken from the Softmax layer of the teacher model as the knowledge of the teacher model and that taken from the Softmax layer of the student model as the knowledge of the student model. Subsequently,the gap between the knowledge of the teacher model and that of the student model is calculated using the mean-square loss function as soft-label errors.The error between the label results predicted by the student model and real label is denoted as the hard label error. The total error is the weighted combination of the soft label and hard label errors. Finally,the student model is trained through error backpropagation, reducing the difference between the knowledge of the teacher model and that of the student model while narrowing the total error. This ensures that the accuracy of the student model prediction is close to that of the teacher model. Finally,the student model is used for prediction,which ensures a relatively short prediction time and high prediction accuracy.The experimental results demonstrate that,using the dataset DuIE2.0,the scale of model parameters is reduced by 93.7% and operation time is reduced by 65.2% under the condition of losing 2.6% of F1 value.

Key words: knowledge distillation, named entity recognition, teacher model, student model, BERT model

中图分类号: