作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (9): 104-112. doi: 10.19678/j.issn.1000-3428.0068078

• 人工智能与模式识别 • 上一篇    下一篇

面向不平衡数据的机械设备故障命名实体识别

党小超1, 刘涧1, 董晓辉1,*(), 祝忠彦2, 李芬芳1   

  1. 1. 西北师范大学计算机科学与工程学院, 甘肃 兰州 730000
    2. 金川集团股份有限公司龙首矿, 甘肃 金昌 737103
  • 收稿日期:2023-07-14 出版日期:2024-09-15 发布日期:2024-03-19
  • 通讯作者: 董晓辉
  • 基金资助:
    国家自然科学基金(62162056); 甘肃省产业支撑计划(021CYZC-06)

Named Entity Recognition of Mechanical Equipment Failure for Imbalanced Data

DANG Xiaochao1, LIU Jian1, DONG Xiaohui1,*(), ZHU Zhongyan2, LI Fenfang1   

  1. 1. School of Computer Science and Engineering, Northwest Normal University, Lanzhou 730000, Gansu, China
    2. Longshou Mine, Jinchuan Group Co., Ltd., Jinchang 737103, Gansu, China
  • Received:2023-07-14 Online:2024-09-15 Published:2024-03-19
  • Contact: DONG Xiaohui

摘要:

命名实体识别作为构建知识图谱的基础任务, 其识别效果直接影响知识图谱的质量。在实际生产中, 机械故障数据通常包含大量的领域专业词汇, 同时实体类型普遍存在分布不平衡的问题, 这对准确识别故障实体构成了挑战。通用领域实体识别方法在这一领域效果欠佳, 从而降低了知识图谱的质量。为应对上述问题, 提出一种融合焦点损失(Focal Loss)函数和专业词典的实体识别方法。该方法使用Focal Loss函数应对实体类型不平衡问题, 通过引入平衡因子和调制系数, 改进传统的交叉熵损失函数, 提升实体识别效果, 同时将领域专业词汇嵌入到模型中, 进一步提高实体识别性能, 这一词典包含机械故障的领域术语, 有助于模型更准确地识别机械设备故障命名实体。在自建的矿井提升机实验数据集上进行广泛实验验证, 结果证明, 融入Focal Loss后模型的F1值比主流模型BERT-BiLSTM-CRF提高了5.57个百分点, 相比用于解决数据不平衡的典型方法SMOTE效果更优, 在此基础上, 通过嵌入领域词典, 模型的F1值得到进一步提升, 达到89.13%。

关键词: 命名实体识别, 不平衡数据, 焦点损失函数, 机械设备故障, 双向长短期记忆网络, 条件随机场

Abstract:

Named Entity Recognition(NER) is a fundamental task in building knowledge graphs and directly affects graph quality. However, in practice, mechanical failure data often contain a significant amount of domain-specific vocabulary, and in general, an imbalance exists in the distribution of entity types. Thus, existing NER methods in general domains do not yield satisfactory results. To address these problems, this paper proposes an entity recognition method that integrates a Focal Loss function into domain-specific dictionaries. This method improves the cross-entropy loss function by introducing balancing and modulation coefficients for sample distributions. In addition, entity recognition is enhanced through the fusion of vocabulary features. Experimental results on a self-built dataset of mining hoist machines show that the incorporation of Focal Loss increases the F1 value by 5.57 percentage points compared with the mainstream Bidirectional Encoder Representations from Transformers(BERT)-Bidirectional Long-Short-Term Memory(BiLSTM)-Conditional Random Field(CRF) model. Furthermore, it outperforms the typical Synthetic Minority Over-sampling Technique(SMOTE) method in solving imbalanced data issues. By incorporating domain dictionaries, the F1 value is further improved, reaching 89.13%.

Key words: Named Entity Recognition(NER), imbalanced data, Focal Loss function, mechanical equipment failure, Bi-directional Long Short-Term Memory(BiLSTM) network, Conditional Random Field(CRF)