作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (5): 354-362. doi: 10.19678/j.issn.1000-3428.0068955

• 开发研究与工程应用 • 上一篇    下一篇

东北亚舆情文本细粒度命名实体识别方法研究

隗昊1,2,3, 刁宏悦1,3, 孔亮宸1, 邓耀臣2,3,*()   

  1. 1. 大连外国语大学软件学院, 辽宁 大连 116044
    2. 大连外国语大学中国东北亚语言研究中心, 辽宁 大连 116044
    3. 大连外国语大学辽宁省新文科数字人文创新实验室, 辽宁 大连 116044
  • 收稿日期:2023-12-05 出版日期:2024-05-15 发布日期:2024-05-14
  • 通讯作者: 邓耀臣
  • 基金资助:
    辽宁省高等学校基本科研项目(LJKQZ20222451)

Research on Fine-grained Named-Entity-Recognition Method for Public-Opinion Texts in Northeast Asia

Hao WEI1,2,3, Hongyue DIAO1,3, Liangchen KONG1, Yaochen DENG2,3,*()   

  1. 1. School of Software, Dalian University of Foreign Languages, Dalian 116044, Liaoning, China
    2. China Research Center for Northeast Asian Languages, Dalian University of Foreign Languages, Dalian 116044, Liaoning, China
    3. Liaoning New Lab for Innovations in Digital Humanities, Dalian University of Foreign Languages, Dalian 116044, Liaoning, China
  • Received:2023-12-05 Online:2024-05-15 Published:2024-05-14
  • Contact: Yaochen DENG

摘要:

东北亚地区的国际形势变化与中国的发展密切相关, 面向该地区构建舆情信息知识图谱可以有效地监测舆情热点, 这不仅能够引导社会舆论健康发展及协助政府决策, 而且对防范政治营销、提升国家语言能力、构建和谐稳定国际关系具有重大价值。命名实体识别是构建知识图谱的关键技术和核心任务, 受到研究者广泛的关注。以社交媒体、门户网站与东北亚地区相关的实时热点舆情文本作为数据源, 充分考虑到东北亚地区的区域特点和地缘结构, 建立包含10个大类、35个子类的细粒度命名实体识别数据集, 并提出基于预训练语言模型RoBERTa和多层残差BiLSTM-CRF架构(RoBERTa-ResBiLSTM-CRF)的舆情实体识别模型, 同时在模型完成标签预测后设计基于规则模板的后处理策略, 以提高整体的实体识别性能。实验结果表明, 所提出的舆情命名实体识别模型的性能优于主流的传统神经网络模型, 验证了该方法的有效性。

关键词: 细粒度, 命名实体识别, 舆情文本, 深度学习, 预训练语言模型

Abstract:

The evolving international situation in Northeast Asia is associated closely with China's development. The construction of a sentiment information knowledge graph for this region enables the effective monitoring of public-opinion hotspots. This not only guides the healthy development of public opinion and assists government decision-making but also prevents political marketing, thus enhancing national language competence and promoting harmonious and stable international relations. Named Entity Recognition(NER) is a key technology and core task in constructing knowledge graphs and has garnered extensive attention from researchers. This study uses real-time hot-sentiment texts related to Northeast Asia from social media and portal websites as data sources. Considering the regional characteristics and geopolitical structure of Northeast Asia, a fine-grained NER dataset comprising 10 major categories and 35 subcategories is established. Furthermore, a sentiment entity-recognition model based on the pretrained language model RoBERTa and a multilayer residual BiLSTM-CRF architecture (RoBERTa-ResBiLSTM-CRF) is proposed. After the model completes label prediction, a post-processing strategy based on rule templates is designed to improve the overall entity-recognition performance. Experimental results demonstrate that the proposed sentiment NER model outperforms the mainstream neural-network models, thus validating the effectiveness of the approach.

Key words: fine-grained, Named Entity Recognition(NER), public opinion texts, deep learning, pre-trained language models