东北亚舆情文本细粒度命名实体识别方法研究

doi:10.19678/j.issn.1000-3428.0068955

摘要/Abstract

摘要：

东北亚地区的国际形势变化与中国的发展密切相关, 面向该地区构建舆情信息知识图谱可以有效地监测舆情热点, 这不仅能够引导社会舆论健康发展及协助政府决策, 而且对防范政治营销、提升国家语言能力、构建和谐稳定国际关系具有重大价值。命名实体识别是构建知识图谱的关键技术和核心任务, 受到研究者广泛的关注。以社交媒体、门户网站与东北亚地区相关的实时热点舆情文本作为数据源, 充分考虑到东北亚地区的区域特点和地缘结构, 建立包含10个大类、35个子类的细粒度命名实体识别数据集, 并提出基于预训练语言模型RoBERTa和多层残差BiLSTM-CRF架构(RoBERTa-ResBiLSTM-CRF)的舆情实体识别模型, 同时在模型完成标签预测后设计基于规则模板的后处理策略, 以提高整体的实体识别性能。实验结果表明, 所提出的舆情命名实体识别模型的性能优于主流的传统神经网络模型, 验证了该方法的有效性。

关键词: 细粒度, 命名实体识别, 舆情文本, 深度学习, 预训练语言模型

Abstract:

The evolving international situation in Northeast Asia is associated closely with China's development. The construction of a sentiment information knowledge graph for this region enables the effective monitoring of public-opinion hotspots. This not only guides the healthy development of public opinion and assists government decision-making but also prevents political marketing, thus enhancing national language competence and promoting harmonious and stable international relations. Named Entity Recognition(NER) is a key technology and core task in constructing knowledge graphs and has garnered extensive attention from researchers. This study uses real-time hot-sentiment texts related to Northeast Asia from social media and portal websites as data sources. Considering the regional characteristics and geopolitical structure of Northeast Asia, a fine-grained NER dataset comprising 10 major categories and 35 subcategories is established. Furthermore, a sentiment entity-recognition model based on the pretrained language model RoBERTa and a multilayer residual BiLSTM-CRF architecture (RoBERTa-ResBiLSTM-CRF) is proposed. After the model completes label prediction, a post-processing strategy based on rule templates is designed to improve the overall entity-recognition performance. Experimental results demonstrate that the proposed sentiment NER model outperforms the mainstream neural-network models, thus validating the effectiveness of the approach.

Key words: fine-grained, Named Entity Recognition(NER), public opinion texts, deep learning, pre-trained language models

隗昊, 刁宏悦, 孔亮宸, 邓耀臣. 东北亚舆情文本细粒度命名实体识别方法研究[J]. 计算机工程, 2024, 50(5): 354-362.

Hao WEI, Hongyue DIAO, Liangchen KONG, Yaochen DENG. Research on Fine-grained Named-Entity-Recognition Method for Public-Opinion Texts in Northeast Asia[J]. Computer Engineering, 2024, 50(5): 354-362.

https://www.ecice06.com/CN/Y2024/V50/I5/354

图/表 11

图1 RoBERTa-ResBiLSTM-CRF框架

Fig.1 Framework of RoBERTa-ResBiLSTM-CRF

图2 BERT预训练语言模型结构

Fig.2 Structure of BERT pre-trained language models

图3 LSTM神经单元结构

Fig.3 The neural unit structure of LSTM

参考文献 22

1	庄芮, 蔡彤娟. 人类命运共同体视域下的东北亚经济共同体构建. 人民论坛·学术前沿, 2023,(15): 55- 64. URL
	ZHUANG R, CAI T J. Building a Northeast Asian economic community: a perspective of community with a shared future for mankind. Frontiers, 2023,(15): 55- 64. URL
2	高翔, 王石, 朱俊武, 等. 命名实体识别任务综述. 计算机科学, 2023, 50(S1): 26- 33. URL
	GAO X, WANG S, ZHU J W, et al. Summary of named entity recognition tasks. Computer Science, 2023, 50(S1): 26- 33. URL
3	王昊, 苏新宁. 基于模式匹配的中文通用本体概念抽取模型. 情报理论与实践, 2008, 31(2): -297, 291. URL
	WANG H, SU X N. A model for extraction of the concept of Chinese general ontology based on pattern matching. Information Studies (Theory & Application), 2008, 31(2): -297, 291. URL
4	ETZIONI O, CAFARELLA M, DOWNEY D, et al. Unsupervised named-entity extraction from the Web: an experimental study. Artificial Intelligence, 2005, 165(1): 91- 134. doi: 10.1016/j.artint.2005.03.001
5	周俊生, 戴新宇, 尹存燕, 等. 基于层叠条件随机场模型的中文机构名自动识别. 电子学报, 2006, 34(5): 804- 809. doi: 10.3321/j.issn:0372-2112.2006.05.008
	ZHOU J S, DAI X Y, YIN C Y, et al. Automatic recognition of Chinese organization Name based on cascaded conditional random fields. Acta Electronica Sinica, 2006, 34(5): 804- 809. doi: 10.3321/j.issn:0372-2112.2006.05.008
6	陈德鑫, 占袁圆, 杨兵, 等. 基于CNN-BiLSTM模型的在线医疗实体抽取研究. 图书情报工作, 2019, 63(12): 105- 113. URL
	CHEN D X, ZHAN Y Y, YANG B, et al. Research on extraction of online medical entities based on mixed deep learning model. Library and Information Service, 2019, 63(12): 105- 113. URL
7	HUANG Z H, XU W, YU K. Bidirectional LSTM-CRF models for sequence tagging[EB/OL]. [2023-11-01]. http://arxiv.org/abs/1508.01991v1.
8	MA X Z, HOVY E. End-to-end sequence labeling via Bi-directional LSTM-CNNs-CRF[C]∥Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, USA: Association for Computational Linguistics, 2016: 1064-1074.
9	任妮, 鲍彤, 沈耕宇, 等. 基于深度学习的细粒度命名实体识别研究——以番茄病虫害为例. 情报科学, 2021, 39(11): 96- 102. URL
	REN N, BAO T, SHEN G Y, et al. Fine-grained named entity recognition based on deep learning: a case study of tomato diseases and pests. Information Science, 2021, 39(11): 96- 102. URL
10	WEI H, GAO M Y, ZHOU A, et al. A multichannel biomedical named entity recognition model based on multitask learning and contextualized word representations. Wireless Communications and Mobile Computing, 2020, 2020, 8894760.
11	琚生根, 李天宁, 孙界平. 基于关联记忆网络的中文细粒度命名实体识别. 软件学报, 2021, 32(8): 2545- 2556. URL
	JU S G, LI T N, SUN J P. Chinese fine-grained name entity recognition based on associated memory networks. Journal of Software, 2021, 32(8): 2545- 2556. URL
12	陈剑, 何涛, 闻英友, 等. 基于BERT模型的司法文书实体识别方法. 东北大学学报(自然科学版), 2020, 41(10): 1382- 1387. doi: 10.12068/j.issn.1005-3026.2020.10.003
	CHEN J, HE T, WEN Y Y, et al. Entity recognition method for judicial documents based on BERT model. Journal of Northeastern University (Natural Science), 2020, 41(10): 1382- 1387. doi: 10.12068/j.issn.1005-3026.2020.10.003
13	王月, 王孟轩, 张胜, 等. 基于BERT的警情文本命名实体识别. 计算机应用, 2020, 40(2): 535- 540. URL
	WANG Y, WANG M X, ZHANG S, et al. Alarm text named entity recognition based on BERT. Journal of Computer Applications, 2020, 40(2): 535- 540. URL
14	顾亦然, 霍建霖, 杨海根, 等. 基于BERT的电机领域中文命名实体识别方法. 计算机工程, 2021, 47(8): 78-83, 92. URL
	GU Y R, HUO J L, YANG H G, et al. BERT-based Chinese named entity recognition method in motor field. Computer Engineering, 2021, 47(8): 78-83, 92. URL
15	CUI Y M, CHE W X, LIU T, et al. Pre-training with whole word masking for Chinese BERT. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2021, 29, 3504- 3514. doi: 10.1109/TASLP.2021.3124365
16	DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[EB/OL]. [2023-11-01]. http://arxiv.org/abs/1810.04805.
17	HOCHREITER S, SCHMIDHUBER J. Long short-term memory. Neural Computation, 1997, 9(8): 1735- 1780. doi: 10.1162/neco.1997.9.8.1735
18	LAFFERTY J D, MCCALLUM A, PEREIRA F C N. Conditional random fields: probabilistic models for segmenting and labeling sequence data[C]∥Proceedings of the 18th International Conference on Machine Learning. New York. USA: ACM Press, 2001: 282-289.
19	STRUBELL E, VERGA P, BELANGER D, et al. Fast and accurate entity recognition with iterated dilated convolutions[C]∥Proceedings of 2017 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, USA: Association for Computational Linguistics, 2017: 2670-2680.
20	朱西平, 卢星宇, 苏作新, 等. 基于多神经网络与注意力的页岩气实体识别. 中国科技论文, 2022, 17(11): 1201- 1206. doi: 10.3969/j.issn.2095-2783.2022.11.005
	ZHU X P, LU X Y, SU Z X, et al. Shale gas entity recognition based on multi-neural network and attention. China Sciencepaper, 2022, 17(11): 1201- 1206. doi: 10.3969/j.issn.2095-2783.2022.11.005
21	PETERS M, NEUMANN M, IYYER M, et al. Deep contextualized word representations[C]∥Proceedings of 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, USA: Association for Computational Linguistics, 2018: 2227-2237.
22	LAN Z Z, CHEN M D, GOODMAN S, et al. ALBERT: a lite BERT for self-supervised learning of language representations[EB/OL]. [2023-11-01]. http://arxiv.org/abs/1909.11942v6.

[1]	魏嵬, 丁香香, 郭梦星, 杨钊, 刘辉. 文本相似度计算方法综述[J]. 计算机工程, 2024, 50(9): 18-32.
[2]	党小超, 刘涧, 董晓辉, 祝忠彦, 李芬芳. 面向不平衡数据的机械设备故障命名实体识别[J]. 计算机工程, 2024, 50(9): 104-112.
[3]	朱凯, 李理, 张彤, 江晟, 别一鸣. 基于Transformer的多阶段运动模糊图像修复网络[J]. 计算机工程, 2024, 50(9): 276-285.
[4]	张天鹏, 韩晶, 吕学强. 基于多任务学习的超分辨率辅助小目标检测[J]. 计算机工程, 2024, 50(9): 304-312.
[5]	高煜宝, 文志诚. 基于注意力机制的双路解码器图像去噪方法[J]. 计算机工程, 2024, 50(9): 324-332.
[6]	张华青, 夏张涛, 陆晓庆, 童基均. 基于字形特征的血管外科命名实体识别[J]. 计算机工程, 2024, 50(8): 13-21.
[7]	李华昱, 张智康, 闫阳, 岳阳. 基于知识图谱增强的领域多模态实体识别[J]. 计算机工程, 2024, 50(8): 31-39.
[8]	张亚洲, 和玉, 戎璐, 王祥凯. 基于上下文知识增强型Transformer网络的抑郁检测[J]. 计算机工程, 2024, 50(8): 75-85.
[9]	高伟, 李帅龙, 茆琳, 王磊, 李颖颖, 韩林. 一种基于TVM的算子生成加速策略[J]. 计算机工程, 2024, 50(8): 353-362.
[10]	陈宇航, 杨勇, 先木斯亚·买买提明, 帕力旦·吐尔逊, 樊小超, 任鸽, 刁宇峰. 基于主题感知和语义增强的作文自动评分方法[J]. 计算机工程, 2024, 50(8): 363-371.
[11]	王宇, 祁琦, 王纯, 许才. 储能变流器信号高精度故障诊断方法[J]. 计算机工程, 2024, 50(8): 389-396.
[12]	刘娟, 段友祥, 陆誉翕, 张鲁. 引入知识增强和对比学习的知识图谱补全[J]. 计算机工程, 2024, 50(7): 112-122.
[13]	牛瑞婷, 严天峰, 高锐, 王映植. 低信噪比下基于深度学习TCNN-MobileNet的调制识别[J]. 计算机工程, 2024, 50(7): 204-215.
[14]	肖慈, 徐杨, 张永丹, 冯明文, 黄易仟. 结合注意力和低光增强的夜间语义分割[J]. 计算机工程, 2024, 50(7): 271-281.
[15]	张诗婧, 莫绪涛, 赵行, 董杨林. 基于球面折反射成像和YOLOv7的内螺纹缺陷检测[J]. 计算机工程, 2024, 50(7): 282-292.

选择文件类型/文献管理软件名称

选择包含的内容