基于词共现有向图的中文合成词提取算法

doi:10.3969/j.issn.1000-3428.2011.23.060

计算机工程 ›› 2011, Vol. 37 ›› Issue (23): 177-180. doi: 10.3969/j.issn.1000-3428.2011.23.060

基于词共现有向图的中文合成词提取算法

刘兴林^1,2，郑启伦¹，马千里¹

(1. 华南理工大学计算机科学与工程学院，广州 510640；2. 五邑大学计算机学院，广东江门 529020)

收稿日期:2011-06-01 出版日期:2011-12-05 发布日期:2011-12-05
作者简介:刘兴林(1976－)，男，实验师、博士研究生，主研方向：文本知识获取，智能计算，数据挖掘；郑启伦，教授、博士、博士生导师；马千里，讲师、博士
基金资助:
广东省自然科学基金资助项目(9451064101003233, S2011 010003681)；广东省科技计划基金资助项目(2010B010600039)；华南理工大学中央高校基本科研业务费基金资助项目(2009ZM0125, 2009ZM0189, 2009ZM0255)

Chinese Compound Word Extraction Algorithm Based on Word Co-occurrence Directed Graph

LIU Xing-lin ^1,2, ZHENG Qi-lun ¹, MA Qian-li ¹

(1. School of Computer Science and Engineering, South China University of Technology, Guangzhou 510640, China; 2. School of Computer Science, Wuyi University, Jiangmen 529020, China)

Received:2011-06-01 Online:2011-12-05 Published:2011-12-05

摘要/Abstract

摘要： 分词系统由于未将合成词收录进词典，因此不能识别合成词。针对该问题，提出一种基于词共现有向图的中文合成词提取算法。采用词性探测方法从文本中获取词串，由所获词串生成词共现有向图，并借鉴Bellman-Ford算法思想，从词共现有向图中搜索多源点长度最长且权重值满足给定条件的路径，该路径所对应的词串即为合成词。实验结果显示，该算法的合成词提取正确率达到91.16%。

关键词: 合成词提取, 词性探测, 词共现有向图, 自然语言处理, Bellman-Ford算法

Abstract: Word segmentation systems do not include compound words into their dictionaries, so they can not recognize compound words. To address this problem, this paper proposes a Chinese compound word extraction algorithm based on word co-occurrence graph. It gets word strings from a document through by part-of-speech detecting, generates word co-occurrence directed graph,, and borrows the idea of the Bellman-Ford algorithm to search the longest paths with weight values satisfy the given conditions for multiple starting points in the word co-occurrence directed graph. The word strings corresponding to the paths are considered as compound words. Experimental results show that the algorithm achieves 91.16% upon the precision.

Key words: compound word extraction, part-of-speech detection, word co-occurrence directed graph, Natural Language Processing(NLP), Bellman-Ford algorithm

中图分类号:

TP391

刘兴林, 郑启伦, 马千里. 基于词共现有向图的中文合成词提取算法[J]. 计算机工程, 2011, 37(23): 177-180.

LIU Xin-Lin, ZHENG Qi-Lun, MA Qian-Li. Chinese Compound Word Extraction Algorithm Based on Word Co-occurrence Directed Graph[J]. Computer Engineering, 2011, 37(23): 177-180.

http://www.ecice06.com/CN/Y2011/V37/I23/177

[1]	郭艳霞, 金勇, 唐宏, 彭金枝. 基于动态卷积与残差门控的多模态情感识别[J]. 计算机工程, 2023, 49(7): 94-101.
[2]	李静雯, 赵奎. 基于改进PCFG算法的口令猜测方法[J]. 计算机工程, 2023, 49(5): 38-47.
[3]	杨文忠, 丁甜甜, 康鹏, 卜文秀. 基于舆情新闻的中文关键词抽取综述[J]. 计算机工程, 2023, 49(3): 1-17.
[4]	蔡瑞初, 张盛强, 许柏炎. 基于结构感知混合编码模型的代码注释生成方法[J]. 计算机工程, 2023, 49(2): 61-69.
[5]	王春东, 孙嘉琪, 杨文军. 基于矫正理解的中文文本对抗样本生成方法[J]. 计算机工程, 2023, 49(2): 37-45.
[6]	田乔鑫, 孔韦韦, 滕金保, 王照乾. 基于并行混合网络与注意力机制的文本情感分析模型[J]. 计算机工程, 2022, 48(8): 266-273.
[7]	司逸晨, 管有庆. 基于Transformer编码器的中文命名实体识别模型[J]. 计算机工程, 2022, 48(7): 66-72.
[8]	张吉祥, 张祥森, 武长旭, 赵增顺. 知识图谱构建技术综述[J]. 计算机工程, 2022, 48(3): 23-37.
[9]	宋旭晖, 于洪涛, 李邵梅. 基于图注意力网络字词融合的中文命名实体识别[J]. 计算机工程, 2022, 48(10): 298-305.
[10]	江旭, 钱雪忠, 宋威. 结合残差BiLSTM与句袋注意力的远程监督关系抽取[J]. 计算机工程, 2022, 48(10): 110-115,122.
[11]	李瑜泽, 栾馨, 柯尊旺, 李哲, 吾守尔·斯拉木. 知识感知的预训练语言模型综述[J]. 计算机工程, 2021, 47(9): 18-33.
[12]	许振雷, 董洪伟. 基于先验MASK注意力机制的视频问答方案[J]. 计算机工程, 2021, 47(2): 52-59.
[13]	韩虎, 赵启涛, 孙天岳, 刘国利. 面向社交媒体评论的上下文语境讽刺检测模型[J]. 计算机工程, 2021, 47(1): 66-71.
[14]	丁辰晖, 夏鸿斌, 刘渊. 融合知识图谱与注意力机制的短文本分类模型[J]. 计算机工程, 2021, 47(1): 94-100.
[15]	李冠宇, 张鹏飞, 贾彩燕. 一种注意力增强的自然语言推理模型[J]. 计算机工程, 2020, 46(7): 91-97.

选择文件类型/文献管理软件名称

选择包含的内容

基于词共现有向图的中文合成词提取算法

Chinese Compound Word Extraction Algorithm Based on Word Co-occurrence Directed Graph

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于词共现有向图的中文合成词提取算法

Chinese Compound Word Extraction Algorithm Based on Word Co-occurrence Directed Graph

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价