作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2011, Vol. 37 ›› Issue (23): 177-180. doi: 10.3969/j.issn.1000-3428.2011.23.060

• 人工智能及识别技术 • 上一篇    下一篇

基于词共现有向图的中文合成词提取算法

刘兴林1,2,郑启伦1,马千里1   

  1. (1. 华南理工大学计算机科学与工程学院,广州 510640;2. 五邑大学计算机学院,广东 江门 529020)
  • 收稿日期:2011-06-01 出版日期:2011-12-05 发布日期:2011-12-05
  • 作者简介:刘兴林(1976-),男,实验师、博士研究生,主研方向:文本知识获取,智能计算,数据挖掘;郑启伦,教授、博士、博士生导师;马千里,讲师、博士
  • 基金资助:
    广东省自然科学基金资助项目(9451064101003233, S2011 010003681);广东省科技计划基金资助项目(2010B010600039);华南理工大学中央高校基本科研业务费基金资助项目(2009ZM0125, 2009ZM0189, 2009ZM0255)

Chinese Compound Word Extraction Algorithm Based on Word Co-occurrence Directed Graph

LIU Xing-lin 1,2, ZHENG Qi-lun 1, MA Qian-li 1   

  1. (1. School of Computer Science and Engineering, South China University of Technology, Guangzhou 510640, China; 2. School of Computer Science, Wuyi University, Jiangmen 529020, China)
  • Received:2011-06-01 Online:2011-12-05 Published:2011-12-05

摘要: 分词系统由于未将合成词收录进词典,因此不能识别合成词。针对该问题,提出一种基于词共现有向图的中文合成词提取算法。采用词性探测方法从文本中获取词串,由所获词串生成词共现有向图,并借鉴Bellman-Ford算法思想,从词共现有向图中搜索多源点长度最长且权重值满足给定条件的路径,该路径所对应的词串即为合成词。实验结果显示,该算法的合成词提取正确率达到91.16%。

关键词: 合成词提取, 词性探测, 词共现有向图, 自然语言处理, Bellman-Ford算法

Abstract: Word segmentation systems do not include compound words into their dictionaries, so they can not recognize compound words. To address this problem, this paper proposes a Chinese compound word extraction algorithm based on word co-occurrence graph. It gets word strings from a document through by part-of-speech detecting, generates word co-occurrence directed graph,, and borrows the idea of the Bellman-Ford algorithm to search the longest paths with weight values satisfy the given conditions for multiple starting points in the word co-occurrence directed graph. The word strings corresponding to the paths are considered as compound words. Experimental results show that the algorithm achieves 91.16% upon the precision.

Key words: compound word extraction, part-of-speech detection, word co-occurrence directed graph, Natural Language Processing(NLP), Bellman-Ford algorithm

中图分类号: