作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2006, Vol. 32 ›› Issue (23): 188-190. doi: 10.3969/j.issn.1000-3428.2006.23.067

• 人工智能及识别技术 • 上一篇    下一篇

基于质子串分解的中文术语自动抽取

何婷婷1,2,张 勇3   

  1. (1. 清华大学软件学院,北京 100084;2. 国家语言资源监测与研究中心(网络媒体),武汉 430079; 3. 华中师范大学计算机科学系,武汉 430079)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2006-12-05 发布日期:2006-12-05

Automatic Chinese Term Extraction Based on Decomposition of Prime String

HE Tingting1,2, ZHANG Yong3   

  1. (1. 清华大学软件学院,北京 100084;2. 国家语言资源监测与研究中心(网络媒体),武汉 430079; 3. 华中师范大学计算机科学系,武汉 430079)
  • Received:1900-01-01 Revised:1900-01-01 Online:2006-12-05 Published:2006-12-05

摘要: 针对中文术语构成特点,提出了一种基于质子串分解的术语自动抽取方法,将词分为2类:结构简单的质词和有复杂结构的合词;使用参数F-MI抽取结构简单的质词;并在其基础上,进一步使用质子串分解方法抽取具有复杂结构的合词。实验结果显示,该算法有效地提高了中文自动术语抽取的精确度。目前该算法已在国家网络媒体监测项目中得到了应用,并显示了良好的效果。

关键词: 质子串分解, 术语自动抽取, C-value, 互信息

Abstract: In view of Chinese word characteristic, this paper proposes an ATE algorithm, which is based on the decomposition of prime string. Word can be classified to two groups: prime words with simple structure and combined words with complex structure. Prime words are extracted using the F-MI parameter, and combined words are extracted by the decomposition of prime string. Experiments show the algorithm can effectively improve the precision in Chinese ATE. Now this method has been applied to the project of National Language Resources Monitor & Research Center (Network Media) for the extraction of words online.

Key words: Decomposition of prime string, Automatic term extraction, C-value, Mutual information