作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• 人工智能及识别技术 • 上一篇    下一篇

基于迭代算法的新词识别

赵小宝,张华平   

  1. (北京理工大学计算机学院,北京 100081)
  • 收稿日期:2013-03-28 出版日期:2014-07-15 发布日期:2014-07-14
  • 作者简介:赵小宝(1987-),男,硕士研究生,主研方向:自然语言处理,信息检索;张华平,副研究员、博士。
  • 基金资助:
    国家自然科学基金资助项目(61272362)。

New Words Identification Based on Iterative Algorithm

ZHAO Xiao-bao, ZHANG Hua-ping   

  1. (School of Computer Science & Technology, Beijing Institute of Technology, Beijing 100081, China)
  • Received:2013-03-28 Online:2014-07-15 Published:2014-07-14

摘要: 新词识别是中文信息处理的重要基础,但中文字符极强的构词能力给新词检测带来较大困难。受对偶原理的启发,提出一种基于迭代算法的新词识别算法。对目标语料进行分词和词性标注,通过两遍扫描进行字符串统计并提取重复模式。结合词语结构的特征,迭代使用重复模式互信息、左(右)熵,左(右)邻右(左)平均熵等特征进行新词识别,获得候选新词列表。利用中文词语搭配库对候选新词列表进行最后一次过滤得到最终新词列表。实验结果表明,利用该方法进行新词识别,P@10值达到100%,P@100值提高至90%,左(右)邻右(左)平均熵可在一定程度上提高新词识别的准确率。

关键词: 对偶原理, 新词识别, 迭代算法, 信息熵, 重复模式, 中文词语搭配库

Abstract: New words identification is an important foundation for Chinese information processing. However, the energetic word building ability of Chinese makes it difficult to automatically identify new words. Inspired by the duality principle, a new word identification algorithm based on iterative algorithm is proposed. The target corpus is analyzed for segmentation and part-of-speech tagging. The repetitive patterns are extracted after statistic of string frequency through scanning twice. Combining with word structure's characteristics, the candidate list of new words is obtained through iteratively using characteristics of repetitive patterns such as Mutual Information(MI), the left(right) entropy, the right(left) average entropy of the left(right) neighbor. The final list of new words is obtained by filtering the candidate list with the help of the library of Chinese words collocation. With this method for identification of new words, results show that the value of P@10 reaches 100%, and that of P@100 increases to 90%, the use of the right(left) average entropy of the left(right) neighbor can raise the accuracy of new words identification.

Key words: duality principle, new words identification, iterative algorithm, information entropy, repetitive pattern, the library of Chinese words collocation

中图分类号: