作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2012, Vol. 38 ›› Issue (22): 163-166. doi: 10.3969/j.issn.1000-3428.2012.22.040

• 人工智能及识别技术 • 上一篇    下一篇

基于语义扩展模型的中文网页关键词抽取

汪 洋,帅建梅   

  1. (中国科学技术大学信息科学技术学院,合肥 230027)
  • 收稿日期:2012-03-01 修回日期:2012-03-21 出版日期:2012-11-20 发布日期:2012-11-17
  • 作者简介:汪 洋(1984-),男,硕士研究生,主研方向:数据挖掘;帅建梅,副教授
  • 基金资助:
    国家“863”计划基金资助项目“结合语义的视频服务网站自动发现与分析评估”(2008AA01Z408)

Chinese Webpage Keyword Extraction Based on Semantics Extension Model

WANG Yang, SHUAI Jian-mei   

  1. (School of Information Science and Technology, University of Science and Technology of China, Hefei 230027, China)
  • Received:2012-03-01 Revised:2012-03-21 Online:2012-11-20 Published:2012-11-17

摘要: 提出一种基于语义扩展模型、分步骤的无监督关键词抽取方法。选择词语的网页结构特征、词性、词长、TF-IDF值等特征,通过聚类算法抽取候选关键词。根据n-gram语言模型理论,引入邻接变化数等特征构建基于词的语义扩展模型,采用无监督方法将候选关键词扩展为关键词串。实验结果表明,该方法能有效改善针对未登录词及短语的抽取结果,提高中文网页关键词抽取结果的质量。

关键词: 中文网页关键词抽取, 语义扩展模型, 邻接变化数, 聚类算法, n-gram语言模型

Abstract: This paper presents a Chinese Webpage keyword extraction algorithm based on word extension model. It creates an evaluation function to transform term-document matrix by scoring candidate keyword based on its Web structure, part-of-speech, length, TF-IDF value, and uses the word extension model to extend the candidate keywords into key phrases which is based on the n-gram language model. Experimental results show that the proposed algorithm has better performance compared with the traditional keyword extraction algorithms.

Key words: Chinese Webpage keyword extraction, semantics extension model, Accessor Variety(AV), clustering algorithm, n-gram language model

中图分类号: