摘要: 提出一种基于语义扩展模型、分步骤的无监督关键词抽取方法。选择词语的网页结构特征、词性、词长、TF-IDF值等特征,通过聚类算法抽取候选关键词。根据n-gram语言模型理论,引入邻接变化数等特征构建基于词的语义扩展模型,采用无监督方法将候选关键词扩展为关键词串。实验结果表明,该方法能有效改善针对未登录词及短语的抽取结果,提高中文网页关键词抽取结果的质量。
关键词:
中文网页关键词抽取,
语义扩展模型,
邻接变化数,
聚类算法,
n-gram语言模型
Abstract: This paper presents a Chinese Webpage keyword extraction algorithm based on word extension model. It creates an evaluation function to transform term-document matrix by scoring candidate keyword based on its Web structure, part-of-speech, length, TF-IDF value, and uses the word extension model to extend the candidate keywords into key phrases which is based on the n-gram language model. Experimental results show that the proposed algorithm has better performance compared with the traditional keyword extraction algorithms.
Key words:
Chinese Webpage keyword extraction,
semantics extension model,
Accessor Variety(AV),
clustering algorithm,
n-gram language model
中图分类号:
汪洋, 帅建梅. 基于语义扩展模型的中文网页关键词抽取[J]. 计算机工程, 2012, 38(22): 163-166.
HONG Xiang, SHUAI Jian-Mei. Chinese Webpage Keyword Extraction Based on Semantics Extension Model[J]. Computer Engineering, 2012, 38(22): 163-166.