计算机工程

• 人工智能及识别技术 • 上一篇    下一篇

融合统计信息与语义相似度的特征扩展算法

李晓红,曹林,宿云,马慧芳   

  1. (西北师范大学 计算机科学与工程学院,兰州 730070)
  • 收稿日期:2016-04-25 出版日期:2017-06-15 发布日期:2017-06-15
  • 作者简介:李晓红(1978—),女,讲师,主研方向为数据挖掘、智能信息处理;曹林,硕士研究生;宿云,讲师、博士研究生;马慧芳,副教授、博士。
  • 基金项目:
    国家自然科学基金(61163039);甘肃省青年科技基金(1606RJYA269,145RJYA259);甘肃省高等学校科研项目(2015A-008);西北师范大学青年教师科研能力提升计划骨干项目(NWNU-LKQN-14-5,NWNU-LKQN-16-20)。

Feature Extension Algorithm Fusing Statistical Information and Semantic Similarity

LI Xiaohong,CAO Lin,SU Yun,MA Huifang   

  1. (College of Computer Science and Engineering,Northwest Normal University,Lanzhou 730070,China)
  • Received:2016-04-25 Online:2017-06-15 Published:2017-06-15

摘要:

通过分析短文本的高维性和稀疏性,提出一种融合特征词间统计信息与语义相似度的短文本特征扩展算法。根据词的贡献度对候选特征集进行筛选,得到扩展集合初始值。计算特征词之间的统计相关度,构建二元相关词对集合。利用外部知识库知网中的语义关系获取相关词对的义项集合并计算语义相似度,将满足条件的义项扩展为短文本的特征词,得到扩展后的特征集。实验结果表明,使用该算法对短文本进行特征扩展后,可显著提升分类器的分类效果。

关键词: 短文本, 统计相关度, 语义相似度, 知网, 特征扩展

Abstract: By analyzing high dimension characteristic and sparsity of short text,this paper proposes a feature extension algorithm fusing statistical information feature words between concepts and semantic similarity for short text.Firstly,it selects reasonable feature set through the contribution degree of word and constructs initial feature extension set.Then it calculates statistical correlation between feature words and constructs a binary word correlation pair set.Finally,by using the semantic relations of external knowledge base,HowNet,it obtains synsets of relevant words,calculates the semantic similarity,extends the synsets which meet the conditions to the feature words of the short text and obtains the extend feature set.Experimental results show that,after using the proposed algorithm to extended features,the classification results of classifiers can be greatly improved.

Key words: short text, statistical correlation, semantic similarity, HowNet, feature extension

中图分类号: