Abstract:
By analyzing high dimension characteristic and sparsity of short text,this paper proposes a feature extension algorithm fusing statistical information feature words between concepts and semantic similarity for short text.Firstly,it selects reasonable feature set through the contribution degree of word and constructs initial feature extension set.Then it calculates statistical correlation between feature words and constructs a binary word correlation pair set.Finally,by using the semantic relations of external knowledge base,HowNet,it obtains synsets of relevant words,calculates the semantic similarity,extends the synsets which meet the conditions to the feature words of the short text and obtains the extend feature set.Experimental results show that,after using the proposed algorithm to extended features,the classification results of classifiers can be greatly improved.
Key words:
short text,
statistical correlation,
semantic similarity,
HowNet,
feature extension
摘要:
通过分析短文本的高维性和稀疏性,提出一种融合特征词间统计信息与语义相似度的短文本特征扩展算法。根据词的贡献度对候选特征集进行筛选,得到扩展集合初始值。计算特征词之间的统计相关度,构建二元相关词对集合。利用外部知识库知网中的语义关系获取相关词对的义项集合并计算语义相似度,将满足条件的义项扩展为短文本的特征词,得到扩展后的特征集。实验结果表明,使用该算法对短文本进行特征扩展后,可显著提升分类器的分类效果。
关键词:
短文本,
统计相关度,
语义相似度,
知网,
特征扩展
CLC Number:
LI Xiaohong,CAO Lin,SU Yun,MA Huifang. Feature Extension Algorithm Fusing Statistical Information and Semantic Similarity[J]. Computer Engineering.
李晓红,曹林,宿云,马慧芳. 融合统计信息与语义相似度的特征扩展算法[J]. 计算机工程.