融合统计信息与语义相似度的特征扩展算法

doi:10.3969/j.issn.1000-3428.2017.06.028

计算机工程

融合统计信息与语义相似度的特征扩展算法

李晓红,曹林,宿云,马慧芳

(西北师范大学计算机科学与工程学院,兰州 730070)

收稿日期:2016-04-25 出版日期:2017-06-15 发布日期:2017-06-15
作者简介:李晓红(1978—),女,讲师,主研方向为数据挖掘、智能信息处理;曹林,硕士研究生;宿云,讲师、博士研究生;马慧芳,副教授、博士。
基金资助:
国家自然科学基金(61163039);甘肃省青年科技基金(1606RJYA269,145RJYA259);甘肃省高等学校科研项目(2015A-008);西北师范大学青年教师科研能力提升计划骨干项目(NWNU-LKQN-14-5,NWNU-LKQN-16-20)。

Feature Extension Algorithm Fusing Statistical Information and Semantic Similarity

LI Xiaohong,CAO Lin,SU Yun,MA Huifang

(College of Computer Science and Engineering,Northwest Normal University,Lanzhou 730070,China)

Received:2016-04-25 Online:2017-06-15 Published:2017-06-15

摘要/Abstract

摘要：

通过分析短文本的高维性和稀疏性,提出一种融合特征词间统计信息与语义相似度的短文本特征扩展算法。根据词的贡献度对候选特征集进行筛选,得到扩展集合初始值。计算特征词之间的统计相关度,构建二元相关词对集合。利用外部知识库知网中的语义关系获取相关词对的义项集合并计算语义相似度,将满足条件的义项扩展为短文本的特征词,得到扩展后的特征集。实验结果表明,使用该算法对短文本进行特征扩展后,可显著提升分类器的分类效果。

关键词: 短文本, 统计相关度, 语义相似度, 知网, 特征扩展

Abstract: By analyzing high dimension characteristic and sparsity of short text,this paper proposes a feature extension algorithm fusing statistical information feature words between concepts and semantic similarity for short text.Firstly,it selects reasonable feature set through the contribution degree of word and constructs initial feature extension set.Then it calculates statistical correlation between feature words and constructs a binary word correlation pair set.Finally,by using the semantic relations of external knowledge base,HowNet,it obtains synsets of relevant words,calculates the semantic similarity,extends the synsets which meet the conditions to the feature words of the short text and obtains the extend feature set.Experimental results show that,after using the proposed algorithm to extended features,the classification results of classifiers can be greatly improved.

Key words: short text, statistical correlation, semantic similarity, HowNet, feature extension

中图分类号:

TP18

李晓红,曹林,宿云,马慧芳. 融合统计信息与语义相似度的特征扩展算法[J]. 计算机工程, doi: 10.3969/j.issn.1000-3428.2017.06.028.

LI Xiaohong,CAO Lin,SU Yun,MA Huifang. Feature Extension Algorithm Fusing Statistical Information and Semantic Similarity[J]. Computer Engineering, doi: 10.3969/j.issn.1000-3428.2017.06.028.

http://www.ecice06.com/CN/Y2017/V43/I6/177

参考文献

参考文献［1］Sun Aixin.Short Text Classification Using Very Few Words［C］//Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval.New York,USA:ACM Press,2012:1145-1146. ［2］Zelikovitz S,Marquez F.Transductive Learning for Short-text Classification Problems Using Latent Semantic Indexing［J］.International Journal of Pattern Recognition and Artificial Intelligence,2005,19(2):143-163. ［3］杨婉霞,孙理,黄永峰.结合语义与统计的特征降维短文本聚类［J］.计算机工程,2012,38(22):171-175. ［4］Yan Tao,Wang Xiwei.Feature Extension for Short Text［C］//Proceedings of the 3rd International Symposium on Computer Science and Computational Technology.Jiaozuo,China:［s.n.］,2010:338-341. ［5］Liu Mingxuan,Fan Xinghua.A Method for Chinese Short Text Classification Considering Effective Feature Expansion［J］.International Journal of Advanced Research in Artificial Intelligence,2012,1(1). ［6］Wang Peng,Zhang Heng,Xu Bo.Short Text Feature Enrichment Using Link Analysis on Topic-keyword Graph［C］//Proceedings of NLPCC’14.Berlin,Germany:Springer,2014:79-90. ［7］Man Yuan.Feature Extension for Short Text Categoriza-tion Using Frequent Term Sets［J］.Procedia Computer Science,2014,31:663-670. ［8］陈羽中,方明月,郭文忠.面向微博热点话题发现的多标签传播聚类方法研究［J］.模式识别与人工智能,2015,28(1):1-10. ［9］Cataldi M,di Caro L,Schifanella C.Emerging Topic Detection on Twitter Based on Temporal and Social Terms Evaluation［C］//Proceedings of the 10th International Workshop on Multimedia Data Mining.Washington D.C.,USA:［s.n.］,2010:1-10. ［10］Chen Mengen,Jin Xiaoming,Shen Dou.Short Text Classification Improved by Learning Multi-granularity Topics［C］//Proceedings of the 22nd International Joint Conference on Artificial Intelligence.Barcelona,Spain:［s.n.］,2011:1776-1781. ［11］刘群,李素建.基于《知网》的词汇语义相似度的计算［C］//第三届汉语词汇语义学研讨会.台北,中国:［出版者不详］,2002:59-76. ［12］Pan Liqiang,Zhang Pu,Xiong Anping.Semantic Similarity Calculation of Chinese Word［J］.International Journal of Advanced Computer Science and Applications,2014,5(8):205-214. ［13］Liu Wenyin,Quan Xiaojun,Feng Min,et al.A Short Text Modeling Method Combining Semantic and Statistical Information［J］.Information Sciences,2010,180(20):4031-4041. ［14］Zhang Huaping,Yu Hongkui,Yi De.HHMM-based Chinese Lexical Analyzer ICT-CLAS［C］//Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing.Sapporo,Japan:［s.n.］,2003:184-187. ［15］Peat H J,Willet P.The Limitations of Term Co-occurrence Data for Query Expansion in Document Retrieval Systems［J］.Journal of American Society for Information Science,1991,42(5):378-383. 编辑金胡考

[1]	杨振宇, 王磊, 马博, 杨雅婷, 董瑞, 艾孜麦提·艾瓦尼尔, 王震. 一种针对维汉的跨语言远程监督方法[J]. 计算机工程, 2023, 49(2): 271-278.
[2]	梁登玉, 刘大明. 融合多粒度信息与外部知识的短文本匹配模型[J]. 计算机工程, 2022, 48(8): 129-135,143.
[3]	詹飞, 朱艳辉, 梁文桐, 张旭, 欧阳康, 孔令巍, 黄雅淋. 基于多任务学习的短文本实体链接方法[J]. 计算机工程, 2022, 48(3): 315-320.
[4]	石彩霞, 李书琴, 刘斌. 多重检验加权融合的短文本相似度计算方法[J]. 计算机工程, 2021, 47(2): 95-102.
[5]	袁自勇, 高曙, 曹姣, 陈良臣. 基于异构图卷积网络的小样本短文本分类方法[J]. 计算机工程, 2021, 47(12): 87-94.
[6]	张晟旗, 王元龙, 李茹, 王笑月, 王晓晖, 闫智超. 基于局部注意力机制的中文短文本实体链接[J]. 计算机工程, 2021, 47(11): 77-83,92.
[7]	段丹丹, 唐加山, 温勇, 袁克海. 基于BERT模型的中文短文本分类算法[J]. 计算机工程, 2021, 47(1): 79-86.
[8]	丁辰晖, 夏鸿斌, 刘渊. 融合知识图谱与注意力机制的短文本分类模型[J]. 计算机工程, 2021, 47(1): 94-100.
[9]	李世宝, 李贺, 赵庆帅, 殷乐乐, 刘建航, 黄庭培. 融合外部语义知识的中文文本蕴含识别[J]. 计算机工程, 2021, 47(1): 44-49.
[10]	殷亚博,杨文忠,杨慧婷,许超英. 基于卷积神经网络和KNN的短文本分类算法研究[J]. 计算机工程, 2018, 44(7): 193-198.
[11]	王淑媛,田生伟,禹龙,冯冠军,艾山·吾买尔,李圃,赵建国. 基于堆栈降噪自编码的维吾尔语事件共指关系识别[J]. 计算机工程, 2018, 44(6): 305-310.
[12]	缪峰,贾华丁,熊于宁. 基于服务相似度的移动用户近似邻居选取方法[J]. 计算机工程, 2018, 44(5): 162-167,173.
[13]	荆琪,段利国,李爱萍,赵谦. 基于维基百科的短文本相关度计算[J]. 计算机工程, 2018, 44(2): 197-202.
[14]	李玉龙,刘任任,赵津锋,臧浪,曹斌. 分簇感知网络中基于压缩感知的数据收集方法[J]. 计算机工程, 2018, 44(10): 129-135.
[15]	邓涵,朱新华,李奇,彭琦. 基于句法结构与修饰词的句子相似度计算[J]. 计算机工程, 2017, 43(9): 240-244,249.

选择文件类型/文献管理软件名称

选择包含的内容

融合统计信息与语义相似度的特征扩展算法

Feature Extension Algorithm Fusing Statistical Information and Semantic Similarity

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

融合统计信息与语义相似度的特征扩展算法

Feature Extension Algorithm Fusing Statistical Information and Semantic Similarity

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价