Document Representation Fused with Term Contribution and Word2Vec Word Vector

doi:10.19678/j.issn.1000-3428.0056370

Abstract

Abstract: The existing document vector representation methods are affected by noise words and the semantics of important words is incomplete.To address the problems,this paper proposes a new document representation method by fusing Term Contribution(TC) and Word2Vec word vector.Trained with a dataset,the Word2Vec model calculates the TC of words in the data set.Then the contribution threshold is set and the words whose TC is greater than the threshold are extracted to construct a word set.On this basic,the word that exists both in the document and the set is extracted,and its word vector is fused with the TC to generate the document vector.Experimental results show that the average accuracy,recall rate and F1 value of the proposed method on Sogou Chinese text corpus and Fudan University Chinese text classification corpus are better than those of traditional methods such as TF-IDF,mean Word2Vec and PIF-IDF weighted Word2Vec models.Meanwhile,it can also effectively classify English texts.

Key words: Term Contribution(TC), Word2Vec word vector, word embedding, document representation, text classification

摘要： 针对现有文档向量表示方法受噪声词语影响和重要词语语义不完整的问题，通过融合单词贡献度与Word2Vec词向量提出一种新的文档表示方法。应用数据集训练Word2Vec模型，计算数据集中词语的贡献度，同时设置贡献度阈值，提取贡献度大于该阈值的单词构建单词集合。在此基础上，寻找文档与集合中共同存在的单词，获取其词向量并融合单词贡献度生成文档向量。实验结果表明，该方法在搜狗中文文本语料库和复旦大学中文文本分类语料库上分类的平均准确率、召回率和F1值均优于TF-IDF、均值Word2Vec、PTF-IDF加权Word2Vec模型等传统方法，同时其对英文文本也能进行有效分类。

关键词: 单词贡献度, Word2Vec词向量, 词嵌入, 文档表示, 文本分类

CLC Number:

TP391

PENG Junli, GU Yu, ZHANG Zhen, GENG Xiaohang. Document Representation Fused with Term Contribution and Word2Vec Word Vector[J]. Computer Engineering, 2021, 47(4): 62-67.

彭俊利, 谷雨, 张震, 耿小航. 融合单词贡献度与Word2Vec词向量的文档表示[J]. 计算机工程, 2021, 47(4): 62-67.

/ / Recommend / Download Citations

URL: http://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0056370

http://www.ecice06.com/EN/Y2021/V47/I4/62

Figures/Tables 8

References

[1] BAEZA-YATES R,RIBEIRO-NETO B.Modern infor-mation retrieval[M].New York,USA:ACM Press,1999.
[2] MANNING C D,SCHUTZE H.Foundations of statistical natural language processing[M].Cambridge,USA:MIT Press,1999.
[3] MA Linjin,WAN Liang,MA Shaoju,YANG Ting.Abnormal traffic identification method based on bag of words model clustering[J].Computer Engineering,2017,43(5):204-209.(in Chinese)马林进,万良,马绍菊,等.基于词袋模型聚类的异常流量识别方法[J].计算机工程,2017,43(5):204-209.
[4] LEI Shuo,LIU Xumin,XU Weixiang.Chinese short text classification based on word vector extension[J].Computer Applications and Software,2018,35(8):269-274.(in Chinese)雷朔,刘旭敏,徐维祥.基于词向量特征扩展的中文短文本分类研究[J].计算机应用与软件,2018,35(8):269-274.
[5] HWANG M,CHOI C,YOUN B,et al.Word sense dis-ambiguation based on relation structure[C]//Proceedings of 2008 International Conference on Advanced Language Processing and Web Information Technology.New York,USA:ACM Press,2008:15-20.
[6] WANG X,McCALLUM A,WEI X.Topical N-grams:phrase and topic discovery,with an application to information retrieval[C]//Proceedings of IEEE International Conference on Data Mining.Washington D.C.,USA:IEEE Press,2007:697-702.
[7] CHEN Xingjian,HU Xuejiao,XUE Wei.Improved bag of words model based on relational expansion[J].Journal of Chinese Computer Systems,2019,40(5):1040-1044.(in Chinese)陈行健,胡雪娇,薛卫.基于关系拓展的改进词袋模型研究[J].小型微型计算机系统,2019,40(5):1040-1044.
[8] CHEN Wenshi,LIU Xinhui,LU Mingyu.Feature extraction of deep topic model for multi-label text classification[J].Pattern Recognition and Artificial Intelligence,2019,32(9):785-792.(in Chinese)陈文实,刘心惠,鲁明羽.面向多标签文本分类的深度主题特征提取[J].模式识别与人工智能,2019,32(9):785-792.
[9] HAN Xuli,ZENG Biqin,ZENG Feng,et al.Sentiment analysis based on word embedding auxiliary mechanism[J].Computer Science,2019,46(10):258-264.(in Chinese)韩旭丽,曾碧卿,曾锋,等.基于词嵌入辅助机制的情感分析[J].计算机科学,2019,46(10):258-264.
[10] ZHENG Cheng,HONG Tongtong,XUE Manyi.BLSTM_MLPCNN model for short text classification[J].Computer Science,2019,46(6):206-211.(in Chinese)郑诚,洪彤彤,薛满意.用于短文本分类的BLSTM_MLPCNN模型[J].计算机科学,2019,46(6):206-211.
[11] MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of word representations in vector space[C]//Proceedings of International Conference on Learning Representations.Scottsdale,USA:[s.n.],2013:1-12.
[12] TU Shouzhong,HUANG Minlie.Mining microblog user interests based on TextRank with TF-IDF factor[J].The Journal of China Universities of Posts and Telecommunica-tions,2016,23(5):40-46.
[13] DUAN Xulei,ZHANG Yangsen,SUN Yizhuo.Research on sentence vector representation and similarity calculation method about Microblog texts[J].Computer Engineering,2017,43(5):143-148.(in Chinese)段旭磊,张仰森,孙卓.微博文本的句向量表示及相似度计算方法研究[J].计算机工程,2017,43(5):143-148.
[14] WANG Jing,LUO Lang,WANG Deqiang.Research on Chinese short text classification based on Word2Vec[J].Computer Systems and Applications,2018,27(5):209-215.(in Chinese)汪静,罗浪,王德强.基于Word2Vec的中文短文本分类问题研究[J].计算机系统应用,2018,27(5):209-215.
[15] WANG Gensheng,HUANG Xuejian.Convolution neural network text classification model based on Word2vec and improved TF-IDF[J].Journal of Chinese Computer Systems,2019,40(5):1120-1126.(in Chinese)王根生,黄学坚.基于Word2vec和改进型TF-IDF的卷积神经网络文本分类模型[J].小型微型计算机系统,2019,40(5):1120-1126.
[16] BENGIO Y,SCHWENK H,SENECAL J S,et al.Neural probabilistic language models[M].Berlin,Germany:Springer,2006.
[17] GAO Mingxia,LI Jingwei.Chinese short text classification method based on word2vec embedding[J].Journal of Shandong University(Engineering Science),2019,49(2):34-41.(in Chinese)高明霞,李经纬.基于word2vec词模型的中文短文本分类方法[J].山东大学学报(工学版),2019,49(2):34-41.
[18] MIKOLOV T,YIH W,ZWEIG G.Linguistic regularities in continuous space word representations[C]//Proceedings of 2013 Conference of the North American Chapter of the Association for Computational Linguistics.Atlsnta,USA:NAACL Press,2013:746-751.
[19] LIU Tao,LIU Shengping,CHEN Zheng,et al.An evalua-tion on feature selection for text clustering[C]//Proceedings of the 20th International Conference on International Con-ference on Machine Learning.Washington D.C.,USA:AAAI Press,2003:488-495.
[20] NIE Weimin,CHEN Yongzhou,MA Jing.A text vector representation model merging multi-granularity informa-tion[J].Data Analysis and Knowledge Discovery,2019,3(9):45-52.(in Chinese)聂维民,陈永洲,马静.融合多粒度信息的文本向量表示模型[J].数据分析与知识发现,2019,3(9):45-52.

Please choose a citation manager

Content to export