作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2021, Vol. 47 ›› Issue (4): 62-67. doi: 10.19678/j.issn.1000-3428.0056370

• 人工智能与模式识别 • 上一篇    下一篇

融合单词贡献度与Word2Vec词向量的文档表示

彭俊利, 谷雨, 张震, 耿小航   

  1. 杭州电子科技大学 通信信息传输与融合技术国防重点学科实验室, 杭州 310000
  • 收稿日期:2019-10-22 修回日期:2020-01-02 发布日期:2020-03-31
  • 作者简介:彭俊利(1993-),男,硕士研究生,主研方向为机器学习、自然语言处理;谷雨(通信作者),副教授、博士;张震、耿小航,硕士研究生。
  • 基金资助:
    国家自然科学基金(61673146)。

Document Representation Fused with Term Contribution and Word2Vec Word Vector

PENG Junli, GU Yu, ZHANG Zhen, GENG Xiaohang   

  1. National Defense Key Discipline Laboratory of Communication Information Transmission and Fusion Technology, Hangzhou Dianzi University, Hangzhou 310000, China
  • Received:2019-10-22 Revised:2020-01-02 Published:2020-03-31

摘要: 针对现有文档向量表示方法受噪声词语影响和重要词语语义不完整的问题,通过融合单词贡献度与Word2Vec词向量提出一种新的文档表示方法。应用数据集训练Word2Vec模型,计算数据集中词语的贡献度,同时设置贡献度阈值,提取贡献度大于该阈值的单词构建单词集合。在此基础上,寻找文档与集合中共同存在的单词,获取其词向量并融合单词贡献度生成文档向量。实验结果表明,该方法在搜狗中文文本语料库和复旦大学中文文本分类语料库上分类的平均准确率、召回率和F1值均优于TF-IDF、均值Word2Vec、PTF-IDF加权Word2Vec模型等传统方法,同时其对英文文本也能进行有效分类。

关键词: 单词贡献度, Word2Vec词向量, 词嵌入, 文档表示, 文本分类

Abstract: The existing document vector representation methods are affected by noise words and the semantics of important words is incomplete.To address the problems,this paper proposes a new document representation method by fusing Term Contribution(TC) and Word2Vec word vector.Trained with a dataset,the Word2Vec model calculates the TC of words in the data set.Then the contribution threshold is set and the words whose TC is greater than the threshold are extracted to construct a word set.On this basic,the word that exists both in the document and the set is extracted,and its word vector is fused with the TC to generate the document vector.Experimental results show that the average accuracy,recall rate and F1 value of the proposed method on Sogou Chinese text corpus and Fudan University Chinese text classification corpus are better than those of traditional methods such as TF-IDF,mean Word2Vec and PIF-IDF weighted Word2Vec models.Meanwhile,it can also effectively classify English texts.

Key words: Term Contribution(TC), Word2Vec word vector, word embedding, document representation, text classification

中图分类号: