计算机工程 ›› 2019, Vol. 45 ›› Issue (3): 309-314.doi: 10.19678/j.issn.1000-3428.0050407

• 开发研究与工程应用 • 上一篇    下一篇

基于Word2vec的自然语言隐写分析方法

喻靖民a,b,向凌云a,b,c,曾道建a,b   

  1. 长沙理工大学 a.综合交通运输大数据智能处理湖南省重点实验室; b.计算机与通信工程学院; c.智能道路与车路协同湖南省重点实验室,长沙 410114
  • 收稿日期:2018-02-05 出版日期:2019-03-15 发布日期:2019-03-15
  • 作者简介:喻靖民(1993—),男,硕士研究生,主研方向为隐写分析、自然语言处理;向凌云、曾道建,讲师、博士。
  • 基金项目:

    国家自然科学基金(61202439,61602059);湖南省教育厅科学研究重点项目(16A008)。

Natural Language Steganalysis Method Based on Word2vec

YU Jingmina,b,XIANG Lingyuna,b,c,ZENG Daojiana,b   

  1. a.Hunan Provincial Key Laboratory of Intelligent Processing of Big Data on Transportation; b.School of Computer and Communication Engineering; c.Hunan Provincial Key Laboratory of Smart Roadway and Cooperative Vehicle-Infrastructure Systems,Changsha University of Science and Technology,Changsha 410114,China
  • Received:2018-02-05 Online:2019-03-15 Published:2019-03-15

摘要:

为数字化表示文本内容的语义信息,并提高基于同义词替换的隐写文本检测精度,提出一种新的自然语言隐写分析方法。利用Word2vec对大规模语料库进行训练获得包含丰富语义信息的多维词向量,使用同义词及其上下文词向量之间的余弦距离度量2个词之间的相关度,并计算同义词在特定上下文中的合适度。根据信息嵌入过程中同义词替换操作对文本同义词合适度的影响提取检测特征形成特征向量,采用贝叶斯分类模型训练特征向量得到隐写分析特征,从而识别隐写文本。实验结果表明,该方法对于不同嵌入率下隐写文本的平均检测精确率和召回率分别达到97.71%和92.64%,具有较好的检测性能。

关键词: 自然语言, 词向量, 同义词替换, 隐写分析, 上下文合适度

Abstract:

In order to represent the semantic information of the text content for digitization and improve the accuracy of detecting stego texts based on synonym substitution,a novel natural language steganalyisis method is proposed.Word2vec is employed to train a large-scale corpus to obtain multi-dimensional word vectors which contains rich semantic information.Then,it uses the cosine distance between a synonym and its context word vector to measure the correlation between two words,and calculates the fitness of synonyms in a specific context.According to the effect on the context fitness of the synonyms caused by the synonym substitutions in the embedding process,detection features are extracted to form a feature vector,and the Bayesian classification model is employed to train feature vector for the task of steganalysis feature to detect the stego texts.Experimental results show that the proposed method has good detection performance,whose average detection precision and average recall for the stego texts with different embedding rates achieve 97.71% and 92.64%,respectively.

Key words: natural language, word vector, synonym substitution, steganalysis, context fitness

中图分类号: