Author Login Editor-in-Chief Peer Review Editor Work Office Work

Computer Engineering

Previous Articles     Next Articles

A Domain Feature Word Vector Description Method for Military Texts

QIN Jie,CAO Lei,PENG Hui,LAI Jun   

  1. (College of Command Information System,PLA University of Science and Technology,Nanjing 210007,China)
  • Received:2015-06-18 Online:2016-08-15 Published:2016-08-15

一种面向军事文本的领域特征词向量描述方法

秦杰,曹雷,彭辉,赖俊   

  1. (解放军理工大学 指挥信息系统学院,南京 210007)
  • 作者简介:秦杰(1990-),男,硕士研究生,主研方向为文本推荐;曹雷(通讯作者),教授;彭辉,讲师、博士;赖俊,讲师、硕士。

Abstract:

According to the large number of named entities and deep domain of feature words in military text information,this paper proposes a vector description method for domain feature words.It compresses the vector space through the optimization of word segmentation and domain feature word selection,improves the extraction rules for four important types of named entity,including time,place name,troop name and weapon equipment,and extends the word segmentation dictionary library.It improves the domain feature word selecting algorithm combining domain relevance and domain consistency,enlarges the difference between domain words and common words,and further filters domain feature words.Experimental results show that after optimizing word segmentation,the named entities and some specific vocabulary in military texts can be extracted,and the number of feature words can be reduced.The accuracy and recall rate of the improved domain feature word selecting method are increased by 20% and 16.7% respectively.The feature word vector generated by the proposed method has strong domain feature.

Key words: military text, named entity, vector space, word segmentation, domain feature word

摘要:

针对军事文本信息中命名实体多、特征词领域性强的特性,提出一种领域特征词向量描述方法。从优化分词和领域特征词筛选方面压缩向量空间,完善时间、地名、部队名称和武器装备4类重要命名实体的提取规则,扩充分词词典库。改进领域相关度和领域一致度相结合的领域特征词筛选算法,突出领域特征词与常用词汇之间的差别,进一步过滤领域特征词。实验结果表明,优化分词后,该方法能够提取出军事文本中的命名实体和部分专有词汇,降低特征词数量,改进后的领域特征词筛选算法将准确率和召回率分别提高20%和16.7%,提出的领域特征词向量描述方法所生成的特征词向量具有较强的领域性。

关键词: 军事文本, 命名实体, 向量空间, 分词, 领域特征词

CLC Number: