Author Login Editor-in-Chief Peer Review Editor Work Office Work

Computer Engineering

Previous Articles     Next Articles

Research on Sensitive Topic Detection Model Based on Conditional Random Fields

ZHAI Dong-hai1,2,CUI Jing-jing1,NIE Hong-yu1,YU Lei1,DU Jia2   

  1. (1.School of Information Science and Technology,Southwest Jiaotong University,Chengdu 610031,China; 2.Engineering School,Tibet University,Lhasa 850000,China)
  • Received:2013-07-15 Online:2014-08-15 Published:2014-08-15

基于条件随机场的敏感话题检测模型研究

翟东海1,2,崔静静1,聂洪玉1,于 磊1,杜 佳2   

  1. (1.西南交通大学信息科学与技术学院,成都 610031;2.西藏大学工学院,拉萨 850000)
  • 作者简介:翟东海(1974-),男,副教授、博士,主研方向:海量数据挖掘,数字图像处理;崔静静,博士研究生;聂洪玉、于 磊,硕士研究生;杜 佳,学士。
  • 基金资助:
    国家语委“十二五”科研规划基金资助项目(YB125-49);教育部科学技术研究基金资助重点项目(212167);中央高校基本科研业务费专项资金科技创新基金资助项目(SWJTU12CX096);国家级大学生创新创业训练计划基金资助项目(201210694017)。

Abstract: Sensitive topics often contain tendentious attitude and some prior knowledge,and how to effectively use prior knowledge to determine sensitivities of network text is the difficulty and hot spots in sensitive topics detection.Taking full advantage of strong knowledge fitting capability of Conditional Random Fields(CRFs),this paper proposes a sensitive topic detection model based on CRFs.By extracting feature items,in combination with the sensitive terminology,this approach represents new documents and sensitive topic categories as observation sequence and state sequence of CRFs.Feature function is constructed by using prior knowledge of sensitive topics categories,and observation sequence and state sequence are connected by them.It estimates the credibility of the observation sequence by Viterbi algorithm,so feature items in new documents is marked with items in sensitive topic categories in probability.Experimental results demonstrate that this approach achieves very good results in precision,recall rate and F-measure.

Key words: sensitive topic detection, Conditional Random Fields(CRFs), feature function, feature item;Viterbi algorithm, sensitivity label

摘要: 敏感话题通常包含态度倾向性,且具有一定的先验知识,如何有效利用这些先验知识来判断网络文本的敏感性是敏感话题检测的研究难点和热点。在充分利用条件随机场强大知识拟合能力的基础上,提出一种基于条件随机场的敏感话题检测模型。抽取特征词项,并结合敏感词汇库,将待检测文档和敏感话题类别分别表示为条件随机场中的观察序列和状态序列,再利用敏感话题类别中的先验知识来构造特征函数,从而使观察序列和状态序列建立联系。将待测文档中的特征项根据概率标注为敏感话题类别中的词项,在此过程中采用Viterbi算法对观察序列的可信度进行估计,并依据估计所得的概率值对待测文档中的特征项进行敏感性标注。实例验证结果表明,该算法能够得到较好的准确率、召回率和F度量值。

关键词: 敏感话题检测, 条件随机场, 特征函数, 特征词项, Viterbi算法, 敏感性标注

CLC Number: