作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• 人工智能及识别技术 • 上一篇    下一篇

基于二元搭配词的微博情感特征选择

周剑峰a,阳爱民b,周咏梅b,王璇璇c   

  1. (广东外语外贸大学 a. 图书馆;b. 思科信息学院;c. 西方语言文化学院,广州 510006)
  • 收稿日期:2013-04-28 出版日期:2014-06-15 发布日期:2014-06-13
  • 作者简介:周剑峰(1986-),男,助理馆员、硕士研究生,主研方向:机器学习,文本分析,数据挖掘;阳爱民,教授、博士后;周咏梅,副教授、硕士;王璇璇,硕士。
  • 基金资助:
    国家社科基金资助项目(12BYY045);教育部人文社会科学研究青年基金资助项目(10YJCZH247);教育部人文社会科学基金资助一般项目(09YJCZH019);教育部新世纪优秀人才支持计划基金资助项目(NCET-12-0939);广东省科技计划基金资助项目(2010B031000014);广东外语外贸大学校级基金资助项目(12Q22);广东外语外贸大学研究生科研创新基金资助项目。

Micro-blog Sentimental Feature Selection Based on Bigram Collocation

ZHOU Jian-feng a, YANG Ai-min b, ZHOU Yong-mei b, WANG Xuan-xuan c   

  1. (a. Library; b. Cisco School of Informatics; c. Faculty of European Languages & Cultures, Guangdong University of Foreign Studies, Guangzhou 510006, China)
  • Received:2013-04-28 Online:2014-06-15 Published:2014-06-13

摘要: 分析和监测微博文本中所包含的情感信息,能够挖掘用户行为,为微博舆情监管提供借鉴。但微博文本具有长度较短、不规范、存在大量变形词和新词等特点,仅以情感词为特征对微博进行分类的方法准确率较低,难以满足实际使用。为此,基于微博语料构建二元搭配词库,并根据PMI-IR算法结合语料库统计信息,提出搭配词组情感权值的计算方法PMI-IR-P。结合情感词典,采用统计方法生成微博情感特征向量,利用机器学习中的C4.5算法构建分类模型,对微博文本进行情感倾向分类。分别使用不同的数据集用于构建搭配词库及分类模型,并与基于情感词典的分类方法以及朴素贝叶斯分类方法进行对比。实验结果表明,提出的情感特征通过运用C4.5算法对微博文本情感分类的准确率达到87%,具有较好的效果。

关键词: 搭配词库, 微博情感特征, 微博情感分类, 机器学习, C4.5算法

Abstract: Analysis and monitoring of emotion information in micro-blog texts can help mine user behavior and offer the reference for the micro-blog public opinion supervision. However, micro-blog texts have the characteristics of short length, non-standardization, existence of a large number of anagrams and new words, etc. To classify micro-blog texts based on sentimental feature only lead poor accuracy. It is also difficult to meet practical demands. Therefore, a word stock of bigram collocation based on micro-blog corpus is constructed, and the PMI-IR-P algorithm is proposed to calculate the semantic weight of collocation based on PMI-IR algorithm. Combining the sentiment dictionary, micro-blog sentimental feature vector is generated by adopting statistical method. The C4.5 algorithm is used to establish classification models, so as to classify the sentiment polarity of the micro-blog. In the experiment, different data sets are utilized to construct collocation stock and classification models, and the result with the method based on sentiment dictionary is compared with rules as well as the Naive Bayes method. Experimental results show that with the help of C4.5 algorithm, the accuracy rate of micro-blog text sentiment classification reaches 87%, which has better effect.

Key words: collocation dictionary, micro-blog sentimental feature, micro-blog sentimental classification, machine learning, C4.5 algorithm

中图分类号: