基于二元搭配词的微博情感特征选择

doi:10.3969/j.issn.1000-3428.2014.06.035

计算机工程

基于二元搭配词的微博情感特征选择

周剑峰^a，阳爱民^b，周咏梅^b，王璇璇^c

(广东外语外贸大学 a. 图书馆；b. 思科信息学院；c. 西方语言文化学院，广州 510006)

收稿日期:2013-04-28 出版日期:2014-06-15 发布日期:2014-06-13
作者简介:周剑峰(1986－)，男，助理馆员、硕士研究生，主研方向：机器学习，文本分析，数据挖掘；阳爱民，教授、博士后；周咏梅，副教授、硕士；王璇璇，硕士。
基金资助:
国家社科基金资助项目(12BYY045)；教育部人文社会科学研究青年基金资助项目(10YJCZH247)；教育部人文社会科学基金资助一般项目(09YJCZH019)；教育部新世纪优秀人才支持计划基金资助项目(NCET-12-0939)；广东省科技计划基金资助项目(2010B031000014)；广东外语外贸大学校级基金资助项目(12Q22)；广东外语外贸大学研究生科研创新基金资助项目。

Micro-blog Sentimental Feature Selection Based on Bigram Collocation

ZHOU Jian-feng ^a, YANG Ai-min^b, ZHOU Yong-mei^b, WANG Xuan-xuan^c

(a. Library; b. Cisco School of Informatics; c. Faculty of European Languages & Cultures, Guangdong University of Foreign Studies, Guangzhou 510006, China)

Received:2013-04-28 Online:2014-06-15 Published:2014-06-13

摘要/Abstract

摘要： 分析和监测微博文本中所包含的情感信息，能够挖掘用户行为，为微博舆情监管提供借鉴。但微博文本具有长度较短、不规范、存在大量变形词和新词等特点，仅以情感词为特征对微博进行分类的方法准确率较低，难以满足实际使用。为此，基于微博语料构建二元搭配词库，并根据PMI-IR算法结合语料库统计信息，提出搭配词组情感权值的计算方法PMI-IR-P。结合情感词典，采用统计方法生成微博情感特征向量，利用机器学习中的C4.5算法构建分类模型，对微博文本进行情感倾向分类。分别使用不同的数据集用于构建搭配词库及分类模型，并与基于情感词典的分类方法以及朴素贝叶斯分类方法进行对比。实验结果表明，提出的情感特征通过运用C4.5算法对微博文本情感分类的准确率达到87%，具有较好的效果。

关键词: 搭配词库, 微博情感特征, 微博情感分类, 机器学习, C4.5算法

Abstract: Analysis and monitoring of emotion information in micro-blog texts can help mine user behavior and offer the reference for the micro-blog public opinion supervision. However, micro-blog texts have the characteristics of short length, non-standardization, existence of a large number of anagrams and new words, etc. To classify micro-blog texts based on sentimental feature only lead poor accuracy. It is also difficult to meet practical demands. Therefore, a word stock of bigram collocation based on micro-blog corpus is constructed, and the PMI-IR-P algorithm is proposed to calculate the semantic weight of collocation based on PMI-IR algorithm. Combining the sentiment dictionary, micro-blog sentimental feature vector is generated by adopting statistical method. The C4.5 algorithm is used to establish classification models, so as to classify the sentiment polarity of the micro-blog. In the experiment, different data sets are utilized to construct collocation stock and classification models, and the result with the method based on sentiment dictionary is compared with rules as well as the Naive Bayes method. Experimental results show that with the help of C4.5 algorithm, the accuracy rate of micro-blog text sentiment classification reaches 87%, which has better effect.

Key words: collocation dictionary, micro-blog sentimental feature, micro-blog sentimental classification, machine learning, C4.5 algorithm

中图分类号:

TP18

周剑峰，阳爱民，周咏梅，王璇璇. 基于二元搭配词的微博情感特征选择[J]. 计算机工程, doi: 10.3969/j.issn.1000-3428.2014.06.035.

ZHOU Jian-feng, YANG Ai-min, ZHOU Yong-mei, WANG Xuan-xuan. Micro-blog Sentimental Feature Selection Based on Bigram Collocation[J]. Computer Engineering, doi: 10.3969/j.issn.1000-3428.2014.06.035.

http://www.ecice06.com/CN/Y2014/V40/I6/162

参考文献

参考文献 [1] 贺德方. 我国科技情报行业发展战略与发展路径的思考[J].情报学报, 2007, 26(4): 483-487. [2] 曾大军, 王飞跃, 曹志冬. 开源信息在突发事件应急管理中的应用[J]. 科技导报, 2008, 26(16): 27-33. [3] Bollen J, Pepe A, Mao Huina. Modeling Public Mood and Emotion: Twitter Sentiment and Socio-economic Pheno- mena[C]//Proc. of the 5th International AAAI Conference on Weblogs and Social Media. [S. l.]: AAAI Press, 2009. [4] Shen Yang, Li Shuchen, Zheng Ling, et al. Emotion Mining Research on Micro-blog[C]//Proc. of the 1st IEEE Symposium on Web Society. [S. l.]: IEEE Press, 2009: 71-75. [5] 韩忠明, 张玉沙, 张慧, 等. 有效的中文微博短文本倾向性分类算法[J]. 计算机应用与软件, 2012, 29(10): 89-93. [6] Yang Aimin, Zhou Yongmei, Lin Jianghao, et al. A Method of Chinese Texts Sentiment Classification Based on Bayesian Algorithm[J]. Applied Mechanics and Materials, 2012, 263-266: 2185-2190. [7] 林江豪, 阳爱民, 周咏梅, 等. 一种基于朴素贝叶斯的微博情感分类[J]. 计算机工程与科学, 2012, 34(9): 86-90. [8] 杨鼎, 阳爱民. 一种基于情感词典和朴素贝叶斯的中文文本情感分类方法[J]. 计算机应用研究, 2010, 27(10): 3737-3739. [9] Yang Aimin, Lin Jianghao, Zhou Yongmei, et al. Research on Building a Chinese Sentiment Lexicon Based on SO-PMI[J]. Applied Mechanics and Materials, 2012, 263-266: 1688-1693. [10] 刘志明, 刘鲁. 基于机器学习的中文微博情感分类实证研究[J]. 计算机工程与应用, 2012, 48(1): 1-4 [11] 杨超, 冯时, 王大玲, 等. 基于情感词典扩展技术的网络舆情倾向性分析[J]. 小型微型计算机系统, 2012, 31(4): 691-695. [12] 王素格, 杨安娜. 基于混合语言信息的词语搭配倾向判别方法[J]. 中文信息学报, 2012, 24(3): 69-74. [13] 段秀婷, 何婷婷, 宋乐. 基于PMI-IR算法的Blog情感分类研究[C]//第五届全国青年计算语言学研讨会论文集. 武汉: [出版者不详], 2010. [14] 刘群, 张华平, 张浩. 计算所汉语词性标记集[EB/OL]. [2013-02-04]. http://ictclas.nlpir.org. [15] 黄爱辉. 决策树C4_5算法的改进及应用[J]. 科学技术与工程, 2009, 9(1): 34-36. 编辑顾逸斐

选择文件类型/文献管理软件名称

选择包含的内容

基于二元搭配词的微博情感特征选择

Micro-blog Sentimental Feature Selection Based on Bigram Collocation

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

[1]	陈治旭, 靳雁霞, 芦烨, 杨晶, 刘亚变, 史志儒. 基于子图卷积神经网络的多精度服装建模方法[J]. 计算机工程, 2023, 49(4): 174-181.
[2]	刘金硕, 詹岱依, 邓娟, 王丽娜. 基于深度神经网络和联邦学习的网络入侵检测[J]. 计算机工程, 2023, 49(1): 15-21,30.
[3]	葛昕, 邹福泰, 郭万达, 谭越, 李林森. 社交僵尸网络发展综述[J]. 计算机工程, 2022, 48(8): 12-24.
[4]	俞莎莎, 牛保宁. 基于交易不可信度的比特币非法交易检测[J]. 计算机工程, 2022, 48(8): 166-172.
[5]	金海波, 赵欣越. 共形预测框架下的高可靠入侵检测算法[J]. 计算机工程, 2022, 48(7): 130-140.
[6]	钱龙, 赵静, 韩京宇, 毛毅. 基于标签相关性的K近邻多标签学习[J]. 计算机工程, 2022, 48(6): 73-78,88.
[7]	李莉, 任振康, 石可欣. 代价敏感的Boosting软件缺陷预测方法[J]. 计算机工程, 2022, 48(3): 175-180.
[8]	刘鹏, 叶润, 闫斌, 谢茜, 刘睿. 一种深度回声状态网络的输入尺度自适应算法[J]. 计算机工程, 2022, 48(2): 92-98,105.
[9]	雷恒林, 古兰拜尔·吐尔洪, 买日旦·吾守尔, 曾琪. 基于Hellinger距离与词向量的终身机器学习主题模型[J]. 计算机工程, 2022, 48(11): 89-95.
[10]	陈良臣, 傅德印. 面向小样本数据的机器学习方法研究综述[J]. 计算机工程, 2022, 48(11): 1-13.
[11]	赵季红, 张文娟, 乔琳琳, 张梦雪. 服务功能链中基于机器学习的QoE评估与预测[J]. 计算机工程, 2022, 48(1): 163-169.
[12]	高平, 广晖, 陈熹, 李光松. 基于侧信道特征的安全代理流量分类方法[J]. 计算机工程, 2021, 47(8): 140-148,156.
[13]	苗立志, 白瑞思蒙, 刘成良, 翟月昊. 面向非平衡数据的癌症患者生存预测分析[J]. 计算机工程, 2021, 47(12): 316-320.
[14]	张彭明, 张晓梅, 胡建鹏. 基于动态信任值的智能手机隐式认证方案[J]. 计算机工程, 2021, 47(10): 132-139,146.
[15]	张凯, 周德云, 杨振, 潘潜. 基于自适应谐振理论的武器目标分配快速决策算法[J]. 计算机工程, 2020, 46(9): 283-291,297.

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于二元搭配词的微博情感特征选择

Micro-blog Sentimental Feature Selection Based on Bigram Collocation

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价