计算机工程 ›› 2017, Vol. 43 ›› Issue (12): 192-196,202.doi: 10.3969/j.issn.1000-3428.2017.12.035

• 人工智能及识别技术 • 上一篇    下一篇

基于混合卡方统计量与逻辑回归的文本情感分析

李平,戴月明,王艳   

  1. (江南大学 物联网工程学院,江苏 无锡 214122)
  • 收稿日期:2016-09-26 出版日期:2017-12-15 发布日期:2017-12-15
  • 作者简介:李平(1990—),女,硕士研究生,主研方向为自然语言处理、情感分析、机器学习;戴月明,副教授;王艳,教授。
  • 基金项目:
    国家自然科学基金(61572238);江苏省杰出青年基金(BK20160001)。

Text Sentiment Analysis Based on Hybrid Chi-square Statistic and Logistic Regression

LI Ping,DAI Yueming,WANG Yan   

  1. (School of Internet of Things Engineering,Jiangnan University,Wuxi,Jiangsu 214122,China)
  • Received:2016-09-26 Online:2017-12-15 Published:2017-12-15

摘要: 针对文本情感分析中基于卡方统计量的特征提取方法容易忽略单个文本词频,导致文本分类准确率较低的问题,提出一种基于混合卡方统计量的特征提取方法。通过增加特征频率、逆文档频率和负相关性指标,选出集中在某个特定类别中的特征词,从而减少特征负相关性的干扰。采用基于随机梯度下降的逻辑回归方法进行文本情感分类,并利用模拟退火原理自适应选择步长,解决随机梯度下降算法中步长难以确定的问题。实验结果表明,与基于卡方统计量的特征提取方法相比,该方法具有更高的文本情感分类质量。

关键词: 卡方统计量, 特征提取, 负相关性, 随机梯度下降, 逻辑回归, 情感分类

Abstract: In text sentiment analysis,feature extraction method based on Chi-square statistic (CHI) is easy to ignore single text word frequency which leads to text feature accuary is low,a feature extraction method based on hybrid chi-square statistics is proposed.By adding the characteristic frequency,inverse document frequency and negative correlation coefficients,it selects the feature words that are concentrated in a particular category and reduces the interference of negative correlation.Logistic regression method based on stochastic gradient descent to realize text sentiment classfication.Simulated annealing principle is used to select step length adaptivly to solve the problem in determining step length when using stochastic gradient descent algorithm.The experimental results show that the proposed method has higher sentiment analysis quality than feature extraction method based on chi-square statistic.

Key words: Chi-square statistic (CHI), feature extraction, negative correlation, stochastic gradient descent, logistic regression, sentiment classification

中图分类号: