作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• 人工智能及识别技术 • 上一篇    下一篇

基于主题扩展的领域问题分类方法

张青,吕钊   

  1. (华东师范大学 计算机科学技术系,上海 200241)
  • 收稿日期:2015-08-27 出版日期:2016-09-15 发布日期:2016-09-15
  • 作者简介:张青(1989-),女,硕士研究生,主研方向为大数据分析、知识处理;吕钊(通讯作者),副教授。
  • 基金资助:
    上海市科学技术委员会科研计划基金资助项目(1451110700,14511106803);上海张江国家自主创新示范区专项发展基金资助项目(201411-JA-B108-002)。

Domain Question Classification Method Based on Topic Expansion

ZHANG Qing,Lü Zhao   

  1. (Department of Computer Science and Technology,East China Normal University,Shanghai 200241,China)
  • Received:2015-08-27 Online:2016-09-15 Published:2016-09-15

摘要: 领域问题分类在问答系统中占有重要地位,但目前面向特定领域的研究较少。针对领域问题文本篇幅较短、数据稀疏的特点,提出基于主题扩展的领域问题分类方法。该方法主要包括特征选择和特征扩展2个部分。利用卡方统计量特征选择方法,将问题文本选择的特征词作为特征扩展的依据。通过潜在狄利克雷分配主题模型对外部知识库进行分析,得到对应的主题分布。为避免引入噪声主题,采用主题熵的方法得到优质主题。将优质主题下所覆盖的词扩充到问题文本中,最后利用支持向量机分类器对问题文本进行分类。实验结果表明,与传统TFIDF文本分类方法相比,该方法分类效果较好,可提高问答系统的性能。

关键词: 领域问题分类, 数据稀疏, 特征选择, 主题模型, 优质主题, 特征扩展

Abstract: Domain question classification plays a central role in Question and Answering(Q&A) systems.Lots of current research work on question classification focuses on open domains while few of them pays attention to special domains.The domain questions are always short and have the issue of data sparseness.Hence,this paper proposes a method for domain question classification based on topic expansion.This algorithm mainly consists of two components:feature selection and feature expansion.It first extracts feature words,which are the bases of feature expansion,from raw question text through feature selection method CHI.Then it uses Latent Dirichlet Allocation(LDA) topic model to analyze the universal dataset to obtain the topic distribution.To avoid noisy topics,this paper adopts topic entropy to obtain high quality topics.Finally,it expands question text using the words from high quality topics and classifies the expanded question text using Support Vector Machine(SVM).Experimental results show that the proposed method performs better than the traditional text classification method TFIDF and is helpful to improve the performance of Q&A systems.

Key words: domain question classification, data sparseness, feature selection, topic model, high quality topic, feature expansion

中图分类号: