Author Login Editor-in-Chief Peer Review Editor Work Office Work

Computer Engineering ›› 2021, Vol. 47 ›› Issue (7): 59-66. doi: 10.19678/j.issn.1000-3428.0058372

• Artificial Intelligence and Pattern Recognition • Previous Articles     Next Articles

Multi-Source Text Topic Model Based on DMA and Feature Division

XU Weijia1,2, QIN Yongbin1,2, HUANG Ruizhang1,2, CHEN Yanping1,2   

  1. 1. College of Computer Science and Technology, Guizhou University, Guiyang 550025, China;
    2. State Key Laboratory of Public Big Data, Guiyang 550025, China
  • Received:2020-05-19 Revised:2020-07-04 Published:2020-07-13

基于DMA与特征划分的多源文本主题模型

许伟佳1,2, 秦永彬1,2, 黄瑞章1,2, 陈艳平1,2   

  1. 1. 贵州大学 计算机科学与技术学院, 贵阳 550025;
    2. 公共大数据国家重点实验室, 贵阳 550025
  • 作者简介:许伟佳(1996-),女,硕士研究生,主研方向为数据与文本挖掘、机器学习;秦永彬,教授、博士生导师;黄瑞章、陈艳平,副教授、博士。
  • 基金资助:
    国家自然科学基金联合基金重点项目(U1836205);国家自然科学基金重大研究计划项目(91746116);贵州省科技厅重大专项(黔科合重大专项字2017-3002号);贵州省科学技术基金重点项目(黔科合基础2020-1Z055号)。

Abstract: Given the poor performance exhibited by the existing topic models for mining information on multi-source text data sets,a multi-source text topic model based on Dirichlet Multinomial Allocation(DMA) and feature division is designed.This model relaxes the restrictions on the number of pre-input topics,assigns a special topic distribution parameter for each data source,and automatically estimates the number of topics for each data source by using the Gibbs sampling algorithm.In addition,the model assigns a specific noise word distribution parameter and topic-word distribution parameter for each data source.The feature words and noise words of each data source are distinguished by using the feature categorization method,and the word features of each data source are learnt to avoid the influence of the noise word set on model clustering.Experimental results show that compared with the existing topic models,the proposed model can keep the unique word features of each data source,and has better topic discovery performance as well as improved robustness.

Key words: multi-source text topic model, text clustering, Dirichlet Multinomial Allocation(DMA), feature division, Gibbs sampling

摘要: 针对传统主题模型在挖掘多源文本数据集信息时存在主题发现效果不佳的问题,设计一种基于狄利克雷多项式分配(DMA)与特征划分的多源文本主题模型。以DMA模型为基础,放宽对预先输入的主题数量的限制,为每个数据源分配专有的主题分布参数,使用Gibbs采样算法估计每个数据源的主题数量。同时,对每个数据源分配专有的噪音词分布参数以及主题-词分布参数,采用特征划分方法区分每个数据源的特征词和噪音词,并学习每个数据源的用词特征,避免噪音词集对模型聚类的干扰。实验结果表明,与传统主题模型相比,该模型能够保留每个数据源特有的词特征,具有更好的主题发现效果及鲁棒性。

关键词: 多源文本主题模型, 文本聚类, 狄利克雷多项分配, 特征划分, Gibbs采样

CLC Number: