Multi-Source Text Topic Model Based on DMA and Feature Division

doi:10.19678/j.issn.1000-3428.0058372

Abstract

Abstract: Given the poor performance exhibited by the existing topic models for mining information on multi-source text data sets,a multi-source text topic model based on Dirichlet Multinomial Allocation(DMA) and feature division is designed.This model relaxes the restrictions on the number of pre-input topics,assigns a special topic distribution parameter for each data source,and automatically estimates the number of topics for each data source by using the Gibbs sampling algorithm.In addition,the model assigns a specific noise word distribution parameter and topic-word distribution parameter for each data source.The feature words and noise words of each data source are distinguished by using the feature categorization method,and the word features of each data source are learnt to avoid the influence of the noise word set on model clustering.Experimental results show that compared with the existing topic models,the proposed model can keep the unique word features of each data source,and has better topic discovery performance as well as improved robustness.

Key words: multi-source text topic model, text clustering, Dirichlet Multinomial Allocation(DMA), feature division, Gibbs sampling

摘要： 针对传统主题模型在挖掘多源文本数据集信息时存在主题发现效果不佳的问题，设计一种基于狄利克雷多项式分配（DMA）与特征划分的多源文本主题模型。以DMA模型为基础，放宽对预先输入的主题数量的限制，为每个数据源分配专有的主题分布参数，使用Gibbs采样算法估计每个数据源的主题数量。同时，对每个数据源分配专有的噪音词分布参数以及主题-词分布参数，采用特征划分方法区分每个数据源的特征词和噪音词，并学习每个数据源的用词特征，避免噪音词集对模型聚类的干扰。实验结果表明，与传统主题模型相比，该模型能够保留每个数据源特有的词特征，具有更好的主题发现效果及鲁棒性。

关键词: 多源文本主题模型, 文本聚类, 狄利克雷多项分配, 特征划分, Gibbs采样

CLC Number:

TP391.1

XU Weijia, QIN Yongbin, HUANG Ruizhang, CHEN Yanping. Multi-Source Text Topic Model Based on DMA and Feature Division[J]. Computer Engineering, 2021, 47(7): 59-66.

许伟佳, 秦永彬, 黄瑞章, 陈艳平. 基于DMA与特征划分的多源文本主题模型[J]. 计算机工程, 2021, 47(7): 59-66.

/ / Recommend / Download Citations

URL: http://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0058372

http://www.ecice06.com/EN/Y2021/V47/I7/59

Figures/Tables 10

References

[1] 孙玉洁,秦永彬.基于LDA模型的多角度个性化微博推荐算法[J].计算机工程,2017,43(4):177-182. SUN Y J,QIN Y B.Multi-angle personalized microblog recommendation algorithm based on LDA model[J].Computer Engineering,2017,43(4):177-182.(in Chinese)
[2] HUANG R Z,YU G,WANG Z J,et al.Dirichlet process mixture model for document clustering with feature partition[J].IEEE Transactions on Knowledge and Data Engineering,2013,25(8):1748-1759.
[3] BLEI D M,NG A Y,JORDAN M I.Latent Dirichlet allocation[J].Journal of Machine Learning Research,2003,3:993-1022.
[4] QIANG J P,LI Y,YUAN Y H,et al.Short text clustering based on Pitman-Yor process mixture model[J].Applied Intelligence,2018,48(7):1802-1812.
[5] YANG S,HUANG G,CAI B.Discovering topic representative terms for short text clustering[J].IEEE Access,2019,7:92037-92047.
[6] JIN O,LIU N N,ZHAO K,et al.Transferring topical knowledge from auxiliary long texts for short text clustering[C]//Proceedings of the 20th ACM International Conference on Information and Knowledge Management.New York,USA:ACM Press,2011:775-784.
[7] YAN Y Y,HUANG R Z,MA C,et al.Improving document clustering for short texts by long documents via a Dirichlet multinomial allocation model[C]//Proceedings of Asia-Pacific Web(APWeb) and Web-Age Information Management(WAIM) Joint Conference on Web and Big Data.Berlin,Germany:Springer,2017:626-641.
[8] 闫盈盈,黄瑞章,王瑞,等.一种长文本辅助短文本的文本理解方法[J].山东大学学报(工学版),2017,48(3):67-74. YAN Y Y,HUANG R Z,WANG R,et al.A document understanding method for short texts by auxiliary long documents[J].Journal of Shandong University(Engineering Science),2017,48(3):67-74.(in Chinese)
[9] HONG L,DOM B,GURUMURTHY S,et al.A time-dependent topic model for multiple text streams[C]//Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York,USA:ACM Press,2011:832-840.
[10] ROSEN-ZVI M,GRIFFITHS T,STEYVERS M,et al.The author-topic model for authors and documents[C]//Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence.[S.l.]:AUAI Press,2004:487-494.
[11] CHEN L,ZHANG H Z,JOSE J M,et al.Topic detection and tracking on heterogeneous information[J].Journal of Intelligent Information Systems,2018,51(1):115-137.
[12] YANG Y,WANG F F,ZHANG J N,et al.A topic model for co-occurring normal documents and short texts[J].World Wide Web,2018,21(2):487-513.
[13] QIANG J P,CHEN P,DING W,et al.Heterogeneous-length text topic modeling for reader-aware multi-document summarization[J].ACM Transactions on Knowledge Discovery from Data,2019,13(4):1-21.
[14] SALOMATIN K,YANG Y,LAD A.Multi-field correlated topic modeling[C]//Proceedings of 2009 SIAM International Conference on Data Mining.[S.l.]:Society for Industrial and Applied Mathematics,2009:628-637.
[15] BLEI D M,LAFFERTY J D.Correlated topic models[EB/OL].[2020-04-11].http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=410BA922B13452F633E26A63E2B1D12A?doi=10.1.1.61.2352&rep=rep1&type=pdf.
[16] 牛硕硕,柴小丽,李德启,等.一种基于神经网络与LDA的文本分类算法[J].计算机工程,2019,45(10):208-214. NIU S S,CHAI X L,LI D Q,et al. A text classification algorithm based on neural network and LDA[J].Computer Engineering,2019,45(10):208-214.(in Chinese)
[17] GHOSH R,ASUR S.Mining information from heterogeneous sources:a topic modeling approach[J].Information,2017,8(3):79.
[18] ZHANG J W,GEROW A,ALTOSAAR J,et al.Fast,flexible models for discovering topic correlation across weakly-related collections[EB/OL].[2020-04-11].https://arxiv.org/abs/1508.04562.
[19] TEH Y W,JORDAN M I,BEAL M J,et al.Sharing clusters among related groups:hierarchical Dirichlet processes[C]//Proceedings of the 17th International Conference on Neural Information Processing Systems.New York,USA:ACM Press,2005:1385-1392.
[20] KIM S,TADESSE M G,VANNUCCI M.Variable selection in clustering via Dirichlet process mixture models[J].Biometrika,2006,93(4):877-893.
[21] HEINRICH G.Parameter estimation for text analysis[EB/OL].[2020-04-11].https://www.researchgate.net/publication/228654366_Parameter_Estimation_for_Text_Analysis.
[22] ZHONG S.Semi-supervised model-based document clustering:a comparative study[J].Machine Learning,2006,65(1):3-29.
[23] JAIN A K.Data clustering:50 years beyond K-means[J].Pattern Recognition Letters,2010,31(8):651-666.

Please choose a citation manager

Content to export