作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2022, Vol. 48 ›› Issue (11): 89-95. doi: 10.19678/j.issn.1000-3428.0062694

• 人工智能与模式识别 • 上一篇    下一篇

基于Hellinger距离与词向量的终身机器学习主题模型

雷恒林, 古兰拜尔·吐尔洪, 买日旦·吾守尔, 曾琪   

  1. 新疆大学 信息科学与工程学院, 乌鲁木齐 830046
  • 收稿日期:2021-09-15 修回日期:2021-12-27 发布日期:2022-01-05
  • 作者简介:雷恒林(1996—),男,硕士研究生,主研方向为主题挖掘、机器学习;古兰拜尔·吐尔洪(通信作者)、买日旦·吾守尔,副教授;曾琪,硕士研究生。
  • 基金资助:
    自治区自然科学基金(2021D01C118)。

Topic Model of Lifelong Machine Learning Based on Hellinger Distance and Word Vector

LEI Henglin, Gulanbaier Tuerhong, Mairidan Wushouer, ZENG Qi   

  1. School of Information Science and Engineering, Xinjiang University, Urumqi 830046, China
  • Received:2021-09-15 Revised:2021-12-27 Published:2022-01-05

摘要: 与传统的机器学习方法相比,终身机器学习能够有效利用知识库中积累的知识来提高当前学习任务的学习效果。然而经典的终身主题模型(LTM)在领域选择时缺乏偏向性,且在计算目标词的相似性时不能充分利用目标词的上下文信息。从词语和主题选择的角度提出改进模型HW-LTM,利用Word2vec词向量的余弦相似度和主题之间的Hellinger距离寻找相似度较大的词语和领域,实现在迭代学习中对词语和领域的更优选择和更有效的知识获取,同时通过预加载词向量相似度矩阵的方式解决词向量余弦距离的重复计算问题,利用Hellinger距离计算主题相似度,加快模型收敛速度。在京东商品评论数据集上的实验结果表明,HW-LTM模型表现优于基线主题挖掘模型,相比LTM模型,其topic coherence指标提升48,耗时缩短43.75%。

关键词: 终身机器学习, 主题模型, Hellinger距离, 词向量, 领域选择

Abstract: Lifelong machine learning, as opposed to conventional machine learning methods, can effectively use accumulated knowledge in the knowledge base to improve the current learning task performance.The classic Lifelong Topic Model(LTM), however, is unbiased in domain selection and fails to fully utilize the contextual information of the target words when calculating similarity.Therefore, from the standpoint of word and topic selection, this study proposes an improved model, named HW-LTM, that finds words and domains with higher similarity using the cosine similarity of Word2vec word vectors and the Hellinger distance between topics.The improved model achieves better word and domain selection during interactive learning and more effective knowledge acquisition.The problem of repeated calculation of the cosine distance of the word vector is solved by preloading the word vector similarity matrix, and the Hellinger distance is used to calculate the topic similarity, which accelerated the model convergence speed.A comparison study using the JD commodity review dataset reveals that HW-LTM outperforms baseline topic mining models.It not only improves the topic coherence index by 48, but it also reduces time consumption by 43.75% when compared to the LTM model.

Key words: lifelong machine learning, topic model, Hellinger distance, word vector, domain selection

中图分类号: