计算机工程 ›› 2018, Vol. 44 ›› Issue (9): 279-.doi: 10.19678/j.issn.1000-3428.0047862

• 开发研究与工程应用 • 上一篇    下一篇

基于改进Labeled LDA模型的科技视频文本分类

马建红,樊跃翔   

  1. 河北工业大学 人工智能与数据科学学院,天津 300401
  • 收稿日期:2017-07-06 出版日期:2018-09-15 发布日期:2018-09-15
  • 作者简介:马建红(1965—),女,教授、博士,主研方向为自然语言处理、数据挖掘;樊跃翔,硕士。

Science and Technology Video Text Classification Based on Improved Labeled LDA Model

MA Jianhong,FAN Yuexiang   

  1. School of Artificial Intelligence,Hebei University of Technology,Tianjin 300401,China
  • Received:2017-07-06 Online:2018-09-15 Published:2018-09-15

摘要:

在对科技领域视频文本进行分类时,容易忽略分类贡献度较高的专业名词。为此,改进传统Labeled潜在Dirichlet分布(LDA)模型,建立用于科技领域视频文本的MulCHI-Labeled LDA模型,避免偏向高频词的现象。通过构建领域术语库以突出专业名词,同时使用卡方加权和文 本位置加权算法提升主题词质量。实验结果表明,与Labeled LDA模型相比,该模型可以解决专业名词被忽略的问题,并能有效提高主题词质量和分类准确率。

关键词: 科技视频, 文本分类, 标签, 卡方加权, 领域术语库

Abstract:

In the process of classifying video texts in the field of science and technology,it is easy to ignore the terminology with high classification contribution.Considering the problem that the traditional Labeled Latent Dirichlet Allocation (LDA) model has biased high frequency words,this paper improves it and establishes the MulCHI-Labeled LDA model for video texts in the scientific field,by building domain termbases to highlight terminology and using chi-square weighting and text position weighting algorithms to improve topic quality.The experimental results show that,compared with the Labeled LDA model,the proposed model can solve the neglect of professional terms and effectively improve the quality of topic words and classification accuracy.

Key words: science and technology video, text classification, label, chi-squared weighting, database of domain words

中图分类号: