基于增量模式的文档层次分类研究

doi:10.3969/j.issn.1000-3428.2014.01.044

计算机工程

基于增量模式的文档层次分类研究

古平，罗志恒，欧阳源遊

(重庆大学计算机学院，重庆 400044)

收稿日期:2012-12-17 出版日期:2014-01-15 发布日期:2014-01-13
作者简介:古平(1976－)，男，副教授、博士，主研方向：机器学习，数据挖掘；罗志恒、欧阳源遊，硕士研究生
基金资助:
重庆市科委自然科学基金资助项目(CSTC2012jjA40002)

Research of Document Hierarchical Classification Based on Incremental Mode

GU Ping, LUO Zhi-heng, OUYANG Yuan-you

(College of Computer Science, Chongqing University, Chongqing 400044, China)

Received:2012-12-17 Online:2014-01-15 Published:2014-01-13

摘要/Abstract

摘要： 在文档层次分类中，分类器的自适应调整和阻滞会影响层次分类的精度。为解决上述问题，提出一种基于类别上下文特征的层次分类模型及增量学习算法。根据分类体系，渐进地为每个判决节点建立并维护一个类别相关的上下文特征集，依据文档在上下文特征集中的支持度，找到最可能的层次分类路径和类别。考虑到增量学习的特殊性，将语义相似度引入到路径置信度计算中，以缓解上下文特征集不完备的问题。实验结果表明，相对层次Bayes、层次SVM模型，该算法不仅具有自适应的特性，而且在测试文档集中能提升近8%的分类精度。

关键词: 增量学习, 语义概念, 层次分类, 自适应, 置信度

Abstract: Blocking and evolvement of classifiers are two key issues which affect the performance of hierarchical classification. To solve these problems, this paper introduces a new algorithm that incrementally learns a hierarchical classification tree by extracting appropriate terms from documents for each node of the taxonomy, and classification is obtained by evaluating the confidence of document on each path from root to the leaf category. Considering the characteristic of incremental learning, it incorporates semantic similarity into the confidence estimation of classification path with aim to alleviate the problem of features incompleteness. Experimental results show that compared with hierarchical Bayes and SVM, the algorithm not only has the characteristics of adaptability, but also can improve the classification accuracy by about 8%.

Key words: incremental learning, semantic concept, hierarchical classification, self-adaptive, degree of confidence

中图分类号:

TP18

古平，罗志恒，欧阳源遊. 基于增量模式的文档层次分类研究[J]. 计算机工程, doi: 10.3969/j.issn.1000-3428.2014.01.044.

GU Ping, LUO Zhi-heng, OUYANG Yuan-you. Research of Document Hierarchical Classification Based on Incremental Mode[J]. Computer Engineering, doi: 10.3969/j.issn.1000-3428.2014.01.044.

http://www.ecice06.com/CN/Y2014/V40/I1/209

参考文献

参考文献 [1] Sun Aixin, Lim Ee-Peng. Hierarchical Text Classification and Evaluation[C]//Proc. of IEEE International Conference on Data. [S. 1.]: IEEE Press, 2001: 521-528. [2] Wang Ke, Zhou Senqiang, He Yu. Hierarchical Classification of Real Life Documents[C]//Proc. of the 1st SIAM International Conference on Data Mining. Chicago, USA: [s. n.], 2001: 1-16. [3] Dumais S. Hierarchical Classification of Web Content[C]// Proc. of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. [S. 1.]: ACM Press, 2000: 256-263. [4] 胡翔. 层次文本分类中阻塞减少策略的研究[D]. 南京: 东南大学, 2006. [5] McCallum A, Rosenfeld R, Mitchell T, et al. Improving Text Classification by Shrinkage in a Hierarchy of Classes[C]// Proc. of International Conference on Machine Learning. [S. 1.]: IEEE Press, 1998: 359-367. [6] Hofmann T, Cai L, Ciaramita M. Learning with Taxonomies: Classifying Documents and Words[C]//Proc. of Workshop on Syntax, Semantics and Statistics. [S. 1.]: IEEE Press, 2003: 170-178. [7] Frommholz I. Categorizing Web Documents in Hierarchical Catalogues[C]//Proc. of the 23rd European Colloquium on Information Retrieval Researc. [S. 1.]: IEEE Press, 2001: 444-455. [8] Liu Rey-Long. Mining for Context Recognition in Document Filtering and Classification[C]//Proc. of ACIS-ICIS’05. [S. 1.]: IEEE Press, 2005: 381-386. [9] Jing Liping, Ng M K, Huang J Z. Knowledge Based Vector Space Model for Text Clustering[J]. Knowledge and Information Systems, 2010, 25(1): 35-55. [10] 张国云, 章兢. 一种新的分裂层次聚类SVM多值分类器[J]. 控制与决策, 2005, 20(8): 931-934. [11] 赵晖, 荣莉莉, 李晓. 一种设计层次支持向量机多类分类器的新方法[J]. 计算机应用研究, 2006, 23(6): 34-37. [12] 王怡, 盖杰, 武港山, 等. 基于潜在语义分析的中文文本层次分类技术[J]. 计算机应用研究, 2004, 21(8): 151-154. 编辑索书志

[1]	江雨燕, 陶承凤, 李平. 数据增强和自适应自步学习的深度子空间聚类算法[J]. 计算机工程, 2023, 49(8): 96-103, 110.
[2]	张欣怡, 张飞, 郝斌, 高鹭, 任晓颖. 基于改进YOLOv5的口罩佩戴检测算法[J]. 计算机工程, 2023, 49(8): 265-274.
[3]	马娜, 温廷新, 贾旭, 李晓会. 复杂光照条件下自适应的车脸重识别模型[J]. 计算机工程, 2023, 49(8): 275-282, 290.
[4]	汤卫芬, 高翠芳. 极值点自适应加权的动态时间规整算法[J]. 计算机工程, 2023, 49(7): 150-160.
[5]	梅晶, 戴龙宝, 童钊, 邓昕, 王嘉珂. 资源约束下基于Lyapunov优化的自适应卸载算法[J]. 计算机工程, 2023, 49(7): 34-46.
[6]	蔡倩倩, 史旭华. 自适应迁移的分解多目标多任务进化算法[J]. 计算机工程, 2023, 49(7): 55-64.
[7]	顾轶寅, 王鸿奎, 殷海兵. 基于上下文自适应阈值剪枝的快速依赖量化算法[J]. 计算机工程, 2023, 49(7): 143-149.
[8]	王爱玲, 马文臻, 邹自明, 钟佳. 基于领域自适应的卫星工程参数异常检测[J]. 计算机工程, 2023, 49(5): 29-37,47.
[9]	叶琪, 张一乾, 阮彤, 杜渂. 基于语义和结构置信度的知识图谱质量校验方法[J]. 计算机工程, 2023, 49(5): 48-55.
[10]	石进, 徐杨, 曹斌. 基于自适应三线性池化网络的细粒度图像分类[J]. 计算机工程, 2023, 49(5): 239-246,254.
[11]	王博, 张远, 杨咏蓓. 基于模仿学习的决策树码率自适应算法研究[J]. 计算机工程, 2023, 49(5): 206-214.
[12]	李培育, 张雅丽. 基于改进SRGAN模型的人脸图像超分辨率重建[J]. 计算机工程, 2023, 49(4): 199-205.
[13]	余嘉昕, 王春媛, 韩华, 高燕. 基于融合代价和优化引导滤波的立体匹配算法[J]. 计算机工程, 2023, 49(3): 257-262,270.
[14]	刘强, 张颖, 周卫祥, 蒋先涛, 周薇娜, 周谋国. 自适应类增量学习的物联网入侵检测系统[J]. 计算机工程, 2023, 49(2): 169-174.
[15]	王国栋, 叶剑, 谢萦, 钱跃良. 基于梯度的自适应阈值结构化剪枝算法[J]. 计算机工程, 2022, 48(9): 113-120.

选择文件类型/文献管理软件名称

选择包含的内容

基于增量模式的文档层次分类研究

Research of Document Hierarchical Classification Based on Incremental Mode

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于增量模式的文档层次分类研究

Research of Document Hierarchical Classification Based on Incremental Mode

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价