计算机工程 ›› 2010, Vol. 36 ›› Issue (20): 52-54.doi: 10.3969/j.issn.1000-3428.2010.20.018

• 软件技术与数据库 • 上一篇    下一篇

适用于非平衡数据的多关系多分类模型

杨鹤标,王 健   

  1. (江苏大学计算机科学与通信工程学院,江苏 镇江 212013)
  • 出版日期:2010-10-20 发布日期:2010-10-18
  • 作者简介:杨鹤标(1960-),男,教授,主研方向:软件工程,数据挖掘,系统集成;王 健,硕士研究生
  • 基金项目:
    江苏省高技术研究基金资助项目(BG2007028);江苏省高校自然科学基金资助项目(09KJB52003)

Multi-relational Multi-class Model for Imbalanced Data

YANG He-biao, WANG Jian   

  1. (School of Computer Sciencre and Telecommunication Engineering, Jiangsu University, Zhenjiang 212013, China)
  • Online:2010-10-20 Published:2010-10-18

摘要: 针对多关系多分类的非平衡数据,提出一种分类模型。在预处理阶段,建立目标类纠错输出编码(ECOC)、目标关系与背景关系间的虚拟连接并完成属性聚集处理,进而划分训练集和验证集。在训练阶段,依据一对多划分思想,结合CrossMine算法构造多个子分类器,采用AUC法评估验证各子分类器。在验证阶段,比较目标类ECOC与各子分类器分类结果连接字的海明距离,选择最小海明距离的目标类为最终分类。经合成和真实数据的实验,验证了模型有效性及分类效果。

关键词: 多关系分类, 非平衡数据, 多类分类, 纠错输出编码, 一对多划分

Abstract: This paper proposes a multi-relational model which is applied to the multi-class imbalanced data. In the preprocessing stage, each class is assigned an Error Correcting Output Coding(ECOC). After setting up the virtual joins between the target and background relations, appropriate aggregation functions are used for different features. On this condition, the data can be divided into training set and validation set. Sub-classifiers are built on the training set in combination with One-vs-All classification method and CrossMine algorithm, and all the sub-classifiers are validated by their AUC values. The ECOC of the target class is compared with the Hamming distance of the linked word produced by the sub-classifiers on the validation set, and the class is chosen which has the shortest Hamming distance for the final result. The validity and effectiveness of the classifier by experiments are shown on both synthetic and real datasets.

Key words: multi-relational classification, imbalanced data, multi-class classification, Error Correcting Output Coding(ECOC), One-vs-All classification

中图分类号: