作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2008, Vol. 34 ›› Issue (21): 12-14,1. doi: 10.3969/j.issn.1000-3428.2008.21.005

• 博士论文 • 上一篇    下一篇

基于置信区间的偏离群数据检测方法

夏秀峰,谢光宇,石祥滨,徐 蕾   

  1. (沈阳航空工业学院计算机学院,沈阳 110136)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2008-11-05 发布日期:2008-11-05

Detection Method of Deviated Group Data Based on Confident Interval

XIA Xiu-feng, XIE Guang-yu, SHI Xiang-bin, XU Lei   

  1. (School of Computer, Shenyang Institute of Aeronautical Engineering, Shenyang 110136)
  • Received:1900-01-01 Revised:1900-01-01 Online:2008-11-05 Published:2008-11-05

摘要: 异常数据检测与处理是数据仓库系统中数据清洗领域的研究热点。该文提出一种基于置信区间的偏离群数据检测方法,从总体中筛选出有效样本,利用遗传算法从中找到可信样本,利用可信样本确定置信区间,基于置信区间对总体进行检测及处理。该方法所处理的数据不需要与时间相关,且可以快速地识别、检测出大数据量中的“脏数据”。实验结果表明,该方法能有效地解决无规则状态下的偏离群数据的检测,并在实际应用中取得了良好效果。

关键词: 脏数据, 置信区间, 偏离群数据, 遗传算法

Abstract: It is a hot topic to detect and dispose the exceptional data in the field of data-cleansing operation of data warehouse system. After analyzing the current detection technology, a detection method of the deviated group data based on confident interval is proposed, in which an effective stylebook is screened out from the group data, a credible stylebook is found from the effective stylebook using a genetic arithmetic, a confident interval is obtained based on credible stylebook, then the group data will be detected and disposed using the confident interval. The data disposed by this method can be irrelative to the time scales, and the detection and identification speed of the “dirty data” in a large volume of data is fast. Experimental results indicate that this method can effectively implement the detection of the deviated group data in the random data, and good effects are obtained in practical applications.

Key words: dirty data, confident interval, deviated group data, genetic arithmetic

中图分类号: