作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2007, Vol. 33 ›› Issue (01): 50-52. doi: 10.3969/j.issn.1000-3428.2007.01.017

• 软件技术与数据库 • 上一篇    下一篇

位置编码在数据仓库ETL中的应用

张 永1,2,迟忠先1   

  1. (1. 大连理工大学计算机系,大连 116024;2. 辽宁师范大学计算机系,大连 116029)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2007-01-05 发布日期:2007-01-05

Application of Position-coding in ETL of Data Warehouse

ZHANG Yong1,2, CHI Zhongxian1   

  1. (1. Department of Computer Science and Engineering, Dalian University of Technology, Dalian 116024; 2. Department of Computer, Liaoning Normal University, Dalian 116029)
  • Received:1900-01-01 Revised:1900-01-01 Online:2007-01-05 Published:2007-01-05

摘要: 为了保证数据仓库中数据的质量,在数据挖掘前必须进行数据清洗。ETL是构建数据仓库的重要环节,数据清洗就包含在其中。而检测和消除数据仓库中的相似重复记录是数据清洗和提高数据质量要解决的关键问题之一。该文将位置编码技术引入到数据仓库ETL中,提出了一种相似重复记录的检测算法,并给出了不同级别匹配阈值的动态确定方法。通过实验表明该算法具有较好的检测效果。

关键词: 数据清洗, 位置编码, 数据仓库, ETL, 相似重复记录

Abstract: Data cleaning should be done before data mining in order to improve data quality of data warehouse. ETL is a crucial process of constructing data warehouse, which includes data cleaning. Examining and eliminating approximately duplicated records is one of key needed solution for data cleaning and data quality improving. This paper introduces the position-coding technology to ETL of data warehouse, presents a new examining algorithm of approximately duplicated records, and brings forward a dynamic method of variant level match thresholds. Experimental comparison with the previous work indicates that the method proposed is effective.

Key words: Data cleaning, Position-coding, Data warehouse, ETL, Approximately duplicated records