作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2007, Vol. 33 ›› Issue (01): 71-73. doi: 10.3969/j.issn.1000-3428.2007.01.024

• 软件技术与数据库 • 上一篇    下一篇

数据清理中同体不同源数据的数化算法研究

夏骄雄,徐 俊,吴耿锋   

  1. (上海大学计算机工程与科学学院,上海 200072)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2007-01-05 发布日期:2007-01-05

Digitization Algorithm Study of SEDS in Data Cleaning

XIA Jiaoxiong, XU Jun, WU Gengfeng   

  1. (School of Computer Engineering and Science, Shanghai University, Shanghai 200072)
  • Received:1900-01-01 Revised:1900-01-01 Online:2007-01-05 Published:2007-01-05

摘要: 在数据仓库构建的数据清理过程中,同体不同源数据的发现一直是清理过程的难点。在现实情况下,存在的单一实体在不同的数据源中以不同的方式进行存储或者表达的同体不同源数据,传统数据清理技术对其发现、修正需要花费大量的时间和系统资源进行比较,实际效果并不理想。该文提出一种新型的、利用数据数字化存储特点来查找同体不同源数据的算法,能够有效减少数据间的比较次数,并确保数据清理结果的质量。

关键词: 同体不同源数据, 数化, 数据清理

Abstract: It is always the difficulty to find out the “same entity from different sources(SEDS)” data in the data cleaning process of the data warehouse. The SEDS data are the same real world entities represented or stored differently in different data sources. The traditional data cleaning method costs a lot of system resources on finding and correcting such data, while the result is not ideal. With the digitization storage of the data, a new algorithm is proposed to find out the SEDS. The algorithm can reduce the comparison among the data effectively, and keep the quality at the same time.

Key words: Same entity from different sources(SEDS), Digitization, Data cleaning