基于可信度模型的重复主数据检测算法

doi:10.3969/j.issn.1000-3428.2014.05.007

计算机工程

基于可信度模型的重复主数据检测算法

王继奎^1,2,3，李少波 ^1,2

(1. 中国科学院成都计算机应用研究所，成都 610041；2. 贵州大学省部共建现代制造技术教育部重点实验室，贵阳 550003；3. 兰州商学院信息工程学院，兰州 730020)

收稿日期:2013-04-02 出版日期:2014-05-15 发布日期:2014-05-14
作者简介:王继奎(1978－)，男，副教授、博士研究生，主研方向：数据管理，软件过程技术，智能计算；李少波，教授、博士生导师。
基金资助:
国家科技支撑计划基金资助项目(2012BAF12B14)。

Duplicate Master Data Detection Algorithm Based on Credibility Model

WANG Ji-kui ^1,2,3, LI Shao-bo ^1,2

(1. Chengdu Institute of Computer Applications, Chinese Academy of Sciences, Chengdu 610041, China; 2. Key Laboratory of Advanced Manufacturing Technology, Ministry of Education, Guizhou University, Guiyang 550003, China;3. College of Information Engineering, Lanzhou University of Finance and Economics, Lanzhou 730020, China)

Received:2013-04-02 Online:2014-05-15 Published:2014-05-14

摘要/Abstract

摘要： 针对来源于多个业务系统的重复主数据影响主数据质量、主数据同步及主数据挖掘等问题，提出重复主数据检测算法fastCdrDetection。从数据可信度的角度出发，在考虑数据源可信度、数据最后更新时间及数据长度的基础上，建立主数据可信度模型，并实现可信记录生成算法。设计非递归的字符串相似度计算算法FiledMatch，解决了由中文简写、缩写、错误拼写造成的主数据重复问题，采用sourceKeys算法对来源于同一业务系统、具有同样业务主键的重复记录进行预处理，从而提高重复主数据检测效率。通过对某电网基建物资63万余条供应商存量数据及23万余条模拟数据进行实验，结果表明，与PQS算法相比，fastCdrDetection算法的召回率由74%提高到88%，准确率由61%提高到95%，证明了算法的有效性。

关键词: 多数据源, 重复主数据, 可信度模型, 检测算法, 数据可信度

Abstract: To avoid the effect of duplicate master data from multiple business systems on the quality, synchronization of the master data as well as master data mining, this paper propose a fastCdrDetection(Fast Cluster Duplicate Records Detection) algorithm, in which a duplicate master data detection model and a credible record generating algorithm are included, considering data source reliability, data refresh time and data length. A non-recursive algorithm FiledMatch is established for character string similarity calculation. Aiming at the eliminating problems caused by abbreviations and wrong spellings in Chinese input, a sourceKeys algorithm is constructed for pretreatment of duplicate records arising from a same business system and sharing same business keys to achieve high efficiency in duplicate master data detection. Experiments are carried on a power grid with 630 thousand records of raw material and 230 thousand simulated data records. Result shows that the recall rate of the fastCdrDetection algorithm is 88%, while the PQS algorithm is 74%, and the accuracy is 95% to 61%. The effectiveness of the algorithm is verified.

Key words: multiple data source, duplicate master data, credibility model, detection algorithm, data credibility

中图分类号:

TP311

王继奎，李少波. 基于可信度模型的重复主数据检测算法[J]. 计算机工程, doi: 10.3969/j.issn.1000-3428.2014.05.007.

WANG Ji-kui, LI Shao-bo. Duplicate Master Data Detection Algorithm Based on Credibility Model[J]. Computer Engineering, doi: 10.3969/j.issn.1000-3428.2014.05.007.

http://www.ecice06.com/CN/Y2014/V40/I5/31

参考文献

参考文献 [1] Hernandez M A, Stolfo S J. Real-world Data is Dirty: Data Cleansing and the Merge/Purge Problem[J]. Data Ming and Knowledge Discovery, 1998, 2(1): 9-37. [2] 韩京宇, 徐立臻, 董逸生. 数据质量研究综述[J]. 计算机科学, 2008, 35(2): 1-5. [3] Batin C, Scannapieca M. Data Quality: Concepts, Methodo- logies and Techniques[M]. New York, USA: Springer-Verlag, 2006. [4] 陈伟, 丁秋林. 可扩展数据清理平台的研究[J]. 电子科技大学学报, 2006, 35(1): 100-103. (下转第40页) (上接第35页) [5] Smith T F, Waterman M S. Identification of Common Mole- cular Subsequences[J]. Journal of Molecular Biology, 1981, 147(1): 195-197. [6] Nawaz Z, Bertelsk A. Acceleration of Simth-Waterman Using Recursive Variable Expansion[C]//Proceedings of the 11th EUROMICRO Conference on Digital System Design Architectures, Methods and Tools. Parma, Italy: IEEE Press, 2008: 915-922. [7] 张永, 迟忠先, 闫德勤. 数据仓库ETL中相似重复记录的检测方法及应用[J]. 计算机应用, 2006, 26(4): 880-882. [8] Hernandez M, Stolfo S. The Merge/Purge Problem for Large Databases[C]//Proceedings of ACM SIGMOD International Conference on Management of Data. San Jose, USA: [s. n.], 1995: 127-138. [9] 李坚, 郑宁. 对基于MPN数据清洗算法的改进[J]. 计算机应用与软件, 2008, 25(2): 245-247. [10] Monge Ａ, Elkan C. An Efficient Domain Independent Algorithm for Detecting Approximately Duplicate Database Records[C]//Proceedings of SIGMOD Workshop on Data Mining and Knowledge Discovery. Tucson, USA: [s. n.], 1997: 23-29. [11] 李亚坤, 王宏志. 基于实体描述属性技术的XML重复对象检测方法[J]. 计算机学报, 2011, 34(11): 2132- 2141. [12] Whang S E, Menestrina D, Georgiaet K. Entity Resolution with Iterative Blocking[C]//Proceedings of the 35th SIGMOD International Conference on Management of Data. New York, USA: ACM Press, 2009: 219-231. 编辑陆燕菲

[1]	李雨阳, 沈记全, 翟海霞, 冯伟华. 基于改进SSD的口罩佩戴检测算法[J]. 计算机工程, 2022, 48(8): 173-179,186.
[2]	叶茂, 马杰, 王倩, 武麟. 多尺度特征融合的轻量化口罩佩戴检测算法[J]. 计算机工程, 2022, 48(7): 42-50.
[3]	李冠达, 金兢, 王凡, 夏营威, 杨学志. 室内场景下应用拓扑结构的高效路径规划算法[J]. 计算机工程, 2022, 48(6): 95-106.
[4]	沈记全, 陈相均, 翟海霞. 基于改进边界框回归损失的YOLOv3检测算法[J]. 计算机工程, 2022, 48(3): 236-243.
[5]	奚琦, 张正道, 彭力. 基于改进密集网络与二次回归的小目标检测算法[J]. 计算机工程, 2021, 47(4): 241-247,255.
[6]	祝捷, 王萍, 海涵, 王帅. 一种可扩展的广义空移键控调制系统设计[J]. 计算机工程, 2021, 47(1): 188-195.
[7]	周文军, 张勇, 王昱洁. 基于DSSD的静态手势实时识别方法[J]. 计算机工程, 2020, 46(2): 255-261.
[8]	陈晓霞,卢菁. 融合多数据源的动态自适应推荐算法[J]. 计算机工程, 2018, 44(9): 64-69.
[9]	李涛,陈黎,聂晖. 基于改进线段分割检测的电线杆遮挡检测算法[J]. 计算机工程, 2017, 43(9): 250-255.
[10]	沈宋衍,陈莹. 在线学习机制下的Snake 轮廓跟踪[J]. 计算机工程, 2015, 41(4): 195-198.
[11]	李占波, 白全海, 申义彩. 基于主成分分析的网络入侵检测算法[J]. 计算机工程, 2013, 39(5): 152-155.
[12]	孟小华, 刘坚强, 区业祥, 张庆丰. 基于CUDA的拉普拉斯边缘检测算法[J]. 计算机工程, 2012, 38(18): 190-193.
[13]	戈军, 周莲英. 无线传感器网络副本攻击的巡逻检测算法[J]. 计算机工程, 2012, 38(14): 109-111.
[14]	胡小青. TCP/RED离散模型分析及参数设置[J]. 计算机工程, 2011, 37(17): 75-77,92.
[15]	张龙飞, 张跃. 一种多导联QRS波实时检测算法[J]. 计算机工程, 2011, 37(16): 282-284.

选择文件类型/文献管理软件名称

选择包含的内容

基于可信度模型的重复主数据检测算法

Duplicate Master Data Detection Algorithm Based on Credibility Model

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于可信度模型的重复主数据检测算法

Duplicate Master Data Detection Algorithm Based on Credibility Model

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价