作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• 先进计算与数据处理 • 上一篇    下一篇

基于可信度模型的重复主数据检测算法

王继奎1,2,3,李少波 1,2   

  1. (1. 中国科学院成都计算机应用研究所,成都 610041;2. 贵州大学省部共建现代制造技术教育部重点实验室,贵阳 550003;3. 兰州商学院信息工程学院,兰州 730020)
  • 收稿日期:2013-04-02 出版日期:2014-05-15 发布日期:2014-05-14
  • 作者简介:王继奎(1978-),男,副教授、博士研究生,主研方向:数据管理,软件过程技术,智能计算;李少波,教授、博士生导师。
  • 基金资助:
    国家科技支撑计划基金资助项目(2012BAF12B14)。

Duplicate Master Data Detection Algorithm Based on Credibility Model

WANG Ji-kui 1,2,3, LI Shao-bo 1,2   

  1. (1. Chengdu Institute of Computer Applications, Chinese Academy of Sciences, Chengdu 610041, China; 2. Key Laboratory of Advanced Manufacturing Technology, Ministry of Education, Guizhou University, Guiyang 550003, China;3. College of Information Engineering, Lanzhou University of Finance and Economics, Lanzhou 730020, China)
  • Received:2013-04-02 Online:2014-05-15 Published:2014-05-14

摘要: 针对来源于多个业务系统的重复主数据影响主数据质量、主数据同步及主数据挖掘等问题,提出重复主数据检测算法fastCdrDetection。从数据可信度的角度出发,在考虑数据源可信度、数据最后更新时间及数据长度的基础上,建立主数据可信度模型,并实现可信记录生成算法。设计非递归的字符串相似度计算算法FiledMatch,解决了由中文简写、缩写、错误拼写造成的主数据重复问题,采用sourceKeys算法对来源于同一业务系统、具有同样业务主键的重复记录进行预处理,从而提高重复主数 据检测效率。通过对某电网基建物资63万余条供应商存量数据及23万余条模拟数据进行实验,结果表明,与PQS算法相比,fastCdrDetection算法的召回率由74%提高到88%,准确率由61%提高到95%,证明了算法的有效性。

关键词: 多数据源, 重复主数据, 可信度模型, 检测算法, 数据可信度

Abstract: To avoid the effect of duplicate master data from multiple business systems on the quality, synchronization of the master data as well as master data mining, this paper propose a fastCdrDetection(Fast Cluster Duplicate Records Detection) algorithm, in which a duplicate master data detection model and a credible record generating algorithm are included, considering data source reliability, data refresh time and data length. A non-recursive algorithm FiledMatch is established for character string similarity calculation. Aiming at the eliminating problems caused by abbreviations and wrong spellings in Chinese input, a sourceKeys algorithm is constructed for pretreatment of duplicate records arising from a same business system and sharing same business keys to achieve high efficiency in duplicate master data detection. Experiments are carried on a power grid with 630 thousand records of raw material and 230 thousand simulated data records. Result shows that the recall rate of the fastCdrDetection algorithm is 88%, while the PQS algorithm is 74%, and the accuracy is 95% to 61%. The effectiveness of the algorithm is verified.

Key words: multiple data source, duplicate master data, credibility model, detection algorithm, data credibility

中图分类号: