计算机工程 ›› 2008, Vol. 34 ›› Issue (16): 16-18.doi: 10.3969/j.issn.1000-3428.2008.16.006

• 博士论文 • 上一篇    下一篇

面向XML数据库的智能数据清洗策略

刘 波1,杨路明1,雷刚跃2,邓云龙3   

  1. (1. 中南大学信息学院,长沙 410083;2. 湖南信息职业技术学院,长沙 410200;3. 中南大学湘雅附三医院,长沙 410013)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2008-08-20 发布日期:2008-08-20

Intelligence Data Cleaning Strategy for XML Database

LIU Bo1, YANG Lu-ming1, LEI Gang-yue2, DENG Yun-long3   

  1. (1. College of Information, Central-south University, Changsha 410083; 2. Hunan College of Information, Changsha 410200; 3. The 3rd Xiangya Hospital, Central-south University, Changsha 410013)
  • Received:1900-01-01 Revised:1900-01-01 Online:2008-08-20 Published:2008-08-20

摘要: 针对XML数据质量问题,通过引入贝叶斯学习方法与马尔可夫链概率转移策略建立XML数据清洗过程的元数据模型,根据综合清洗结构化数据中相似重复记录的思想,提出一种智能清洗XML数据的新方法。实验表明,与其他方法比较,该方法不仅自动化程度较高,降低人工参与的程度,而且精确率和查全率提升了2%~5%。

关键词: XML数据库, 数据清洗, 贝叶斯公式, 马尔可夫链

Abstract: Aiming at the quality of XML data, this paper introduces a metadata model based on XML cleaning data through Bayes learning method and Markoff chain probabilistic strategy, and designs a new intelligence method how to clean XML data by the idea which can clean the similarity duplicated records. Compared with other methods, experimental results show that it not only has a high automatization and a low manual working, but also has a better precision and its recall rates between 2% and 5%.

Key words: XML database, data cleaning, Bayes formula, Markoff chain

中图分类号: