计算机工程 ›› 2009, Vol. 35 ›› Issue (3): 211-213.doi: 10.3969/j.issn.1000-3428.2009.03.071

• 人工智能及识别技术 • 上一篇    下一篇

一种基于主动学习的相似记录匹配方法

陈 波1,2,王延章1   

  1. (1. 大连理工大学管理学院,大连 116023;2. 中国人民银行征信中心,北京 100140)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2009-02-05 发布日期:2009-02-05

Matching Approach for Similar Records Based on Active Learning

CHEN Bo1,2, WANG Yan-zhang1   

  1. (1. Management College, Dalian University of Technology, Dalian 116023; 2. Credit Reference Center, The People’s Bank of China, Beijing 100140)
  • Received:1900-01-01 Revised:1900-01-01 Online:2009-02-05 Published:2009-02-05

摘要: 针对当前相似记录匹配方法中存在的问题,提出一种改进的机器学习匹配方法。该方法聚类相似记录,生成决策树学习委员会,由学习委员会主动挑选信息量最大的相似记录,由用户标注匹配结果后对学习委员会成员进行训练,自主学习不同数据源间相似记录匹配规则。实际数据的实验分析验证了该方法可以在有效减少训练实例数量的同时获得较高的匹配精度。

关键词: 信息集成, 相似记录匹配, 主动学习, 决策树

Abstract: This paper discusses the problems in current matching methods of similar records, and proposes an improved method of machine learning. Similar records are clustered to generate a committee of decision tree learners. The committee actively chooses the most informative similar records for users to distinguish between duplicate and non-duplicate records. Those pairs of records are used to train the committee. The matching rules of multiple data source are automatically learned. Experiments on real data show that this method effectively reduces the number of training instances, and achieves high matching accuracy.

Key words: information integration, similar records matching, active learning, decision tree

中图分类号: