摘要: 对联邦数字图书馆中重复元数据记录进行检测和管理,是保证元数据质量、提高联邦检索服务质量的关键。针对现有联邦数字图书馆中重复记录检测方法计算集中、准确度不高等缺点,提出一种快速高效的相似重复元数据记录检测方法,该方法基于改进的N-Gram方法,适合较大规模联邦数字图书馆。模拟实验结果表明,该方法能有效提高重复检测的性能,加快重复检测的速度。
关键词:
元数据,
重复记录检测,
N-Gram方法,
相似度
Abstract: Metadata records duplicate detection and management of federated digital library are one of key issues to ensure metadata quality and improve federal retrieval services. Many duplicate record detection methods exist for conventional federated digital library, but they are computationally intensive and low accuracy and so on. This paper proposes an efficient duplication approach for a relatively large federated digital library based on improved N-Gram method. Simulation experimental results show that the method improve the performance of duplicate detection effectively, accelerate the rate of duplicate detection.
Key words:
metadata,
duplicate record detection,
N-Gram method,
similarity
中图分类号:
王常武;韩菁华;张付志. 一种相似重复元数据记录检测方法[J]. 计算机工程, 2009, 35(21): 85-87.
WANG Chang-wu; HAN Jing-hua; ZHANG Fu-zhi. Method for Approximately Duplicate Metadata Record Detection[J]. Computer Engineering, 2009, 35(21): 85-87.