基于双视图对比学习与子模优化的实体规范化方法

doi:10.19678/j.issn.1000-3428.0252914

计算机工程 ›› 2026, Vol. 52 ›› Issue (6): 80-95. doi: 10.19678/j.issn.1000-3428.0252914

基于双视图对比学习与子模优化的实体规范化方法

薛寒冰¹, 倪晨¹^,*(), 李渔迎², 关佳¹, 方恺¹, 崔文倩¹

1. 同济大学物理科学与工程学院, 上海 200092
2. 同济大学教育技术中心, 上海 200092

收稿日期:2025-08-15 修回日期:2025-12-17 出版日期:2026-06-15 发布日期:2026-01-23
通讯作者: 倪晨
作者简介:
薛寒冰, 女, 硕士, 主研方向为知识图谱、物理教育
倪晨(通信作者), 高级工程师
李渔迎, 助理工程师
关佳, 工程师
方恺, 教授级高级工程师
崔文倩, 助教
基金资助:
教育部物理学类专业教学指导委员会2024年力学课程研究会课题(JZW-24-LX-02); 同济大学第二十期实验教学改革专项

Entity Canonicalization Method Based on Dual-View Contrastive Learning and Submodular Optimization

XUE Hanbing¹, NI Chen¹^,*(), LI Yuying², GUAN Jia¹, FANG Kai¹, CUI Wenqian¹

1. School of Physical Science and Engineering, Tongji University, Shanghai 200092, China
2. Education Technology and Computing Center, Tongji University, Shanghai 200092, China

Received:2025-08-15 Revised:2025-12-17 Online:2026-06-15 Published:2026-01-23
Contact: NI Chen

摘要/Abstract

摘要：

知识图谱(KG)在构建过程中常因异构数据源或信息抽取错误而引入实体冗余, 即多个节点表示同一真实世界实体, 严重影响图谱质量与应用性能。针对单知识图谱内的实体规范化(EC)问题, 本文提出一种两阶段方法。该方法的核心创新在于: 1)提出对比表示引导聚类(CRGC)方法, 结合实体上下文与定义的双视图信息进行对比学习, 并利用最小描述长度(MDL)准则对层次聚类结果进行自适应切割, 避免了人工设定阈值的难题; 2)设计子模冗余最小化(SRM)算法, 将代表实体选择问题建模为带分割拟阵约束的子模覆盖最大化问题, 在保证近似比的同时显式地平衡知识覆盖率(KCR)与冗余度; 3)针对实体规范化任务的特性, 引入类型一致性惩罚与硬负样本挖掘策略, 有效抑制了同形异义实体导致的"过合并"问题。在多个公开及内部数据集上的实验结果表明: 提出的CRGC-SRM方法在聚类质量上相比最强基线平均提升约2.7百分点, 进而将实体冗余率(ERR)平均从29.7%降至7.8%(较原始图谱削减73.7%), 同时保持不低于98%的KCR, 显著改善了图谱质量; 将SPARQL工作负载的平均倒数排名(MRR)提升约15.4%、首位命中率(Hits@1)提升约18.5%、95分位数(P95)查询延迟降低27.7%~35.9%, 有效提升了查询性能。CRGC-SRM方法为解决单图实体规范化问题提供了一套兼具理论保证与工程实用性的高效解决方案。

关键词: 知识图谱, 实体规范化, 对比学习, 最小描述长度, 子模优化

Abstract:

Entity redundancy, where multiple nodes represent the same real-world entity due to heterogeneous data sources or extraction errors, severely affects the quality and utility of Knowledge Graphs (KG). To address the problem of Entity Canonicalization (EC) within a single knowledge graph, we propose a two-stage method whose core innovations are threefold. 1) We propose a Contrastive Representation-Guided Clustering (CRGC) method that performs contrastive learning by leveraging the dual-view information (context and definition) of entities and adaptively cuts the hierarchical clustering results using the Minimum Description Length (MDL) principle, thereby avoiding the need for manual threshold setting. 2) We design a Submodular Redundancy Minimization (SRM) algorithm that formulates the representative entity selection problem as a submodular coverage maximization under partition matroid constraints. This method, denoted as CRGC-SRM, provides an approximation guarantee while explicitly optimizing the trade-off between the Knowledge Coverage Rate (KCR) and redundancy. 3) Tailored for the EC task, we introduce a type-consistency penalty and a hard-negative mining strategy to effectively suppress the ″over-merging″ problem caused by homographic (or polysemous) entities. Experiments on multiple public and internal datasets demonstrate that CRGC-SRM improves clustering quality by approximately 2.7 percentage points over the strongest baselines, subsequently reducing the Entity Redundancy Rate (ERR) from 29.7% to 7.8% on average (reducing redundancy by 73.7% relative to that of the original graph) while maintaining ≥98% KCR. Furthermore, CRGC-SRM significantly improves query performance, increasing the Mean Reciprocal Rank (MRR) by approximately 15.4%, Hits@1 by approximately 18.5%, and reducing the 95th Percentile (P95) query latency by 27.7%—35.9%. CRGC-SRM offers an efficient, theoretically grounded, and practical solution for single-graph EC.

Key words: Knowledge Graph (KG), Entity Canonicalization (EC), contrastive learning, Minimum Description Length (MDL), submodular optimization

薛寒冰, 倪晨, 李渔迎, 关佳, 方恺, 崔文倩. 基于双视图对比学习与子模优化的实体规范化方法[J]. 计算机工程, 2026, 52(6): 80-95.

XUE Hanbing, NI Chen, LI Yuying, GUAN Jia, FANG Kai, CUI Wenqian. Entity Canonicalization Method Based on Dual-View Contrastive Learning and Submodular Optimization[J]. Computer Engineering, 2026, 52(6): 80-95.

https://www.ecice06.com/CN/Y2026/V52/I6/80

图/表 10

图1 CRGC-SRM方法整体框架

Fig.1 Overall framework of CRGC-SRM method

图2 覆盖目标β对ERR和KCR的影响

Fig.2 Impact of coverage target β on ERR and KCR

图3 硬负样本池大小k对聚类性能的敏感性分析

Fig.3 Sensitivity analysis of hard-negative pool size k on clustering performance

图4 字符级噪声鲁棒性分析

Fig.4 Character-level noise robustness analysis

图5 缩写率鲁棒性分析

Fig.5 Abbreviation rate robustness analysis

参考文献 40

1	JI S X , PAN S R , CAMBRIA E , et al. A survey on knowledge graphs: representation, acquisition, and applications. IEEE Transactions on Neural Networks and Learning Systems, 2022, 33 (2): 494- 514. doi: 10.1109/TNNLS.2021.3070843
2	PAN J Z , VETERE G , GOMEZ-PEREZ J M , et al. Exploiting linked data and knowledge graphs in large organisations. Belrin, Germany: Springer International Publishing, 2017.
3	HOGAN A , BLOMQVIST E , COCHEZ M , et al. Knowledge graphs. ACM Computing Surveys, 2021, 54 (4): 1- 37.
4	ZHU B B , WANG R L , WANG J Y , et al. A survey: knowledge graph entity alignment research based on graph embedding. Artificial Intelligence Review, 2024, 57 (9): 229. doi: 10.1007/s10462-024-10866-4
5	ZAVERI A , RULA A , MAURINO A , et al. Quality assessment for linked data: a survey. Semantic Web, 2016, 7 (1): 63- 93.
6	ZHANG B W, SOH H. Extract, define, canonicalize: an LLM-based framework for knowledge graph construction[C]//Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Miami, USA: Association for Computational Linguistics, 2024: 9820-9836.
7	WANG Q , MAO Z D , WANG B , et al. Knowledge graph embedding: a survey of approaches and applications. IEEE Transactions on Knowledge and Data Engineering, 2017, 29 (12): 2724- 2743. doi: 10.1109/TKDE.2017.2754499
8	MURTAGH F , CONTRERAS P . Algorithms for hierarchical clustering: an overview. WIREs Data Mining and Knowledge Discovery, 2012, 2 (1): 86- 97. doi: 10.1002/widm.53
9	刘春雨, 陈庆锋, 莫少聪, 等. 基于逻辑规则和图神经网络的知识图谱补全. 计算机工程, 2025, 51 (3): 131- 143. doi: 10.19678/j.issn.1000-3428.0069129
	LIU C Y , CHEN Q F , MO S C , et al. Knowledge graph completion based on logical rules and graph neural network. Computer Engineering, 2025, 51 (3): 131- 143. doi: 10.19678/j.issn.1000-3428.0069129
10	朱红, 王阔然, 朱彤. 基于多侧面信息表征联合的实体相似性度量及对齐方法. 计算机工程, 2025, 51 (3): 64- 75. doi: 10.19678/j.issn.1000-3428.0068839
	ZHU H , WANG K R , ZHU T . Entity similarity metrics and alignment method based on the union of multi-side information representations. Computer Engineering, 2025, 51 (3): 64- 75. doi: 10.19678/j.issn.1000-3428.0068839
11	SUN Z Q, HU W, LI C K. Cross-lingual entity alignment via joint attribute-preserving embedding[C]//Proceedings of the International Semantic Web Conference. Berlin, Germany: Springer International Publishing, 2017: 628-644.
12	张晓明, 陈通庆, 王会勇. 基于图像置信度动态引导的多模态实体对齐. 计算机工程, 2025, 51 (12): 140- 150. doi: 10.19678/j.issn.1000-3428.0069802
	ZHANG X M , CHEN T Q , WANG H Y . Dynamic guided multimodal entity alignment based on image confidence. Computer Engineering, 2025, 51 (12): 140- 150. doi: 10.19678/j.issn.1000-3428.0069802
13	丛烁, 苏贵斌, 柳林, 等. 知识图谱实体对齐研究综述: 从传统方法到前沿技术. 计算机工程与应用, 2026, 62 (1): 47- 67.
	CONG S , SU G B , LIU L , et al. Survey on knowledge graph entity alignment research: from traditional methods to frontier technologies. Computer Engineering and Applications, 2026, 62 (1): 47- 67.
14	ZHU H, XIE R B, LIU Z Y, et al. Iterative entity alignment via joint knowledge embeddings[C]//Proceedings of the 26th International Joint Conference on Artificial Intelligence. Melbourne, Australia: International Joint Conferences on Artificial Intelligence Organization, 2017: 4258-4264.
15	SUCHANEK F M , ABITEBOUL S , SENELLART P . PARIS: probabilistic alignment of relations, instances, and schema. Proceedings of the VLDB Endowment, 2011, 5 (3): 157- 168. doi: 10.14778/2078331.2078332
16	PAN L, QIAN K, NAGESH A, et al. LLM-based knowledge graph construction: a survey[EB/OL]. [2025-07-23]. https://arxiv.org/abs/2402.16309.
17	GAO T Y, YAO X C, CHEN D Q. SimCSE: simple contrastive learning of sentence embeddings[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Philadelphia, USA: Association for Computational Linguistics, 2021: 6894-6910.
18	YAN Y M, LI R M, WANG S R, et al. ConSERT: a contrastive framework for self-supervised sentence representation transfer[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Philadelphia, USA: Association for Computational Linguistics, 2021: 5065-5075.
19	KRAUSE A , GOLOVIN D . Submodular function maximization. Cambridge, USA: Cambridge University Press, 2014.
20	LIN H, BILMES J. A class of submodular functions for document summarization[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Portland, USA: Association for Computational Linguistics, 2011: 510-520.
21	JIANG Z L, ZHANG G X, DAVIS L S. Submodular dictionary learning for sparse coding[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2012: 3418-3425.
22	LIU P , GUO Y M , WANG F L , et al. Chinese named entity recognition: the state of the art. Neurocomputing, 2022, 473, 37- 53. doi: 10.1016/j.neucom.2021.10.101
23	GRÜNWALD P D , RISSANEN J . The minimum description length principle. Cambridge, USA: MIT Press, 2007.
24	CALINESCU G , CHEKURI C , PÁL M , et al. Maximizing a monotone submodular function subject to a matroid constraint. SIAM Journal on Computing, 2011, 40 (6): 1740- 1766. doi: 10.1137/080733991
25	PAULHEIM H . Knowledge graph refinement: a survey of approaches and evaluation methods. Semantic Web, 2017, 8 (3): 489- 508.
26	TISSOT H , DOBSON R . Combining string and phonetic similarity matching to identify misspelt names of drugs in medical records written in Portuguese. Journal of Biomedical Semantics, 2019, 10 (1): 17.
27	WU Y T, LIU X, FENG Y S, et al. Relation-aware entity alignment for heterogeneous knowledge graphs[C]//Proceedings of the 28th International Joint Conference on Artificial Intelligence. Macao, China: International Joint Conferences on Artificial Intelligence Organization, 2019: 5278-5284.
28	MAO X, WANG W T, XU H M, et al. Relational reflection entity alignment[C]//Proceedings of the 29th ACM International Conference on Information & Knowledge Management. New York, USA: ACM Press, 2020: 1095-1104.
29	LI Y L , LI J F , SUHARA Y , et al. Deep entity matching with pre-trained language models. Proceedings of the VLDB Endowment, 2020, 14 (1): 50- 60.
30	ZEAKIS A , PAPADAKIS G , SKOUTAS D , et al. Pre-trained embeddings for entity resolution: an experimental analysis. Proceedings of the VLDB Endowment, 2023, 16 (9): 2225- 2238.
31	ZHANG J, SUN H, HO J C. EMBA: entity matching using multi-task learning of BERT with attention-over-attention[C]//Proceedings of the 27th International Conference on Extending Database Technology. Washington D.C., USA: IEEE Press, 2024: 281-293.
32	PEETERS R, STEINER A, BIZER C. Entity matching using large language models[C]//Proceedings of the 28th International Conference on Extending Database Technology. Washington D.C., USA: IEEE Press, 2025: 1-15.
33	WANG T, CHEN X, LIN H, et al. Match, compare, or select? An investigation of large language models for entity matching[C]//Proceedings of the 31st International Conference on Computational Linguistics. Washington D.C., USA: IEEE Press, 2025: 96-109.
34	VAN HEUSDEN R , KAMPS J , MARX M . Bcubed revisited: elements like me. Discover Computing, 2024, 27 (1): 5.
35	ZAPOROJETS K, DELEU J, JIANG Y W, et al. Towards consistent document-level entity linking: joint models for entity linking and coreference resolution[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Philadelphia, USA: Association for Computational Linguistics 2022: 778-784.
36	KIM J . A fast and integrative algorithm for clustering performance evaluation in author name disambiguation. Scientometrics, 2019, 120 (2): 661- 681.
37	XUE B C , ZOU L . Knowledge graph quality management: a comprehensive survey. IEEE Transactions on Knowledge and Data Engineering, 2023, 35 (5): 4969- 4988.
38	MOHAMMADHASSANZADEH H , VAN WOENSEL W , ABIDI S R , et al. Semantics-based plausible reasoning to extend the knowledge coverage of medical knowledge bases for improved clinical decision support. BioData Mining, 2017, 10 (1): 7.
39	GONG J B , FANG X H , PENG J Q , et al. MORE: toward improving author name disambiguation in academic knowledge graphs. International Journal of Machine Learning and Cybernetics, 2024, 15 (1): 37- 50.
40	DEAN J , BARROSO L A . The tail at scale. Communications of the ACM, 2013, 56 (2): 74- 80.

[1]	余正涛, 孙资钦, 张勇丙, 高盛祥, 黄于欣, 谭凯文. 基于随机自注意力和动量对比学习的自监督序列推荐方法[J]. 计算机工程, 2026, 52(6): 132-140.
[2]	王硕, 李克, 李泽霖. 面向多知识图谱融合的实体对齐优化方法[J]. 计算机工程, 2026, 52(5): 139-149.
[3]	和红光, 线岩团, 相艳. 基于关系约束对比学习的常识知识图谱补全方法[J]. 计算机工程, 2026, 52(4): 122-130.
[4]	张震, 游兰, 彭庆喜, 金红, 曾昊秋, 夏宇春. XSGCL: 用于推荐的轻量级图对比学习框架[J]. 计算机工程, 2026, 52(4): 163-175.
[5]	孙圆, 王康平, 赵鸣博. 基于多提示和图文对比学习的服装检索[J]. 计算机工程, 2026, 52(2): 322-330.
[6]	郭天晟, 谢瑾奎. 自适应调节图增强与表示结构的推荐模型[J]. 计算机工程, 2026, 52(2): 69-78.
[7]	李强, 谭兴义, 郑唯, 刘震, 杨文海. 基于对抗训练与对比表示蒸馏的图神经网络推理优化[J]. 计算机工程, 2026, 52(1): 126-135.
[8]	符家成, 田瑾, 张玉金, 方志军. 结合前置三元组集的知识图谱推荐[J]. 计算机工程, 2025, 51(9): 101-109.
[9]	徐式芃, 王雷, 盛捷. 基于知识图谱的异常个体提前识别模型研究[J]. 计算机工程, 2025, 51(9): 59-70.
[10]	余鹏, 杨佳琦, 陈欣然, 贺超波. 基于二部图对比学习的特征增强推荐算法[J]. 计算机工程, 2025, 51(7): 100-110.
[11]	姚迅, 王海鹏, 胡新荣, 杨捷. 基于自适应增强的多视图对比推荐算法[J]. 计算机工程, 2025, 51(5): 103-113.
[12]	刘文杰, 陈亮, 任智杰. 基于图神经网络与元学习的小样本关系推理模型[J]. 计算机工程, 2025, 51(5): 124-132.
[13]	朱红, 王阔然, 朱彤. 基于多侧面信息表征联合的实体相似性度量及对齐方法[J]. 计算机工程, 2025, 51(3): 64-75.
[14]	刘春雨, 陈庆锋, 莫少聪, 谢泽. 基于逻辑规则和图神经网络的知识图谱补全[J]. 计算机工程, 2025, 51(3): 131-143.
[15]	马恒志, 钱育蓉, 冷洪勇, 吴海鹏, 陶文彬, 张依杨. 知识图谱嵌入研究进展综述[J]. 计算机工程, 2025, 51(2): 18-34.

选择文件类型/文献管理软件名称

选择包含的内容

基于双视图对比学习与子模优化的实体规范化方法

Entity Canonicalization Method Based on Dual-View Contrastive Learning and Submodular Optimization

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 10

参考文献 40

相关文章 15

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于双视图对比学习与子模优化的实体规范化方法

Entity Canonicalization Method Based on Dual-View Contrastive Learning and Submodular Optimization

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 10

参考文献 40

相关文章 15

编辑推荐

Metrics

本文评价