Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering

   

Entity Canonicalization with Dual-View Contrastive Learning and Submodular Optimization

  

  • Published:2026-01-23

基于双视图对比学习与子模优化的实体规范化方法

Abstract: Entity redundancy, where multiple nodes represent the same real-world entity due to heterogeneous data sources or extraction errors, severely impacts knowledge graph quality and utility. To address entity canonicalization within a single knowledge graph, we propose a two-stage framework named CRGC-SRM. The core innovations are threefold: (1) A Contrastive Representation-Guided Clustering (CRGC) method that combines dual-view (context and definition) contrastive learning and employs the Minimum Description Length (MDL) principle for adaptive hierarchical clustering, eliminating manual thresholding; (2) A Submodular Redundancy Minimization (SRM) algorithm that models representative selection as submodular coverage maximization under partition matroid constraints, explicitly balancing knowledge coverage and redundancy with theoretical approximation guarantees; (3) Task-specific enhancements including type-consistency penalty and hard-negative mining to mitigate over-merging of polysemous entities. Experiments on multiple public and internal datasets demonstrate that CRGC-SRM improves clustering quality by approximately 2.7 percentage points over the strongest baselines, subsequently reducing entity redundancy from 29.7% to 7.8% on average (a relative redundancy reduction of 73.7%) while maintaining ≥98% knowledge coverage. Furthermore, it significantly improves query performance, increasing Mean Reciprocal Rank (MRR) by approximately 15.4%, Hits@1 by approximately 18.5%, and reducing the 95th percentile (P95) query latency by 27.7–35.9%. Our framework offers an efficient, theoretically-grounded, and practical solution for single-graph entity canonicalization.

摘要: 知识图谱在构建过程中常因异构数据源或信息抽取错误而引入实体冗余,即多个节点表示同一真实世界实体,严重影响图谱质量与应用性能。针对单知识图谱内的实体规范化问题,本文提出一种名为CRGC-SRM的两阶段框架。该框架的核心创新在于:(1) 提出对比表示引导聚类(CRGC)方法,它结合实体上下文与定义的双视图信息进行对比学习,并利用最小描述长度(MDL)准则对层次聚类结果进行自适应切割,避免了人工设定阈值的难题;(2) 设计了子模冗余最小化(SRM)算法,将代表实体选择问题建模为带分割拟阵约束的子模覆盖最大化问题,在保证近似比的同时,显式地平衡知识覆盖率与冗余度;(3) 针对实体规范化任务的特性,引入了类型一致性惩罚与硬负样本挖掘策略,有效抑制了同形异义实体导致的“过合并”问题。在多个公开及内部数据集上的实验结果表明,CRGC-SRM在聚类质量上相比最强基线平均提升约2.7个百分点,进而将实体冗余率平均从29.7%降至7.8%(较原始图谱削减73.7%),同时保持不低于98%的知识覆盖率,显著改善了图谱质量;并将SPARQL工作负载的平均倒数排名(MRR)提升约15.4%、首位命中率(Hits@1)提升约18.5%,查询延迟95分位数(P95)降低27.7–35.9%,有效提升了查询性能。本文所提框架为解决单图实体规范化问题提供了一套兼具理论保证与工程实用性的高效解决方案。