作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2026, Vol. 52 ›› Issue (6): 80-95. doi: 10.19678/j.issn.1000-3428.0252914

• 计算智能与模式识别 • 上一篇    下一篇

基于双视图对比学习与子模优化的实体规范化方法

薛寒冰1, 倪晨1,*(), 李渔迎2, 关佳1, 方恺1, 崔文倩1   

  1. 1. 同济大学物理科学与工程学院, 上海 200092
    2. 同济大学教育技术中心, 上海 200092
  • 收稿日期:2025-08-15 修回日期:2025-12-17 出版日期:2026-06-15 发布日期:2026-01-23
  • 通讯作者: 倪晨
  • 作者简介:

    薛寒冰, 女, 硕士, 主研方向为知识图谱、物理教育

    倪晨(通信作者), 高级工程师

    李渔迎, 助理工程师

    关佳, 工程师

    方恺, 教授级高级工程师

    崔文倩, 助教

  • 基金资助:
    教育部物理学类专业教学指导委员会2024年力学课程研究会课题(JZW-24-LX-02); 同济大学第二十期实验教学改革专项

Entity Canonicalization Method Based on Dual-View Contrastive Learning and Submodular Optimization

XUE Hanbing1, NI Chen1,*(), LI Yuying2, GUAN Jia1, FANG Kai1, CUI Wenqian1   

  1. 1. School of Physical Science and Engineering, Tongji University, Shanghai 200092, China
    2. Education Technology and Computing Center, Tongji University, Shanghai 200092, China
  • Received:2025-08-15 Revised:2025-12-17 Online:2026-06-15 Published:2026-01-23
  • Contact: NI Chen

摘要:

知识图谱(KG)在构建过程中常因异构数据源或信息抽取错误而引入实体冗余, 即多个节点表示同一真实世界实体, 严重影响图谱质量与应用性能。针对单知识图谱内的实体规范化(EC)问题, 本文提出一种两阶段方法。该方法的核心创新在于: 1)提出对比表示引导聚类(CRGC)方法, 结合实体上下文与定义的双视图信息进行对比学习, 并利用最小描述长度(MDL)准则对层次聚类结果进行自适应切割, 避免了人工设定阈值的难题; 2)设计子模冗余最小化(SRM)算法, 将代表实体选择问题建模为带分割拟阵约束的子模覆盖最大化问题, 在保证近似比的同时显式地平衡知识覆盖率(KCR)与冗余度; 3)针对实体规范化任务的特性, 引入类型一致性惩罚与硬负样本挖掘策略, 有效抑制了同形异义实体导致的"过合并"问题。在多个公开及内部数据集上的实验结果表明: 提出的CRGC-SRM方法在聚类质量上相比最强基线平均提升约2.7百分点, 进而将实体冗余率(ERR)平均从29.7%降至7.8%(较原始图谱削减73.7%), 同时保持不低于98%的KCR, 显著改善了图谱质量; 将SPARQL工作负载的平均倒数排名(MRR)提升约15.4%、首位命中率(Hits@1)提升约18.5%、95分位数(P95)查询延迟降低27.7%~35.9%, 有效提升了查询性能。CRGC-SRM方法为解决单图实体规范化问题提供了一套兼具理论保证与工程实用性的高效解决方案。

关键词: 知识图谱, 实体规范化, 对比学习, 最小描述长度, 子模优化

Abstract:

Entity redundancy, where multiple nodes represent the same real-world entity due to heterogeneous data sources or extraction errors, severely affects the quality and utility of Knowledge Graphs (KG). To address the problem of Entity Canonicalization (EC) within a single knowledge graph, we propose a two-stage method whose core innovations are threefold. 1) We propose a Contrastive Representation-Guided Clustering (CRGC) method that performs contrastive learning by leveraging the dual-view information (context and definition) of entities and adaptively cuts the hierarchical clustering results using the Minimum Description Length (MDL) principle, thereby avoiding the need for manual threshold setting. 2) We design a Submodular Redundancy Minimization (SRM) algorithm that formulates the representative entity selection problem as a submodular coverage maximization under partition matroid constraints. This method, denoted as CRGC-SRM, provides an approximation guarantee while explicitly optimizing the trade-off between the Knowledge Coverage Rate (KCR) and redundancy. 3) Tailored for the EC task, we introduce a type-consistency penalty and a hard-negative mining strategy to effectively suppress the ″over-merging″ problem caused by homographic (or polysemous) entities. Experiments on multiple public and internal datasets demonstrate that CRGC-SRM improves clustering quality by approximately 2.7 percentage points over the strongest baselines, subsequently reducing the Entity Redundancy Rate (ERR) from 29.7% to 7.8% on average (reducing redundancy by 73.7% relative to that of the original graph) while maintaining ≥98% KCR. Furthermore, CRGC-SRM significantly improves query performance, increasing the Mean Reciprocal Rank (MRR) by approximately 15.4%, Hits@1 by approximately 18.5%, and reducing the 95th Percentile (P95) query latency by 27.7%—35.9%. CRGC-SRM offers an efficient, theoretically grounded, and practical solution for single-graph EC.

Key words: Knowledge Graph (KG), Entity Canonicalization (EC), contrastive learning, Minimum Description Length (MDL), submodular optimization