作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (12): 151-162. doi: 10.19678/j.issn.1000-3428.0068536

• 人工智能与模式识别 • 上一篇    下一篇

基于先验知识引导提示学习的自监督分类法补全

陈志强1, 仇瑜2, 朱宇1,*(), 王晓英1   

  1. 1. 青海大学计算机技术与应用系, 青海 西宁 810000
    2. 北京智谱华章科技有限公司, 北京 100084
  • 收稿日期:2023-10-10 出版日期:2024-12-15 发布日期:2024-03-11
  • 通讯作者: 朱宇
  • 基金资助:
    国家自然科学基金(62166032); 国家自然科学基金(62162053); 青海省自然科学基金(2022-ZJ-961Q); 科技创新2030-"新一代人工智能"重大项目(2021ZD0114102)

Self-Supervised Taxonomy Completion Based on Prior Knowledge-Guided Prompt Learning

CHEN Zhiqiang1, QIU Yu2, ZHU Yu1,*(), WANG Xiaoying1   

  1. 1. Department of Computer Technology and Application, Qinghai University, Xining 810000, Qinghai, China
    2. Beijing ZhiPu HuaZhang Technology Limited Company, Beijing 100084, China
  • Received:2023-10-10 Online:2024-12-15 Published:2024-03-11
  • Contact: ZHU Yu

摘要:

由于现有各领域种子分类法不完整, 且随着时间的推移, 涌现出大量新的领域术语, 使得各领域种子分类法有待自动补全。现有的自监督分类法补全方法采用图嵌入技术, 并未充分利用预训练语言模型所提供的丰富语义信息, 且只关注图中局部的节点关系, 忽视了整体图结构所蕴含的信息。针对上述问题, 提出一个基于先验知识引导提示学习的自监督分类法补全模型, 该模型融合了预训练语言模型的语义信息和种子分类法的结构信息。根据查询节点在垂直路径上存在粗粒度三元组的特性, 改进自监督数据集构建策略。在大样本情况下, 利用基于预训练和微调模式进行匹配。在微调阶段, 为了加强预训练语言模型对真实上位词的关注, 在提示(prompt)中融入真实上位词的同义词或缩略词的先验知识注意力, 从而更有效地利用prompt来引导预训练模型的微调过程。在匹配阶段, 为了降低时间复杂度, 采用软束搜索规则, 具体来说, 在局部图结构上, 利用prompt指导生成的节点嵌入来评估同级对兄弟节点的查询置信度; 在整体图结构上, 采用垂直路径的游走方法进行路径截取与排序筛选。在小样本情况下, 利用基于提示学习的模式进行匹配, 同时采用不同模板组合和上下文示例去微调预训练语言模型。在4个不同领域的大型公开数据集上进行实验, 结果表明, 相较于对比模型, 该模型的MR、MRR、Hit@10指标分别提升15%、0.057、0.030。

关键词: 分类法补全, 先验知识, 提示学习, 自监督, 预训练语言模型

Abstract:

Owing to the incompleteness of existing seed taxonomies in various fields and the emergence of a considerable number of new domain terms over time, seed taxonomies in various fields must be automatically completed. Existing self-supervised taxonomy completion methods utilize graph embedding technology; however, these methods do not fully utilize the rich semantic information provided by the pre-trained language model; they only focus on the local node relationship in the graph while ignoring the information contained in the overall graph structure. To address the above problems, a self-supervised taxonomy completion model based on prior knowledge-guided prompt learning named Pro-tax is proposed. The model integrates the semantic information of the pre-trained language model and the structural information of the seed taxonomy. First, based on the coarse-grained triplet characteristics of the query node on the vertical path, the building strategy of a self-supervised dataset is improved. Second, the matching based on the pre-training and fine-tuning modes is used for large samples. To strengthen the attention of the pre-trained language model to the true hypernyms during the fine-tuning stage, the prior knowledge attention of the synonyms or abbreviations of the true hypernyms is integrated into the prompt; therefore, the prompt can be used to guide the fine-tuning process of the pre-trained language model more effectively. During the matching stage, soft beam search rules are adopted to reduce time complexity. Specifically, in the local graph structure, the node embedding generated by prompt guidance is used to evaluate the query confidence level of the sibling nodes at the same level, whereas the walk method of vertical paths is used for path interception and sorting filter in the global graph structure. Third, for few-shot, matching based on prompt learning is used; concurrently, different template combinations and in-context demonstration are used to fine-tune the pre-trained language model. Finally, the experimental results on large public datasets in four different domains indicate that compared with the comparative model, the Mean Rank(MR), Mean Reciprocal Rank(MRR), and Hit@10 indicators of Pro-tax increase by 15%, 0.057, and 0.030, respectively.

Key words: taxonomy completion, prior knowledge, prompt learning, self-supervised, pre-trained language model