作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (10): 97-104. doi: 10.19678/j.issn.1000-3428.0066057

• 人工智能与模式识别 • 上一篇    下一篇

基于多模型加权组合的文本相似度计算模型

刘栋1,2, 杨辉2,3, 姬少培1,2,*, 曹扬2,3   

  1. 1. 中国电子科技集团公司第三十研究所, 成都 610041
    2. 中电科大数据研究院有限公司, 贵阳 550022
    3. 提升政府治理能力大数据应用技术国家工程实验室, 贵阳 550022
  • 收稿日期:2022-10-20 出版日期:2023-10-15 发布日期:2023-10-10
  • 通讯作者: 姬少培
  • 作者简介:

    刘栋(1986—),男,高级工程师、硕士,主研方向为人工智能、信息安全

    杨辉,高级工程师、硕士

    曹扬,高级工程师、硕士

  • 基金资助:
    国家自然科学基金企业创新发展联合基金(U20B2049); 提升政府治理能力大数据应用技术国家工程实验室2018—2019年开放基金项目(w-2019010)

Text Similarity Computing Model Based on Weighted Combination of Multiple Models

Dong LIU1,2, Hui YANG2,3, Shaopei JI1,2,*, Yang CAO2,3   

  1. 1. The 30th Research Institute of China Electronics Technology Group Corporation, Chengdu 610041, China
    2. CETC Big Data Research Institute Co., Ltd., Guiyang 550022, China
    3. Big Data Application on Improving Government Governance Capabilities National Engineering Laboratory, Guiyang 550022, China
  • Received:2022-10-20 Online:2023-10-15 Published:2023-10-10
  • Contact: Shaopei JI

摘要:

针对传统文本相似度计算模型存在的未考虑语义及结构信息、容易忽略文本特征细节信息等问题, 建立一种基于多模型加权组合的文本相似度计算模型。在考虑次序、主题、语义等内容的基础上,对句子的每个单词进行嵌入表示,利用最大池化操作和Bi-GRU神经网络作为编码器生成关于句子的嵌入表示,通过多层次的比较来学习句子嵌入之间的相似性关系。对文本转换进行结构化表示,提取基于短语的浅层语法树结构化特征作为Tree-GRU的输入进行文本相似度计算。将上述2个计算结果进行加权处理,获取文本相似度的最终计算结果。实验结果表明:当权重参数C1C2分别取值0.6和0.4时该模型具有最优的相似度计算结果;在STSB数据集上模型的精确率、召回率和F1值分别达到90.32%、90.89%和90.52%;在SICK数据集上精确率、召回率和F1值分别达到85.41%、85.95%和85.61%;在MRPC数据集上精确率、召回率和F1值分别达到90.32%、90.89%和90.52%。该模型可以充分利用文本的多层次内容信息及结构化信息,适用于处理复杂的长文本,相对于DT-TEAM、ECNU等模型能取得更好的文本相似度计算结果。

关键词: 文本特征, 多词嵌入, 多层次比较, 浅层语法树, 线性加权, 文本相似度

Abstract:

A text similarity computing model based on multi-model weighted combination is established to address the issues of traditional text similarity computing models that do not consider semantic and structural information and are thus prone to ignoring text feature details. Based on considerations of order, theme, semantics, and other content, embedding representations are performed for each word in the sentence. The maximum pooling operation and Bi-directional Gated Recurrent Unit (Bi-GRU) neural network are used as encoders to generate embedding representations of the sentence, and the similarity relationship between sentence embeddings is learned through multi-level comparison. By transforming the text conversion into a structured representation, shallow-syntax tree-structured features are extracted from phrases that are input into Tree-GRU for text similarity computing. Weights are applied to the above two computing results to obtain the final text similarity computing outcome. The experimental results show that the model provides optimal similarity computing results when the weight parameters C1 and C2 are taken as 0.6 and 0.4, respectively; the precision, recall, and F1 scores of the model on the STSB dataset reached 90.32%, 90.89%, and 90.52%, respectively; the precision, recall, and F1 scores on the SICK dataset reached 85.41%, 85.95%, and 85.61%, respectively; the precision, recall, and F1 scores on the MRPC dataset reached 90.32%, 90.89%, and 90.52%, respectively. The proposed model can fully utilize multi-level content and structured information of text and is suitable for processing complex long texts. Compared to models such as DT-TEAM and ECNU, the model achieved better text similarity computing results.

Key words: text feature, multi-words embedding, multi-level comparison, shallow syntax tree, linear weighting, text similarity