作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2025, Vol. 51 ›› Issue (6): 174-183. doi: 10.19678/j.issn.1000-3428.0069112

• 人工智能与模式识别 • 上一篇    下一篇

不同基本单元信息融合的藏文短文本摘要生成

夏吾吉1,2, 黄鹤鸣1,2,*(), 樊永红1,2, 更藏措毛1,2, 范玉涛1,2   

  1. 1. 青海师范大学计算机学院, 青海 西宁 810008
    2. 藏语智能信息处理及应用国家重点实验室, 青海 西宁 810008
  • 收稿日期:2023-12-27 出版日期:2025-06-15 发布日期:2024-05-21
  • 通讯作者: 黄鹤鸣
  • 基金资助:
    国家自然科学基金(62066039); 国家自然科学基金(62166034); 青海省自然科学基金(2022-ZJ-925); 藏语智能信息处理及应用国家重点实验室自主项目(2022-SKL-007)

Tibetan Short Text Summarization Based on Fusion of Different Basic Units Information

XIA Wuji1,2, HUANG Heming1,2,*(), FAN Yonghong1,2, Gengzangcuomao1,2, FAN Yutao1,2   

  1. 1. School of Computer Science and Technology, Qinghai Normal University, Xining 810008, Qinghai, China
    2. State Key Laboratory of Tibetan Intelligent Information Processing and Application, Xining 810008, Qinghai, China
  • Received:2023-12-27 Online:2025-06-15 Published:2024-05-21
  • Contact: HUANG Heming

摘要:

藏文文本摘要能使用户快速有效地理解藏文文本内容。然而, 公开的、多领域的大规模藏文摘要数据集的稀缺, 使得藏文文本摘要生成的发展面临挑战; 此外, 藏文文本摘要生成研究借用中文和英文等以词作为基本单元的文本摘要生成技术构建模型, 但由于藏文受分词技术的限制, 直接以词作为文本摘要生成的基本单元, 对性能的影响较大。针对上述问题, 构建包含10 523条文本-摘要对的多领域藏文短文本摘要数据集TB-SUM, 在研究藏文文本构成单元的基础上, 提出适用于藏文文本摘要生成的不同基本单元融合方法, 并构建融合不同基本单元的藏文文本摘要生成模型Fusion_GloVe_GRU_Atten, 利用全局词向量表示(GloVe)模块实现藏文文本向量化后通过双向门控循环单元(Bi-GRU)模块对输入向量进行编码, 利用注意力机制获取输入向量的完整语义信息, 使解码器更加关注与当前单词相关的编码器输出, 同时将GRU作为解码器生成藏文摘要。在数据集TB-SUM和Ti-SUM上的实验结果表明, 以音节和词的融合作为模型训练的基本单元, 以音节作为测试的基本单元时, Fusion_GloVe_GRU_Atten模型生成短文本摘要效果更好, 能得到更高的ROUGE (Recall-Oriented Understudy for Gisting Evaluation)分数。

关键词: 基本单元, 信息融合, 词向量, 数据集构建, 藏文短文本摘要生成

Abstract:

Tibetan text summary enables users to quickly and effectively understand the content of the text. However, the scarcity of public, multi-domain, and large-scale Tibetan summarization datasets hinders the further development of Tibetan text summarization techniques. Furthermore, most studies on Tibetan text summarization adopt models built on Chinese and English text summarization techniques that use words as basic units. However, owing to limitations in Tibetan word segmentation technology, the direct use of words as basic units for text summarization has a significant impact on performance. Therefore, a multi-domain Tibetan short text summarization dataset, TB-SUM, containing 10 523 text-summary pairs is constructed in this study. Based on an analysis of the constituent units of Tibetan texts, a fusion method for different basic units suitable for Tibetan text summarization is proposed. Finally, a Tibetan text summarization model called Fusion_GloVe_GRU_Atten that integrates different basic units is proposed. This method utilizes the Global Vectors for Word Representation (GloVe) module to vectorize Tibetan text and encodes the input vector using the Bi-Gated Recurrent Unit (Bi-GRU) module. The attention mechanism is used to obtain the complete semantic information of the input vector, allowing the decoder to pay more attention to the encoder output related to the current word. GRU is used as a decoder to generate a Tibetan abstract. Experiments on the TB-SUM and Ti-SUM datasets are conducted. The results show that when the fusion of syllables and words is used as the basic unit for model training and syllables are used as the basic unit for testing, the Fusion_GloVe_GRU_Atten model generates a good summary and can achieve high Recall-Oriented Understudy for Gisting Evaluation (ROUGE) scores.

Key words: basic unit, information fusion, word vector, dataset construction, Tibetan short text summarization