不同基本单元信息融合的藏文短文本摘要生成

doi:10.19678/j.issn.1000-3428.0069112

摘要/Abstract

摘要：

藏文文本摘要能使用户快速有效地理解藏文文本内容。然而, 公开的、多领域的大规模藏文摘要数据集的稀缺, 使得藏文文本摘要生成的发展面临挑战; 此外, 藏文文本摘要生成研究借用中文和英文等以词作为基本单元的文本摘要生成技术构建模型, 但由于藏文受分词技术的限制, 直接以词作为文本摘要生成的基本单元, 对性能的影响较大。针对上述问题, 构建包含10 523条文本-摘要对的多领域藏文短文本摘要数据集TB-SUM, 在研究藏文文本构成单元的基础上, 提出适用于藏文文本摘要生成的不同基本单元融合方法, 并构建融合不同基本单元的藏文文本摘要生成模型Fusion_GloVe_GRU_Atten, 利用全局词向量表示(GloVe)模块实现藏文文本向量化后通过双向门控循环单元(Bi-GRU)模块对输入向量进行编码, 利用注意力机制获取输入向量的完整语义信息, 使解码器更加关注与当前单词相关的编码器输出, 同时将GRU作为解码器生成藏文摘要。在数据集TB-SUM和Ti-SUM上的实验结果表明, 以音节和词的融合作为模型训练的基本单元, 以音节作为测试的基本单元时, Fusion_GloVe_GRU_Atten模型生成短文本摘要效果更好, 能得到更高的ROUGE (Recall-Oriented Understudy for Gisting Evaluation)分数。

关键词: 基本单元, 信息融合, 词向量, 数据集构建, 藏文短文本摘要生成

Abstract:

Tibetan text summary enables users to quickly and effectively understand the content of the text. However, the scarcity of public, multi-domain, and large-scale Tibetan summarization datasets hinders the further development of Tibetan text summarization techniques. Furthermore, most studies on Tibetan text summarization adopt models built on Chinese and English text summarization techniques that use words as basic units. However, owing to limitations in Tibetan word segmentation technology, the direct use of words as basic units for text summarization has a significant impact on performance. Therefore, a multi-domain Tibetan short text summarization dataset, TB-SUM, containing 10 523 text-summary pairs is constructed in this study. Based on an analysis of the constituent units of Tibetan texts, a fusion method for different basic units suitable for Tibetan text summarization is proposed. Finally, a Tibetan text summarization model called Fusion_GloVe_GRU_Atten that integrates different basic units is proposed. This method utilizes the Global Vectors for Word Representation (GloVe) module to vectorize Tibetan text and encodes the input vector using the Bi-Gated Recurrent Unit (Bi-GRU) module. The attention mechanism is used to obtain the complete semantic information of the input vector, allowing the decoder to pay more attention to the encoder output related to the current word. GRU is used as a decoder to generate a Tibetan abstract. Experiments on the TB-SUM and Ti-SUM datasets are conducted. The results show that when the fusion of syllables and words is used as the basic unit for model training and syllables are used as the basic unit for testing, the Fusion_GloVe_GRU_Atten model generates a good summary and can achieve high Recall-Oriented Understudy for Gisting Evaluation (ROUGE) scores.

Key words: basic unit, information fusion, word vector, dataset construction, Tibetan short text summarization

夏吾吉, 黄鹤鸣, 樊永红, 更藏措毛, 范玉涛. 不同基本单元信息融合的藏文短文本摘要生成[J]. 计算机工程, 2025, 51(6): 174-183.

XIA Wuji, HUANG Heming, FAN Yonghong, Gengzangcuomao, FAN Yutao. Tibetan Short Text Summarization Based on Fusion of Different Basic Units Information[J]. Computer Engineering, 2025, 51(6): 174-183.

https://www.ecice06.com/CN/Y2025/V51/I6/174

图/表 9

图1 TB-SUM构建流程

Fig.1 TB-SUM construction process

图2 对TB-SUM中样本进行切分的流程

Fig.2 Process of sample segmentation in TB-SUM

图3 Fusion_GloVe_GRU_Atten架构

Fig.3 Architecture of Fusion_GloVe_GRU-Atten

参考文献 27

1	闫晓东, 王羿钦, 黄硕, 等. 藏文文本摘要数据集. 中国科学数据, 2022, 7 (2): 43- 49.
	YAN X D , WANG Y Q , HUANG S , et al. A dataset of Tibetan text summarization. China Scientific Data, 2022, 7 (2): 43- 49.
2	安见才让. 藏文搜索引擎系统中网页自动摘要的研究. 微处理机, 2010, 31 (5): 77- 80.
	Anjiancairang . Research on automatic abstract of web document summarization of Tibetan search engine. Microprocessors, 2010, 31 (5): 77- 80.
3	南奎娘若, 安见才让. 基于敏感信息的藏文文本摘要提取的研究. 网络安全技术与应用, 2016 (4): 58- 59. doi: 10.3969/j.issn.1009-6833.2016.04.039
	Nankuininagruo , Anjiancairang . Research on Tibetan text abstraction based on sensitive information. Network Security Technology & Application, 2016 (4): 58- 59. doi: 10.3969/j.issn.1009-6833.2016.04.039
4	李维, 闫晓东, 解晓庆. 基于改进TextRank的藏文抽取式摘要生成. 中文信息学报, 2020, 34 (9): 36- 43. doi: 10.3969/j.issn.1003-0077.2020.09.006
	LI W , YAN X D , XIE X Q . An improved TextRank for Tibetan summarization. Journal of Chinese Information Processing, 2020, 34 (9): 36- 43. doi: 10.3969/j.issn.1003-0077.2020.09.006
5	吕晶. 基于深度学习的藏文抽取式摘要研究[D]. 兰州: 兰州大学, 2022. LÜ J.
	Research on Tibetan abstraction based on deep learning[D]. Lanzhou: Lanzhou University, 2022. (in Chinese)
6	HU B T, CHEN Q C, ZHU F Z. LCSTS: a large scale Chinese short text summarization dataset[C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, USA: Association for Computational Linguistics, 2015: 1967-1972.
7	KAMEZAWA H, NISHIDA N, SHIMIZU N, et al. RNSum: a large-scale dataset for automatic release note generation via commit logs summarization[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, USA: Association for Computational Linguistics, 2022: 8718-8728.
8	WU H, ZHAN M J, TAN H C, et al. VCSUM: a versatile Chinese meeting summarization dataset[M]//ROGERS A, BOYD-GRABER J, OKAZAKI N. Findings of the Association for Computational Linguistics: ACL 2023. Stroudsburg, USA: Association for Computational Linguistics, 2023: 6065-6079.
9	MENG R, THAKER K, ZHANG L, et al. Bringing structure into summaries: a faceted summarization dataset for long scientific documents[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Stroudsburg, USA: Association for Computational Linguistics, 2021: 1080-1089.
10	VERMA Y, JANGRA A, VERMA R, et al. Large scale multi-lingual multi-modal summarization dataset[C]//Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. Stroudsburg, USA: Association for Computational Linguistics, 2023: 3620-3632.
11	LIU X J, ZANG S N, ZHANG C, et al. CLTS+: a new Chinese long text summarization dataset with abstractive summaries[EB/OL]. [2023-06-09]. https://arxiv.org/abs/2206.04253.
12	ZHONG M, YIN D, YU T, et al. QMSum: a new benchmark for query-based multi-domain meeting summarization[C]//Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, USA: Association for Computational Linguistics, 2021: 5905-5921.
13	FABBRI A, LI I, SHE T W, et al. Multi-news: a large-scale multi-document summarization dataset and abstractive hierarchical model[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, USA: Association for Computational Linguistics, 2019: 1074-1084.
14	才智杰, 孙茂松, 才让卓玛. 一种基于向量模型的藏文字拼写检查方法. 中文信息学报, 2018, 32 (9): 47- 55. doi: 10.3969/j.issn.1003-0077.2018.09.009
	CAI Z J , SUN M S , Cairangzhuoma . Vector based spelling check for Tibetan characters. Journal of Chinese Information Processing, 2018, 32 (9): 47- 55. doi: 10.3969/j.issn.1003-0077.2018.09.009
15	色差甲. 基于神经网络的藏文律诗生成研究[D]. 西宁: 青海师范大学, 2018.
	Sechajia. Research on the generation of Tibetan rhyme based on neural network[D]. Xining: Qinghai Normal University, 2018. (in Chinese)
16	才智杰. 藏文自动分词系统中紧缩词的识别. 中文信息学报, 2009, 23 (1): 35-37, 43.
	CAI Z J . Identification of abbreviated word in Tibetan word segmentation. Journal of Chinese Information Processing, 2009, 23 (1): 35-37, 43.
17	NALLAPATI R, ZHOU B, SANTOS C D, et al. Abstractive text summarization using Seq-to-Seq RNNs and beyond[C]//Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning. Stroudsburg, USA: Association for Computational Linguistics, 2016: 280-290.
18	LIN X, HAN S, JOTY S. Straight to the gradient: learning to use novel tokens for neural text generation[C]//Proceedings of the 38th International Conference on Machine Learning. [S. l. ]: PMLR Press, 2021: 6642-6653.
19	SEE A, LIU P J, MANNING C D. Get to the point: summarization with pointer-generator networks[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, USA: Association for Computational Linguistics, 2017: 1073-1083.
20	RAVAUT M, JOTY S, CHEN N. SummaReranker: a multi-task mixture-of-experts re-ranking framework for abstractive summarization[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, USA: Association for Computational Linguistics, 2022: 4504-4524.
21	VARAB D, XU Y M. Abstractive summarizers are excellent extractive summarizers[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Stroudsburg, USA: Association for Computational Linguistics, 2023: 330-339.
22	KWON J, KAMIGAITO H, OKUMURA M. Abstractive document summarization with summary-length prediction[M]//BOUAMOR H, PINO J, BALI K. Findings of the Association for Computational Linguistics: EACL 2023. Stroudsburg, USA: Association for Computational Linguistics, 2023: 618-624.
23	LAM K, DOAN T, PHAM K, et al. Abstractive text summarization using the BRIO training paradigm[M]//ROGERS A, BOYD-GRABER J, OKAZAKI N. Findings of the Association for Computational Linguistics: ACL 2023. Stroudsburg, USA: Association for Computational Linguistics, 2023: 92-99.
24	倪海清, 刘丹, 史梦雨. 基于语义感知的中文短文本摘要生成模型. 计算机科学, 2020, 47 (6): 74- 78.
	NI H Q , LIU D , SHI M Y . Chinese short text summarization generation model based on semantic-aware. Computer Science, 2020, 47 (6): 74- 78.
25	LIANG X N, BIAN C, WU S Z, et al. Towards modeling role-aware centrality for dialogue summarization[C]//Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Stroudsburg, USA: Association for Computational Linguistics, 2022: 43-50.
26	BAHDANAU D, CHO K, BENGIO Y. Neural machine translation by jointly learning to align and translate[C]//Proceedings of the 3rd International Conference on Learning Representations. San Diego, USA: [s. n. ], 2015: 1-10.
27	LIN C. ROUGE: A package for automatic evaluation of summaries[C]//Proceedings of the 2004 Workshop on Text Summarization Branches Out. Stroudsburg, USA: Association for Computational Linguistics, 2004: 74-81.

[1]	王晓龙, 江波, 罗润书, 安国成. 基于多信息融合的高速收费站拥堵检测算法[J]. 计算机工程, 2025, 51(5): 377-386.
[2]	刘兆伟, 方艳红, 郑明宇, 锁斌. 基于注意力机制与多任务的肺部疾病诊断方法[J]. 计算机工程, 2025, 51(1): 332-342.
[3]	魏嵬, 丁香香, 郭梦星, 杨钊, 刘辉. 文本相似度计算方法综述[J]. 计算机工程, 2024, 50(9): 18-32.
[4]	王晋涛, 秦昂, 张元, 陈一飞, 王廷凤, 谢承霖, 邹刚. 基于注意力增强与特征融合的中文医学实体识别[J]. 计算机工程, 2024, 50(7): 324-332.
[5]	韩一, 郑懿, 解广聪, 彭晨飞, 邵锦依, 周瑞淳, 廖杨喆, 钟毅. 船舶交通智能感知融合与辅助决策方法综述[J]. 计算机工程, 2024, 50(11): 18-37.
[6]	李军怀, 陈苗苗, 王怀军, 崔颖安, 张爱华. 基于ALBERT-BGRU-CRF的中文命名实体识别方法[J]. 计算机工程, 2022, 48(6): 89-94,106.
[7]	李冉冉, 刘大明, 刘正, 常高祥. 融合笔画特征的胶囊网络文本分类[J]. 计算机工程, 2022, 48(3): 69-73,80.
[8]	雷恒林, 古兰拜尔·吐尔洪, 买日旦·吾守尔, 曾琪. 基于Hellinger距离与词向量的终身机器学习主题模型[J]. 计算机工程, 2022, 48(11): 89-95.
[9]	彭俊利, 谷雨, 张震, 耿小航. 融合单词贡献度与Word2Vec词向量的文档表示[J]. 计算机工程, 2021, 47(4): 62-67.
[10]	金婧, 万怀宇, 林友芳. 融合实体类别信息的知识图谱表示学习[J]. 计算机工程, 2021, 47(4): 77-83.
[11]	李俊, 吕学强. 融合BERT语义加权与网络图的关键词抽取方法[J]. 计算机工程, 2020, 46(9): 89-94.
[12]	陈俊月, 郝文宁, 张紫萱, 唐新德, 康睿智, 莫斐. 基于改进句子相似度算法的释义识别研究[J]. 计算机工程, 2020, 46(9): 76-82.
[13]	王义, 沈洋, 戴月明. 基于细粒度多通道卷积神经网络的文本情感分析[J]. 计算机工程, 2020, 46(5): 102-108.
[14]	许莹莹, 黄浩. 基于标签分解的口语理解模型[J]. 计算机工程, 2019, 45(7): 237-241.
[15]	卢晨阳,康雁,杨成荣,蒲斌. 基于语义结构的迁移学习文本特征对齐算法[J]. 计算机工程, 2019, 45(5): 116-121.

选择文件类型/文献管理软件名称

选择包含的内容