基于MEGA网络和分层预测的标点恢复方法

doi:10.19678/j.issn.1000-3428.0068599

摘要/Abstract

摘要：

标点恢复又称标点预测, 是指对一段没有标点的文本添加合适的标点, 以提高文本的可读性, 是一项经典的自然语言处理任务。随着预训练模型的发展和标点恢复研究的深入, 标点恢复任务的性能在不断提升。然而, 基于Transformer结构的预训练模型在提取长序列输入的局部信息方面存在局限性, 不利于最终标点符号的预测。此外, 以往的研究将标点标签视为要预测的符号, 忽略了不同标点的场景属性和标点间的关系。为了解决这些问题, 引入移动平均门控注意力(MEGA)网络作为辅助模块, 以增强模型对局部信息的提取能力。同时, 构建分层预测模块, 充分利用不同标点符号的场景属性和标点间的关系进行最终的分类。使用多种基于Transformer结构的预训练模型在不同语言的数据集上进行实验, 在英文标点数据集IWSLT上的实验结果表明, 在多数预训练模型上应用MEGA模块和分层预测模块都能获得性能增益, 使用DeBERTaV3 xlarge在IWSLT的REF测试集上的F1值达到85.5%, 相比于基线提升了1.2个百分点。此外, 在中文标点数据集的实验中也取得较高的精度。

关键词: 标点恢复, 自然语言处理, 预训练模型, Transformer结构, 分层预测

Abstract:

Punctuation restoration, also known as punctuation prediction, refers to the task of adding appropriate punctuation marks to a text without punctuation to enhance its readability. This is a classic Natural Language Processing (NLP) task. In recent years, with the development of pretraining models and deepening research on punctuation restoration, the performance of punctuation restoration tasks has continuously improved. However, Transformer-based pretraining models have limitations in extracting local information from long-sequence inputs, which hinders the prediction of the final punctuation marks. In addition, previous studies have treated punctuation labels as symbols to be predicted by overlooking the contextual attributes of different punctuation marks and their relationships. To address these issues, this study introduces a Moving average Equipped Gated Attention (MEGA) network as an auxiliary module to enhance the ability of the model to extract local information. Moreover, a hierarchical prediction module is constructed to fully utilize the contextual attributes of different punctuation marks and the relationships between them for the final classification. Experiments are conducted using various transformer-based pretraining models on datasets in different languages. The experimental results on the English punctuation dataset IWSLT demonstrate that applying the MEGA and hierarchical prediction modules to most pretraining models leads to performance gains. Notably, DeBERTaV3 xlarge achieved an F1 score of 85.5% on the REF test set of the IWSLT, which is a 1.2 percentage points improvement compared to the baseline. The proposed model achieved the highest accuracy for the Chinese punctuation dataset.

Key words: punctuation restoration, Natural Language Processing(NLP), pretrained model, Transformer structure, hierarchical prediction

张文博, 黄浩, 吴迪, 唐敏杰. 基于MEGA网络和分层预测的标点恢复方法[J]. 计算机工程, 2024, 50(12): 396-406.

ZHANG Wenbo, HUANG Hao, WU Di, TANG Minjie. Punctuation Restoration Method Based on MEGA Network and Hierarchical Prediction[J]. Computer Engineering, 2024, 50(12): 396-406.

https://www.ecice06.com/CN/Y2024/V50/I12/396

图/表 11

图1 总体模型结构

Fig.1 Overall model structure

图2 EMA计算过程

Fig.2 EMA calculation process

图3 注意力权重可视化

Fig.3 Visualization of attention weights

参考文献 37

1	PǍIŞ V, TUFIŞ D. Capitalization and punctuation restoration: a survey. Artificial Intelligence Review, 2022, 55(3): 1681- 1722. doi: 10.1007/s10462-021-10051-x
2	VANDEGHINSTE V, GUHR O. FullStop: punctuation and segmentation prediction for dutch with transformers[EB/OL]. [2023-09-10]. https://arxiv.org/pdf/2301.03319v1.
3	VANDEGHINSTE V, VERWIMP L, PELWMANS J, et al. A comparison of different punctuation prediction approaches in a translation context[C]//Proceedings of the 21st Annual Conference of the European Association for Machine Translation. Alacant, Spain: [s. n.], 2018: 269-278.
4	ZHOU C, LI Q, LI C, et al. A comprehensive survey on pretrained foundation models: a history from BERT to ChatGPT[EB/OL]. [2023-09-10]. https://arxiv.org/pdf/2302.09419v1.
5	VASEANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[EB/OL]. [2023-09-10]. https://arxiv.org/abs/1706.03762.
6	SHI N, WANG W, WANG B X, et al. Incorporating external POS tagger for punctuation restoration[C]//Proceedings of ISCA'21. Washington D. C., USA: IEEE Press, 2021: 1987-1991.
7	DAI Z H, LAI G K, YANG Y M, et al. Funnel-transformer: filtering out sequential redundancy for efficient language processing[C]//Proceedings of NIPS'20. Cambridge, USA: MIT Press, 2020: 4271-4282.
8	YI J Y, TAO J H, TIAN Z K, et al. Focal loss for punctuation prediction[C]//Proceedings of ISCA'20. Washington D. C., USA: IEEE Press, 2020: 721-725.
9	刘鹏远, 王伟康, 邱立坤, 等. CDCPP: 跨领域中文标点符号预测. 中文信息学报, 2021, 35(6): 131- 140. doi: 10.3969/j.issn.1003-0077.2021.06.014
	LIU P Y, WANG W K, QIU L K, et al. CDCPP: cross-domain Chinese punctuation prediction. Journal of Chinese Information Processing, 2021, 35(6): 131- 140. doi: 10.3969/j.issn.1003-0077.2021.06.014
10	陈玉娜, 史晓东. 通过标点恢复提高机器同传效果. 计算机应用, 2020, 40(4): 972- 977.
	CHEN Y N, SHI X D. Improving machine simultaneous interpretation by punctuation recovery. Journal of Computer Applications, 2020, 40(4): 972- 977.
11	COURTLAND M, FAULKNER A, MCELVAIN G. Efficient automatic punctuation restoration using bidirectional transformers with robust inference[C]//Proceedings of the 17th International Conference on Spoken Language Translation. Stroudsburg, USA: Association for Computational Linguistics, 2020: 272-279.
12	ALAM T, KHAN A, ALAM F. Punctuation restoration using transformer models for high-and low-resource languages[C]//Proceedings of the 6th Workshop on Noisy User-generated Text. Stroudsburg, USA: Association for Computational Linguistics, 2020: 132-142.
13	KENTON J D M W C, TOUTANOVA L K. BERT: pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of NAACL-HLT'19. Washington D. C., USA: IEEE Press, 2019: 4171-4186.
14	LIU Y H, OTT M, GOYAL N, et al. RoBERTa: a robustly optimized BERT pretraining approach[EB/OL]. [2023-09-10]. https://arxiv.org/abs/1907.11692.
15	LIN T Y, WANG Y X, LIU X Y, et al. A survey of transformers. AI Open, 2022, 3, 111- 132. doi: 10.1016/j.aiopen.2022.10.001
16	孔韦韦, 田乔鑫, 滕金保, 等. 融合注意力机制的混合神经网络文本情感分析模型. 电讯技术, 2023, 63(6): 781- 789.
	KONG W W, TIAN Q C, TENG J B, et al. A hybrid neural network text sentiment analysis model with attention mechanism. Telecommunication Engineering, 2023, 63(6): 781- 789.
17	STOLCKE A, SHRIBERG E. Automatic linguistic segmentation of conversational speech[C]//Proceedings of the 4th International Conference on Spoken Language Processing. Philadelphia, USA: IEEE Press, 1996: 1005-1008.
18	LU W, NG H T. Better punctuation prediction with dynamic conditional random fields[C]//Proceedings of IEEE Conference on Empirical Methods in Natural Language Processing. Washington D. C., USA: IEEE Press, 2010: 177-186.
19	SUTTON C, ROHANIMANESH K, MCCALLUM A. Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data[C]//Proceedings of the 21st International Conference on Machine learning. New York, USA: ACM Press, 2004: 99-108.
20	CHE X, WANG C, YANG H, et al. Punctuation prediction for unsegmented transcript based on word vector[C]//Proceedings of the 20th IEEE International Conference on Language Resources and Evaluation. Washington D. C., USA: IEEE Press, 2016: 654-658.
21	CHO E, KILGOUR K, NIEHUES J, et al. Combination of NN and CRF models for joint detection of punctuation and disfluencies[C]// Proceedings of ISCA'15. Washington D. C., USA: IEEE Press, 2015: 315-326.
22	TILK O, ALUMÄE T. Bidirectional recurrent neural network with attention mechanism for punctuation restoration[C]// Proceedings of ISCA'16. Washington D. C., USA: IEEE Press, 2016: 276-285.
23	段大高, 梁少虎, 赵振东, 等. 基于自注意力机制的中文标点符号预测模型. 计算机工程, 2020, 46(5): 291- 297. URL
	DUAN D G, LIANG S H, ZHAO Z D, et al. Prediction model of Chinese punctuation based on self-attention mechanism. Computer Engineering, 2020, 46(5): 291- 297. URL
24	李雅昆, 潘晴. 基于改进的多层BLSTM的中文分词和标点预测. 计算机应用, 2018, 38(5): 1278-1282, 1314.
	LI Y K, PAN Q. Joint Chinese word segmentation and punctuation prediction based on improved multilayer BLSTM network. Journal of Computer Applications, 2018, 38(5): 1278-1282, 1314.
25	ZHANG Z, LIU J, CHI L H, et al. Word-level BERT-CNN-RNN model for Chinese punctuation restoration[C]//Proceedings of the 6th IEEE International Conference on Computer and Communications. Washington D. C., USA: IEEE Press, 2020: 1629-1633.
26	ZHU Y M, WU L W, CHENG S B, et al. Unified multimodal punctuation restoration framework for mixed-modality corpus[C]//Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. Washington D. C., USA: IEEE Press, 2022: 7272-7276.
27	CHO Y, NG S, TRAN T, et al. Leveraging prosody for punctuation prediction of spontaneous speech[C]//Proceedings of ISCA'22. Washington D. C., USA: IEEE Press, 2022: 555-559.
28	PAPPAGARI R, ZELASKO P, MIKOLAJCZYK A, et al. Joint prediction of truecasing and punctuation for conversational speech in low-resource scenarios[C]//Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop. Washington D. C., USA: IEEE Press, 2021: 1185-1191.
29	WANG W, LIU Y, JIANG W, et al. Making punctuation restoration robust with disfluency detection[C]//Proceedings of the 25th IEEE International Conference on Computer Supported Cooperative Work in Design. Washington D. C., USA: IEEE Press, 2022: 395-399.
30	LIN B H, WANG L Y. Joint prediction of punctuation and disfluency in speech transcripts[C]//Proceedings of ISCA'20. Washington D. C., USA: IEEE Press, 2020: 716-720.
31	CHEN Q, CHEN M Z, LI B, et al. Controllable time-delay transformer for real-time punctuation prediction and disfluency detection[C]//Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. Washington D. C., USA: IEEE Press, 2020: 8069-8073.
32	CHEN Q, WANG W, CHEN M Z, et al. Discriminative self-training for punctuation prediction[C]//Proceedings of ISCA'21. Washington D. C., USA: IEEE Press, 2021: 771-775.
33	HUNTER J S. The exponentially weighted moving average. Journal of Quality Technology, 1986, 18(4): 203- 210. doi: 10.1080/00224065.1986.11979014
34	ELFWING S, UCHIBE E, DOYA K. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks, 2018, 107, 3- 11. doi: 10.1016/j.neunet.2017.12.012
35	FEDERICO M, CETTOLLO M, BENTIVOGLI L, et al. Overview of the IWSLT 2012 evaluation campaign[C]//Proceedings of International Workshop on Spoken Language Translation. Washington D. C., USA: IEEE Press, 2012: 12-33.
36	KINGMA D P, BA J. Adam: a method for stochastic optimization[C]//Proceedings of the 3rd International Conference on Learning Representations. Washington D. C., USA: IEEE Press, 2015: 467-476.
37	HE P, GAO J, CHEN W. DeBERTaV3: improving DeBERTa using ELECTRA-Style pre-training with gradient-disentangled embedding sharing[C]//Proceedings of the 11th International Conference on Learning Representations. Washington D. C., USA: IEEE Press, 2022: 3321-3335.

[1]	魏嵬, 丁香香, 郭梦星, 杨钊, 刘辉. 文本相似度计算方法综述[J]. 计算机工程, 2024, 50(9): 18-32.
[2]	程腾腾, 姚春龙, 于晓强, 李旭, 王庆丰. 基于多头注意力机制融合常识知识的共情对话生成[J]. 计算机工程, 2024, 50(6): 94-101.
[3]	周昭辰, 方清茂, 吴晓红, 胡平, 何小海. 基于MacBERT与对抗训练的机器阅读理解模型[J]. 计算机工程, 2024, 50(5): 41-50.
[4]	曹渝昆, 程宇, 何祯奕, 徐康乐, 颜家洛, 李云峰. 文档上下文异构表示的句子级关系抽取方法[J]. 计算机工程, 2024, 50(5): 111-119.
[5]	李田芳, 普园媛, 赵征鹏, 徐丹, 钱文华. 基于CLIP和双空间自适应归一化的图像翻译[J]. 计算机工程, 2024, 50(5): 229-240.
[6]	侯钰涛, 阿布都克力木·阿布力孜, 史亚庆, 马依拉木·木斯得克, 哈里旦木·阿布都克里木. 面向"一带一路"的低资源语言机器翻译研究[J]. 计算机工程, 2024, 50(4): 332-341.
[7]	于明诚, 党亚固, 吴奇林, 吉旭, 毕可鑫. 基于多尺度上下文的英文作文自动评分研究[J]. 计算机工程, 2024, 50(3): 259-266.
[8]	朱贵德, 黄海. 文本视觉问答综述[J]. 计算机工程, 2024, 50(2): 1-14.
[9]	孙仁科, 许靖昊, 皇甫志宇, 李仲年, 许新征. 基于视觉-语言预训练模型的零样本迁移学习方法综述[J]. 计算机工程, 2024, 50(10): 1-15.
[10]	崔蒙蒙, 刘井平, 阮彤, 宋雨秋, 杜渂. 基于双重多视角表示的目标级隐性情感分类[J]. 计算机工程, 2024, 50(1): 79-90.
[11]	曹发鑫, 孙媛媛, 王治政, 潘丁豪, 林鸿飞. 面向借贷案件的相似案例匹配模型[J]. 计算机工程, 2024, 50(1): 306-312.
[12]	李鸿鹏, 马博, 杨雅婷, 王磊, 王震, 李晓. 基于槽位语义增强提示学习的篇章级事件抽取方法[J]. 计算机工程, 2023, 49(9): 23-31.
[13]	郭艳霞, 金勇, 唐宏, 彭金枝. 基于动态卷积与残差门控的多模态情感识别[J]. 计算机工程, 2023, 49(7): 94-101.
[14]	侯华, 郭宏洋, 代超娜, 李峻辉. 结合多重注意力与迭代优化的立体匹配算法[J]. 计算机工程, 2023, 49(7): 161-168.
[15]	张博旭, 蒲智, 程曦. 基于提示学习的维吾尔语文本分类研究[J]. 计算机工程, 2023, 49(6): 292-299,313.

选择文件类型/文献管理软件名称

选择包含的内容