作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• •    

基于MEGA网络和分层预测的标点恢复方法

  • 发布日期:2024-04-10

Punctuation Restoration Method Based on MEGA Network and Hierarchical Prediction

  • Published:2024-04-10

摘要: 标点恢复又称标点预测,是指对一段没有标点的文本添加合适的标点,以提高文本的可读性,是一项经典的自然语言处理任务。近年来,随着预训练模型的发展和标点恢复研究的深入,标点恢复任务的性能在不断提升。然而,基于Transformer结构的预训练模型在提取长序列输入的局部信息方面存在局限性,不利于最终标点符号的预测。此外,以往的研究将标点标签视为要预测的符号,忽略了不同标点的场景属性和标点间的关系。为了解决这些问题,该文引入了移动平均门控注意力(Moving average Equipped Gated Attention ,MEGA)网络作为辅助模块,以增强模型对局部信息的提取能力。同时,该文还构建了分层预测模块,充分利用不同标点符号的场景属性和标点间的关系进行最终的分类。该文使用多种基于Transformer结构的预训练模型在不同语言的数据集上进行实验,在英文标点数据集IWSLT的实验结果表明,多数预训练模型上应用MEGA模块和分层预测模块都能获得性能增益,值得注意的是使用DeBERTaV3 xlarge在IWSLT的REF测试集取得了85.5%的F1分数,相比于基线提升了1.2%,这是目前在REF测试集中的最佳结果。此外,该文的模型在中文标点数据集的实验中也取得了目前为止最高的精度。

Abstract: Punctuation restoration, also known as punctuation prediction, refers to the task of adding appropriate punctuation marks to a text without punctuations to enhance its readability. It is a classic natural language processing task. In recent years, with the development of pre-training models and deepening research on punctuation restoration, the performance of punctuation restoration tasks has been continuously improving. However, Transformer-based pre-training models have limitations in extracting local information from long sequence inputs, which hinders the prediction of final punctuation marks. Additionally, previous studies treated punctuation labels as symbols to be predicted, overlooking the contextual attributes of different punctuation marks and their relationships. To address these issues, this paper introduces the Moving average Equipped Gated Attention (MEGA) network as an auxiliary module to enhance the model's ability to extract local information. Moreover, a hierarchical prediction module is constructed to fully utilize the contextual attributes of different punctuation marks and the relationships between them for final classification. The paper conducts experiments using various Transformer-based pre-training models on datasets in different languages. The experimental results on the English punctuation dataset IWSLT demonstrate that applying the MEGA module and hierarchical prediction module on most pre-training models leads to performance gains. Notably, using DeBERTaV3 xlarge achieves an F1 score of 85.5% on the REF test set of IWSLT, which is a 1.2% improvement compared to the baseline, making it the best result on the REF test set to date. The proposed model also achieves the highest accuracy on the Chinese punctuation dataset.