作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (12): 396-406. doi: 10.19678/j.issn.1000-3428.0068599

• 开发研究与工程应用 • 上一篇    下一篇

基于MEGA网络和分层预测的标点恢复方法

张文博1,*(), 黄浩1,2, 吴迪1, 唐敏杰1   

  1. 1. 新疆大学计算机科学与技术学院, 新疆 乌鲁木齐 830046
    2. 新疆多语种信息技术重点实验室, 新疆 乌鲁木齐 830046
  • 收稿日期:2023-10-16 出版日期:2024-12-15 发布日期:2024-04-10
  • 通讯作者: 张文博
  • 基金资助:
    科技创新2030-“新一代人工智能”重大项目(2020AAA0107902)

Punctuation Restoration Method Based on MEGA Network and Hierarchical Prediction

ZHANG Wenbo1,*(), HUANG Hao1,2, WU Di1, TANG Minjie1   

  1. 1. College of Computer Science and Technology, Xinjiang University, Urumqi 830046, Xinjiang, China
    2. Xinjiang Key Laboratory of Multi-language Information Technology, Urumqi 830046, Xinjiang, China
  • Received:2023-10-16 Online:2024-12-15 Published:2024-04-10
  • Contact: ZHANG Wenbo

摘要:

标点恢复又称标点预测, 是指对一段没有标点的文本添加合适的标点, 以提高文本的可读性, 是一项经典的自然语言处理任务。随着预训练模型的发展和标点恢复研究的深入, 标点恢复任务的性能在不断提升。然而, 基于Transformer结构的预训练模型在提取长序列输入的局部信息方面存在局限性, 不利于最终标点符号的预测。此外, 以往的研究将标点标签视为要预测的符号, 忽略了不同标点的场景属性和标点间的关系。为了解决这些问题, 引入移动平均门控注意力(MEGA)网络作为辅助模块, 以增强模型对局部信息的提取能力。同时, 构建分层预测模块, 充分利用不同标点符号的场景属性和标点间的关系进行最终的分类。使用多种基于Transformer结构的预训练模型在不同语言的数据集上进行实验, 在英文标点数据集IWSLT上的实验结果表明, 在多数预训练模型上应用MEGA模块和分层预测模块都能获得性能增益, 使用DeBERTaV3 xlarge在IWSLT的REF测试集上的F1值达到85.5%, 相比于基线提升了1.2个百分点。此外, 在中文标点数据集的实验中也取得较高的精度。

关键词: 标点恢复, 自然语言处理, 预训练模型, Transformer结构, 分层预测

Abstract:

Punctuation restoration, also known as punctuation prediction, refers to the task of adding appropriate punctuation marks to a text without punctuation to enhance its readability. This is a classic Natural Language Processing (NLP) task. In recent years, with the development of pretraining models and deepening research on punctuation restoration, the performance of punctuation restoration tasks has continuously improved. However, Transformer-based pretraining models have limitations in extracting local information from long-sequence inputs, which hinders the prediction of the final punctuation marks. In addition, previous studies have treated punctuation labels as symbols to be predicted by overlooking the contextual attributes of different punctuation marks and their relationships. To address these issues, this study introduces a Moving average Equipped Gated Attention (MEGA) network as an auxiliary module to enhance the ability of the model to extract local information. Moreover, a hierarchical prediction module is constructed to fully utilize the contextual attributes of different punctuation marks and the relationships between them for the final classification. Experiments are conducted using various transformer-based pretraining models on datasets in different languages. The experimental results on the English punctuation dataset IWSLT demonstrate that applying the MEGA and hierarchical prediction modules to most pretraining models leads to performance gains. Notably, DeBERTaV3 xlarge achieved an F1 score of 85.5% on the REF test set of the IWSLT, which is a 1.2 percentage points improvement compared to the baseline. The proposed model achieved the highest accuracy for the Chinese punctuation dataset.

Key words: punctuation restoration, Natural Language Processing(NLP), pretrained model, Transformer structure, hierarchical prediction