作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2022, Vol. 48 ›› Issue (5): 91-97. doi: 10.19678/j.issn.1000-3428.0061106

• 人工智能与模式识别 • 上一篇    下一篇

AMR文本生成的数据扩充方法

付叶蔷, 李军辉   

  1. 苏州大学 计算机科学与技术学院, 江苏 苏州 215006
  • 收稿日期:2021-03-12 修回日期:2021-05-17 发布日期:2021-05-21
  • 作者简介:付叶蔷(1996—),女,硕士研究生,主研方向为自然语言处理、机器翻译、AMR文本生成技术;李军辉,副教授、博士。
  • 基金资助:
    国家自然科学基金(61876120)。

Data Augmentation Method for AMR-to-Text Generation

FU Yeqiang, LI Junhui   

  1. School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu 215006, China
  • Received:2021-03-12 Revised:2021-05-17 Published:2021-05-21

摘要: 在抽象语义表示(AMR)文本生成过程中,AMR图到文本形式的转换在很大程度上受语料规模的影响。提出一种简单有效的动态数据扩充方法,在已标注数据集规模有限的情况下提高AMR文本生成性能。将AMR文本生成模型解码端视作一个语言模型,使用单词级别的扩充方法,通过动态地对目标端单词进行随机替换,得到带噪声的数据,从而增强模型的泛化能力。在加载数据时,随机选择目标句子中的部分单词做噪声化处理,利用约束编码器预测被覆盖的单词并还原出原始语句,使模型具有更深层的语言表征能力。基于AMR2.0和AMR3.0英文标准数据集进行实验,结果表明,该方法可有效提升AMR文本生成系统性能,与未引入噪声的基准Transformer模型相比,能够获得更优的BLEU、Meteor和chrF++指标,其中BLEU值在人工标注语料场景下分别提升0.68和0.64,且在大规模自动标注语料场景下也能提升0.60和0.68。

关键词: 抽象语义表示, 语料规模, AMR文本生成, 动态数据扩充, 噪声

Abstract: In the process of Abstract Meaning Representation(AMR)-to-text generation, the transformation from AMR graph to text is largely affected by the size of the corpus.A simple and effective dynamic data augmentation method is proposed to improve the performance of AMR-to-text generation for limited scale labeled datasets.The decoding end of the AMR-to-text generation model is regarded as a language model, and the word level augmentation method is used to dynamically replace the words at the target end to obtain noisy data to enhance the generalization ability of the model.When loading data, some words in the target sentence are randomly selected as noise.The covered words are predicted by the constraint encoder, and the original sentences are restored to ensure that the model acquires deeper language representation ability.Experimental results show that, the method proposed in this study has achieved significant improvement on the AMR English standard dataset.Experiment is based on AMR2.0 and AMR3.0 English standard dataset.The results show that this method can effectively improve the performance of AMR-to-text generation system.Compared with the benchmark Transformer model without noise, it can obtain better BLEU, Meteor and chrF + + indicators.Among them, BLEU value can be improved by 0.68 and 0.64 respectively in the scene of manually labeling corpus, while it can still be improved by 0.60 and 0.68 in the scene of large-scale automatic labeling corpus.

Key words: Abstract Meaning Representation(AMR), corpus size, AMR-to-text generation, dynamic data augmentation, noise

中图分类号: