作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2021, Vol. 47 ›› Issue (5): 277-284. doi: 10.19678/j.issn.1000-3428.0057677

• 开发研究与工程应用 • 上一篇    下一篇

面向法律文本的三元组抽取模型

陈彦光1, 王雷2, 孙媛媛1, 王治政1, 张书晨1   

  1. 1. 大连理工大学 计算机科学与技术学院, 辽宁 大连 116024;
    2. 辽宁省人民检察院第三检察部, 沈阳 110033
  • 收稿日期:2020-03-11 修回日期:2020-05-08 发布日期:2020-05-25
  • 作者简介:陈彦光(1995-),女,硕士研究生,主研方向为自然语言处理;王雷,博士研究生;孙媛媛(通信作者),教授、博士生导师;王治政,博士研究生;张书晨,硕士研究生。
  • 基金资助:
    国家重点研发计划(2018YFC0830603)。

Triple Extraction Model for Legal Texts

CHEN Yanguang1, WANG Lei2, SUN Yuanyuan1, WANG Zhizheng1, ZHANG Shuchen1   

  1. 1. School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116024, China;
    2. The Third Procuratorial Department, People's Procuratorate of Liaoning Province, Shenyang 110033, China
  • Received:2020-03-11 Revised:2020-05-08 Published:2020-05-25

摘要: 在中国裁判文书网上的开源刑事判决文档中蕴藏着重要的法律信息,但刑事判决书文档通常以自然语言的形式进行记录,而机器难以直接理解文档中的内容。为使由自然语言记录的非结构化刑事判决书文本转化为结构化三元组形式,构建一种面向法律文本的司法三元组抽取模型。将三元组抽取过程看作二阶段流水线结构,利用预训练的基于Transformer的双向编码器表示模型先进行命名实体识别,再将识别结果应用于关系抽取阶段得到相应的三元组表示,从而实现对非结构化刑事判决书文本的信息提取。实验结果表明,在经过人工标注的刑事判决书数据集上,该模型相比基于循环神经网络的组合模型的F1值提高了28.1个百分点,具有更优的三元组抽取性能。

关键词: 命名实体识别, 关系抽取, 预训练语言模型, Transformer编码器, 流水线结构

Abstract: The open-source documents of criminal sentences on China judgments online contain important legal information.However,the documents are usually transcribed in the form of natural language and difficult for machines to understand.This paper proposes a triplet extraction model for legal texts to transform the unstructured texts recorded by natural language into structured triplets.In the construction of the model,the triplet extraction process is considered as a two-stage pipeline structure.The pretrained Bidirectional Encoder Representations from Transformer(BERT) model is used for Named Entity Recognition(NER),and the recognition results are applied to relation extraction to obtain the corresponding triplet representation,completing the information extraction for the unstructured legal texts of criminal senteces.Experimental results on the manually labeled dataset of criminal sentences show that the F1 score of the proposed model is 28.1 percentage points higher than that of combinational model based on recurrent neural network, demonstrating its excellent triplet extraction performance.

Key words: Named Entity Recognition(NER), relation extraction, pretrained language model, Transformer encoder, pipeline structure

中图分类号: