作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• 开发研究与工程应用 • 上一篇    下一篇

基于条件随机场的中医临床病历命名实体抽取

刘 凯1a,周雪忠1a,1b ,于 剑1a,1b ,张润顺2   

  1. (1. 北京交通大学a. 计算机与信息技术学院;b. 交通数据分析与挖掘北京市重点实验室,北京100044;2. 中国中医科学院广安门医院,北京100053)
  • 收稿日期:2013-06-07 出版日期:2014-09-15 发布日期:2014-09-12
  • 作者简介:刘 凯(1986 - ),男,硕士,主研方向:文本信息抽取;周雪忠(通讯作者),副教授;于 剑、张润顺,教授。
  • 基金资助:
    国家自然科学基金资助项目(61105055, 81230086);国家“863”计划基金资助项目(2012AA02A609);中央高校基本科研业务 费专项基金资助项目(K13JB00140)。

Named Entity Extraction of Traditional Chinese Medicine Medical Records Based on Conditional Random Field

LIU Kai 1a ,ZHOU Xue-zhong 1a,1b ,YU Jian 1a,1b ,ZHANG Run-shun 2   

  1. (1a. School of Computer and Information Technology;1b. Beijing Key Lab of Traffic Data Analysis and Mining,Beijing Jiaotong University,Beijing 100044,China;2. Guang’anmen Hospital,China Academy of Chinese Medical Sciences,Beijing 100053,China)
  • Received:2013-06-07 Online:2014-09-15 Published:2014-09-12

摘要: 中医临床病历是中医重要的科研数据资源,但目前临床病历仍以文本为主要表达形式,对病历数据深入分 析的前提是进行结构化信息抽取,而命名实体抽取是其基础性步骤。针对中医临床病历的命名实体,如症状、疾病 和诱因等的抽取问题,通过手工标注的413 份病历数据(以中文字为特征)与4 类特征模版,将条件随机场(CRF)、隐马尔科夫模型(HMM)和最大熵马尔科夫模型(MEMM)用于中医病历命名实体抽取的实验,并进行比较分析。结果表明,结合合适的特征模版,CRF 命名实体抽取方法取得了较好的性能,F1 值的症状达到0. 80,疾病名称达到0 . 74,诱因0. 74。与HMM 和MEMM 相比,CRF 有最高的准确率和召回率,是一种较为适用的中医临床病历命名 实体抽取方法。

关键词: 中医临床病历, 命名实体抽取, 语料库标注系统, 条件随机场, 特征模板

Abstract: Traditional Chinese Medicine(TCM) medical records are the important data resources of the TCM medical research. The main form of them is still text now,and it is necessary to extract the structured information from the medical records,while named entity extraction is the basic step. It makes 413 copies of manually labeled medical records in Chinese text and four types of feature templates to study about the named entity extraction practice such as symptoms, diseases and incentives. It compares the results of TCM medical records named entity extraction by Conditional Random Field(CRF ), Hidden Markov Model ( HMM ) and Maximum Entropy Markov Model ( MEMM ). Combined with appropriate feature templates,CRF has well performance of F1:symptoms 0. 80,the name of the disease 0. 74,incentives 0. 74. Compared with HMM and MEMM,CRF has the highest precision and recall rate. This preliminary shows that CRF is an applicable method of the Chinese medical records named entity extraction

Key words: Traditional Chinese Medicine(TCM) medical records, named entity extraction, corpus annotation system, Conditional Random Field(CRF), feature template

中图分类号: