作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2014, Vol. 40 ›› Issue (12): 57-62,67. doi: 10.3969/j.issn.1000-3428.2014.12.010

• 先进计算与数据处理 • 上一篇    下一篇

一种混合的领域概念分类体系自动构建算法

罗年洁,吕钊   

  1. 华东师范大学计算机科学技术系,上海 200241
  • 收稿日期:2013-12-24 修回日期:2014-01-15 出版日期:2014-12-15 发布日期:2015-01-16
  • 作者简介:罗年洁(1989-),女,硕士研究生,主研方向:大数据分析,知识处理;吕 钊(通讯作者),副教授。
  • 基金资助:
    国家科技支撑计划基金资助项目(2012BAH74F02);上海市科委科研基金资助项目(12dz1500205)。

A Hybrid Algorithm of Automatic Domain Concept Taxonomy Construction

LUO Nianjie,LV Zhao   

  1. Department of Computer Science and Technology,East China Normal University,Shanghai 200241,China
  • Received:2013-12-24 Revised:2014-01-15 Online:2014-12-15 Published:2015-01-16

摘要: 领域概念分类体系自动构建在人工智能、自然语言处理和信息检索等领域具有重要作用,但现有研究较多关注通用知识,面向特定领域的研究较少,且存在领域概念间关系抽取准确率以及自动构建算法效率较低等问题。为此,提出一种混合的领域概念分类体系自动构建算法,该算法主要包括领域概念间关系抽取模块和分类体系构建模块。领域概念间关系抽取模块设计考虑中文自身的特点,采取句法树和基于规则相结合的方法,以提高抽取领域概念间关系的查准率和查全率;分类体系构建模块设计采取改进的BRT算法,从而在降低算法复杂度的同时,提高领域分类体系构建的查准率。在通信、金融和计算机领域的实验结果均表明,与BRT算法相比,该算法的构建效果较好,查准率最高可达到89.3%。

关键词: 领域概念分类体系, 贝叶斯玫瑰树, 句法树

Abstract: Domain concept taxonomy automatic construction plays an important role in artificial intelligence,natural language processing and information retrieval.Existing approaches pay more attention on common knowledge,while there are fewer reports about domain concepts.Two main challenges of domain concept taxonomy automatic construction are identifying relationships between concepts and less efficiency of current algorithms.In this paper,a Hybrid algorithm of Automatic Domain concept Taxonomy construction(HADT) is proposed,which has two main modules:extracting relationships between domain concepts and automatic taxonomy construction.Considering Chinese characteristics,the first module uses syntax tree method and rule-based method together,to get the aim of higher precision and higher recall.The second module uses an improved BRT algorithm to reduce time complexity and to improve taxonomy construction precision.The experiments conducted on three datasets of mobile,financial and computer show the HADT algorithm is effectiveness compared with the BRT algorithm,and the highest precision rate is 89.3%.

Key words: domain concept taxonomy, Bayesian Rose Tree (BRT), syntax tree

中图分类号: