作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2007, Vol. 33 ›› Issue (02): 47-49. doi: 10.3969/j.issn.1000-3428.2007.02.016

• 软件技术与数据库 • 上一篇    下一篇

自动提取含字母词语的领域新术语的研究

姜韶华,党延忠   

  1. (大连理工大学系统工程研究所,大连 116024)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2007-01-20 发布日期:2007-01-20

Research on Automatic Extraction of Chinese New Domain-specific Terms Comprising Lettered-words

JIANG Shaohua, DANG Yanzhong   

  1. (Institute of Systems Engineering, Dalian University of Technology, Dalian 116024)
  • Received:1900-01-01 Revised:1900-01-01 Online:2007-01-20 Published:2007-01-20

摘要: 新术语的提取是中文信息处理领域的一个重要研究课题。针对现有提取方法的不足和很多专业术语表现为字母词语的特点,该文提出了一种综合统计技术和规则筛选的方法:基于长串优先和串频统计的思路进行文本切分,得到共现字符串,利用词语搭配规则进行过滤,经过领域词典及评价函数的筛选,提取出领域新术语。该方法可发现包含字母词语、专业术语等未登录词在内的频率大于等于2的任意长度的专指语义串、短语和词。实验表明了该方法的有效性及新术语的准确率分布特征。

关键词: 专指语义串, 长串优先, 字母词语, 中文信息处理

Abstract: Extraction of new domain-specific terms is one of the important topics in Chinese natural language processing. Aiming at the limitation of the current methods and the specialties of many domain-specific terms are lettered-words, a novel approach combined with statistic technique and rule is proposed to extract new special semantic strings. Co-occurrence of character strings is formed by text segmentation based on matching longer strings first combined with frequency statistics. No-meaningful character strings are trimmed by collocation rules. Filtered by domain lexicon and membership degree, new domain-specific terms are extracted finally. This method can extract new special semantic strings, phrases and words, including unknown words like lettered-words and domain-specific terms, their frequency is larger than 2. Experiments show that this extraction technique is effective and indicate new domain-specific terms’ distribution characteristic of precision ratio.

Key words: Special semantic strings, Matching longer string first, Lettered-words, Chinese natural language processing