自动提取含字母词语的领域新术语的研究

doi:10.3969/j.issn.1000-3428.2007.02.016

计算机工程 ›› 2007, Vol. 33 ›› Issue (02): 47-49. doi: 10.3969/j.issn.1000-3428.2007.02.016

自动提取含字母词语的领域新术语的研究

姜韶华，党延忠

（大连理工大学系统工程研究所，大连 116024）

收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2007-01-20 发布日期:2007-01-20

Research on Automatic Extraction of Chinese New Domain-specific Terms Comprising Lettered-words

JIANG Shaohua, DANG Yanzhong

(Institute of Systems Engineering, Dalian University of Technology, Dalian 116024)

Received:1900-01-01 Revised:1900-01-01 Online:2007-01-20 Published:2007-01-20

摘要/Abstract

摘要： 新术语的提取是中文信息处理领域的一个重要研究课题。针对现有提取方法的不足和很多专业术语表现为字母词语的特点，该文提出了一种综合统计技术和规则筛选的方法：基于长串优先和串频统计的思路进行文本切分，得到共现字符串，利用词语搭配规则进行过滤，经过领域词典及评价函数的筛选，提取出领域新术语。该方法可发现包含字母词语、专业术语等未登录词在内的频率大于等于2的任意长度的专指语义串、短语和词。实验表明了该方法的有效性及新术语的准确率分布特征。

关键词: 专指语义串, 长串优先, 字母词语, 中文信息处理

Abstract: Extraction of new domain-specific terms is one of the important topics in Chinese natural language processing. Aiming at the limitation of the current methods and the specialties of many domain-specific terms are lettered-words, a novel approach combined with statistic technique and rule is proposed to extract new special semantic strings. Co-occurrence of character strings is formed by text segmentation based on matching longer strings first combined with frequency statistics. No-meaningful character strings are trimmed by collocation rules. Filtered by domain lexicon and membership degree, new domain-specific terms are extracted finally. This method can extract new special semantic strings, phrases and words, including unknown words like lettered-words and domain-specific terms, their frequency is larger than 2. Experiments show that this extraction technique is effective and indicate new domain-specific terms’ distribution characteristic of precision ratio.

Key words: Special semantic strings, Matching longer string first, Lettered-words, Chinese natural language processing

姜韶华;党延忠. 自动提取含字母词语的领域新术语的研究[J]. 计算机工程, 2007, 33(02): 47-49.

JIANG Shaohua; DANG Yanzhong. Research on Automatic Extraction of Chinese New Domain-specific Terms Comprising Lettered-words[J]. Computer Engineering, 2007, 33(02): 47-49.

http://www.ecice06.com/CN/Y2007/V33/I02/47

[1]	庞磊, 李寿山, 周国栋. 基于情绪知识的中文微博情感分类方法[J]. 计算机工程, 2012, 38(13): 156-158,162.
[2]	沈亚敏, 赵晖, 张权, 唐朝京. 面向语音转换的汉语语料自动选取算法[J]. 计算机工程, 2011, 37(5): 256-257,260.
[3]	才智杰, 才让卓玛. 基于语料库的藏文字属性分析系统设计[J]. 计算机工程, 2011, 37(22): 270-272.
[4]	杨撼岳, 陈笑蓉, 郑高山. 水族文字笔形编码方法研究[J]. 计算机工程, 2011, 37(14): 285-287.
[5]	朱萌;刘长松;陈御天;邹燕明. 手写汉语拼音的融合识别系统[J]. 计算机工程, 2010, 36(7): 170-172.
[6]	邓琦;苏一丹;曹波;闭剑婷. 中文文本体裁分类中特征选择的研究[J]. 计算机工程, 2008, 34(23): 89-91.
[7]	刘政怡;龚建成;吴建国. 基于交叉覆盖算法的中文文本分类[J]. 计算机工程, 2006, 32(19): 183-184.

选择文件类型/文献管理软件名称

选择包含的内容

自动提取含字母词语的领域新术语的研究

Research on Automatic Extraction of Chinese New Domain-specific Terms Comprising Lettered-words

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 7

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

自动提取含字母词语的领域新术语的研究

Research on Automatic Extraction of Chinese New Domain-specific Terms Comprising Lettered-words

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 7

编辑推荐

Metrics

本文评价