作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2010, Vol. 36 ›› Issue (06): 1-4. doi: 10.3969/j.issn.1000-3428.2010.06.001

• 博士论文 •    下一篇

通用文本处理方法的研究与设计

宋 友1,梁士兴2,黄 璐1,2   

  1. (1. 北京航空航天大学软件学院,北京 100191;2. 国际商业机器有限公司中国开发中心,北京 100193)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2010-03-20 发布日期:2010-03-20

Research and Design of General Text Processing Method

SONG You1, LIANG Shi-xing2, HUANG Lu1,2

  

  1. (1. College of Software, Beihang University, Beijing 100191; 2. IBM China Development Lab, Beijing 100193)
  • Received:1900-01-01 Revised:1900-01-01 Online:2010-03-20 Published:2010-03-20

摘要: 设计描述通用文本处理逻辑的规则以及执行规则的引擎,使开发文本处理程序简化为开发应用规则。用XML描述规则的数据模型,规则元素包括原子规则、规则集、前置条件和数据上下文,在规则中用正则表达式实现文本匹配,用转义符和脚本语言实现多种转换逻辑。利用该方法进行Web主题文本提取,验证了规则的合理性和引擎的有效性。

关键词: 文本处理, 正则表达式, 脚本语言

Abstract: A rule is defined to describe the logic of text processing, and an engine is designed to execute the rule, with which text processing is simplified from programming to writing rule. A model of the rule is defined based on XML. The rule includes atom-rules, rule-sets, rule-applications and data contexts. The rule can match text with regular expression, and transform the matched results with escape character and script language. An experiment of extracting Web topic text is given to verify the rationality of the rule and the efficiency of the engine.

Key words: text processing, regular expression, script language

中图分类号: