作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• 人工智能及识别技术 • 上一篇    下一篇

基于图模型的通用半结构化数据检索

康积华1,张奇1,2   

  1. (1.复旦大学计算机科学技术学院,上海 201203; 2.上海市智能信息处理重点实验室,上海 200433)
  • 收稿日期:2014-07-30 出版日期:2015-08-15 发布日期:2015-08-15
  • 作者简介:康积华(1991-),男,硕士研究生,主研方向:人工智能,信息检索;张奇,副教授。
  • 基金资助:

    国家自然科学基金资助项目(61472088,61473092)。

General Semi-structured Data Retrieval Based on Graph Model

KANG Jihua 1,ZHANG Qi 1,2   

  1. (1.School of Computer Science,Fudan University,Shanghai 201203,China; 2.Shanghai Key Laboratory of Intelligent Information Processing,Shanghai 200433,China)
  • Received:2014-07-30 Online:2015-08-15 Published:2015-08-15

摘要:

随着用户输入查询的自由度越来越高,导致已有半结构化数据检索模型无法满足用户需求。针对该问题,提出一种新的半结构化数据检索模型。在对原始查询进行分词后,把得到的词条作为基本元素,通过特征方程给每个词条设定对应权重,使用基于朴素贝叶斯的内容属性匹配方法进行内容属性匹配概率设定,并采用基于编辑距离的字符串相似度算法改善检索质量。从某商业搜索网站的查询日志中随机抽取真实的查询记录,人工为这些查询标注正确答案,从而做性能评估。实验结果表明,与层次语言模型、半结构化数据概率检索模型相比,该模型能有效提高半结构化数据的检索性能。

关键词: 半结构化数据, 查询, 数据检索, 图模型, 全局因子, 特征集合

Abstract:

With the increase of the users’ input query freedom,it causes the performance that the semi-structured data retrieval method can not meet the users’ requirements.A novel semi-structured retrieval model based on the factor graph model is proposed to solve this problem.This framework incorporates term weighting,Bayesien attribute mapping and edit distance based string similarity metrics together to improve the retrieving performance.A number of queries are randomly selected from logs of a commercial search engine and manually are labeled for analysis and evaluation.Experimental results show that this model can effectively improve the retrieval performance of semi-structured data compared with Hierarchical Language Model(HLM) and Probability Retrieval Model for Semi-structured Data(PRMS),etc.

Key words: semi-structured data, query, data retrieval, graph model, global factor, feature set

中图分类号: