计算机工程

• 先进计算与数据处理 • 上一篇    下一篇

MapReduce环境下支持大规模文本检索的概念索引

张生,胡加靖   

  1. (上海理工大学光电信息与计算机工程学院,上海 200093)
  • 收稿日期:2014-05-19 出版日期:2015-07-15 发布日期:2015-07-15
  • 作者简介:张生(1968-),男,高级工程师,主研方向:云计算,数据挖掘;胡加靖,硕士研究生。

Concept Index Supporting Large Scale Text Retrieval Under MapReduce Enviroment

ZHANG Sheng,HU Jiajing   

  1. (School of Optical-Electrical and Computer Engineering,University of Shanghai for Science and Technology,Shanghai 200093,China)
  • Received:2014-05-19 Online:2015-07-15 Published:2015-07-15

摘要: 随着信息化技术飞速发展,爆炸性数据的增长以及数据的多样化给大数据检索带来了挑战。MapReduce作为一种并行处理框架,在大数据处理上具有明显优势。为此,结合概念格的相关知识,采用形式概念分析发现文档之间的关系并用格进行表示,提出一种新型的支持大规模文本检索的形式概念索引结构,给出基于MapReduce框架建立概念索引的相关算法。通过与Lucene索引进行比较,验证了所提索引的有效性。实验结果表明,将文档之间关系采用概念格表示并建立概念索引,能够提高大规模文本检索的性能。

关键词: 大数据, MapReduce框架, 数据检索, 形式概念分析, 概念格, 概念索引

Abstract: With high speed developing of the informatization,the coming of big data era brings some revolution to the world,and it becomes a challenge for big data searching by its explosive growth and variety.MapReduce is commonly used in processing big data and shows its great advantages.Combined with the relative knowledge of lattice,this paper uses Form Concept Analysis(FCA) to discover the relationships among textual documents and expresses them with lattice,and proposes a novel conceptually index structure,which supports large scale data retrieval.In addition,it describes the related algorithms for building conceptual index.Compared with Lucene index,conceptual index supporting queries has better efficiency.Experimental results show that using lattice to express the relationship of documents and indexing it with conceptual can significally improve the performance of large scale documents retrieval.

Key words: big data, MapReduce framework, data retrieval, Formal Concept Analysis(FCA), concept lattice, conceptual index

中图分类号: