摘要: 随着信息化技术飞速发展,爆炸性数据的增长以及数据的多样化给大数据检索带来了挑战。MapReduce作为一种并行处理框架,在大数据处理上具有明显优势。为此,结合概念格的相关知识,采用形式概念分析发现文档之间的关系并用格进行表示,提出一种新型的支持大规模文本检索的形式概念索引结构,给出基于MapReduce框架建立概念索引的相关算法。通过与Lucene索引进行比较,验证了所提索引的有效性。实验结果表明,将文档之间关系采用概念格表示并建立概念索引,能够提高大规模文本检索的性能。
关键词:
大数据,
MapReduce框架,
数据检索,
形式概念分析,
概念格,
概念索引
Abstract: With high speed developing of the informatization,the coming of big data era brings some revolution to the world,and it becomes a challenge for big data searching by its explosive growth and variety.MapReduce is commonly used in processing big data and shows its great advantages.Combined with the relative knowledge of lattice,this paper uses Form Concept Analysis(FCA) to discover the relationships among textual documents and expresses them with lattice,and proposes a novel conceptually index structure,which supports large scale data retrieval.In addition,it describes the related algorithms for building conceptual index.Compared with Lucene index,conceptual index supporting queries has better efficiency.Experimental results show that using lattice to express the relationship of documents and indexing it with conceptual can significally improve the performance of large scale documents retrieval.
Key words:
big data,
MapReduce framework,
data retrieval,
Formal Concept Analysis(FCA),
concept lattice,
conceptual index
中图分类号: