计算机工程

• 先进计算与数据处理 • 上一篇    下一篇

面向大规模数据的在线新事件检测

蔡偃武1,高大启1,阮 彤1,蒋锐权2   

  1. (1. 华东理工大学计算机科学与工程系,上海200237; 2. 上海证券交易所技术开发部,上海200120)
  • 收稿日期:2013-10-18 出版日期:2014-10-15 发布日期:2014-10-13
  • 作者简介:蔡偃武(1989 - ),男,硕士研究生,主研方向:话题检测,模式识别,神经网络;高大启,教授、博士;阮 彤,副教授、博士; 蒋锐权,博士。
  • 基金项目:
    国家科技支撑计划基金资助项目“证券业云平台研发与运营”(2012BAH13F02)。

Online New Event Detection for Large-scale Data

CAI Yan-wu 1,GAO Da-qi 1,RUAN Tong 1,JIANG Rui-quan 2   

  1. (1. Department of Computer Science and Technology,East China University of Science and Technology, Shanghai 200237,China; 2. Technology Development Department,Shanghai Stock Exchange,Shanghai 200120,China)
  • Received:2013-10-18 Online:2014-10-15 Published:2014-10-13

摘要: 通过分析基于新闻要素的在线新事件检测算法的时间消耗,提出一种面向大规模数据环境的在线新事件 检测算法。该算法利用基于倒排索引的高效相似报道搜索机制,有效减少单路径聚类算法中的相似度比较次数。通过对报道预处理、报道与事件比较以及索引搜索这3 个过程的并行化,提高算法在多机环境下的运行效率和可 伸缩性。实验结果表明,该算法在不影响漏检率和误检率的基础上,提高了新事件检测的速度,并且在千万到亿级 别的报道规模下,其吞吐量达到150 条/ s ~200 条/ s。

关键词: 新事件检测, 单路径聚类, 大规模数据, 并行计算, 倒排索引, MapReduce 架构

Abstract: Through analyzing the time consumption of the existing online New Event Detection(NED) algorithm based on news elements,this paper improves an online NED algorithm for large-scale data environment. The algorithm uses efficient reported similar search mechanism based on inverted index to reduce the similarity comparison of single path clustering algorithms. Through parallelization of report pretreatment, report and event comparison, index search, it improves the efficiency and scalability of the algorithm in multimachine. Experimental result shows that the algorithm can greatly improve new event detection speed without affecting the miss probability and false-alarm probability,and its throughput reaches 150 ~200 reports / s at the scale of 10 ~100 million reports.

Key words: New Event Detection(NED), single-pass clustering, large-scale data, parallel computing, inverted index, MapReduce architecture

中图分类号: