作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2022, Vol. 48 ›› Issue (2): 79-85,91. doi: 10.19678/j.issn.1000-3428.0060269

• 人工智能与模式识别 • 上一篇    下一篇

连续时间区间内的频繁词序列挖掘算法

王璐1, 刘晓清2, 何震瀛2   

  1. 1. 复旦大学 软件学院, 上海 200441;
    2. 复旦大学 计算机科学技术学院, 上海 200433
  • 收稿日期:2020-12-14 修回日期:2021-01-25 发布日期:2022-02-14
  • 作者简介:王璐(1996-),女,硕士研究生,主研方向为文本数据分析;刘晓清,研究员、博士;何震瀛,副教授。
  • 基金资助:
    国家自然科学基金(61732004)。

Frequent Word Sequence Mining Algorithm in Continuous Time Interval

WANG Lu1, LIU Xiaoqing2, HE Zhenying2   

  1. 1. School of Software, Fudan University, Shanghai 200441, China;
    2. School of Computer Science, Fudan University, Shanghai 200433, China
  • Received:2020-12-14 Revised:2021-01-25 Published:2022-02-14

摘要: 查询文本中频繁出现的短语可快速掌握文本内容,然而传统频繁词序列挖掘算法面向挖掘任务时的时间复杂度较高,无法满足频繁更换查询条件及快速获得反馈的查询需求。利用基于频率树的快速频繁词序列挖掘算法(TS_Mining),在保持后缀树线性构造时间的情况下实现文本集合中频繁词序列的查询,并采用树型索引结构避免多次扫描文本集合,降低算法时间复杂度。针对连续时间区间内的频繁词序列查询问题,提出改进的剪枝挖掘算法(TS_Pruning),通过减少频率树的扫描范围进一步提高挖掘效率。实验结果表明,TS_Mining与TS_Pruning算法的运行时间相比经典Apriori挖掘算法约减少了2个数量级,具有更高的频繁词序列挖掘效率。

关键词: 频繁词序列, 后缀树, 数据挖掘, 频繁项集, 热点话题检测

Abstract: Mining frequent phrases in a text assists in the quick understanding of the content, but traditional algorithms for frequent word sequence mining usually have high time complexity in mining tasks, and fail to deal with frequently changing query criteria or respond to queries in real time.This paper gives an algorithm for rapidly mining frequent word sequences based on frequency tree, called TS_Mining.The algorithm can implement the query of the frequent word sequence in a text set while maintaining the linear construction time of the suffix tree.Through the tree index structure, repeated scanning of the text set can be avoided to reduce the algorithm complexity.In addition, in order to increase the efficiency of frequent word sequence mining in a continuous time interval, an improved pruning mining algorithm, called TS_Pruning is designed, which reduces the scanning range of the frequency tree.The experimental results show that compared with the Apriori mining algorithm, the TS_Mining algorithm and TS_Pruning algorithm reduces the running time by about 2 orders of magnitude, and displays a higher efficiency in frequent word sequence mining.

Key words: frequent word sequence, suffix tree, data mining, frequent itemset, hot topic detection

中图分类号: