作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2022, Vol. 48 ›› Issue (7): 13-21,28. doi: 10.19678/j.issn.1000-3428.0064163

• 热点与综述 • 上一篇    下一篇

一种高效的跨平台工作流优化方法

杜清华1, 张凯2   

  1. 1. 复旦大学 软件学院, 上海 200438;
    2. 复旦大学 计算机科学技术学院, 上海 200438
  • 收稿日期:2022-03-14 修回日期:2022-04-21 出版日期:2022-07-15 发布日期:2022-05-04
  • 作者简介:杜清华(1996—),男,硕士研究生,主研方向为数据科学、数据工程;张凯,副教授、博士。
  • 基金资助:
    国家重点研发计划(2018YFB1402602)。

An Efficient Cross-Platform Workflow Optimization Method

DU Qinghua1, ZHANG Kai2   

  1. 1. School of Software, Fudan University, Shanghai 200438, China;
    2. School of Computer Science, Fudan University, Shanghai 200438, China
  • Received:2022-03-14 Revised:2022-04-21 Online:2022-07-15 Published:2022-05-04

摘要: 为了应对复杂的数据分析任务,研究人员设计开发出结合多个平台的跨平台数据处理系统。系统跨平台工作流中算子的平台选择对于系统性能至关重要,因为算子在不同平台上的实现会产生性能间的显著差异。目前多使用基于成本的优化方法来实现跨平台工作流的平台选择,但现有的成本模型由于无法挖掘跨平台工作流的潜在信息而导致成本估计不准确。提出一种高效的跨平台工作流优化方法,采用GGFN模型作为成本模型,以算子特征和工作流特征作为模型输入,利用图注意力机制捕捉有向无环图型跨平台工作流的结构信息和算子邻居节点信息,同时结合门控循环单元记忆算子的运行时序信息,从而实现准确的成本估计。在此基础上,根据跨平台工作流的特点设计算子实现平台的枚举算法,利用基于GGFN的成本模型和延迟贪婪剪枝方法进行枚举操作,为每个算子选择合适的实现平台。实验结果表明,该方法可以将跨平台工作流的执行性能提升3倍,运行时间缩短60%以上。

关键词: 跨平台工作流, GGFN模型, 图注意力机制, 门控循环单元, 枚举算法

Abstract: To manage complex data analysis tasks, cross-platform data processing systems combining multiple platforms are being developed.The platform selection of operators in the cross-platform workflow of the system is critical to the system performance, because the implementation of operators on different platforms will result in significantly different performances.Currently, cost-based optimization methods are primarily applied in cross-platform workflow optimization to achieve platform selection;however, the existing cost models cannot mine the potential information of cross-platform workflows, thus resulting in inaccurate cost estimation.Hence, a more efficient cross-platform workflow optimization method is proposed herein.This method uses the GAT-BiGRU-FC Network(GGFN) model as the cost model, which uses both operator and workflow features as model inputs.The model uses a graph attention mechanism to capture the structure information of the Directed Acyclic Graph(DAG)-type cross-platform workflow and the information of the neighbor nodes of the operator.The gated recurrent unit is used to memorize the operation timing information of operators to achieve accurate cost estimations.Subsequently, the enumeration algorithm of the operator implementation platform is designed and implemented based on the characteristics of the cross-platform workflow.The algorithm utilizes the GGFN-based cost model and delay-greedy pruning method to perform enumeration and selects the appropriate implementation platform for each operator.Experiments show that this method can improve the execution performance of cross-platform workflows by 3x and reduce the runtime by more than 60%.

Key words: cross-platform workflow, GGFN model, graph attention mechanism, Gated Recurrent Unit(GRU), enumeration algorithm

中图分类号: