面向PMVS算法的自动两级并行翻译方法

doi:10.19678/j.issn.1000-3428.0063914

摘要/Abstract

摘要： 当使用高分辨率的图像作为图像处理算法的输入时会降低算法运行速度，将算法并行化可提升执行效率，但手动将串行程序转换为并行程序则较为繁琐，并且现有自动并行翻译工具性能不稳定，同时翻译后的程序是单一并行模式。面向基于面片的三维多视角立体视觉（PMVS）算法，提出一种从C到CUDA的自动两级并行翻译方法。使用ANTLR自动解析源C代码，通过分析数据依赖关系和循环数组私有化来识别可并行化的循环结构，将算法翻译成CPU多线程和GPU两级并行结构的代码。在算法执行过程中，将输入图像在CPU和GPU上分别进行处理，降低了算法总执行时间。实验结果表明，该方法的计算加速比随着输入图像分辨率的增加逐渐提高，最高约达到32，相比于PPCG和OpenACC自动并行翻译方法提升明显。

关键词: 两级并行翻译, 图像处理算法, 基于面片的三维多视角立体视觉, 扩展Backus-Naur范式, 抽象语法树

Abstract: Currently, the calculation speed of the image processing algorithm is very slow when high-resolution images are used as input data.Although parallelizing the algorithm can improve its execution efficiency, the manual conversion of serial programs to parallel programs is tedious.Moreover, current automatic parallel translation tools are not scalable, and the translated program is in single parallel mode.To solve this problem, this study proposes an automatic two-level parallel translation method from C to CUDA for the Patch-based Multiple View Stereo(PMVS) algorithm, using Another Tool for Language Recognition(ANTLR) to automatically parse the source C code and identify the parallelizable loop structures by analyzing data dependencies and loop array privatization.Additionally, the loop structure of the algorithm is translated into a two-level parallel structure that includes CPU multithreading and the GPU.When the algorithm is executed, the input image is divided into two parts:one part is processed by the CPU's multithreaded code, and the other part is processed by the GPU code, thereby reducing the total execution time of the algorithm.The experimental results show that an increase in the input image resolutions gradually improves the performance of the proposed method, and the maximum speedup ratio can reach approximately 32.Moreover, the proposed method has a significantly higher speed compared with the automatic Polyhedral Parallel Code Generation(PPCG) and OpenACC translation methods.

Key words: two-level parallel translation, image processing algorithm, Patch-based Multiple View Stereo(PMVS), Extended Backus-Naur Form(EBNF), Abstract Syntax Tree(AST)

中图分类号:

TP31

刘金硕, 黄朔, 邓娟. 面向PMVS算法的自动两级并行翻译方法[J]. 计算机工程, 2022, 48(12): 16-23.

LIU Jinshuo, HUANG Shuo, DENG Juan. Automatic Two-level Parallel Translation Method for PMVS Algorithm[J]. Computer Engineering, 2022, 48(12): 16-23.

https://www.ecice06.com/CN/Y2022/V48/I12/16

图/表 9

20230112182144

20230112182149

20230112182159

20230112182203

20230112182206

20230112182209

20230112182213

20230112182218

20230112182222

参考文献

[1] GROUP W, LUCK E, SKJELLUM A.Using MPI:portable parallel programming with the message passing interface[M].Cambridge, USA:MIT Press, 1994.
[2] STELLNER G.CoCheck:check pointing and process migration for MPI[C]//Proceedings of International Conference on Parallel Processing.Washington D.C., USA:IEEE Press, 1996:526-531.
[3] GABRIEL E, FAGG G E, BOSILCA G, et al.Open MPI:goals, concept, and design of a next generation MPI implementation[C]//Proceedings of European Parallel Virtual Machine/Message Passing Interface Users' Group Meeting.Berlin, Germany:Springer, 2004:97-104.
[4] STONE J E, GOHARA D, SHI G C.OpenCL:a parallel programming standard for heterogeneous computing systems[J].Computing in Science & Engineering, 2010, 12(3):66-73.
[5] JÄÄSKELÄINEN P, LAMA C S, SCHNETTER E, et al.POCL:a performance-portable OpenCL implementation[J].International Journal of Parallel Programming, 2015, 43(5):752-785.
[6] DAGUM L, MENON R.OpenMP:an industry standard API for shared-memory programming[J].IEEE Computational Science and Engineering, 1998, 5(1):46-55.
[7] CHAPMAN B, JOST G, VAN DER PAS R.Using OpenMP:portable shared memory parallel programming[M].Cambridge, USA:MIT Press, 2008.
[8] ATZENI S, GOPALAKRISHNAN G, RAKAMARIC Z, et al.ARCHER:effectively spotting data races in large OpenMP applications[C]//Proceedings of 2016 IEEE International Parallel and Distributed Processing Symposium(IPDPS).Washington D.C., USA:IEEE Press, 2016:53-62.
[9] WIENKE S, SPRINGER P, TERBOVEN C.OpenACC-first experiences with real-world applications[C]//Proceedings of European Conference on Parallel Processing.Berlin, Germany:Springer, 2012:859-870.
[10] KIRK D.Nvidia CUDA software and GPU parallel computing architecture[C]//Proceedings of the 6th International Symposium on Memory Management.New York, USA:ACM Press, 2007:103-104.
[11] YANG Z Y, ZHU Y T, PU Y.Parallel image processing based on CUDA[C]//Proceedings of International Conference on Computer Science and Software Engineering.Washington D.C., USA:IEEE Press, 2008:198-201.
[12] SANDERS J, KANDROT E.CUDA by example:an introduction to general-purpose GPU programming[M].[S.l.]:Addison-Wesley Professional, 2010.
[13] FONSECA A, CABRAL B, RAFAEL J, et al.Automatic parallelization:executing sequential programs on a task-based parallel runtime[J].International Journal of Parallel Programming, 2016, 44(6):1337-1358.
[14] ZHANG Y Q, CAO T, LI S G, et al.Parallel processing systems for big data:a survey[J].Proceedings of the IEEE, 2016, 104(11):2114-2136.
[15] OH T, BEARD S R, JOHNSON N P, et al.A generalized framework for automatic scripting language parallelization[C]//Proceedings of the 26th International Conference on Parallel Architectures and Compilation Techniques(PACT).Washington D.C., USA:IEEE Press, 2017:356-369.
[16] BASKARAN M M, RAMANUJAM J, SADAYAPPAN P.Automatic C-to-CUDA code generation for affine programs[M].Berlin, Germany:Springer, 2010.
[17] BONDHUGULA U, RAMANUJAM J.Pluto:a practical and fully automatic polyhedral parallelizer and locality optimizer[EB/OL].[2022-01-11].https://www.xueshufan.com/publication/2034761517.
[18] BASTOUL C.Efficient code generation for automatic parallelization and optimization[C]//Proceedings of the 2nd International Symposium on Parallel and Distributed Computing.Washington D.C., USA:IEEE Press, 2003:23-30.
[19] VERDOOLAEGE S, JUEGA J C, COHEN A, et al.Polyhedral parallel code generation for CUDA[J].ACM Transactions on Architecture and Code Optimization, 2013, 9(4):1-23.
[20] NUGTEREN C, CORPORAAL H.Bones:an automatic skeleton based C-to-CUDA compiler for GPUS[J].ACM Transactions on Architecture and Code Optimization, 2015, 11(4):1-35.
[21] LEE S I, T.A.JOHNSON T A, EIGENMANN R.Cetus-an extensible compiler infrastructure for source-to-source transformation[C]//Proceedings of International Workshop on Languages and Compilers for Parallel Computing.Berlin, Germany:Springer, 2003:539-553.
[22] DAVE C, BAE H, MIN S J, et al.Cetus:a source-to-source compiler infrastructure for multicores[J].Computer, 2009, 42(12):36-42.
[23] BAE H S, MUSTAFA D, LEE J W, et al.The Cetus source-to-source compiler infrastructure:overview and evaluation[J].International Journal of Parallel Programming, 2013, 41(6):753-767.
[24] AMINI M, CREUSILLET B, EVEN S, et al.Par4All:from convex array regions to heterogeneous computing[EB/OL].[2022-01-11].https://hal-mines-paristech.archives-ouvertes.fr/hal-00744733.
[25] QUINLAN D.ROSE:compiler support for object-oriented frameworks[J].Parallel Processing Letters, 2000, 10(2):215-226.
[26] QUINLAN D, LIAO C H.ROSE source-to-source compiler infrastructure[EB/OL].[2022-01-11].https://engineering.purdue.edu/Cetus/cetusworkshop/papers/4-1.pdf.
[27] 刘松, 赵博, 蒋庆, 等.一种面向循环优化和非规则代码段的粗粒度半自动并行化方法[J].计算机学报, 2017, 40(9):2127-2147. LIU S, ZHAO B, JIANG Q, et al.A semi-automatic coarse-grained parallelization approach for loop optimization and irregular code sections[J].Chinese Journal of Computers, 2017, 40(9):2127-2147.(in Chinese)
[28] 李雁冰, 赵荣彩, 韩林, 等.一种面向异构众核处理器的并行编译框架[J].软件学报, 2019, 30(4):981-1001. LI Y B, ZHAO R C, HAN L, et al.Parallelizing compilation framework for heterogeneous many-core processors[J].Journal of Software, 2019, 30(4):981-1001.(in Chinese)
[29] 王鹏翔, 韩林, 丁丽丽, 等.典型编译器自动并行化效果和评估[J].信息工程大学学报, 2018, 19(2):186-190. WANG P X, HAN L, DING L L, et al.Effect and evaluation of automatic parallelization of typical compiler[J].Journal of Information Engineering University, 2018, 19(2):186-190.(in Chinese)
[30] 丁丽丽, 李雁冰, 张素平, 等.分支嵌套循环的自动并行化研究[J].计算机科学, 2017, 44(5):14-19, 52. DING L L, LI Y B, ZHANG S P, et al.Auto-parallelization research based on branch nested loops[J].Computer Science, 2017, 44(5):14-19, 52.(in Chinese)
[31] 高雨辰, 赵荣彩, 韩林, 等.循环自动并行化技术研究[J].信息工程大学学报, 2019, 20(1):82-89. GAO Y C, ZHAO R C, HAN L, et al.Research on automatic parallelization of loops[J].Journal of Information Engineering University, 2019, 20(1):82-89.(in Chinese)
[32] PARR T J, QUONG R W.ANTLR:a predicated-LL(k) parser generator[J].Software:Practice and Experience, 1995, 25(7):789-810.
[33] PARR T.The definitive ANTLR 4 reference[M].[S.l.]:Pragmatic Bookshelf, 2013.

选择文件类型/文献管理软件名称

选择包含的内容