计算机工程

• •    

Canny边缘检测在飞腾平台的实现与优化

  

  • 发布日期:2020-12-15

Realization and optimization of Canny Edge detection in FT Platform

  • Published:2020-12-15

摘要: 为完善飞腾国产DSP平台上的底层图像库支持,缩进与主流加速器件在图像处理领域性能差距。针对图像处理中Canny边缘检测计算过程存在大量时间开销问题,提出了一种基于FT-M7002的Canny梯度计算并行算法。首先,基于飞腾体系结构和现代并行程序优化策略,对该算法进行分析并发掘数据级并行性,采用手工SIMD向量化的方法对程序进行改写以充分利用飞腾DSP平台的512位长SIMD计算部件。其次,结合该平台中特有的向量存储器层次结构特征,分析了该算法中的访存模式,通过首地址偏移取址以处理不连续访存,并采用“乒乓”缓冲的方式完成计算与DMA访存的时延隐藏。基于FT-M7002实验平台进行实验,结果显示,在与原始算法同精度情况下,Canny梯度并行算法在常用的3*3、5*5、7*7核梯度计算部分运行速度平均分别提升了2.0、2.5、2.8倍,整体运行速度提升了1.49-2.11倍。

Abstract: In order to improve the support of the underlying image library on the Feiteng domestic DSP platform, the performance gap between the indentation and mainstream acceleration devices in the image processing field is reduced. In view of the large amount of time overhead in the Canny edge detection calculation process in image processing, a parallel algorithm for Canny gradient calculation based on FT-M7002 is proposed. First, based on the Feiteng architecture and modern parallel program optimization strategies, the algorithm is analyzed and data-level parallelism is explored, and the program is rewritten using manual SIMD vectorization to make full use of the 512-bit long SIMD computing component of the Feiteng DSP platform. Secondly, combined with the unique vector memory hierarchical structure characteristics of the platform, the memory access mode in the algorithm is analyzed. The address is offset by the first address to deal with discontinuous memory access, and the calculation and DMA are completed by means of "ping-pong" buffering. The latency of memory access is hidden. Based on the FT-M7002 experimental platform, the results show that, with the same accuracy as the original algorithm, Canny gradient parallel algorithm in the commonly used 3*3, 5*5, 7*7 core gradient calculation part of the average speed increased by 2.0, 2.5, 2.8 times respectively, the overall operating speed increased by 1.49-2.11 times.