摘要
三维多视角立体视觉算法(patch-based multi-view stereo, PMVS)以其良好的三维重建效果广泛应用于数字城市等领域,但用于大规模计算时算法的执行效率低下。针对此,提出了一种细粒度并行优化方法,从任务划分和负载均衡、主系统存储和GPU存储、通信开销等3方面加以优化;同时,设计了基于面片的PMVS算法特征提取的GPU和多线程并行改造方法,实现了CPUs_GPUs多粒度协同并行。实验结果表明,基于CPU多线程策略能实现4倍加速比,基于统一计算设备架构(compute unified device architecture, CUDA)并行策略能实现最高34倍加速比,而提出的策略在CUDA并行策略的基础上实现了30%的性能提升,可以用于其他领域大数据处理中快速调度计算资源。
We address the problem of fine-grained parallel optimization of large-scale data. Patch-based multi-view stereo(PMVS) algorithm has been widely applied to digital city and other fields because of its good three-dimensional reconstruction effect, however, its large-scale computing algorithm has a low execution efficiency. Therefore, to address the limitation, this paper proposes a fine-grained parallel optimization method, including task allocation and load-balancing; strategies of main system memory and GPU memory; the optimization of communication. We perform CPU multi-threading operation using the pthreads function library to take full advantage of the computing power of multi-core CPUs. And for GPUs, we utilize the CUDA framework while optimizing thread organization and memory access. Besides that, we propose the idea of adapting memory pool model and pipelining model to improve bandwidth availability ratio. The memory pool model reduces the impact of data resources transferring on the bus for CPUs_GPUs while waiting for resources; the pipelining model hides communication time for CPU to read data from memory. At the same time, this paper utilizes the Harris-DOG feature extraction of PMVS algorithm of sequences of images as the example to verify our optimization strategies. The experiments demonstrate that the multi-threading CPU-based strategy can achieve 4 times speed-up ratio, the highest ratio that parallel CUDA-based strategy can achieve is 34 times, and our strategy can improve the performance 30% on the basis of the parallel CUDA-based strategy. In the future, our optimization strategy can be applied to quick computing resource scheduling in big data processing of other domains.
引文
[1] Yu Ming, Qi Feifei, Yu Yang, et al. 3D Reconstruction Algorithm Based on Multi-view Stereo[J]. Computer Engineering and Design, 2013, 34(2): 730-733 (于明, 齐菲菲, 于洋, 等. 基于立体视觉的三维重建算法[J]. 计算机工程与设计, 2013, 34(2): 730-733)
[2] Liu Jinshuo, Jiang Zhuangyi, Xu Yabo, et al. Multithread and GPU Parallel Schema on Patch-Based Multi-view Stereo Algorithm[J]. Computer Science, 2017, 44(2): 296-301 (刘金硕, 江庄毅, 徐亚渤, 等. PMVS算法的CPU多线程和GPU两级粒度并行策略[J]. 计算机科学, 2017, 44(2): 296-301)
[3] Xiao Han, Zhou Qinglei, Zhang Zuxun. Parallel Algorithm of Harris Corner Detection Based on Multi-GPU[J]. Geomatics and Information Science of Wuhan University, 2012, 37(7): 876-881 (肖汉, 周清累, 张祖勋. 基于多GPU的Harris角点检测并行算法[J]. 武汉大学学报·信息科学版, 2012, 37(7): 876-881)
[4] Zhang H,Xie Y, Heng P A. Accelerating Feature Extraction for Patch-based Multi-view Stereo Algorithm[C]. International Conference on Computer Design and Applications,Qinhuangdao,China,2010
[5] Xiao Han. Research on High Efficiency Heterogeneous Parallel Computing Based on CPU+GPU in Image Matching[D].Wuhan: Wuhan University, 2011 (肖汉. 基于CPU+GPU的影像匹配高效能异构并行计算研究[D]. 武汉: 武汉大学, 2011)
[6] Liu Jinshuo, Cheng Li, Wang Lina, et al. 3D Visua- lization of Shear Wave Data Based on CUDA[J]. Geomatics and Information Science of Wuhan University, 2013, 38(11): 1 271-1 275 (刘金硕, 程力, 王丽娜, 等. 利用CUDA的剪切波数据三维可视化[J]. 武汉大学学报·信息科学版, 2013, 38(11): 1 271-1 275)
[7] Liu Jinshuo, Deng Juan, Zhou Zheng, et al. Parallel Programming Based on CUDA[M]. Beijing: Science Press, 2014: 31-32, 92-94 (刘金硕, 邓娟, 周峥, 等. 基于CUDA设计[M].北京: 科学出版社, 2014: 31-32, 92-94)
[8] Romerolaorden D, Villazonterrazas J, Martinez- graullera O, et al. Analysis of Parallel Computing Strategies to Accelerate Ultrasound Imaging Processes[J]. IEEE Transactions on Parallel and Distributed Systems, 2016, 27: 3 429-3 440
[9] Fang Xudong. Research on CPU-GPU Heteroge- neous Parallel Technology for Large-Scale Scientific Computing[D]. Changsha: National University of Defense Technology, 2009 (方旭东. 面向大规模科学计算的CPU-GPU异构并行技术研究[D]. 长沙: 国防科学技术大学, 2009)
[10] Ilic A, Sousa L. Collaborative Execution Environment for Heterogeneous Parallel Systems[C]. IEEE International Symposium on Parallel and Distributed Processing, Workshops and Phd Forum, Atlanta, USA, 2010
[11] Lee J,Samadi M, Park Y, et al. Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems[C]. The 22nd International Conference on Parallel Architectures and Compilation Techniques, Edinburgh, UK, 2013
[12] Ohshima S, Kise K, Katagiri T, et al. Parallel Processing of Matrix Multiplication in a CPU and GPU Heterogeneous Environment[C].The 7th International Meeting on High Performance Computing for Computational Science, Rio de Janeiro, Brazil, 2006
[13] Pei Songwen, Ning Jing, Zhang Junge. Dynamic Task Scheduling Algorithm Based on CPU-GPU Heterogeneous Multi-core System[J]. Application Research of Computers, 2016, 33(11): 3 315-3 319 (裴颂文, 宁静, 张俊格. CPU-GPU异构多核系统的动态任务调度算法[J]. 计算机应用研究, 2016, 33(11): 3 315-3 319)
[14] Heldens S, Varbanescu A L, Iosup A. Dynamic Load Balancing for High-Performance Praph Processing on Hybrid CPU-GPU Platforms[C]. The 6th Workshop on Irregular Applications: Architectures and Algorithms, Salt Lake City, USA, 2016
[15] Yaseen A, Ji H, Li Y H. A Load-Balancing Workload Distribution Scheme for Three-Body Interaction Computation on Graphics Processing Units(GPU)[J]. Journal of Parallel and Distributed Computing, 2016, 87: 91-101
[16] Wan L J, Li K L, Liu J, et al. Efficient CPU-GPU Cooperative Computing for Solving the Subset-Sum Problem[J]. Concurrency and Computation Practice and Experience, 2016, 28(2): 492-516
[17] Yu C D, Wang W. Performance Models and Workload Distribution Algorithms for Optimizing a Hybrid CPU-GPU Multifrontal Solver[J].Compututers and Mathematics with Applicatons, 2014, 67(7): 1 421-1 437
[18] Shehab E, Algergawy A, Sarhan A. Accelerating Relational Database Operations Using Both CPU and GPU Co-processor[J]. Computers and Electrical Engineering, 2017, 57: 69-80
[19] Chan L M, Srinivasan R. A Hybrid CPU-Graphics Processing Unit(GPU) Approach for Computationally Efficient Simulation-Optimization[J]. Computers and Chemical Engineering, 2016, 87: 49-62
[20] Chavez D. Parallelizing Map Projection of Raster Data on Multi-core CPU and GPU Parallel Programming Frameworks[D]. Stockholm: KTH Royal Institute of Technology, 2016
[21] Gremse F, Hofter A, Razik L, et al. GPU-Acce- lerated Adjoint Algorithmic Differentiation[J]. Computer Physics Communications, 2016, 200: 300-311
[22] Liu Jinshuo, Zeng Qiumei, Zou Bin, et al. Speed-up Robust Feature Image Registration Algorithm Based on CUDA[J]. Computer Science, 2014, 41(4): 24-27 (刘金硕, 曾秋梅, 邹斌, 等. 快速鲁棒特征算法的CUDA加速优化[J]. 计算机科学, 2014, 41(4): 24-27)