摘要
稀疏下三角方程求解器(SpTRSV)作为基础线性代数库中一个重要的算法,在大规模科学计算中有着广泛应用。在非结构网格中,由于非结构网格具有数据存储无序性、数据强相关性以及频繁地离散访存等特点,该算法在众核架构上难以实现有效的并行。文中基于国产异构众核处理器SW26010体系结构的特点,针对非结构网格计算,提出了一种基于流水线串行-局部并行思想的通用众核优化方法。该方法能够有效减少非结构网格计算中的随机访存,提高计算效率,并且具有很好的扩展性。基于该算法对多个实际应用算例进行众核优化,实验结果表明:该方法能够实现单核组3倍以上的加速,显著降低了运行时间。
Sparse Triangular Solver(SpTRSV),as an important algorithm in basic linear algebraic library,has been widely used in large-scale scientific computing.In unstructured-grids,because unstructured grid have the characte-ristics of data storage disorder,data depth correlation and frequent discrete-time memory access,this algorithm is difficult to achieve effective parallelism in the many-core architecture.In this paper,based on the architecture of the domestic heterogeneous multiprocessor SW26010 architecture,a general kernel optimization method based on pipelined serial and local parallel was proposed for unstructured grid computing.This method can effectively reduce random access in unstructured grid computing,improve the computing efficiency,and have the good scalability.Based on this algorithm,multiple kernel optimization is carried out for several practical applications.The experimental results show that the method can achieve more than 3 times acceleration of the single core group and significantly reduce the running time.
引文
[1] ANDERSON E,SAAD Y.Solving sparse triangular linear systems on parallel computers[J].International Journal of High Speed Computing,1989,1(1):73-95.
[2] DUFF I S,ERISMAN A M,REID J K.Direct Methods for Sparse Matrices[J].Matehematics of Computation,1986,52(185):250.
[3] SAAD Y.Iterative methods for sparse linear systems[OL].http://ieeexplore.ieee.org/ielx5/99/12132/001231631.pdf.
[4] REGULY I Z,MUDALIGE G R,BERTOLLI C,et al.Acceleration of a full-scale industrial cfd application with op2[J].IEEE Transactionson Parallel and Distributed Systems,2016,27(5):1265-1278.
[5] YAO C,YANG X,WANG W,et al.26 pflops stencil computations for atmospheric modeling on sunway taihulight[C]//2017 IEEE International Parallel and Distributed Processing Sympo-sium(IPDPS).IEEE,2017:535-544.
[6] CHEN T,LI M,LI Y,et al.Mxnet:A flexible and efficient machine learning library for heterogeneous distributed systems[J].arXiv preprint arXiv:1512.01274,2015.
[7] ANDERSON E,SAAD Y.Solving sparse triangular linear systems on parallel computers[J].International Journal of High Speed Computing,1989,1(1):73-95.
[8] KALIKAR S,NASRE R.Domlock:A new multi-granularity locking technique for hierarchies[C]//Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming,2016.
[9] SCHREIBER R,TANG W P.Vectorizing the conjugate gradient method[D].Los Angeles:Stanford University,1982.
[10] LIU W,LI A,HOGG J D,et al.Fast synchronization-free algorithms for parallel sparse triangular solves with multiple right-hand sides[J].Concurrency and Computation:Practice and Experience,2017,29(21):e4244-n/a.
[11] PICCIAU A,INGGS G E,WICKERSON J,et al.Balancing locality and concurrency:Solving sparse triangular systems on gpus[C]//2016 IEEE 23rd International Conference on High Performance Computing(HiPC).2016:183-192.
[12] 敖玉龙.国产大型众核系统上稀疏矩阵和Stencil运算的性能优化关键技术研究[D].北京:中科院软件研究所,2017:27-47.
[13] 王欣亮.关键稀疏数值计算核心在国产众核架构上的性能优化研究[D].北京:清华大学,2018:24-43.
[14] QI F B.Sunway TaihuLight super computer[J].Communications of the CCF,2017,13(10):16-22.
[15] ANDERSON W K,GROPP W D,KAUSHIK D K,et al.Achieving High Sustained Performance in an Unstructured Mesh CFD Application[C]//Conference on High Performance Computing (Super Computing).1999.
[16] 林恒.基于超大规模异构体系结构的图计算系统研究[D].北京:清华大学,2018:23-53.