万亿次机群系统NPB性能评测与并行非数值算法实现及性能分析

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

万亿次机群系统NPB性能评测与并行非数值算法实现及性能分析

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：NPB Performance Evaluation of Tera-Scale Clusters and Implementation of Parallel Non-Numerical Algorithm with Performance Analysis
作者：袁伟
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：万亿次机群系统 ; 性能评侧 ; NPB ; 并行数据挖掘 ; 关联规则
英文关键词：tera-scale cluster system ; performance evaluation ; NPB (NAS Parallel Benchmarks) ; parallel data mining ; association rules
学位年度：2005
导师：张云泉
学科代码：081202
学位授予单位：中国科学院研究生院（软件研究所）
论文提交日期：2005-05-01

摘要

高性能计算正处于一个新的快速发展时期,有两个现象值得关注,一方面,并行计算机的峰值性能提升迅速,峰值计算速度高达百万亿次的计算机系统已经被研制成功,高性价比的机群(cluster)成为高性能计算机的主流架构,促进了高性能计算在更多领域的普及应用;另一方面,并行应用软件缺乏,高性能计算机的实际效率长期以来处于较低水平,当前大型并行应用软件仅能发挥20%以下的系统峰值性能。
     应用性能才是用户最关心的,也是最重要的。并行软件和应用水平已经成为高性能计算发展中的薄弱环节,应该给予更多的重视。并行计算机和并行应用程序是影响并行计算性能的两个主要方面,也是本文的研究重点。
     本文以3个万亿次机群系统为平台,利用有着很强应用背景的NPB(NAS Parallel Benchmarks)进行性能测试分析。NPB程序包的8个程序都来自于实际应用领域,是科学计算领域并行应用的典型代表,NPB性能评测属于面向应用的性能评测,可以较真实地表现出系统的拟应用性能。
     通过NPB测试,重点研究在大规模并行处理时(处理器数目达到上千个)系统的性能特点和趋势。分析了不同的处理器、互连网络等系统配置对NPB性能的影响,发现NPB的8个程序在3个万亿次机群上的性能特点和表现并不一致,表明国产高性能机群在设计上正在逐渐走出同质化的趋势,向多样化发展。进一步分析表明,目前NPB程序的可扩展性可以达到几百个处理器,但尚不能达到上千个处理器,NPB程序能发挥出的系统峰值的百分比仍然徘徊在10%左右,机群系统的并行可扩展性和应用程序对机器运算潜能的利用还需要进一步提高。对于处理器数目达到上千个的万亿次机群系统来说,对聚合通信和细粒度通信能力的支持亟需提高。
     高性能并行计算在非数值领域有着广泛的应用前景。本文介绍了一个自主开发的基于MPI的并行数据挖掘系统(关联规则挖掘),在2个机群系统上进行了加速比性能测试,分析了程序的并行特点。结果表明,在非数值并行应用中,应当做好数据划分,精心设计优化数据结构,尽可能利用程序与易并行程序相类似的特点,这样可以有效减少进程间通信,实现负载均衡和同步计算,使得程序有较好的并行性能。
We should pay attention to two facts in the rapid progress of high performance computing, one is that the peak performance of parallel computer is in fast progress and it has got the level of 100 Tflops, cluster with high performance/cost value has now become the main architecture and is adopted in more applications; At the same time, the sustained performance of parallel applications is very low compared with the peak performance of the computer, most parallel applications can only exploit below 20 percent of the peak performance.The real application performance is more important than peak performance, and it is what we care about most. The shortage of parallel application and low level of sustained performance has become the bottleneck in the progress of high performance computing. Both parallel computer and parallel applications affect the real performance, so we carried out application oriented performance benchmarking and application performance analysis on tera-scale cluster systems.NPB benchmarking was performed on three domestic tera-scale cluster systems with emphasis on the performance characteristics and trends when carrying out tera-scale parallel computing on systems with thousands of processors. The effects of different system configurations (processor, interconnection network, etc.) on final NPB performance were analyzed and it is found that the programs in NPB suites got their best performance on different clusters. Through further analysis, we found out that the scalability of NPB programs can reach hundreds of processors, but can't reach thousands of processors. Most of NPB programs can only exploit around 10% of system peak performance, the scalability of cluster systems and real application performance on tera-scale cluster systems need further improvement. For manufactures of tera-scale cluster systems with thousands of processors, the

    performance of collective communication and fine-grained message passing needs further improvement.Performance research of parallel non-numerical applications is also very important. We developed a parallel data-mining program (association rule mining) and tested its speedup performance on two cluster systems. With good data partition and optimized data structures, this program has good parallel performance.The main works of my thesis are:· I performed NPB benchmarking on three domestic tera-scale cluster systems. Analyzed the effects of different system configurations on final NPB performance, Studied the sustained performance and scalability of NPB programs with thousands processors.· I developed a parallel data mining system (association rule mining) and tested its speedup on two cluster systems. Using the characteristics of this program analyzed the main factors that affected the performance.

引文

[1] 黄铠,徐志伟著,陆鑫达等译,《可扩展并行计算,技术、结构与编程》,机械工业出版社,2000
    [2] 陈国良,《并行计算—结构、算法、编程》,高等教育出版社,2003
    [3] Rajkumar Buyya, 《High Performance Cluster Computing Architectures and Systems, Volume 1》, Prentice-Hall press, 1999
    [4] 孟庆平,刘淘英,李恪,“机群系统管理的研究和实现”,《计算机科学》,2003,30(4):26-29
    [5] 金戈,“Linux高性能计算集群—概述”,IBM developerWorks http://www-128. ibm. com/developerworks/cn/linux/cluster/hpc/partl/#resources
    [6] 朱鹏,“认识集群”,中国计算机用户-赛迪网http://tech. ccidnet. com/pub/article/c786_a56723_pl. html
    [7] 孙家昶,张林波,迟学斌,汪道柳,《网络并行计算与分布式编程环境》,科学出版社,1996
    [8] 莫则尧,袁国兴,《消息传递并行编程环境MPI》,科学出版社,2001
    [9] http://www. mpi-forum. org
    [10] http://www-unix. mcs. anl. gov/mpi/
    [11] http://www-unix. mcs. anl. gov/mpi/mpich
    [12] http://www. lam-mpi. org
    [13] http://www. myri. com/scs/
    [14] T. Fahringer, Estimating and Optimizing Porformance for Parallel Programs, computer, November 1995 (Vol. 28, No. 11), pp. 47-56
    [15] T. Fahringer and H. Zima, A Static Parameter-based Performance Prediction Tool for Parallel Programs, Proc. Seventh ACM Int Conf. Supercomputing, ACM, New York, July 1993
    [16] Roger W. Hockney, The Science of Computer Benchmarking (Software, Environment, Tools) , Society for Industrial & Applied Mathematics, 1996

    [17] R. K. Jain, The Art of Computer Systems Performance Analysis : Techniques for Experimental Design, Measurement, Simulation, and Modeling, John Wiley & Sons, 1991
    [18] David J. Lilja , Measuring Computer Performance : A Practitioner's Guide , Cambridge University Press
    [19] http://www.top5OO. org/
    [20] http://www.samss. org. cn
    [21] http://www. hpctest. org. cn
    [22] Lei Hu and Ian Gorton. Performance Evaluation for Parallel Systems: A Survey, University of NSW, Sydney, Australia, Tech Rep: UNSW-CSE- TR-9707, 1997
    [23] Marcelo Lobosco, Vitor Santos Costa, and Claudio L. de Amorim, Performance Evaluation of Fast Ethernet, Giganet and Myrinet on a Cluster, In: Proc. of the International Conference on Computer Science, Springer-Verlag, Amsterdam, 2002
    [24] Jack Dongarra, Performance of Various Computers Using Standard Linear Equations Software, University of Tennessee Computer Science, America, Tech Rep: CS-89-85, 2003
    [25] A.B. Yoo and B.R. de Supinski and F. Mueller and S.A. McKee and I. Kumar, Memory Benchmarks for SMP-based High Performance Parallel Computers, Lawrence Livermore National Laboratory, Tech Rep: UCRL-JC- 146246, 2001
    [26] 罗水华;杨广文;张林波;石威;郑纬民,“并行集群系统的Linpack性能测试分析”,《数值计算与计算机应用》2003,24(4):285～292
    [27] 都志辉;吴傅;刘鹏;陈渝;王小鸽;李三立,“LINPACK与机群系统的LINPACK测试”,《计算机科学》,2002,29(5):8～10,59
    [28] 曹镇南,冯圣中,冯高峰,“大规模linux机群系统的Linpack测试研究”,《第8届全国并行计算大会论文集》,大连理工大学出版社,2004

    [29] 刘杰,胡庆丰,迟利华等,“高效能Linpack并行计算性能分析”,《第8届全国并行计算大会论文集》,大连理工大学出版社,2004
    [30] http://icl. cs. utk. edu/hpcc/
    [31] http://www. tpc. org/
    [32] http://www. dawning. com. cn/technic/4000A, asp#
    [33] http://www. sccas. cn/gb/compute/hardware/200404210006. html
    [34] http://lsec. cc. ac. cn/～973/Internal/973jb-8. htm
    [35] NAS Parallel Benchmarks, http://science. nas. nasa.gov/Software/NPB
    [36] Ahmad Faraj and Xin Yuan, Communication Characteristics in the NAS Parallel Benchmarks, Fourteenth IASTED International Conference on Parallel and Distributed Computing and Systems. Cambridge, MA, 2002
    [37] D. Bailey, E. Barscz, J. Barton, D. Browning, The NAS Parallel Benchmarks, NASA Ames Research Center, Tech Rep: RNR-94-007, 1994
    [38] Qingda Lu, Jiesheng Wu, Dhabaleswar Panda, P. Sadayappan, Applying MPI Derived Patatypes to the NAS Benchmarks: A Case Study, 2004 International Conference on Parallel Processing Workshops. Montreal, Quebec, Canada, 2004
    [39] Rob F. Van der Wijngaart, NAS Parallel Benchmarks Version 2.4, NASA Ames Research Center, Tech Rep: NAS-02-007, 2002
    [40] 胡明昌;史岗;胡伟武;唐志敏,“通信对机群并行计算性能的影响”,《小型微型计算机系统》,2003,24(9):1569～1573
    [41] Stephen R. Donaldson , Jonathan M.D. Hill, David B. Skillicorn, Performance Results for a Reliable Low-latency Cluster Communication Protocol, In PCNOW Workshop at IPPS' 99, 1999
    [42] Richard Martin, Amin Vahdat, David Culler, and Thomas Anderson. Effects of Communication Latency, Overhead, and Bandwidth in a Cluster Architecture, Proceedings of the 24th Annual International Symposium on Computer Architecture (ISCA), pages 85 - 97. June 1997

    [43] http:// www.gigabit-ethernet.org/
    [44] http://www.myri.com/
    [45] http://www. quadrics. com/quadrics/QuadricsHome. nsf/ DisplayPages/Homepage
    [46] http://www.dolphinics.com/
    [47] http://www. infinibandta.org/home
    [48] S. R. Sarukkai, Scalability Analysis Tools for SPMD Message-Passing Parallel Programs, Proc. Second Int'l Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (Mascots' 94), IEEE Computer Society Press, Los Alamitos, Calif., 1994, pp. 180-186
    [49] V.Kumar and A.Gupta, Analyzing Scalability of Parallel Algorithms and Architectures, Proc. Int'l Conf. Supercomputing, ACM Press, New York, 1990
    [50] J. S. Vetter and M. O. McCracken, Statistical Scalability Analysis of Communication Operations in Distributed Applications, In: Proc. of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Snowbird, Utah, USA, pp: 123~132, 2001
    [51] Seung-Seok Kang, Seong-Tae Kang , Study of Beowulf-class Architecture, Project paper for CSE822, 1999
    [52] Marcelo Lobosco, Vitor Santos Costa, and Claudio L. de Amorim, Performance Evaluation of Fast Ethernet, Giganet and Myrinet on a Cluster, In Proceedings of the International Conference on Computer Science (ICCS'2002), LNCS 2329, 296-305, Springer-Verlag, Amsterdam, April 2002
    [53] Frederick C. Wong, Richard P. Martin, Remzi H. Arpaci-Dusseau, and David E. Culler, Archi tec tural Requiremen ts and Scalability of the NAS Parallel Benchmarks, In Proceedings of Supercomputing'99, 1999

    [55] A. Snavely, L. Carrington, N. Wolter, C. Lee, J. Labarta, J, Gimenez, P. Jones, Performance Modeling of HPC Applications, Mini-symposium on Performance Modeling, ParCo 2003, September 2003
    [56] Jeffrey S. Vetter, Andy You, An Empirical Performance Evaluation of Scalable Scientific Applications, Proceedings of the 2002 ACM/IEEE conference on Supercomputing, Baltimore, Maryland, pp.1-18, 2002
    [57] Thomas H. Hinke, A Data Miner For The Information Power Grid, NASA Ames Research Center Moffett Field, Calfornia, USA
    [58] R. Agrawal, T. Imielinski, and A. Swami,Mining association rules between sets of items in large databases, Proceedings of the ACM SIGMOD Conference on Management of data, pp. 207-216, 1993
    [59] R. Agrawal, and R. Srikant, Fast algorithms for mining association rules in large database, Technical Report FJ9839, IBM Almaden Research Center, San Jose, CA, Jun. 1994
    [60] H. Mannila, H. Toivonen, and A. Verkamo, Efficient algorithm for discovering association rules, AAAI Workshop on Knowledge Discovery in Databases, pp. 181-192, Jul. 1994
    [61] R. Agrawal and J. C. Shafer, Parallel mining of association rules, IEEE Transaction on Knowledge and Data Engineering, 8:962-969, 1996
    [62] Eui-Hong(Sam) Han, George Karypis, and Vipin Kumar, Scalable Parallel Data Mining for Association Rules, IEEE Transaction on Knowledge and Data Engineering, 12(3):337-352, 2000
    [63] 王文辉,张水平,“数据挖掘中新的并行算法”http://www. bjx. com. cn/files/wx/xddzjs/2003-1/8. htm
    [64] J. S. Park, M. S. Chen, and P. S. Yu, Efficient parallel data mining of association rules, 4th International Conference on Information and Knowledge Management, Baltimore, Maryland, Novermber 1995

    [65] M. J. Zaki, S. Parthasarathy, and W. Li, A localized algorithm for parallel association mining, 9th Annual ACM Symposium on Parallel Algorithms and Architectures, Newport, Rhode Island, June 1997
    [66] 都志辉等,《高性能计算并行编程技术—MPI并行程序设计》,清华大学出版社,2001
    [67] Wilson, G.v.,practical parallel programmng, MPI press, Cambridage, Massachusetts, 1995
    [68] http://www. highproductivity. org/
    [69] Stuart Faulk and John Gustafson and Philip M. Johnson and Adam A. Porter and Walter Tichy and Larry Votta, Toward Accurate HPC Productivity Measurement, Proceedings of the First International Workshop on Software Engineering for High Performance Computing System Applications, Edinburgh, Scotland, May, 2004
    [70] J. Gustafson, Purpose-Based Benchmarks, International Journal of High Performance Computingand Applications: Special Issue on HPC Productivity, vol. 18, no. 4, Winter 2004
    [71] S. Faulk, J. Gustafson, P. Johnson, A. Porter, W. Tischy, and L. Votta, Measuring HPCS Productivity, International Journal of High Performance Computingand Applications: Special Issue on HPC Productivity, vol. 18, no. 4, Winter 2004
    [72] J. Kepner, HPC Productivity: an Overarching View, International Journal of High Performance Computingand Applications: Special Issue on HPC Productivity, vol. 18, no. 4, Winter 2004
    [73] D.E. Post, Mitigating the risks faced by large-scale computational science, 11th Intl. Symposium on High-performance Computer Architecture(HPCA-2OO5)San Francisco, CA, February 13,2005

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700