用户名: 密码: 验证码:
超级计算机系统的可用性评估研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
超级计算机系统是世界各国争相抢占的重要战略资源,而性能则是它的生命线。在其性能不断飞跃、功能与结构越来越复杂的同时,超级计算机系统的可用性问题日益严重。为了提高超级计算机系统的可用性,尽量降低失效与维修事件对系统性能发挥的影响,对系统进行可用性评估是必不可少的。然而超级计算机系统不同于普通的计算机系统,它对可用性评估有着自己的特色要求,因此不能直接延用可用性评估的传统指标与方法,而是需要开展更为深入的研究。
     本文在分析了超级计算机系统的可用性评估研究现状、并总结了普通系统进行可用性评估的一般原理与基本要素的基础上,针对目前工作存在的不足、以及针对直接运用一般原理及其要素开展超级计算机系统的可用性评估时存在的问题,主要开展了三方面研究:(1)面向超级计算机的、具有一定普适性的可用性评估架构与方法研究;(2)面向应用的、能体现超级计算机系统的本质特征的可用性评估指标研究;(3)针对超级计算机系统的可用性评估状态空间模型的数值解析方法中存在的状态空间爆炸问题的解决方案研究。
     本文的贡献包括四个方面:
     (1)提出了针对超级计算机的、面向应用的层次化可用性建模(Application-Oriented Hierarchical Availability Modeling,缩写为AOHAM)评估方法。该方法以超级计算机系统的总体特征为基础,立足于不同观察主体的感知角度,采用层次化、模块化的SANs模型建模方法,并利用模型模块间库所(places)与行为(activities)的共享实现系统行为间的关联,最终利用Mobius建模工具来实现一次评估同时满足多个评估需求、以减少重复性评估工作量的效果。
     (2)提出了两种新的可用性评估指标:可用强度和强度可用性,并对它们的定义和度量规则给出了详细的描述与推导。这两个指标都是基于超级计算机系统的计算能力强度而提出的,前者直接度量的是系统所能提供的计算能力的大小,后者度量的是这种能力在系统的总计算能力中所占的比例。通过对一组参数可变的简单实例模型进行强度可用性与基本可用性的两种方法度量,其结论有力证明了新的可用性指标更能体现超级计算机系统的本质特征,因此更适合用于超级计算机系统的可用性评估。
     (3)设计并实现了基于MapReduce机制的、状态空间的自动化分布式生成方案。状态空间模型是超级计算机系统的可用性评估的主要方法,而数值解析是该模型的重要求解方法之一。然而,状态空间模型的数值解析方法存在着状态空间爆炸问题,即模型的状态空间随着建模对象系统规模的增长而呈非线性增长,它严重限制了适合采用状态空间模型进行可用性评估的超级计算机系统的规模。解决这一问题的一种重要方法是在分布式环境下并行完成状态空间的生成。针对现有的并行化方案存在对平台环境与操作用户要求高、难以推广应用的问题,本文提出了一种基于Hadoop平台及其核心MapReduce机制的、状态空间生成的自动化并行方案,该方案已在分布式环境下实现,实验的结果表明:(a)该方案具有良好的求解加速比;(b)实验的宿主机平台具有良好的独立性,非常易于扩展,以应对模型所模拟的系统规模的扩张;(c)方案的实现非常简单,便于普通用户的编程应用。因此,该解决方案具有良好的应用前景。
     (4)实现了对某超级计算机系统的两个核心组成部分——主机系统和外围系统——的可用性评估。对于主机系统,其可用性评估采用了强度可用性评估指标,并从系统维护人员、系统管理员和作业用户等多个层次角度,分析了系统的逻辑层次结构和不同层次的行为模型,建立了各自的SANs评估模型,并利用Mobius工具实现了不同模型模块的整合,从而实现了一次评估模型求解即可满足对多个主体的可用性评估需求的目标。对于外围系统,鉴于其用户观察角度的单一性,因而对它的可用性评估采用了传统的可用性指标,也建立了层次化的SANs可用性评估模型,并实现了基于一定模型参数值的可用性实验评估。通过实现对实例系统不同部分的可用性评估,我们的结论是:采用何种评估指标需要依据情况而定,对于可用性状态而具有布尔特性、拥有单一观察角度的用户的外围系统而言,仍适合采用传统的系统基本可用性进行评估;对于需要体现计算能力强弱、拥有多个位于不同层次的观察主体的主机系统而言,强度可用性是更好的选择。
Supercomputer systems are important strategic resources that every country manages to own, and their performance is their lives. Nowadays, while the supercomputers’performance improves at the rate of orders of magnitude and so is no longer the bottleneck of supercomputers’design, the problem of their availability is becoming more and more crucial, for the reason that it always restrains them from operating normally, which means that their performance would never exert entirely and sometimes even would discount heavily, unless the availability problem has been coped with properly. In order to improve their availability and reduce the influences on their performance as much as possible brought out by the systems’failure and maintenance events, availability evaluation for the systems is very significative. However, supercomputers’availability can not be evaluated in the same way as common computer systems’availability is usually done, since super computers have many unique characteristics different from the common ones. For this reason, availability evaluation for supercomputers deserves further research in several directions, such as evaluation methods, measurement metrics, the solution to
     evaluation models, etc. Based on the investigations of researches on availability evaluation for supercomputer systems by far and the conclusion we have drawn including the general principle and its three primary elements of availability evaluation for common systems, work in three aspects has been endeavor in this thesis, which are aiming at overcoming shortages in present researches on availability evaluation for supercomputers and solving problems encountered in evaluating supercomputers’availability directly with general principles and with its elements. They include the following aspects: (1) researches on availability evaluation principles and methods especially for supercomputers; (2) researches on application-oriented availability evaluation metrics that can reflect essential characteristics of supercomputers; (3) study of solutions to the state space exploration problem inevitably encountered while solving the state space models of supercomputers with numerical analysis methods.
     The contributions of this dissertation include:
     (1) An evaluation method named as AOHAM (shorted for Application-Oriented Hierarchical Availability Modeling) is proposed for supercomputers’availability evaluation. It’s based upon the general characteristics of supercomputer systems and takes multiple different observation subjects into account. By hierarchical and modularized SANs modeling method, AOHAM pictures the relationships between the system behaviors with places or activities shared by different model modules. And with the help of the modeling tool Mobius, multiple requirements from different observation subjects could be satisfied by just one solving process to the integrated model, which reduces much repeated work that should be done when modeling the supercomputer system for multiple observers in the general principle as the common systems were done.
     (2) Two new availability evaluation metrics have been proposed: Powerful Availability (PA) and Available Power (AP), with their definitions and measurement rules having been stated and deduced in detail. They are both brought out for the reason that measuring how much computing power the supercomputer can provide to the user is much more meaningful than just judging whether it is available at certain time. And the difference between them is that, the former directly measures the computing power that can be provided by the system, while the latter measures the ratio of this power in the system’s total computing power. By evaluating a set of simple parameter-variable example models respectively with Powerful Availability and traditional primary availability, we can draw the conclusion from the experimental results that new availability metrics can better reflect supercomputer’s essential characteristics, and so they are more suitable for supercomputers’availability evaluation.
     (3) An automatic distributed state space generation scheme based on MapReduce mechanism has been designed and implemented. Numerical analysis is a much important solution to state space models, which are the most significant method to measuring the supercomputers’availability. Unfortunately however, it would face a crucial problem when the target system is increasing in scale, that is, the state space exploration problem, which baffles the application of state space models in supercomputers’availability evaluation. One important approach to counterwork this problem is to generate the models’state space in parallel under distributed environments. Since the current implementations to this approach have some shortages such as high demands on platforms and programmers, hard to extend in applications, etc., the scheme proposed in this thesis is implemented based on the open Hadoop platform and its MapReduce mechanism, which can automatically parallelize the generation progress of state space. It has been realized in a common distributed environment, and the experiment results show that: (a) it has good solving speed-up ratio; (b) the host platform for the experiment is independent and easy to scale, which is apt to meet the expansion of the simulated system; (c) the implementation of the scheme is ease to use by for common programmers, whether he (she) has the knowledge of parallel programming or not. Therefore, this scheme has a broad and promising application perspective.
     (4) Two core parts of a certain supercomputer system (the host system and the peripheral system) has been accomplished availability-evaluating, respectively with Powerful Availability and traditional primary availability. When evaluating the host system’s availability, PA metrics has been adopted, the system’s logic hierarchies have been analyzed and the behavior models corresponding to different hierarchies have been independently set up in SANs. At last, these model modules have been integrated by Mobius into a universal one, the model of the whole system, which are to be resolved once and can fulfill multiple requirements from different observation subjects. While for the peripheral system, its availability has been evaluated with the traditional primary availability metrics since it is of Boolean property in availability. Hierarchical SANs models have also been setup for it, and based on them, serial experiments have been done with some kinds of parameters. The conclusion drawn from these implementation is that which metrics should been chosen, PA or primary availability, is determined by the target system’s property. If its availability is of Boolean property, primary availability is equivalent to PA and measure its availability with primary one is more convenient; otherwise, PA should be chosen, for only with PA should the computing power of it be properly shown.
引文
[Adna03] Adnan Agbaria and Roy Friedman. Overcoming Byzantine Failures Using Checkpointing[R]. Coordinated Science Laboratory. University of Illinois at Urbana-Champaign Urbanan. 2003.
    [ASCI02] Lawrence Livermore National Laboratory. Advanced Simulation and Computing (ASCI) [R] . Livermore, Cailfornia, 2002.
    [Avizi00] Algirdas Avizienis, Jean-Claude Laprie and Brian Randell. Dependability of computer systems: Fundamental concepts, terminology, and examples [R]. LAAS Report No. , UCLA Report No. , Newcastle No. , October 2000.
    [Avizi01] Algirdas Avizienis, Jean-Claude Laprie and Brian Randell. Fundamental Concepts of Dependability [R]. UCLA CSD Report No.10028, LAAS Report No.01-145 and Newcastle University Report No.CS-TR-739, 2001.
    [Avizi04] Algirdas Avizienis, Brian Randell and Carl Landwehr. Basic Concepts and Taxonomy of Dependable and Secure Computing[J]. IEEE Transactions on Dependable and Secure Computing, Vol1, No. 1. January-March, 2004.
    [ASCIW] Narasimha Raju, Gottumukkala, Yudan Liu, Chokchai Box Leangsuksun. Reliability Analysis in HPC clusters[R]. 2001.
    [Axel96] Axel Hein and Kumar K. Goswarmi. Conjoint Simulation——a Technique for the Combined Performance and Dependability Analysis of Large-Scale Computer Systems[C]. Proceedings of the 2nd International Computer Performance and Dependability Symposium (IPDS '96). 1996.
    [Beya81] B. Beyaert, G. Florin, P. Lonc, and S. Natkin. Evaluation of computer systems dependability using stochastic Petri nets[C]. in Proc. 11th Int. Symp. Fault-Tolerant Computing (FTCS-11). Portland, Maine, USA, 1981.
    [Boyle03] P. A. Boyle, C. Jung, and T. Wettig. The QCDOC Supercomputer: Hardware, Software, and Performance[C]. Proceedings of the Conference for Computing in High Energy and Nuclear Physics (CHEP03). 2003.
    [BGL02] N.R. Adiga et al.. An Overview of the Blue Gene/L[C]. Proc. of IEEE Int’l Conference on Supercomputing. 2002.
    [Bowe97] N.S. Bowen, J. Antognini, R.D. Regan, N.C. Matsakis. Availability in parallel systems: automatic processrestart[J]. IBM Systems Journal. vol. 36, no. 2, 1997, pp. 284-300.
    [Buch92] P. Buchholz. Numerical solution methods based on structured descriptions of Markovian models[C]. In G. Balbo and G. Serazzi, editors, Computer Performance Evaluation - Modeling Techniques and Tools, pages 251–267. Elsevier, 1992.
    [Ciar02] Gianfranco Ciardo, Jogesh Muppala and Kishor Trivedi, SPNP: Stochastic Petri Net Package[M], Dpeartment of Computer Science, Duke University, Sept. 2002.
    [Ciar98] Ciardo, G., J. Gluckman and D. Nicol, Distributed state-space generation of discrete-state stochastic models[C], INFORMS Journal on Comp. 10 (1998), pp. 82–93.
    [CFDR07] Bianca Schroeder, Garth A.Gibson, The Computer Failure Data Repository(CFDR) [C], Workshop on Reliability Analysis of System Failure Data(RAF’07) MSR Cambridge, UK, March 2007.
    [Chris05] Christian Engelmann, High Availability for Ultra-Scale High-End Scientific Computing[M], Oak Ridge National Laboratory, USA, May 12, 2005.
    [Cour03] T. Courtney et al., The Mobius Modeling Environment[C], Tools of the 2003 Illinois Int’l Multiconference on Measurement, Modeling and Evaluation of Computer Communication Systems, 2003.
    [Cybe09] John W. Chinneck, Bjarni Kristjansson, Matthew J. Saltzman, Operations Research and Cyber-Infrastructure[M], Springer, Jan. 2009.
    [Davi02] David M. Nicol and Gianfranco Ciardo, Automated Parallelization of Discrete State-space Generation[M], Sept. 3, 2002.
    [Deri04] S. Derisavi, P. Kemper, and W. H. Sanders. Symbolic state-space exploration and numerical analysis of state-sharing composed models[M]. Linear Algebra and Its Applications, 386:137–166, July 2004.
    [Dean04] Dean J, Ghemawat S. MapReduce: Simplified Data Processing on Large Clusters[C], Proc. of the 6th Symposium on Operating System Design and Implementation, San Francisco. 2004.
    [Dani97] Daniel D. Deavours and William H. Sanders, ON-THE-FLY Solution Techniques for Stochastic Petri Nets and Extensions[M], Center for Reliable and High-Performance Computing Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, 1997.
    [Daly00] D. Daly, D. D. Deavours, J. M. Doyle, P. G. Webster, and W. H. Sanders, Mobius: An extensible tool for performance and dependability modeling[C], in Computer Performance Evaluation: Modelling Techniques and Tools: Proceedings of the 11th International Conference, TOOLS 2000.
    [Daly03] J. Daly, A Model for Predicting the Optimum Checkpoint Interval for Restart Dumps [C], Proc. of Int’l Conference on Computational Science, 2003.
    [Davi02] David Essex, The Future of CPUs in Brief, Technology Review[M], January 28, 2002.
    [Deri03] S. Derisavi, H. Hermanns, and W. H. Sanders. Optimal state-space lumping in Markov-chains[C]. Inf. Proc. Letters, 87(6):309–315, September 2003.
    [Dong05] Jack Dongarra, Thomas Sterling, Horst Simon, and Erich Strohmaier, High-Performance Computing: Clusters, Constellations, MPPs, and Future Directions [M], Computing in Science and Engineering, Copublished by the IEEE CS and the AIP, March/April 2005.
    [Elmo04] Elmootazbellah N. Elnozahy, James S. Plank. Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery[J], IEEE Transactions on Dependable and Secure Computing, vol. 01, no. 2, pp. 97-108, April-June, 2004.
    [Enriq00-1] Enrique Vargas, High Availability Fundamentals, Sun BluePrints? OnLine[EB/OL],November 2000, http://www.sun.com/blueprints.
    [Enriq00-2] Enrique Vargas, High Availability Best Practices, Sun BluePrints? OnLine[EB/OL] . December 2000, http://www.sun.com/blueprints.
    [Hado06] Apache Lucene Hadoop[EB/OL]. (2006-11). http://lucene.apache.org/hadoop.
    [Have96] Haverkort, Performability Modelling Tools and Techniques[M], 1996.
    [Hors05] Horst Simon, Lenny Oliker, David Skinner, and Erich Strohmaier, Towards Petascale Computing for Science[C]. The Salishan Conference on High-Speed Computing. April 19, 2005.
    [HPCE07] http://grid.ece.ntua.gr/index.php?option=com_content&view=article&catid=36: past- projects& id =49:hpc-europa-high-performance-computing-for-europe&Itemid=92.
    [HPCM93] http://oai.dtic.mil/oai/verb=getRecord&metadataPrefix=htm&identif=ADA285359.
    [HPCS05] HPCS Language Project Web Site. Http://hpls.lbl.gov/.
    [Hsu 02] Hsu, Feng-hsiung, Behind Deep Blue: Building the Computer that Defeated the World Chess Champion[R]. Princeton University Press, 2002.
    [Isla03] M. Islam, P. Balaji, P. Sadayappan, and D. Panda. Qops: A Qos Based Scheme for Parallel Job Scheduling[M]. In JSSPP, pages 252–268, 2003.
    [Jame99] Aaron James Stillman, Model Composition within the Mobius Modeling Framework[D]. Thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering in the Graduate College of the University of Illinois at Urbana-Champaign, 1999.
    [Joge95] Jogesh K. Muppala, Ricardo M. Fricks and Kishor S. Trivedi. Techniques for System Dependability Evaluation[M]. 1995.
    [Kava97] G. P. Kavanaugh, W. H. Sanders, Performance Analysis of Two Time-based Coordinated Checkpointing Protocols[C]. Proc. of IEEE Pacific Rim Int’l Symp. On Fault Tolerant Systems, 1997.
    [Kulk87] G. Kulkarni, V. F. Nicola, K. S. Trivedi, The Completion Time of a Job on Multimode Systems[J]. Adances in Applied Probability, Vol. 19, 1987.
    [Kenk07] Ken Koch, LANL Roadrunner Summary, Computer, Computational, and Statistical Sciences Division[M]. Los Alamos National Laboratory, March,12, 2007.
    [Korn99] K. T. Kornegay, G. Qu, and M. Potkonjak. Quality of Service and System Design[C]. In Proceedings of the IEEE Computer Soc. Workshop on VLSI’99, page 112, 1999.
    [Koo 85] R. Koo and S. Toueg. Checkpointing and rollback-recovery for distributed systems[R]. Technical Report TR85-706, Cornell University, Computer Science Department, 1985.
    [koo 87] R. Koo, S. Toueg, Checkpointing and Recovery Rollback for Distributed Systems[J], IEEE Trans. On Software Engineering, Vol. SE-13, No.1, 1987.
    [Lian06] Yinglung Liang, Yanyong Zhang, Anand Sivasubramaniam, Morris Jette, Ramendra Sahoo. BlueGene/L Failure Analysis and Prediction Models[C]. dsn, pp. 425-434, International Conference on Dependable Systems and Networks (DSN'06), 2006.
    [Mair97] Jean Mairesse, Petri Nets, (MAX,+) Algebra and Scheduling[M], University Paris, May,1997.
    [Mani94] Manish Malhotra and Kishor S. Trivedi. Power-Hierarchy of Dependability Model Types[J]. IEEE Transactions on Reliability, R-43(3):493-502, Sep. 1994.
    [Meye80] J. F. Meyer. On Evaluating the Performability of Degradable Computer Systems[J]. IEEE Trans. Comput., Vol. C-29, pp720-731, Aug. 1980.
    [Meye85] J. F. Meyer, A. Movaghar, and W. H. Sanders. Stochastic activity networks: Structure, behavior, and application[C]. In Proc. International Workshop on Timed Petri Nets, pages 106–115, Torino, Italy, July 1985.
    [Meye95] J. F. Meyer, Performabiltiy Evaluation: Where It Is and What Lies Ahead[C], in Proceeding of the International Computer Performance and Dependability Symposium(IPDS'95), 1995.
    [Meue94] H. W. Meuer. The Mannheim Supercomputer Statistics 1986–1992[R], TOP500 Report 1993, University ofMannheim, 1994, p. 1–15.
    [Migu01] Miguel Castro. Practical Byzantine Fault Tolerance[M]. Massachusetts Institute of Technology Laboratory for Computer Science, Cambridge, Massachusetts, USA, 01/31/2001.
    [Moot99] Mootaz Elnozahy, Yimin Wang, Lorenzo Alvisi and David B. Johnson. A Survey of Rollback-Recovery Protocols in Message-Passing Systems[M], School of Computer Science Carnegie Mellon University, Pittsburgh,June 1999.
    [Mobi05] Mobius User Manual Version 1.8, PERFORM Performability Engineering Research Group, University of Illinois at Urbana-Champaign, 2005, www.mobius.uiuc.edu.
    [Moll82] M. K. Molloy, Performance Analysis Using Stochastic Petri Nets[J], IEEE Transactions on Computers, vol. C-31, 1982.
    [Mova01] Ali Movaghar. Stochastic Activity Networks: A New Definition and Some Properties [M]. Department of Computer Engineering,Sharif University of Technology, Tehran, Iran, 2001.
    [NERSC] National Energy Research Scientific Computing Center. http://www.nersc.gov/.
    [NORC54] The IBM Naval Ordnance Research Calculator (NORC). Columbia University Computing History. http://en.wikipedia.org/wiki/IBM_NORC.
    [PDSI] Scientific Discovery through Advanced Computing (SciDAC), The Petascale Data Storage Institute (PDSI). http://www.pdsi-scidac.org/, 2006.
    [Plank99] J. S. Plank, M. G. Thomason, The Average Availability of Parallel Checkpointing Systems and Its Importance in Selecting Runtime Parameters[C], IEEE Proc. Int’l Symp. On Fault-Tolerant Computing, 1999.
    [Obal01] W. D. Obal II and W. H. Sanders. Measure-adaptive state-space construction methods [J]. Performance Evaluation, 44:237–258, April 2001.
    [Olin04] A. J. Oliner, R. K. Sahoo, J. E. Moreira, M. Gupta, and A. Sivasubramaniam. Fault-aware Job Scheduling for BlueGene/L Systems[C]. In IEEE IPDPS, Intl. Parallel and Distributed Processing Symposium, Apr. 2004.
    [OliM05] A. J. Oliner. Cooperative Checkpointing for High Performance Computing Systems[D]. Master’s thesis, Massachusetts Institute of Technology, 2005.
    [Olin05] A. J. Oliner, L. Rudolph, R. K. Sahoo, J. E. Moreira and M. Gupta. Probabilistic QoS Guarantees for Supercomputing Systems[C]. Proceedings of the 2005 International Conference on Dependable Systems and Networks (DSN’05), 2005.
    [Pete04] peter J. Braam, et. al, The Lustre Storage Architecture[M]. Cluster File Systems Inc., http://www.clusterfs.com, Nov. 2004.
    [Petr04] F. Petrini, K. Davis, J. C. Sancho, System Level Fault Tolerance in Large-Scale Parallel Machines[C]. Proc. of IEEE International Parallel and Distributed Processing Symp. (IPDPS’04), 2004.
    [Plank97] J. S. Plank, An Overview of Checkpointing in Uniprocessor and Distributed Systems, Focusing on Implementation and Performance[R]. Technical Report of University of Tennessee, UT-CS, 1997.
    [Powel 03] David Powell and Robert Stroud, Conceptual Model and Architecture of MAFTIA, LAAS-CNRS Public document, January 31, 2003.
    [Prot04] http://proteinexplorer.org/.
    [RAS02] IBM @server pSeries 630 Reliability, Availability, Serviceability (RAS) Version 1.0, Syed Iggy Haiderzaidi - RAS Architecture and Engineering IBM Server Group Austin, Texas, November 12, 2002.
    [Rost02] E. Rosti, et al., Models of Parallel Applications with Large Computation and I/O Requirements[J]. IEEE Trans. On Software Engineering, Vol. 28, Num. 3, 2002.
    [Roth06] Roth, P. C. The Path to Petascale at Oak Ridge National Laboratory[C], In Petascale Data Storage Workshop Supercomputing. 2006.
    [Smir97] E. Smirni, D. A. Reed, Workload Characterization of Input/Output Intensive Parallel Applications[C], Proc. of Int’l Conference on Computer Performance Evaluation: Modeling Techniques and Tools, 1997.
    [Sahn00] Robin Sahner, Kishor S. Trivedi & Antonio Puliafito, Performance and Reliability Analysis of Computer Systems——An Example-Based Approach Using the SHARPE Software Package[M]. Kluwer Academic Publishers, May, 2000.
    [Sahoo03] R. K. Sahoo, A. J. Oliner, I. Rish, M. Gupta, J. E. Moreira, S. Ma, R. Vilalta, and A. Sivasubramaniam. Critical Event Prediction for Proactive Management in Large-scale Computer Clusters[C]. In ACM SIGKDD, Intl. Conf. on Knowledge Discovery and Data Mining, pages 426–435,Washington, DC, Aug. 2003.
    [Sand91] W. H. Sanders and J. F. Meyer, A Unified Approach for Specifying Measures of Performance, Dependability, and Performability[M]. in Dependable Computing for Critical Applications, Vol 4: of Dependable Computing and Fault-Tolerant Systems, 1991.
    [Sane91] W. Sanders and J. Meyer. Reduced Base Model Construction Methods for Stochastic Activity Networks[J]. IEEE Selected Areas of Communications, pp. 25-36, Jan. 1991.
    [Sand01] W. H. Sanders and J. F. Meyer, Stochastic Activity Networks: Formal Definitions and Concepts. Lecture Notes in Computer Science[J], Berlin: Springer, 2001.no. 2090, pp.315-343.
    [Shiv96] Shivakant Mishra and Dongliang Wang, Choosing an Appropriate Checkpointing and Recovery Algorithm for Distributed Applications[M]. Department of Computer Science Department of Computer Science, University of Wyoming, isca96.
    [Schr06] Schroeder, B. and Gibson, G. A. A Largescale Study of Failures in High-performance Computing Systems[C]. In Proceedings of the international Conference on Dependable Systems and Networks(DSN’06), June 2006.
    [SEMI04] Semiconductor Equipment and Materials International. Specification for definition and measurement of equipment reliability, availability, and maintainability[S]. SEMI E10-0304, 1986, 2004.
    [Sequ08] IBM press room. 20 Petaflop Sequoia Supercomputer. Armonk, NY - 03 Feb 2009. http://www-03.ibm.com/press/us/en/pressrelease/26599.wss.
    [Souz92] E. de Souza e Silva, P.M. Ochoa, State Space Exploration in Markov Models[J]. ACM Performance Evaluation Review, 20, pp. 152-166, 1992.
    [SPNP99] Kishor S. Trivedi, SPNP User's Manual Version 6.0,Center for Advanced Computing and Communication (CACC)[M]. Department of Electrical and Computer Enginee -ring, Duke University, Sept. 1999.
    [Stea04] Jon Stearley, Defining and Measuring Supercomputer Reliability, Availability, and Serviceability (RAS)[M]. Sandia National Laboratories, Dec. 2004.
    [Stea05] Jon Stearley, Towards a Specification for Measuring Red Storm Reliability, Availability, and Serviceability (RAS) [M]. Sandia National Laboratories, May, 2005.
    [Stef07] Stefan Blom, Bert Lisser, Jaco van de Pol and Michael Weber, A Database Approach to Distributed State Space Generation[C]. PDMC 2007.
    [Stev07] Rick Stevens, Argonne's Leadership Computing Facility: Petascale Computing for Science[R], Argonne National Laboratory, The University of Chicago. 2007
    [Singh77] C. Singh, R. Billinton and S. Lee, The Method of Stages for Non-Markovian Models [J]. IEEE Trans. on Reliability, R-26(1):135-137, June 1977. [Trive92] K. Trivedi, J. Muppala, S. Woolet, and B. Havekort. Composite Performance and Dependability Analysis[J]. Performance Evaluation, 14:197-215, 1992.
    [Tang92] D. Tang, R. K. Iyer, Analysis and Modeling of Correlated Failures in Multicomputer Systems[J]. IEEE Trans. On Computers, Vol. 41, Num. 5, 1992.
    [Top500] http://www.top500.org
    [Triv99] Kishor S. Trivedi, SHARPE Interface User’s Manual Version 1.01, Center for Advanced Computing and Communication (CACC)[M]. Department of Electrical and Computer Engineering, Duke University, Aug. 1999.
    [Vaid95] N. H. Vaidya, On Checkpoint Latency[C]. Proc. of IEEE Pacific Rim Int’l Symp. On Fault-Tolerant Systems, 1995.
    [Vali90] L. G. Valiant, A Bridging Model for Parallel Computation[M], Communications of the ACM, Vol.33, 1990.
    [Vasa04] Vasant Butala, Data Center Availability Features for High-End Servers, SunBluePrints? OnLine—July 2004, http://www.sun.com/blueprints.
    [Wang05] Long Wang, Karthik Pattabiraman, Zbigniew Kalbarczyk and Ravi K. Iyer, Modeling Coordinated Checkpointing for Large-Scale Supercomputers[C]. Center for Reliable and High-Performance Computing,University of Illinois at Urbana Champaign, DSN2005.
    [Will98] A.Williamson. Discrete event simulation in the Mobius modeling framework[D]. Master’s thesis, University of Illinois at Urbana-Champaign, 1998.
    [Zhan04] Y. Zhang, et al., Performance Implications of Failures in Large-Scale Cluster Scheduling[C]. 10th Workshop on Job Scheduling Strategies for Parallel Processing, 2004.
    [Chen01]陈左宁,岳霖霖.并行机系统级容错技术[J].南京大学学报,第37卷: P287-P293, 2001.
    [Dawn06]由渊霞,孟丹,薛刚.曙光4000A可用性测量与应用可完成性计算[J].高性能计算发展与应用, 2006年第四期,总第十七期.
    [Gangy06]中华人民共和国国务院.国家中长期科学和技术发展规划纲要(2006━2020年), 2006年2月.
    [Gaow01]高文、祝明发.基于生灭过程的机群系统高可用性分析与设计[J].微电子学与计算机, 2001年第4期,第47-49页.
    [Gong08]龚道永,付金辉,朱建涛.超级计算机容错系统设计研究[J], HPC CHINA 2008会议论文,高性能计算发展与应用, 2008年第四期(总第二十五期), 2008年.
    [Hewq04]何王全,张富红,尤洪涛. UPC容错技术研究[J].高性能计算技术.第171期,第45-49页. 2004年12月.
    [Huan00]黄凯,徐志伟.可扩展并行计算——技术、结构与编程[M],北京:机械工业出版社, 2000.
    [Husu05]胡苏太.高性能计算机发展趋势与展望[J].高性能计算发展与应用. 2005年第二期(总第十一期),第15-23页.
    [Jian99]蒋仁言,左明健.可靠性模型与应用[M].北京:机械工业出版社. 1999.
    [Linc01]林闯.计算机网络和计算机系统的性能评价[M].北京:清华大学出版社. 2001.
    [Liux99]刘心松,范明惠,陈虹.面向应用的计算机网络冗余设计探索[J].电子学报, 1999, 27(11):93-95.
    [Liur04]刘睿涛.并行计算机可用性评估研究[J].高性能计算技术.第167期, 11-14页, 2004年4月.
    [Liut04]刘睿涛.大规模并行计算机容错研究[J].高性能计算技术.第171期, 1-5页, 2004年12月.
    [Leim04]雷鸣,陈左宁,朱建涛. HPC系统级容错体系结构设计[J].高性能计算技术.第171期, 6-11页, 2004年12月.
    [Linx97]林晓东,刘心松.高可用性系统的研究与实现[J].电子科技大学学报. 1997,26(5):533-538.
    [Lijy07]李江昀,童朝南,孙一康.双过程机热备份集群系统的可信性建模与仿真[J].北京科技大学学报. 2007, 29(1):76-81.
    [Licj06]李春江,李东升,肖侬等.计算网格应用可用性的度量模型[J].计算机研究与发展. 2003.
    [Plan863]国家高技术研究发展计划(863计划)信息技术领域高效能计算机及网格服务环境重大项目(一期)2006年度课题申请指南.
    [Shij06]石健,王少萍,魏振金.面向用户的以太网可用性模型[J].北京航空航天大学学报.第32卷第4期,第494-498页. 2006年4月.
    [Shilu00]石钟慈,陆甬祥.第三种科学方法:计算机时代的科学计算[M].北京:清华大学出版社出版. 2000年.
    [Tang07]汤海鹰,许鲁.基于服务部署的高可用模型及其可用性分配算法[J].计算机学报. 2007, 30(10):1731-1739.
    [Wang04]王红艳,朱建涛,郑翔,一个并行I/O系统的可用性评估模型[J].高性能计算技术[J], 2004年第6期,总第171期,第30~34页.
    [Wang06]王海鹏,周兴社,张涛,向冬.面向用户的普适计算系统可用性度量模型[J].计算机科学,第33卷第11期,第89-93页, 2006年.
    [Xiaps84]夏培肃主编.英汉计算机辞典[S].中国电子学会电子计算机学会编,人民邮电出版社出版, 1984年3月第一版,第924页.
    [Xiwo88] D.P.西沃赖克, R.S.斯沃兹,可靠系统的设计理论与实践[M].北京:科学出版社. 1988.
    [Yubi06]于斌,刘宏伟,崔刚等.高可用双机容错服务器的研究与设计[J].计算机工程与设计. 2006, 27(9): 1524- 1525.
    [Youh06]尤洪涛,姜小成,陈左宁,基于动态任务划分的降级机制[J],微计算机信息, 2006, 30: 72-75.
    [Quwx05]屈婉霞,蒋句平,杨晓东,徐炜遐.并行计算机系统容错设计[J],计算机工程与科学, 2005年09期.
    [Zhan07]张秋余,余冬梅,赵薇娜等.机群系统的可用性分析计算[J].兰州理工大学学报. 2007, 33(1): 92-95.
    [Zhan06]张永忠,赵银亮.量化评估用户感知的可用性[J].西安交通大学学报, 2006, 40(12):1383-1387.
    [Zhen08]郑方,郑霄,李宏亮,陈左宁.面向用户的并行计算机系统可用性建模研究[J]. 计算机研究与发展.第45卷,第5期, 2008年5月, page886-894.
    [Zhou90]周源泉、翁朝曦.可靠性评定[M].北京:科学出版社.1990.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700