分布式系统中回卷恢复技术研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

分布式系统中回卷恢复技术研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research on Rollback Recovery Technology in Distributed Systems
作者：刘国良
论文级别：博士
学科专业名称：计算机科学与技术
中文关键词：分布式系统 ; 回卷恢复 ; 检查点 ; XMPP协议 ; 原型系统
英文关键词：Distributed Systems ; Rollback Recovery ; Checkpointing ; XMPP Protocol ; Prototype System
学位年度：2012
导师：陈蜀宇
学科代码：0812
学位授予单位：重庆大学
论文提交日期：2012-10-01

摘要

分布式系统具有用户投资风险小、结构可扩展性好、用户可继承原有的软硬件资源、构造简单等特点，其应用领域越来越广泛。包括大规模科学计算系统、天气预报系统、分时电话系统、飞机订票系统、银行系统、股票系统、购物系统等。随着系统规模的不断扩大，其在计算过程中发生故障的几率也在指数增长，系统一旦失效，可能带来灾难性的后果，因此迫切需要为分布式计算系统提供容错机制。检查点与回卷恢复(Checkpoint and Rollback-Recovery)技术是一类重要的软件容错技术，具有实现和使用简单，对资源要求低等特点，适合在分布式计算环境中应用。
     分布式计算环境中，不确定的通信带宽、存储空间限制、节点的动态性、频繁的断开连接等特点决定了为单机系统开发的回卷恢复技术不能直接地应用到分布式计算系统中。在保证系统一致性的前提下，减少检查点和消息日志的存储开销、减少回卷恢复机制引入的通信开销、提高节点的自治性(autonomy)、减少由于进程间依赖关系造成的节点间藕合、实现回卷恢复机制对节点的透明，是分布式环境下回卷恢复技术研究的核心问题。本文围绕以上内容展开研究，主要创新点如下。
     （1）提出了一种分布式环境下非阻塞协调检查点及回卷恢复算法。在分布式计算环境的实际应用中，节点的自治性很强，希望的容错机制是一种透明的服务。提出的检查点算法基于发送进程来确保不会产生孤儿消息，不需要接收进程的任何信息，算法每次获得的检查点均是全局一致检查点，直接获得永久检查点，跳过临时检查点阶段，加快了检查点的形成时间，一个进程是否获得检查点与其他进程无关，算法是否获得检查点只与发送标志有关，确保了算法的高并行性。某节点失效后，只需要通过进程广播一条同步消息，其他进程收到同步消息后，根据算法做独立处理，不需要其他进程的额外消息，从而实现了节点间透明、并行地执行回卷恢复算法。通过算法性能分析和仿真实验，验证了算法无故障运行和回卷恢复阶段的低开销性。
     （2）提出了一种基于动态分组的两级检查点及回卷恢复容错算法。就包含多个结点的应用而言，结点间交换信息的频率是不一样的，甚至相差很大，因此需要一种机制来适应分布式系统中进程动态协作的特点。提出的算法根据结点间通信的频率、通信时延、通信带宽及分组中结点数等指标来实现动态分组，实现分组的高内聚低耦合。组内通信时延小、结点数不多，适合协调检查点算法，因此在组级采用协调检查点算法。组间通常是由高时延、低带宽的网络相互连接，并且组间的通信频率较低，提出的系统级检查点算法充分考虑了这些特点，每个分组是否获得检查点，与其他分组无关，各个分组可以独立地，以并行方式获得系统级检查点；通过发送分组来确保分组间不会产生孤儿消息，每次获得的系统级检查点均是全局一致检查点，避免了多米诺效应的发生。算法一方面动态适应了应用自身的要求，提高了资源的整体效能，另一方面通过发送分组来确保分组间不会产生孤儿消息，实现了由传统的两阶段提交算法到单阶段算法的转变。实验结果表明，算法执行时间较低，相对于传统的两阶段提交算法，时间复杂度由通常的O(n2)降低到O(n)。
     （3）基于XMPP协议构建一个通用的消息传递机制。已有检查点及回卷恢复算法，都是自定义算法，消息传递方式各不相同，没有通用性可言。我们根据分布式系统的特点及检查点算法传递的消息特点，构建一个通用的消息传递机制，该机制基于XMPP协议，实现了消息的跨平台、准实时传输。对XMPP协议中XML标签进行扩展，实现了多种检查点消息传输格式的统一，提高了程序的重用性。
     （4）原型系统的设计与实现。在理论研究基础上，进行系统原型设计及实现，验证理论的可实现性，是从理论研究到实际工程应用过程中非常重要的工作。结合前面的理论研究成果，研究了原型系统的系统构建、客户端软件需求分析、客户端软件总体框架、功能模块及处理流程，并编程实现一个原型系统，证明了理论成果的可实现性。
The application fields of distributed system are more and more widely because ofthe characteristics that have small investment risk, have structure with good scalability,can inherit the original software and hardware resources, and have the advantages ofsimple structure. These fields including large-scale scientific computing system,telephone system, the aircraft booking system, bank system, stock system, shoppingsystem. With the constant enlargement of the scale of the system, the probability ofoccurring failure is growth in exponential and it may have disastrous consequences oncethe system failure. So there is an urgent need for the distributed computing systemprovides fault tolerance mechanism. Checkpointing and rollback recovery technology isa kind of important software fault tolerant technique, which can be realized easily andused easily, is suitable for application in a distributed computing environment.
     In the distributed computing environment, the characteristic of uncertainty of thecommunication bandwidth, storage space constraints, node dynamic and frequentdisconnection characteristic decided to the rollback recovery technology that bedeveloped for single computer can not be applied directly to the distributed computingsystem. Under the premise that ensure the consistency of the system, reduce checkpointand message log storage cost, reduce the communication cost of the rollback recoverymechanism, improve the node autonomy, decreased the coupling due to the process ofdependency relationship between the nodes, achieve the transparently of rollbackrecovery mechanism for nodes, which are the core problems of rollback recoveryresearch on technology. This paper is focuses on these aspects to elaborate.
     (1) In a distributed computing environment i, a lot of network structure is loosecoupling, node autonomy is very strong, we hope the fault-tolerant mechanism is a kindof transparent services, at the same time, need the asynchronous rollback recoverymechanism. We have presented a non-blocking coordinated checkpointing and rollbackrecovery algorithm for distributed systems, which are differ from the conventionalapproach of taking first temporary checkpoints and then converting them to permanentones by processes. The proposed checkpointing algorithm allows processes to takepermanent checkpoints directly, without taking temporary checkpoints. The character ofthe algorithm contributes to its speed of execution. The orphan messages are eliminatedby sender processes and the in-transit messages are eliminated by checkpointing interval and retransmission mechanism. To support the station just for each node keeps a recentcheckpoint, the log information to avoid synchronization, reducing the free errorrun-time overhead. After a node failure, which only need to broadcast onesynchronization message to the others processes and the others processes are processindependently according to the algorithm after the synchronization messages arereceived.Thereby realizing the node transparent and high parallel rollback recovery.
     (2) Aiming to the application that includes a number of nodes, we need to anadaptive mechanism to satisfy vary status because the frequency of the exchange ofinformation between nodes is not the same, even vary greatly. Aim to the characters ofdistributed systems, we have presented a two-level checkpointing and rollbackrecovery fault-tolerance algorithm based on dynamic group, which adopt cooperativecheckpointing algorithm in group-level and single phase checkpointing algorithm insystem-level. According to the communication frequency, communication delay,bandwidth and the number of nodes and other indicators to achieve dynamic packet. Asa result, the communication delay is small and the nodes are not many within group,therefore we adopts coordinated checkpointing algorithm in the group-level. It isusually composed of the networks of high time delay and low bandwidth networkconnected to each other between group, and the communication frequency is lowintergroup, the proposed system level checkpoint algorithm takes full account of thecharacteristics.The orphan messages are eliminated by sender groups and the in-transitmessages are eliminated by checkpointing interval and retransmission mechanism. Sothe obtaining system level checkpoints are consistent global checkpoint, which avoidingthe occurrence of Domino effect. on the one hand it dynamic adapt to the requirement ofapplication and enhance the efficiency to whole application,on the other hand theorphan messages are eliminated by sender processes and realize the change from twophase commit algorithm to single phase commit algorithm. The character of thealgorithm contributes to its speed of execution.
     (3) In a distributed environment, how to construct a general message passingmechanism, information platform, quasi real time transmission various checkpointmessage, is worthy of study. According to the characteristics of distributed system andcheckpoint algorithm, we present a messaging mechanism that can extend and adapt totransfer for variety checkpoint messages. The presented mechanism has suchadvantages: cross platform, easy to expand, quasi real time transmission.
     (4) Based on the theoretical research, design and realization the prototype system, verify the theory can be realized, from the theory research to practical applications isvery important work in engineering.We have researched on the system construction ofprototype system, the requirement analysis of the client software, software framework,function module, the processing flow, and have implemented a prototype systemcombine with the theoretical research, which proved that the theory achievements canbe realized.

引文

[1]王继刚.高可用集群系统中回卷恢复容错技术研究[D].哈尔滨工程大学博士论文,2006.
    [2]李海山.面向恢复的容错计算技术研究[D].哈尔滨工程大学博士论文,2007.
    [3] Xuejun Yang, Yunfei Du, Panfeng Wang, Hongyi Fu, Jia Jia. FTPA: Supporting Fault-TolerantParallel Computing through Parallel Recomputing. IEEE TRANSACTIONS ON PARALLELAND DISTRIBUTED SYSTEMS,2009,20(10):1471-1486.
    [4] Jinho Ahn.Effective sender-based message logging algorithm with checkpointing consideringtransient communication errors. International Conference on High Performance Computingand Simulation (HPCS),2011:330-335.
    [5] Elmootazbellah N. Elnozahy, James S. Plank. Checkpointing for Peta-Scale Systems: A Lookinto the Future of Practical Rollback-Recovery. IEEE TRANSACTIONS ON DEPENDABLEAND SECURE COMPUTING,2004,1(2):97-108.
    [6] Pourmahmoud, S. Asbaghi, S. Haghighat, A.T. A New Way of Calculating theRwww.lw20.comecovery Line Through Eliminating Useless Checkpoints in DistributedSystems. Computer and Information Sciences,2008:1-4.
    [7] Domenico Cotroneo, Catello Di Martino.Field Data Based Modeling of Sender BasedMessage Logging Protocols for Supercomputers Checkpointing. IEEE30th InternationalConference on Distributed Computing Systems Workshops,2010:294-301.
    [8] Hung-Manh Pham, S′ebastien Pillement, Stanis aw J. Piestrak. Low OverheadFault-Tolerance Technique for Dynamically Reconfigurable Softcore Processor. IEEETRANSACTIONS ON COMPUTERS,2012:1-15.
    [9]张展.移动计算环境下卷回恢复技术的研究[D].哈尔滨工业大学博士论文,2008.
    [10] B. Gupta, S. Rahimi, and R. Ahmad.A New Roll-Forward Checkpointing RecoveryMechanism for Cluster Federation. IJCSNS International Journal of Computer Science andNetwork Security,2006,6(11):292-298.
    [11] Chih-Ho Chen, Yung Ting, Jia-Sheng Heh: Low Overhead Incremental Checkpointing andRollback Recovery Scheme on Windows Operating System. IEEE Third InternationalConference on Knowledge Discovery and Data Mining,2010:268-271.
    [12] S. Monnet, C. Morin, R. Badrinath. Hybrid checkpointing for parallel applications in clusterfederations. CCGRID,2004:773-782.
    [13] Iván Cores, Gabriel Rodríguez, María J. Martín, Patricia González.ReducingApplication-level Checkpoint File Sizes: Towards Scalable Fault Tolerance Solutions. IEEEInternational Symposium on Parallel and Distributed Processing with Applications2012:371-378.
    [14] Bidyut Gupta, Shahram Rahimi, Yixin Yang.A Novel Roll-Back Mechanism for PerformanceEnhancement of Asynchronous Checkpointing and Recovery. Informatica (Slovenia),2007,31(1):1-13.
    [15Nishanth Chandrasekaran, Suman Kalyan Mandal.A Two level Cache Based Checkpointingand Rollback Recovery Scheme using Multiple Epochs. TR-Texas A&M University,2006.
    [16] George Bosilca, Remi Delmas, Jack Dongarra, Julien Langou. Algorithm-based faulttolerance applied to high performance computing. J. Parallel Distrib. Comput.,2009,69(4):410-416.
    [17] Sunil KumarGupta, R. K Chauhan,Parveen Kumar. Backward Error Recovery Protocols inDistributed Mobile Systems: A Survey. Journal of Theoretical and Applied InformationTechnology,2008,30(4):225-240.
    [18] Claudia Rusu, Cristian Grecu, Lorena Anghel. Blocking and Non-blocking Checkpointing andRollback Recovery for Networks-on-Chip. Dependable and Secure Nanocomputing,2008.
    [19] D. Manivannan.Checkpointing and Rollback Recovery in Distributed Systems: ExistingSolutions, Open Issues and Proposed Solutions. In Proceedings of the12th WSEASInternational Conference on Systems, Heraklion, Crete Island, Greece,2008,22-24.
    [20] Raphael Y. de Camargo, Andrei Goldchleger, Fabio Kon, Alfredo Goldman.Checkpointing-based rollback recovery for parallel applications on the InteGrade gridmiddleware. Middleware for Grid Computing2004:35-40.
    [21] G. Janakiraman, Yuval Tamir.Coordinated Checkpointing-Rollback Error Recovery forDistributed Shared Memory Multicomputers. SRDS,1994:42-51.
    [22] Bidyut Gupta, Shahram Rahimi and Ziping Liu. Design of High Performance DistributedSnapshot Recovery Algorithms for Ring Networks. Journal of Computing and InformationTechnology,2008,16(1):23-28.
    [23] K. Mani Chandy and Leslie Lamport. Distributed Snapshots: Determining Global States ofDistributed Systems. ACM Transactions on Computer Systems,1985,3(1):63-75.
    [24] G. Cao and M. Singhal,.Mutable checkpoints: a new checkpointing approach for mobilecomputing systems.IEEE Transactions on Parallel and Distributed Systems,2001,12(2):157–172.
    [25]汪东升,邵明珑.具有O(n)消息复杂度的协调检查点设置算法.软件学报,2003,14(1):43-48.
    [26] Samir JAFAR，Thierry GAUTIER，Axel KRINGS，Jean-Louis ROCH. A CheckpointRecovery Model for Heterogeneous Dataflow Computations Using Work-Stealing.Euro-Par'05Proceedings of the11th international Euro-Par conference on ParallelProcessing,2005.
    [27] B. Gupta，S. Rahimi. A Novel Low-Overhead Recovery Approach for Distributed Systems[J].Journal of Computer Systems, Networks, and Communications,2009,10(2):1-8.
    [28] E.N.(MOOTAZ) ELNOZAHY， LORENZO ALVISI， YI-MIN WANG， DAVID B.JOHNSON. A Survey of Rollback-Recovery Protocols in Message-Passing Systems[J]. ACMComputing Surveys.2002,34(3):375-408.
    [29] George Bosilca，R′emi Delmas，Jack Dongarra，Julien Langou. Algorithmic Based FaultTolerance Applied to High Performance Computing [J]. Journal of Parallel and DistributedComputing,2009,69(4):410-416.
    [30] Jim Smith， Paul Watson. Applying Low-Overhead Rollback-Recovery to Wide AreaDistributed Query Processing. School of Computing Science Technical Report Series,2004.
    [31] Jiannong Cao，Yifeng Chen，Kang Zhang，Yanxiang He. Checkpointing in hybrid distributedsystems.Proceedings of the7th International Symposium on Parallel Architectures,Algorithms and Networks (ISPAN’04),2004:136-141.
    [32] Mehdi Aminian, Mohammad k. Akbari, Bahman Javadi. Combining Coordinated andUncoordinated Checkpoint in Pessimistic Sender-Based Message Logging[J]. InternationalJournal of Computer Science and Network Security,2006,6(4):156-162.
    [33] CHENG-MIN LIN， CHYI-REN DOW. Efficient Checkpoint-based Failure RecoveryTechniques in Mobile Computing Systems [J]. JOURNAL OF INFORMATION SCIENCEAND ENGINEERING,2001,17(4):549-573.
    [34] N. Naksinehaboon, M. Paun， R. Nassar， B. Leangsuksun， S. Scott. High PerformanceComputing Systems with Various Checkpointing Schemes [J]. International Journal ofComputers,Communications&Control,2009,4(4):386-400.
    [35] Chao Wang, Frank Mueller, Christian Engelmann, Stephen L. Scott. Hybrid Full/IncrementalCheckpoint/Restart for MPI Jobs in HPC Environments. International Conference on Paralleland Distributed Systems (ICPADS),2010:386-400.
    [36] S. Gerhold，P. Schmidt，A. Weggerle，P. Schulthess. Improved Checkpoint/Restart UsingSolid State Disk Drives. International Conference on Computer Engineering and Applications(ICCEA),2010:235-239.
    [37] Chaoguang Men,Zhenpeng Xu, Xiang Li. An Efficient Checkpointing and Rollback RecoveryScheme for Cluster-Based Multi-channel Ad Hoc Wireless Networks. IEEE InternationalSymposium on Parallel and Distributed Processing with Applications,2008:371-378.
    [38] M. Aliouat，Z. Aliouat. Recovery in Distributed Systems from Transient and PermanentFaults[J]. Journal of Computer Science,2007,3(8):617-6231.
    [39] Benoit Hudzia, Serge Petiton. Reliable multicast using fault tolerant MPI in the Gridenvironment. International Conference GRIDnet,2004:1-7.
    [40] Justin C. Y. Ho, Cho-Li Wang，Francis C. M. Lau. Scalable Group-based Checkpoint Restartfor Large-Scale Message-passing Systems. International Conference on Parallel andDistributed Processing,2008:1-12.
    [41]魏晓辉,鞠九滨.SFT:一个具有较短冻结时间的一致检查点算法[J].计算机学报,1999,22(6):645-650.
    [42] John Bent, Garth A. Gibson, Gary Grider, Ben McClelland, Paul Nowoczynski, James Nunez,Milo Polte, Meghan Wingate: PLFS: a checkpoint filesystem for parallel applications.Proceedings of the Conference on High Performance Computing Networking, Storage andAnalysis, ACM New York,2009:234-246.
    [43]周波.移动数据库系统中移动主机的故障恢复技术研究[D].重庆大学硕士论文,2009.
    [44] Yawei Li, Zhiling Lan: A fast restart mechanism for checkpoint/recovery protocols innetworked environments. DSN2008:217-226.
    [45] Partha Sarathi Mandal.Checkpointing using Mobile Agents in Distributed Systems. IEEEProceedings of the International Conference on Computing: Theory and Applications,2011:39-45.
    [46]崔磊,晏海华.基于高性能集群实时容错机制的研究与实现[J].数据通信,2008,12(5):97-102.
    [47] Bidyut Gupta，Shahram Rahimi，Ziping Liu. A New High Performance CheckpointingApproach for Mobile Computing Systems[J]. The Journal of Supercomputing.2005.33(1):95-104.
    [48] E. Roman. A Survey of Checkpoint/Restart Implementations. Technical Report LBNL-54942,Lawrence Berkeley National Laboratory, July2002.
    [49]孙国忠,李艳红,樊建平.高性能并行计算系统检查点技术与应用[C].中国科学院计算技术研究所第八届计算机科学与技术研究生学术讨论会.北京,2004.
    [50] Samir Jafar, Axel Krings, Thierry Gautier. Flexible Rollback Recovery in DynamicHeterogeneous Grid Computing. IEEE TRANSACTIONS ON DEPENDABLE ANDSECURE COMPUTING,2009,6(1):32-44.
    [51] Marco A. S. Netto，Alfredo Goldman，Pierre-Fran ois Dutot. A Flexible Architecture forScheduling Parallel Applications on Opportunistic Computer Networks. Technical ReportRT-MAC-2006-01, IME-USP, Brazil, February2006.
    [52] Andrey Brito, Christof Fetzer，Pascal Felber. Multithreading-Enabled Active Replication forEvent Stream Processing Operators. International Conference on Reliable DistributedSystems,2009:22-31.
    [53] Qi Gao，Wei Huang，Matthew J. Koop.Group-based Coordinated Checkpointing for MPI: ACase Study on InfiniBand.IEEE International Conference on Parallel Processing,2007,6(1):47-53.
    [54] József Kovács. Consistent, global state transfer for message-passing parallel algorithms inGrid environments. Summary of Ph.D Theses.2008.
    [55] E. Mizan, M. Alba,“Fault-Tolerant CMP Design Using a Write Cache Checker”, InternationalConference on Dependable Systems and Networks (DSN), June2005.
    [56] YongChul Kwon, Magdalena Balazinska, Albert G. Greenberg. Fault-tolerant streamprocessing using a distributed, replicated file system. PVLDB,1(1):574-585,2008.
    [57] E.N. Elnozahy, L. Alvisi, Y.-M. Wang and D.B. Johnson.A Survey of Rollback-RecoveryProtocols in Message-Passing System. ACM Computing Surveys,2002,34(3):375-408.
    [58] John Mehnert-Spahn, Eugen Feller, and Michael Sch ttner."Incremental Checkpointing forGrids", Linux Symposium2009, Montreal, Canada, May2009:201-208.
    [59] John Mehnert-Spahn, Michael Schoettner, Christine Morin." Integrated Process Managementin a Grid Checkpointing Environment", PDCAT'08, New Zealand,101-108.
    [60] Dang Minh Quan, J rn Altmann: Mapping a group of jobs in the error recovery of theGrid-based workflow within SLA context. AINA2007:986-993.
    [61] William M. Jones: Network-aware selective job checkpoint and migration to enhanceco-allocation in multi-cluster systems. Concurrency and Computation: Practice andExperience (CONCURRENCY)21(13):1672-1691(2009).
    [62] G. Cao and M. Singhal, On coordinated checkpointing in Distributed Systems, IEEETransactions on Parallel and. Distributed Systems1998,9(12):1213-1225.
    [63] Najib A. Kofahi, Said Al-Bokhitan, Ahmed Al-Nazer,“On Disk-Based and DisklessCheckpointing for Parallel and Distributed Systems: An Empirical Analysis,” InformationTechnology Journal, Vol.4, No.(4),2005, pp:367-376.
    [64] Donald E. Porter, Emmett Witchel: Operating Systems Should Provide Transactions. HotOS2009:1-5.
    [65] stopher Dabrowski: Reliability in grid computing systems. Concurrency and Computation:Practice and Experience (CONCURRENCY)21(8):927-959(2009).
    [66] E. Imamagi, D. D. agar, B. Radi.Checkpointing approach for computer clusters. IIS,2005,23.9.
    [67] Van Roy, Peter, Ali Ghodsi, Seif Haridi, Jean-Bernard Stefani, Thierry Coupaye, AlexanderReinefeld,Ehrhard Winter, and Roland Yap, Self Management of Large-Scale DistributedSystems by Combining Peer-to-Peer Networks and Components, CoreGRID Technical ReportTR-0018, Dec.14,2005.
    [68]杨金民.低开销的回卷恢复容错技术研究[D].湖南大学博士论文,2004.
    [69]门朝光.分布式系统协同检查点技术的研究[D].哈尔滨工业大学博士论文,2004.
    [70] David E. Lowell, Subhachandra Chandra, Peter M. Chen: Exploring Failure Transparency andthe Limits of Generic Recovery. OSDI2000:289-304.
    [71]张展,左德承,慈轶为,杨孝宗.穿戴计算机的内核级检查点优化策略研究[J].高技术通讯,2008，18(5):492-497.
    [72]刘国良，陈蜀宇，徐光侠，常光辉.基于动态分组的两级检查点算法[J].华南理工大学学报（自然科学版）,2011，39(2):141-147.
    [73] P. Ramanathan, K.G. Shin,"Use of Common Time Base for Checkpointing and RollbackRecovery in a Distributed System," IEEE Transactions on Software Engineering, vol.19, no.6, pp.571-583, June1993.
    [74] I. S. W. B. Prasetya, Tanja E. J. Vos, A. Azurat, and S. Doaitse Swierstra. A unity-basedframework towards component based systems. In Teruo Higashino, editor, Revised SelectedPapers OPODIS2004, volume3544of Lecture Notes in Computer Science, pages52–66.Springer,2005.
    [75] Guohong Cao, Mukesh Singhal. Checkpointing with mutable checkpoints. TheoreticalComputer Science, Volume290, Number2,2January2003, pp.1127-1148(22).
    [76] I.S.W.B. Prasetya, S.D. Swierstra. Theoretical Computer Science290(2003)1201–1222.
    [77] Andrea Baldini, Alfredo Benso, Paolo Prinetto. A Dependable Autonomic ComputingEnvironment for Self-Testing of Complex Heterogeneous Systems. Electronic Notes inTheoretical Computer Science116(2005)45–57.
    [78] Hong Wang, Hiroyuki Takizawa, Hiroaki Kobayashi. A dependable Peer-to-Peer computingplatform. Future Generation Computer Systems23(2007)939–955.
    [79] Michele Cirinei, Enrico Bini, Giuseppe Lipari, Alberto Ferrari: A Flexible Scheme forScheduling Fault-Tolerant Real-Time Tasks on Multiprocessors. IPDPS2007:1-8.
    [80] Miriam Zia, Sadaf Mustafiz, Hans Vangheluwe, J rg Kienzle: A modelling and simulationbased process for dependable systems design. Software and System Modeling (SOSYM)6(4):437-451(2007)
    [81] Karsten Loer· Michael D. Harrison. An integrated framework for the analysis of dependableinteractive systems (IFADIS): Its tool support and evaluation. Autom Software Eng (2006)13:469–496.
    [82] Sebastian Schumann, Michael Maruschke, Eugen Mikoczy: The Potential of ConsolidatingSIP and XMPP Based Communication for Telecommunication Carriers. TRIDENTCOM2010:713-726.
    [83] Ozgur Ozturk: Introduction to XMPP protocol and developing online collaborationapplications using open source software and libraries. CTS2010:21-25.
    [84]吴燕.基于XMPP协议的P2P即时通讯软件设计[D].浙江大学硕士论文,2007.
    [85]招俏春.基于XMPP协议的即时通讯系统的研究[D].华南师范大学硕士论文,2008.
    [86] D. J. Sorin, M. M. K. Martin, M. D. Hill, and D. A. Wood. Fast Checkpoint/Recovery toSupport Kilo-instruction Speculation and Hardware Fault Tolerance. Computing ScienceTechnical Report CS-TR-2000-1420, University of Wisconsin-Madison, october2000.
    [87]邹奕婷.基于Jabber的校园IM系统的研建[D].北京林业大学硕士论文,2008.
    [88]刘亚.基于Symbian的MMIM系统手机客户端的设计与实现[D].北京工业大学硕士论文,2007.
    [89] Simon N. Foley, Wayne Mac Adams: Trust management of XMPP federation. IntegratedNetwork Management2011:1192-1195.
    [90]罗堃.基于JXTA平台的P2P文件共享系统的研究与实现[D].电子科技大学硕士论文,2009.
    [91] Leigh Griffin, Eamonn de Leastar, Dmitri Botvich: Dynamic shared groups within XMPP: Aninvestigation of the XMPP group model. Integrated Network Management2011:634-637.
    [92] Jichiang Tsai, Chi-Yi Lin and Sy-Yen Kuo,"Adaptive communication-induced checkpointingprotocols with domino-effect freedom," Journal of Information Science and Engineering, vol.20, no.5, pp.885-901, Sept.2004.
    [93] Abbas Attarwala, Deepak Jagdish, Ute Fischer.Real Time Collaborative Video AnnotationUsing Google App Engine and XMPP Protocol. IEEE CLOUD,2011:738-739.
    [94]肖荣军.贵州省乡镇企业局即时通信系统设计与实现[D].重庆大学硕士论文,2009.
    [95] Ronny Klauck, Jan G bler, Michael Kirsche, Sebastian Schoepke: Mobile XMPP and cloudservice collaboration: An alliance for flexible disaster management. CollaborateCom2011:201-210.
    [96] Liu Guoliang, Chen Shuyu, Zhang Xiaoqin.A Non-blocking Checkpointing Algorithm forDistributed Systems[J].Journal of Digital Content Technology and its Applications,2011，5(7):230–238.
    [97] Shuyu Chen, Guoliang Liu, Xiaoqin Zhang.Low-Overhead Checkpointing/Rollback RecoveryAlgorithms[J].International Journal of Advancements in Computing Technology,2012,4(17):244-253.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700