面向可信计算的分布式故障检测系统研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

面向可信计算的分布式故障检测系统研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research on Dependable Computing Oriented Distributed Fault Detection System
作者：卢华玮
论文级别：博士
学科专业名称：计算机科学与技术
中文关键词：可信计算 ; 大规模分布式系统 ; 故障检测 ; 自组织管理 ; 支持向量机
英文关键词：Dependable Computing ; Large‐Scale Distributed Systems ; Fault Detection ; Self
英文关键词：Organized Management ; Support Vector Machine
学位年度：2012
导师：陈蜀宇
学科代码：0812
学位授予单位：重庆大学
论文提交日期：2012-04-01

摘要

随着计算机软硬件以及网络技术的不断发展，服务计算模式也不断发生日新月异的变化。依托Internet的快速发展，近年来先后出现了以广域网P2P系统、网格、云计算等为代表的开放型的大规模分布式网络计算系统。这些网络计算系统承载了大量与经济生产和社会生活息息相关的业务系统，极大地促进了经济的发展和社会的进步。一旦这些系统出现故障、或者服务质量降低，就会给社会的生产生活带来极大的不便和经济损失。因此，如何保障和提升计算系统可信性，使其能够持续提供高可用、高可靠等特点的计算服务，也就成为了分布式计算技术应用和发展的关键问题之一。为此工业界和学术界都投入了大量的人力和财力开展了相关研究。正确地对系统中的故障进行控制和容错处理是提高系统可信性的重要手段，而切实有效的基于系统实体状态的故障检测是确保这一手段可行的一个重要基础。
     故障检测不仅是对故障的正确识别，同时也包括了对被检测对象的有效监测，本文针对大规模分布式网络计算环境中的故障检测若干关键问题展开讨论和研究。在对现有相关技术和研究成果的总结和深入分析基础上，提出了面向可信计算的分布式故障检测系统的体系结构，设计了相关的分布式自组织实体监测算法、状态消息散播算法、检测系统可生存性算法和故障识别算法，最后实现了一套自组织的分布式故障检测原型系统。
     本论文的具体研究工作和创新点包含如下几个方面：
     ①在明晰了可信计算框架下对分布式计算系统进行故障检测所面临的问题基础上，针对开放式网络分布式系统应用特点及容错需求，设计了与上层策略分离的故障检测体系结构，建立了包含状态数据采集、系统实体监测、状态信息散播、故障识别等模块的分布式故障检测总体框架。
     ②针对传统的集中式或者层次式面向失效检测的节点监测体系不能很好适应开放式大规模分布式网络计算环境下节点分布范围广、参与计算节点数量多、消息传输时延不稳定、服务依赖关系不确定等特点，基于自组织思想，提出了一种依据系统实体相互之间距离的分邻域监测方法。该方法有效地降低了邻域互监测的时延，提高了大规模分布式环境中的实体监测效率。
     ③针对消息泛洪方式引起的网络高负载和消息单播方式造成的系统高时延问题，分析了采用传统流言协议的消息散播方法的优缺点，提出了基于流言协议的定向消息散播算法D-Gossip，降低了传统流言协议在消息散播时的不确定性，有效地提高了消息散播效率和覆盖率，减少了系统冗余信息量。
     ④在分布式检测系统中，节点相互之间具有对等性，同时监测域是自组织构造的，这就造成监测域中存在关键节点，一旦关键节点退出系统，将造成大量的节点无法被监测，导致分布式检测系统部分失效，降低故障检测功能的可生存性。这个问题在高动态的分布式环境中尤为明显，为此，本论文针对关键节点设计了一套包含自适应检测、主动检测和修复的方法，有效地解决了分布式故障检测系统的可生存性问题。
     ⑤针对在大规模分布式计算服务中，故障样本有限和传统方法在故障分类识别中面临的困难，把支持向量机引入到分布式故障检测系统中来，为故障的分类和识别提供了新的研究方法。论文研究了支持向量机用于故障分类和识别的关键问题，给出了基于支持向量机故障识别的基本实现步骤，同时针对标准支持向量机不能直接用于解决面向可信计算的故障检测这种典型多值分类问题的困难，提出采用决策有向无环图的多值分类算法，建立了多故障分类器模型，并以故障注入方式对其正确性进行了验证。
     ⑥设计了面向可信计算的分布式故障检测原型系统，对系统中每个组件的实现过程给出了详细的介绍；同时在该原型系统之上做了本文所述检测系统的系列实验，验证了系统各组件的功能。
     综上，本文分析并研究了当前故障检测技术在大规模分布式可信计算应用环境中所面临的若干关键问题，设计并改进了一系列算法。通过理论分析和实验结果表明：相关算法是正确、有效的，能够针对大规模分布式网络计算环境进行面向可信计算的故障检测，为系统可信性保障决策提供夯实的基础。
Along with the constant and unceasing development of computer software, hardware and network technologies, service computing modes undergo continuously changing. Relying on the rapid development of Internet, some open large‐scale distributed computing systems, such as WAN P2P, grid, and cloud, emerge in recent years. Many service systems, which are closely linked to economic production and social lives, rely on these network computing systems. Such computing systems greatly promote economic development and social progress. Once failures or quality of service (QoS) degradations arise in these systems, much inconvenience and economic losses would come forward. Therefore, one of the key issues for application and development of distributed computing technology is how to guarantee and promote the dependability of computing systems, thus computing services with high availability and reliability could be continuously offered. Hence, industrial and academic communities have launched realated researches, which cost a lot of manpower and material resources. An important means to guarantee and promote a system’s dependability is to correctly control and tolerate the faults in the system. While an effective fault detection approach based on states of system entities is an important basis for this means.
     Fault detection includes not only correct recognition of faults, but also effective monitoring of the entities being detected. This thesis discusses and studies several key problems for dependable computing oriented fault detection in large‐scale distributed network computing environment. Based on thorough summarization and in‐depth discussion for existing related technologies and research achievements, the architecture of a distributed fault detection system is put forward. And a series of related algorithms including distributed self‐organizing entity monitoring algorithm, state message dissemination algorithm, detection system survivability algorithm and fault recognition algorithm, are illustrated and analyzed. Finally, a suit of self‐organizing distributed fault detection prototype system is implemented.
     The concrete works and innovations of this thesis include following aspects.
     ①Under dependable computing framework, the problems of fault detection in distributed computing system are clarified. Based on this clarification, and aimed at application characteristics of open network distributed system and fault tolerance requirements, a fault detection system architecture detached from upper strategy is designed. Meanwhile an overall framework for distributed fault detection, including modules such as state data collection, system entity monitoring, state information dissemination, fault recognition, etc. is built.
     ②Traditionalcentralizedorhierarchicalmonitoringsystemscouldnotwelladaptthe characteristicsofopenlarge‐scaledistributednetworkcomputingenvironments,suchaswide rangeofnodesdistribution,largenumberofnodesparticipatingincomputing,instabilityof messagetransmissiondelay,uncertaintyofservicedependence,etc.Basedontheideaof self‐organizationnetwork,aneighboredmonitoringmethodinlightofthedistancebetween systementitiesisputforward.Thismethodcaneffectivelyreduceneighborhoodmutual monitoring delay, and thus improve monitoring efficiency.
     ③For the two common message transmission methods in network, flooding and unicasting, the former may cause high network overhead, while the latter may pose high system delay. The advantages and disadvantages of traditional Gossip protocol in fault detection are analyzed. Based on the idea of Gossip protocol, a directional message dissemination algorithm, that is, D‐Gossip is designed. D‐Gossip reduces message dissemination uncertainty of traditional Gossip protocols. It effectivelyimprovestheefficiencyandcoverageofmessagedissemination,reducessystem’s redundant information.
     ④Inthedistributeddetectionsystem,thepeer‐to‐peernatureexistsbetweennodes. Meanwhile, monitoring domains are self‐organizing formed. These two factors cause critical nodes in monitoring domains. Once these critical nodes depart the system, it will cause a large number ofnodesnottobemonitored.Consequently,itwillleadpartialfailurefordistributeddetection system, thus reduce survivability of fault detection function. This problem is particularly obvious in high churn distributed environment. Therefore, this thesis designs a series of methods, including adaptivedetection,activedetectionandneutralization.Theyeffectivelysolvesurvivabilityof distributed fault detection system.
     ⑤For computing service in large‐scale distributed systems, the fault sample size is limited, thereforetraditionalmethodsfacedificultiesinfaultclassificationandidentification.Support vectormachine(SVM)isintroducedtodistributedfaultdetectionsystem.Itprovidesanew approachforfaultclassificationandidentification.ThisthesisstudiesthekeyproblemsofSVM methodsinfaultclassificationandidentification.BasicimplementationstepsofSVM‐basedfault identificationaregivenout.AsstandardSVMmethodcan'tbedirectlyadoptedtosolve dependablecomputingorientedfaultdetection,whichisatypicalmulti‐valueclassification problem.Therefore,amulti‐valueclassificationalgorithmbasedonDDAG(DecisionDirected Acyclic Graph) is designed. And a multi‐fault classifier model is built up, and the correctness of the model is verified by means of fault injections.
     ⑥Adependablecomputingorienteddistributedfaultdetectionprototypesystemis illustrated in this thesis. Some key implementation processes of each component in the prototype system are stated. At the same time, for aforementioned detection system, a series of experiments areperformedovertheprototypesystem.Theexperimentsverifythefunctionsofeach component.
     Inconclusion,thethesisstudiesandsummarizessomekeyproblemsforexistingfault detection technologies in large‐scale distributed dependable computing application environment. A series of algorithms are designed. Meanwhile, the theoretical analysis and experimental results provethecorrectnessofthealgorithms.Andthesealgorithmscouldimplementdependable computing oriented fault detection for large‐scale distributed computing application environment, thus provide steady baiss for the decision of system’s dependability guarantee.

引文

[1]闵应骅,杨孝宗.关于dependable computing和trusted computing的翻译[J].中国科技术语,2009,11(6):49-51.
    [2] Laprie J. Dependable computing and fault tolerance: concepts and terminology [C].Proceedings of15th IEEE International. Symposium on Fault-Tolerant Computing(FTCS-15),1985:2-11.
    [3]徐拾义.可信计算系统设计和分析[M].北京:清华大学出版社,2006.
    [4] Avizienis A, Laprie J, et al. Basic concepts and taxonomy of dependable and securecomputing [J]. IEEE Transactions on Dependable and Secure Computing2004,1(1):11–33.
    [5]刘东红,郭长国,等.监控使能的分布式软件系统构造方法[J].软件学报,2011,22(11):2610-2624.
    [6]黄锡滋.软件可靠新、安全性及质量保证[M].北京:电子工业出版社,2002.
    [7] Castelli V, Harper R, et al. Proactive management of software aging [J]. IBM Journal ofResearch and Development,2001,45(2):311-332.
    [8] Huang Y, Chung P, et al. NT-Swift: software implemented fault tolerance on windows NT[J]. Journal of Systems and Software,2004,71(1):127-141.
    [9] Kalbarczyk Z, Iyer R, Bagchi S. Chameleon: a software infrastructure for adaptive faulttolerance [J]. IEEE Transactions on Parallel and Distributed Systems,1999,10(6):560-579.
    [10] Krishnamurthy S. An adaptive quality of service aware middleware for replicated services[J]. IEEE Transactions on Parallel and Distributed Systems,2003,14(11):1112-1125.
    [11]单锦辉,徐克俊,王戟.一种软件故障诊断过程框架[J].计算机学报,2011,34(2):371-381.
    [12] Wang S, Yan K, Wang S. Achieving efficient agreement within a dual-failure cloudcomputing environment [J]. Expert Systems with Applications,2011,38:906-915.
    [13] Deng J, Huang S, et al. Fault tolerant and reliable computation in cloud computing [C]. IEEEGlobecom Workshop on Web and Pervasive Security,2010:1601-1605.
    [14]闵应骅.网络容错与安全研究述评[J].计算机学报,2003,26(9):1035-1041.
    [15] Kong X, Huang J, et al. Performance, fault tolerance and scalability analysis of virtualinfrastructure management system [C]. IEEE International Symposium on Parallel andDistributed Processing with Applications,2009:282-289.
    [16]冯登国,张敏,等.云计算安全研究[J].软件学报,2011,22(1):71-83.
    [17] Smith D, Guan Q, Fu S. An anomaly detection framework for automatic management ofcompute cloud systems [C].34th Annual IEEE International Computer Software andApplications Conference Workshop,2010:376-381.
    [18] Wang C, Talwar W, et al. Online detection of utility cloud anomalies using metricdistributions [C]. IEEE/IFIP Network Operations and Management Symposium,2010:19-23.
    [19] Wang C. EbAT: online methods for detecting utility cloud anomalies [C]. Proceedings of the6th Middleware Doctoral Symposium,2009:1-6.
    [20] Mi H, Wang H, et al. Magnifier: online detection of performance problems in large-scalecloud computing systems [C]. IEEE International Conference on Services Computing,2011:418-425.
    [21] Meng S, Liu L, Wang T. State monitoring in cloud datacenters [J]. IEEE Transactions onKnowledge and Data Engineering,2011,23(9):1328-1344.
    [22] Guan Q, Zhang Z, Fu S. Proactive failure management by integrated unsupervised andsemi-supervised learning for dependable cloud systems [C].6th International Conference onAvailability, Reliablity and Security,2011:83-90.
    [23] Stewart C, Shen K, et al. Entomo Model: Understanding and avoiding performance anomalymanifestations [C].18th IEEE/ACM International Symposium on Modeling Analysis&Simulation of Computer and Telecommunications Systems,2010:3-13.
    [24] Ramya M, Bose S, et al. Detecting and diagnosing application misbehaviors in ‘on-demand’virtual computing infrastructures [C].2011IEEE International Conference on CloudComputing and Intelligence Systems,2011:198-203.
    [25] Li A, Gu L, Xu K. Fast anomaly detection for large data centers [C]. IEEE GlobalTelecommunications Conference,2010:1-6.
    [26]常光辉.大规模分布式可信监控系统研究[D].重庆大学,博士研究生学位论文,2011.
    [27] Kwon H, Kim T, et al. Self-similarity based lightweight intrusion detection method for cloudcomputing [C].3rd International Conference on Intelligent Information and DatabaseSystems,2011:353-362.
    [28] Avizienis A, Laprie J, Randell B. Fundamental concepts of dependability [R]. UCLA CSDReport no.010028,2000.
    [29] Bouricius W, Carter W, Schneider P. Reliability modeling techniques for self-repairingcomputer systems [C]. Proceeding of24th National Conference of ACM,1969:295-309.
    [30] Randell B. System structure for software fault tolerance [J]. IEEE Transactions on SoftwareEngineering,1975, SE-1(10):220-232.
    [31] Avizienis A, Chen L. On the implementation of N-version programming for software faulttolerance during execution [C]. Proceedings of IEEE COMPSAC77,1977,149-155.
    [32] Dobson J, Randell B. Building reliable secure computing systems out of unreliable insecurecomponents [C]. Proceedings of1986IEEE Symposium Security and Privacy, Oakland,California,1986:187-193.
    [33] Laprie J. Dependability: a unifying concept for reliable, safe, secure computing [C].Proceedings of the IFIP12th World Computer Congress on Algorithms, Software,Architecture–Information Processing92,1992(1):585-593.
    [34] DoD. Trusted computer system evaluation criteria [S]. USA, Department of Defense,National Computer Security Center,5200.28-STD,1985.
    [35] DoD. Trusted database management system interpretation [S]. USA, Department of Defense,National Computer Security Center, NCSC-TG-021,1991.
    [36] DoD. Trusted network interpretation of the trusted computer system evaluation criteria [S].USA, Department of Defense, National Computer Security Center, NCSC-TG-005,1987.
    [37] Gates B. Trustworthy computing [EB/OL].http://www.wired.com/techbiz/media/news/2002/01/49826,2002.
    [38] Mundie C, Vries P, et al. Trustworthy computing Microsoft white paper [EB/OL].http://download.microsoft.com/download/a/f/2/af22fd56-7f19-47aa-8167-4b1d73cd3c57/twc_mundie.doc,2002.
    [39]闵应骅.前进中的可信计算-开篇[J].中国传媒科技,2005,9:50-52.
    [40]王怀民,唐扬斌,等.互联网软件的可信机理[J].中国科学E辑:信息科学,2006,36(1):1-14.
    [41]田东,陈蜀宇,陈峰.一种网格环境下的动态故障检测算法[J].计算机研究与发展,2006,43(11):1870-1875.
    [42] Lamport L, Shostak R, Pease M. The Byzantine Generals Problem [J]. ACM Transactions onProgramming Language and Systems,1982,4(3):382-401.
    [43] Stelling P, Foster I, et al. A fault detection service for wide area distributed computations [J].Cluster Computing,1999,2(2):117–128.
    [44] Felber P, Défago X, et al. A fault detection service for wide area distributed computations[C]. Proceedings of7th IEEE Symposiums on High Performance Distributed Computing,Chicago,1998, Washington: Computer Society Press:268-278.
    [45] Défago X, Hayashibara N, Katayama T. On the design of a failure detection service folarge-scale distributed systems [C]. Proceeding of International Symposium TowardsPeta-Bit Ultra-Networks (PBit2003),2003:88-95.
    [46] Bertier M, Marin O, Sens Pierre. Performance analysis of a hierarchical failure detector [C].2003International Conference on Dependable Systems and Networks (DSN ‘03), SanFrancisco, California,2003:635-644.
    [47] Shi X, Jin H, et al. ALTER: Adaptive failure detection services for grid [C]. IEEEInternational Conferences on Services Computing, Orlando, Washington, IEEE ComputerSociety Press,2005:355-358.
    [48] Eugster P, Guerraoui R. Probabilistic multicast [C]. Proceedings of InternationalConferences on Dependable Systems and Networks (DSN ‘02), Washington, USA,2002,313-324.
    [49] Chandra R, Ramasubramanian V, Birman K. Anonymous gossip: improving multicastreliability in mobile Ad-hoc networks [C].21st International Conferences on DistributedComputing Sytems, Phoenix, Arizona,2001:275-283.
    [50]林闯,彭雪海.可信网络研究[J].计算机学报,2005,23(5):751-753.
    [51] Renesse R, Birman K, Vogels W. Astrolabe: A Robust and Scalable Technology forDistributed System Monitoring, Management, and Data Mining [J]. ACM Transactions onComputer Systems,2003,21(2):164-206.
    [52] Lin M, Marzullo K, Masini S. Gossip versus deterministic flooding: low message overheadand high reliability for broadcasting on small networks [C]. Proceedings of14thInternational Symposium on Distributed Computing (DISC2000), Toledo, Spain,2000:253-267.
    [53]何正嘉,陈进,等.机械故障诊断理论及应用[M].北京:高等教育出版社,2010.
    [54] Das A, Gupta I, Motivala A. SWIM: Scalable weakly-consistent infection-style processgroup membership protocol [C]. Proceedings of the International Conference on DependableSystems and Networks, June2002. IEEE Computer Society Press: Los Alamitos, CA,2002:303–312.
    [55] Horita Y, Taura K, Chikayama T. A scalable and efficient self-organizing failure detector forgrid applications [C]. Proceedings of the Sixth IEEE/ACM International Workshop on GridComputing, November2005. IEEE Computer Society Press: Los Alamitos, CA,2006:202–210.
    [56]赵洪华,陈鸣,等.网络性能特性的描述和测量[J].解放军理工大学学报（自然科学版）,2004,5(5):19-21.
    [57] Presuhn R, Case J, Rose M, Waldbusser S. Version2of the Protocol Operations for theSimple Network Management Protocol (SNMP)[S]. RFC3416, Internet Engineering TaskForce, December2002.
    [58] Fischer M, Lynch N, Paterson M. Impossibility of distributed consensus with one faultyprocess [J]. Journal of the Association for Computing Machinery1985,32(2):374–382.
    [59] Barborak M, Malek M. The consensus problem in fault-tolerant computing [J]. ACMComputing Surveys1993,25(2):171–220.
    [60]方滨兴,崔翔,王威.僵尸网络综述[J].计算机研究与发展,2011,48(8):1315-1331.
    [61] Chandra T, Toueg S. Unreliable failure detectors for reliable distributed systems [J]. Journalof the Association for Computing Machinery1996,43(2):225–267.
    [62] Gupta I, Chandra T, Goldszmidt G. On scalable and efficient distributed failure detectors [C].Proceedings of the20th Annual ACM Symposium on Principles of Distributed Computing,August2001. ACM Press: New York,2001:170–179.
    [63] Hayashibara N, Cherif A, Katayama T. Failure detectors for large-scale distributed systems[C]. Proceedings of the21st IEEE Symposium on Reliable Distributed Systems, October2002. IEEE Computer Society Press: Los Alamitos, CA,2002:404–409.
    [64] Zanikolas S, Sakellariou R. A taxonomy of Grid monitoring systems [J]. Future GenerationComputer Systems2005;21(1):163–188.
    [65] Tierney B, Aydt R, Gunter D, Smith W, Swany M, Taylor V, Wolski R. A Grid monitoringarchitecture [R]. Informational Document GFD-I.7, Open Grid Forum, January2002.
    [66] Jain A, Shyamasundar R. Failure detection and membership management in Gridenvironments [C]. Proceedings of the Fifth IEEE/ACM International Workshop on GridComputing, November2004. IEEE Computer Society Press: Los Alamitos, CA,2005:44–52.
    [67] Steinder M, Sethi A. Probabilistic fault localization in communication systems using beliefnetworks [J]. IEEE/ACM Transactions on Networking,2004,12(5):809-822.
    [68] Steinder M, Sethi A. Probabilistic event-driven fault diagnosis through incrementalhypothesis updating [C]. IFIP/IEEE8th International Symposium on Integrated NetworkManagement,2003:635-648.
    [69] Yemini S, Kliger S, et al. High speed and robust event correlation [J]. IEEE Transactions onCommunications,1996,34(5):82-90.
    [70] Korb K, Nicholson A. Bayesian artificial intelligence [M]. London: Chapman&Hall/CRCpress,2003.
    [71] Dagum P, Luby M. Approximating probabilistic inference in Bayesian belief networks isNP-hard [J]. Artificial Intelligence,1993,60(1):141-153.
    [72] Isermann R. Model-based fault-detection and diagnosis–status and applications [J]. AnnualReviews in Control,2005,29(1):71-85.
    [73]李小勇,桂小林,等.基于行为监控的自适应动态信任度测模型.计算机学报,2009,32(4):664-674.
    [74] Mills D. Internet time synchronization: the network time protocol [J]. IEEE Transactions onCommunications,1991,39(10):1482-1493.
    [75]徐非,杨广文,鞠大鹏.基于Peer-to-Peer的分布式存储系统的设计[J].软件学报,2004,15(2):268-277.
    [76] Albert R, Jeong H, Barabási A. Error and attack tolerance of complex networks [J]. Nature,2000,406:378-382.
    [77]黄子乘,怀进鹏,等.一个基于流程相似性的自动服务发现框架[J].软件学报,2012,23(3):489-503.
    [78] Avizienis A. Design of fault-tolerant computers [C]. Proceedings of1967AFIPS Fall JointComputer Conference,1967,31:733-743.
    [79] Bar-Hillel M. The base-rate fallacy in probability judgments [J]. Acta Psychologica,1980,44(3):211-233.
    [80]杨仕平.分布式任务关键实时系统的防危（Safety）技术研究[D].电子科技大学,博士研究生学位论文,2004.
    [81] Bianchini R, Buskens R. An adaptive distributed system-level diagnosis algorithm and itsimplementation [C]. Proceedings of the Twenty-first International Symposium onFault-Tolerant Computing, Digest of Papers, June1991. IEEE Computer Society Press: LosAlamitos, CA,1991:222–229.
    [82] Boser B, Guyon I, Vapnik V N. A training algorithm for optimal margin classifiers [C]. FifthAnnual Workshop on Computational Learning Theory, Pittsburgh ACM:144-152.
    [83] Boyd S, Ghosh A, et al. Randomized gossip algorithms [J]. IEEE Transactions onInformation Theory,2006,52(6):2508-2530.
    [84]刘鹏.云计算（第二版）[M].北京:电子工业出版社,2011.
    [85] Brunner C, Flore D. Generation of pathloss and interference maps as SON enabler indeployed UMTS networks [C]. Proceedings of IEEE Vehicular Technology Conference,VTC Spring’09, Barcelona, Spain,2009.
    [86] Buckley F, Lewinter M. A friendly introduction to graph theory [M]. Prentice-Hall, NewJersey,2002.
    [87] Chtepen M, Claeys F, et al. Adaptive task checkpointing and replication: toward efficientfault-tolerant grids [J]. IEEE Transactions on Parallel and Distributed Systems,2009,20(2):180-190.
    [88] Cortes C, Vapnik V N. Support vector networks. Machine Learning [J],2000,20:273-297.
    [89] Dabrowski C. Reliability in grid computing systems [J]. Concurrency and Computation:Practice and Experience,2009,21(8):927-959.
    [90] Demers A. Epidemic algorithms for replicated database maintenance [C]. The6th AnnualACM Symposium on Principles of Distributed Computing, Vancouver, British Columbia,Canada,1987.
    [91] Foster I, Kesselman C, Tuecke S. The anatomy of the Grid: Enabling scalable virtualorganizations [J]. International Journal of High Performance Computing Applications2001;15(2):200–222.
    [92]褚灵伟.分布式互联网服务故障管理[D].北京邮电大学,博士研究生学位论文,2009.
    [93] Frank P. Analytical and qualitative model-based fault diagnosis: a survey and some newresults [J]. European Journal of Control,1996,2(1):6-28.
    [94] Fu J, Yu X. Rotorcraft acoustic noise estimation and outlier detection [C]. International JointConference on Neural Networks.2006,1(16-18):4401-4405.
    [95] Ganesh A, Kermarrec A, Massoulié. Scamp: peer-to-peer lightweighted membership servicefor large-scale group communication [C]. Networked Group Communicaiton Lecture Notesin Computer Science, Volume2233/2001,2001:44-55.
    [96] Harary F. Graph Theory [M]. Addison-Wesley, Reading,1969.
    [97]常光辉,陈蜀宇,等.一种高效可扩展的自组织邻域故障检测协议[J],电子与信息学报,2010,32(9):2145-2150.
    [98] He Y, Ren H, et al. On the reliability of large-scale distributed systems–A topological view[J]. Computer Networks,2009,53(12):2140-2152.
    [99] Holliday J, Steinke R, et al. Epidemic algorithms for replicated databases [J]. IEEETransactions on Knowledge and Data Engineering,2003,15(5):1218-1238.
    [100] Hwang S, Kesselman C. A flexible framework for fault tolerance in the Grid [C]. Journal ofGrid Computing2003;1(3):251–272.
    [101]左朝树,刘心松,等.一种分布式并行服务器节点故障检测算法[J].电子科技大学学报,2007,36(1):119-122.
    [102] Jelasity M, Montresor A. Epidemic-style proactive aggregation in large overlay networks [C].Proceedings of the24th International Conference on Distributed Computing Systems,2004:102-109.
    [103] Johnson T, Kwok I, et al. Fast computation of2-dimensional depth contours [C].Proceedings of the4th ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining. New York: AAAI,1998:224-228.
    [104] Keerthi S. Efficient tuning of SVM hyper parameters using radius/margin bound anditerative algorithms [J]. IEEE Transactions on Neural Networks,2002,13(5):1223-1229.
    [105] Kempe D, Kleinberg J. Protocols and impossibility results for gossip-based communicationmechanisms [C]. Proceedings of the43rd Annual IEEE Symposium on Foundations ofComputer Science. IEEE,2002:471-480.
    [106] Kermack W, McKendrick A. A contribution to the mathematical theory of epidemics [C].Proceedings of the Royal Society of London,1927:699-721.
    [107] Keshk A, Miura Y, Kinoshita K. Procedure to Overcome the Byzantine General’s Problemfor Bridging Faults in CMOS Circuits [C]. Proceedings of Eighth Asian Test Symposium(ATS’99),1999:121-126.
    [108] Knight J, Strunk E, Sullivan K. Towards a rigorous definition of information systemsurvivability [C]. Proceedings of DARPA Information Survivability Conference andExposition,2003,(1):78-89.
    [109] Lee H, Park D, et al. A resource management system for fault tolerance in grid computing[C]. Proceedings of International Conference on Computational Science and Engineering(CSE ‘09), Vancouver, CA,2009:609-614.
    [110] Liu X, Xiao L, et al. Optimizing overlay topology by reducing cut vertices [C]. Proceedingsof ACM NOSSDAV, Newport, RI,2006.
    [111] MA Arbib. The handbook of brain theory and neural networks [M]. Massachusetts: The MITPress,2003.
    [112] Patterson D, Brown A, et al. Recovery oriented computing (ROC): motivation, Definitiontechniques and case studies [R]. Technical Report,2002: CSD-02-1175.
    [113] Peter Jackson. Introduction to Expert Systems (3rd ed.)[M]. Addison-Wesley,1998.
    [114] Petrovesky M. Outlier detection algorithms in data mining systems [J]. Programming andComputer Software.2003,29(4):228-237.
    [115]颜秉勇.非线性系统故障诊断若干方法及其应用研究[D].上海交通大学,博士研究生学位论文,2010.
    [116] Poor R. Self-organizing network [P]. US Patent6,028,857,2000.
    [117] Renesse R, Minsky Y, Hayden M. A gossip-style failure detection service [C]. Proceedingsof International Conference on Distributed Systems Platform and Open DistributedProcessing (IFIP). The Lake District, UK: Spring-Verlag Press,2009:55-70.
    [118] Saroiu S, Gummadi P, Gribble S. A measurement study of peer-to-peer file sharing systems[C]. Proceedings of Multimedia Computing and Networking (MMCN). San Jose, CA, USA,2002.
    [119]姬晓波,陈蜀宇,等.高效可扩展的网格系统动态故障检测算法[J].武汉大学学报(信息科学版),2008,33(10):1046-1050.
    [120] Schlkopf B, Platt J, et al. Estimating the support of a high dimensional distribution [J].Neural Computation,2001,13(7):1443-1471.
    [121] Stann F, Heidemann J, et al. RBP: Robust broadcast propagation in wireless networks [C].Proceedings of ACM SenSys. Boulder, Colorado, USA,2006:85-98.
    [122] Stojmenovic I, Simplot-Ryl D, Nayak A. Towards scalable cut and link detection withapplications in wireless ad hoc networks [J]. IEEE Transactions on Networks,2011,25(1):44-48.
    [123] Stok P, Claessen M, Alstein D. A hierarchical membership protocol for synchronousdistributed systems [C]. Proceedings of the First European Dependable ComputingConference on Dependable Computing (Lecture Notes in Computer Science, vol.852),Echtle K, Hammer D, Powell D (eds.). Springer: London,1994:599–616.
    [124]李千目,戚湧,等.基于粗糙集神经网络的网络故障诊断新方法[J].计算机研究与发展,2004,41(10):1696-1702.
    [125] Tax D, Duin R. Support vector domain description [J]. Pattern Recognition Letters.1999,20:1991-1199.
    [126] Vapnik V. Statistical learning theory [M]. New York: John Wiley and Sons,1998.
    [127] Vapnik V. The nature of statistical learning theory [M]. New York: Spring-Verlag,1995.
    [128] Vapnik V, Golowich S, Smola A. Support vector method for function approximation,regression estimation and signal processing [C]. In M. C. Mozer, M. I. Jordan, and T.Petsche, editors, Advances in Neural Information Processing Systems9, Cambridge, MA,MIT Press,1997:281-287.
    [129] Hsu C, Lin C. A comparison of methods for multi-class support vector machines [J]. IEEETransactions on Neural Networks,2002,13(2):415-425.
    [130]荣海娜,张葛祥,金炜东.系统辨识中支持向量机核函数及其参数的研究[J].系统仿真学报,2006,18(11):3204-3209.
    [131]李君宝,高会军.基于数据依赖核函数的核优化算法[J].模式识别与人工智能,2010,23(3):300-306.
    [132]李仁兵,李艾华,等.支持向量机的进化多核设计[J].控制理论与应用,2011,28(6):793-798.
    [133] Keerthi S, Lin C. Asymptotic behavior of support vector machines with Gaussian kernel [J].Neural Computation,2003,15(7):1667-1689.
    [134]林升梁,刘志.基于RBF核函数的支持向量机参数选择[J].浙江工业大学学报,2007,35(2):163-167.
    [135] Lin H, Lin C. A study on sigmoid kernels for SVM and the training of non-PSD kernels bySMO-type methods [EB/OL]. http://www.csie.ntu.edu.tw/~cjlin/papers/tanh.pdf,2003.
    [136]谭东宁,谭东汉.小样本机器学习理论:统计学习理论[J].南京理工大学学报,2001,25(1):108-112.
    [137] Platt J. Fast training of support machines using sequential minimal optimaization [Z].Advances in Kernel Methods–Support Vector Learning, Scholkopf B, Burges C, Smola A,editors, Cambridge: MIT Press,1999:185-208.
    [138]毛先柏.基于支持向量机的模拟电路故障诊断研究[D].华中科技大学,博士研究生学位论文,2009.
    [139] Platt J, Cristianini N, Shawe-Taylor J. Large margin DAGs for multiclass classification. InAdvances in Neural Information Processing Systems,12, MIT Press,2000:547-553.
    [140]何学文.基于支持向量机的故障智能诊断理论与方法研究[D].中南大学,博士研究生学位论文,2004.
    [141] Kre el U. Pairwise classification and support vector machines [Z]. In Sch lkopf B. et al(Editors), Advances in Kernel Methods–Support Vector Learning, Cambridge, MA, MITPress,1999:255-268.
    [142] Wang X, Chellappan S, et al. On the effectiveness of secure overlay forwarding systemsunder intelligent distributed DoS attacks [J]. IEEE Transactions on Parallel and DistributedSystems,2006,17(7):619–632.
    [143] Weston J, Watkins C. Multi-class support vector machines [R]. Technical ReportCSD-TR-98-04, Department of Computer Science, University of London,1998:1-10.
    [144] Xu G, Lu H, et al. A software-implemented fault injection toolkit for dependency analysis oflarge scale distributed applications [J]. International Journal of Digital Content Technologyand its Applications,2011,5(12):96-103.
    [145]潘庆和.软件故障注入关键技术研究[D].哈尔滨工业大学,博士研究生学位论文,2011.
    [146]江建慧,梁剑华,等. Linux上软件实现的瞬时故障注入方案及实现[J].同济大学学报(自然科学版),2006,34(6):823-827.
    [147] Yamanouchi M, Matsuura S, Sunahara H. A fault detection system for large scale sensornetworks considering reliability of sensor data [C]. Proceedings of the9th AnnualInternational Symposium on Application and Internet (SAINT ‘09), Seattle, USA,2009:255-258.
    [148]任侠,吕述望. ARP协议欺骗原理分析与抵御方法[J].计算机工程,2003,29(9):127-129.
    [149] Zhang W, Xue G, Misra S. Fault-tolerant relay node placement in wireless sensor networks:problems and algorithms [C]. Proceedings of IEEE INFOCOM, Arizona State University,Tempe, USA,2007.
    [150]董剑,左承德,等.一种基于QoS的自适应网格失效检测器[J].软件学报,2006,17(11):2362-2372.
    [151] Lu H, Chen S, et al. A ripple-like gossip-based fault detection service [J]. Journal ofInformation and Computational Science,2010,7(12):2395-2402.
    [152]纪俊杰,阳小龙,等.基于信任关系的IP网络容错容侵机制[J].电子与信息学报,2009,31(7):1576-1581.
    [153]杨家海,吴建平,安常青.互联网络测量理论与应用[M].北京:人民邮电出版社,2009.
    [154] Chang C, Lin C. Libsvm: a library for support vector machines [J]. ACM Transactions onIntelligent systems and Technology,2011,2(3):1-27.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700