用户名: 密码: 验证码:
基于hadoop平台作业调度算法的研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
互联网技术的迅猛发展,互联网数据呈现爆炸性的增长,面临海量数据处理问题。云计算作为—种新的模型提出来,发展极为迅速。云计算开源系统Hadoop模仿和实现了Google云计算的主要技术,并获得广泛的应用。Hadoop是一个在不断发展和完善的平台,在Hadoop研究中作业调度的研究是学术界和工业界的热点问题之一。改进和提高作业调度能力,能提升海量数据处理的能力。对提高Hadoop平台的性能和资源利用效率都有重要的现实意义。
     本文首先介绍了Hadoop的技术背景,其次介绍了Hadoop平台的核心部分,即Hadoop的分布式文件系统(HDFS)和MapReduce计算框架,详细分析了Hadoop的作业调度流程。接着研究了Hadoop平台下现有的调度算法,即FIFO算法,计算能力算法,公平调度算法。详细研究了公平调度算法。
     在对Hadoop平台深入了解和对其作业调度算法进行详细研究下,提出对作业调度算法的改进。首先,分析了公平调度算法的数据本地化问题,分析其中的延迟改进算法,在此算法的基础上,提出保证响应时间T的延迟算法,来保证特殊用户(如:付费用户)的服务水平协议(SLA)要求,这里主要针对短作业。其次,希望通过利用过去的节点历史记录和学习作业属性来不断的改进作业调度,提出应用基于特征加权的朴素贝叶斯分类器算法来改进作业调度的任务分配,详细分析了算法的设计思想,并进行原型的设计和实现。
     最后通过实验环境的搭建来测试改进算法,首先测试了保证特定响应时间T的延迟算法,实验证明到达了响应时间T的要求,但损失了部分的数据本地化。其次,测试了基于特征加权的朴素贝叶斯分类调度算法,对其学习的能力,特征加权对性能的影响,决策的正确率以及与现有调度算法的性能进行试验对比分析。
The rapid development of Internet technology, the explosive growth of Internet data, is facing massive data processing problems. Cloud computing as a new model proposed, developed with great speed. Hadoop which is open source cloud computing system, imitats and realizes the main Google cloud computing technology and accesses to a wide range of use. Haoop is a platform for continuous development and improvement. In the Hadoop job scheduling is the academic research and industry hot topics. Improving and enhancing the job scheduling capabilities can enhance the ability of massive data processing. Hadoop platform for improving the performance and efficiency of resource use has important practical significance.
     This paper describes the technical background of Hadoop, and then introduces the core of the Hadoop platform that is Hadoop Distributed File System and the MapReduce computation framework, a detailed analysis of the Hadoop job scheduling process. Then, I researched Hadoop platform of existing scheduling algorithms, namely FIFO algorithm, capacity algorithm, fair scheduling algorithm. A detailed analysis of fair scheduling algorithm.
     In-depth understanding of the Hadoop platform job scheduling algorithm and its detailed study, I proposed improvements for the job scheduling algorithm. First, the analysis of fair scheduling algorithm for data localization, then I analyzes the delay algorithm based on this algorithm and proposed the response time T of the delay improved algorithm that guarantees Service Level Agreement(SLA) for specific users (such as:paying customers) of requirement, this is mainly for short job. Secondly, I hope nodes through the use of past history and learning job properties to improve job scheduling, I proposed Feature Weighting-based Naive Bayes classification algorithm to improve scheduling of task allocation, detailed analysis of the algorithm ideas, and finished the prototype design and implementation.
     And then I builded the lab environment for test the performance of improved algorithm in our lab, the first test is guaranting a specific response time T delay algorithm. Experiments showed that it reached the requirements for the response time T to, but the loss of part of the data localization. Second, the experiment based on Feature Weighting-based Naive Bayes classification scheduling algorithm, testing its ability for learning, feature weightied impacting on performance of job, the performance of decision-making accuracy and performance comparison of scheduling algorithms for existing scheduling algorithms.
引文
[1]Cloud computing[EB/OL].http://en.wikipedia.org/wiki/Cloud_computing,2010-5-10.
    [2]刘鹏.云计算[M].电子工业出版社,2010.
    [3]MapReduce[EB/OL].http://zh.wikipedia.org/zh-cn/MapReduce,2010-3-5.
    [4]Dean J, Ghemawat S. MapReduce:Simplified data processing on large clusters[J]. Communications of the ACM,2008,51(1):107-113.
    [5]Welcome to Apache Hadoop! [EB/OL].http://hadoop.apache.org/,2010.
    [6]Get Started with Amazon Elastic MapReduce[EB/OL]. http://docs.amazonwebservices.com/ElasticMapReduce/latest/GettingStartedGuide/,2009-3-31.
    [7]Hadoop——分布式计算技术专题[EB/OL].http://subject.csdn.net/hadoop/,2010.
    [8]Zaharia M, Konwinski A, Joseph AD 等. Improving mapreduce performance in heterogeneous environments. In:USENIX Association; 2008:29-42
    [9]Polo J, Carrera D, Becerra Y 等. Performance-driven task co-scheduling for mapreduce environments. In:IEEE; 2010:373-380
    [10]Kc K, Anyanwu K. Scheduling Hadoop Jobs to Meet Deadlines. In:IEEE; 2010:388-392
    [11]Sandholm T, Lai K. Dynamic proportional share scheduling in Hadoop. In:Springer; 2010:110-131
    [12]Ghemawat S, Gobioff H, Leung ST. The Google file system[J]. ACM SIGOPS Operating Systems Review,2003,37 (5):29-43.
    [13]Darlington J, Guo Y, To HW. Structured parallel programming:Theory meets practice[J]. Computing tomorrow:future research directions in computer science,1996:49.
    [14]White T. Hadoop:The Definitive Guide[M]. Yahoo Press,2010.
    [15]Hadoop分布式文件系统:架构和设计[EB/OL]. http://hadoop.apache.org/common/docs/r0.18.2/cn/hdfs_design.html,2008.
    [16]hadoop 中 mapreduce部分执行流程[EB/OL]. http://www.blogjava.net/shenh062326/archive/2011/01/14/342959.html,2011.
    [17]夏袆Hadoop平台下的作业调度算法研究与改进[D].华南理工大学,2010
    [18]White T. Hadoop权威指南[M].第一版,清华大学出版社2010-5.
    [19]Hadoop Map/Reduce教程[EB/OL]. http://hadoop.apache.org/common/docs/r0.18.2/cn/mapred_tutorial.html,2008.
    [20]Capacity Scheduler Guide[EB/OL]. http://hadoop.apache.org/common/docs/r0.20.2/capacity_scheduler.html,2010.
    [21]Fair Scheduler Guide[EB/OL]. http://hadoop.apache.org/common/docs/r0.20.2/fair_scheduler.html,2010-1-2.
    [22]中科院计算所网络重点实验室Hadoop作业调度研究[EB/OL]. http://www.slideshare.net/YongqiangHe/hadoopv01,2010.
    [23]Zaharia M. Job Scheduling with the Fair and Capacity Schedulers [EB/OL]. http://www.cs.berkeley.edu/~matei/talks/2009/hadoop_summit_fair_scheduler.pdf,2009-7-10.
    [24]王峰Hadoop集群作业的调度算法[J].程序员,2009,12
    [25]The Hadoop Fair Scheduler [EB/OL].http://developer.yahoo.net/blogs/hadoop/FairSharePres.ppt, 2010-2-7.
    [26]The Hadoop Map-Reduce Capacity Scheduler [EB/OL]. http://developer.yahoo.com/blogs/hadoop/posts/2011/02/capacity-scheduler/,2011-2-10.
    [27]Zaharia M, Borthakur D, Sarma JS 等. Job scheduling for multi-user mapreduce clusters[J]. EECS Department, University of California, Berkeley, Tech Rep UCB/EECS-2009-55, Apr,2009:2009-2055.
    [28]Hadoop Fair Scheduler Design Document [EB/OL]. http://svn.apache.org/repos/asf/hadoop/mapreduce/trunk/src/contrib/fairscheduler/designdoc /fair_scheduler_design_doc.pdf,2010-3-20.
    [29]Zaharia M, Borthakur D, Sarma JS 等.Job Scheduling for Multi-User MapReduce Clusters [R], 2009-5-30.
    [30]Hadoop公平调度器算法解析[EB/OL].http://dongxicheng.org/mapreduce/hadoop-schedulers/, 2010-3-22.
    [31]Job Scheduling for MapReduce[EB/OL]. http://ww.cs.berkeley.edu/~matei/talks/2009/msr_mapreduce_scheduling.pdf.
    [32]Duda RO, Hart PE, Stork DG等.模式分类[M].机械工业出版社,2003.
    [33]Zaharia M, Borthakur D, Sen Sarma J等Delay scheduling:a simple technique for achieving locality and fairness in cluster scheduling. In:ACM; 2010:265-278
    [34]王凯,吴泉源,杨树强.一种多用户MapReduce集群的作业调度算法的设计与实现[J].计算机与现代化,2010(010):23-28.
    [35]张密密MapReduce模型在Hadoop实现中的性能分析及改进优化[D].电子科技大学,2010
    [36]赵春燕.云环境下作业调度算法研究与实现[D].北京交通大学,2009
    [37]Kunz T. The Learning Behaviour of a Scheduler using a Stochastic Learning Automation[J]. Relation,1991,10(1.127):2600.
    [38]Negi A, Kishore K. Applying machine learning techniques to improve linux process scheduling. In:IEEE; 2005:1-6
    [39]Santos LP, Proenca A. A Bayesian runtime load manager on a shared cluster. In:Published by the IEEE Computer Society; 2001:674
    [40]Santos LP, Proenca A. Scheduling under conditions of uncertainty:a bayesian approach. In: Springer; 2004:222-229
    [41]Bin Z, Zhaohui L, Jun W. Grid scheduling optimization under conditions of uncertainty. In: Springer-Verlag; 2007:51-60
    [42]Zhang H. The optimality of naive Bayes[J].A A,2004,1 (2):3.
    [43]Jaideep Dhok, Varma V. Using Pattern Classification for Task Assignment in MapReduce [EB/OL]. 10th IEEE/ACM International Conference CCGrid 2010 http://researchweb.iiit.ac.in/~jaideep/learning-scheduler.pdf.
    [44]邓维斌,王国胤,王燕.基于Rough Set的加权朴素贝叶斯分类算法[J].计算机科 学,2007,34(002):204-206.
    [45]Hadoop Wild PoweredBy [EB/OL].http://wiki.apache.org/hadoop/PoweredBy,2010.2.13.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700