用户名: 密码: 验证码:
基于多Agent Q学习算法的气候合作策略研究与仿真
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
近年来,气候恶化已成为一个不争的事实,全球已经关注了这个问题,并试图携手共同应对气候挑战。但是,气候合作其实是参与国之间为追求个体国家利益而展开的,参与国的理性决定了其行为的目标是追求最大化的自身利益。因此,在气候合作中如何实现获得共同目标并能使自身利益得到保障,是参与国家追求的最优策略。本文以气候合作策略为研究对象,通过应用多Agent的强化学习算法来仿真气候合作策略中在不同惩罚率下参与国家的最优策略问题。
     本文的主要创新工作如下:
     (1)首先汲取NashQ算法中博弈论博弈均衡思想,提出基于Meta平衡的多Agent Q学习算法,对于Q学习算法通过Meta平衡策略求解Q值,以求解多Agent系统的最优联合策略。并给出了MetaQ算法的理论基础,同时理论分析了MetaQ算法有着能够获得Pareto最优解且算法时间复杂度远低于NashQ算法的优势。通过网格世界游戏实验仿真分析,MetaQ算法有着很好的收敛性,在实验中MetaQ算法收敛的最优移动步数要比NashQ算法快出近6倍。
     (2)本文对气候合作策略问题进行了研究,把气候合作策略定义为一个非合作的多Agent系统,并给出了其投资模型和惩罚模型。而研究非合作多Agent系统,博弈均衡策略有着其明显优势,所以本文利用基于Nash平衡和Meta平衡的Q学习算法分别对气候合作策略进行了研究。并通过使用NashQ算法和MetaQ算法对气候合作策略进行了实验仿真。由于Meta平衡是纯策略的,如果存在Pareto最优解,Meta平衡必能求解到其最优解,并且Meta平衡点的求解时间复杂度低于Nash平衡点的求解。实验仿真结果表明MetaQ算法比NashQ算法求解气候合作策略在高惩罚概率下收敛速度更快,而在低惩罚率下求得的联合策略要比NashQ算法的更为人性化和具有可信性。
Recently, without question climate deterioration has become a fact. All countries have paid attention to this problem, also jointed to resolve the challenge of climate deterioration. However, the climate cooperation is conducted among countries which try to pursue individual interests when the fix quantify determines that the purposes of collective behaviors are to obtain maximization of individual interests. Thus, the countries hope to pursue the optimal strategy about how to achieve common goal and make to protect their own interests at the same time in the climate. The research object of this paper is the cooperative climate strategy, through the application of multi-agent reinforcement learning algorithm to simulate the cooperation climate strategy for the optimal policy problem of different countries in the punishment rates.
     The main contributions of the paper as follows:
     (1) First, this paper proposes multi-agent Q learning algorithm based on Meta equilibrium, which imbibes the ideal of game theory of NashQ algorithm, to solve the Q value through Meta equilibrium to get the optimal joint strategy in multi-agent system. And this paper gives the theoretical basis of MetaQ algorithm, and analyses the theoretical which is why MetaQ algorithm can get the Pareto optimal solution. The time complexity of MetaQ algorithm is far lower than NashQ algorithm. The grid world game simulation show that MetaQ algorithm has good convergence, and MetaQ algorithm which converges to the optimal algorithm is faster than NashQ algorithm almost 6 times in the experiment.
     (2) This paper studies the issues of cooperation climate strategy which is defined as a non-cooperative strategy multi-agent system, and gives its investment model and the punishment model. To the research of non-cooperative multi-agent system, it has obvious advantages to game equilibrium strategy, so this paper researches cooperation climate strategy which used Q learning algorithm based Nash and Meta equilibrium. Respectively, this paper simulates the experiments of cooperation climate strategy through NashQ and MetaQ algorithm. Meta equilibrium is a pure strategy, if there is a Pareto optimal solution to equilibrium, Meta equilibrium will be able to solve its optimal solution, and the time complexity of solving the Meta equilibrium point is shorter than Nash equilibrium. Simulation experiments show that the convergence of MetaQ algorithm is faster than NashQ algorithm in cooperation climate strategy when it has the high probability of punishment, and the joint strategy of MetaQ algorithm is more humane and credible than NashQ algorithm in the low punishment rate.
引文
[1]詹姆斯·古斯塔夫·史伯斯著,李燕妹,沈索芹,朱琳译.朝霞似火:美国与全球环境危机——公民的行动议程[M].北京:中国社会科学出版社,2007.
    [2]李强.国际气候合作研究[D].华中师范大学博士学位论文,2009.
    [3]国家气候变化对策协调小组办公室,中国21世纪议程管理中心著.全球气候变化——人类面临的挑战[M].北京:商务印书馆,2004.
    [4]万以诚等选编.新文明的路标——人类绿色运动史上的经典文献[M].长春:吉林人民出版社,2001.
    [5]詹姆斯·多尔蒂,小罗伯特·普法尔茨格拉夫著,阎学通,陈寒溪等译.争论中的国际关系理论[M].北京:世界知识出版社,2003.
    [6] M.S, Soroos. The Endangered Atmosphere: Preserving a Global Commons[M]. S.C.: University of South Carolina Press, Columbia, 1997.
    [7]基欧汉著,苏长和等译.霸权之后——世界政治经济中的合作与纷争[M].上海:上海人民出版社,2006.
    [8]甘恢运.国际气候合作的可行性分析[D].华中师范大学硕士学位论文,2009.
    [9]刘大有,杨鲲,陈建中. Agent研究现状与发展趋势[J].软件学报,2000,11(3):315-321.
    [10] B.Widrow, N.K.Gupta and S.Maintra. Punish/Reward Learning with a Criticin Adaptive Threshold System[J]. IEEE Trans. On Systems Man and Cybernetics, 1973,3(5):455-465.
    [11]郭茂祖.机器学习中若干算法研究及其应用[D].哈尔滨工业大学工学博士论文,1997.
    [12] Araabi.B.N, Mastoureshgh.S, Ahmadabadi.M.N. A Study on Expertise of Agents and Its Effects on Cooperative Q-learning[J]. IEEE Systems, Man, and Cybernetics Society, 2007, 37(2): 398-409.
    [13] Watkins. Learning from Delayed Rewards[D]. Cambridge University, 1989.
    [14]张汝波,杨广铭,顾国昌等. Q学习及其智能机器人局部路径规划中的应用研究[J].计算机研究与发展,1999,36(12):1430-1436.
    [15] P.R.J.Tillotsona, Q.H.Wua and P.M.Hughesb. Multi-agent Learning For Routing Control within an Internet Environment[J]. Engineering Applications of Artificial Intelligence, 2004,(17): 179-185.
    [16]胡子婴.基于智能体系统的Q-学习算法的研究与改进[D].哈尔滨理工大学硕士学位论文,2007.
    [17] Watkin, C.J.C.H, Dayan.P. Q Learning[J]. Machine Learning, 1992,8(3):29-292.
    [18] M.Tan. Multi-agent Reinforcement Learning Independent vs. Cooperative Agent[A]. In:Proc.of the tenth international conference on machine learning [C], Amherst, MA, 1993:330-337.
    [19]江大伟.合作型多智能体决策技术的研究[D].东南大学博士学位论文,2008.
    [20] K.Steven Donoho, A.Larry Rendell. Representing and Restructuring Domain Theories: A constructive Induction Approach[J]. Journal of Artificial Intelligence, 1995,(2): 11-446. [21 ] Joanna Deplede. The Organization of Global Negotiations: Constructing the Climate Change Regime[M], Sterling, VA, Earthscan, 2005.
    [22] Fahama Yamin, Joanna Deplede. The International Climate Change Regime: A Guide to Rules, Institutions and Procedures[M]. UK: Cambridge University Press, 2004.
    [23] Scott Barrett. Environment and Statecraft: the Strategy of Environmental Treaty-Making[M]. New York: Oxford University Press, 2003. [ 24 ] Stefan Brem, Kendall Stiles. Cooperation without America: Theories and Case Studies of Non-Hegemonic Regimes[M]. New York: Rutledge, 2008.
    [25]王之佳.中国环境外交[M].北京:中国环境科学出版社,1999.
    [26]苏长和.全球公共问题域国际合作:一种制度的分析[M].上海:上海人民出版社,2000.
    [27]崔大鹏.国际气候合作的政治经济学分析[M].北京:商务印书馆,2003.
    [28]薄燕.国际谈判与国内政治——美国与<京都议定书>谈判的实例[M],上海:上海三联书店,2007.
    [29]陈红波等.温室气体减排的国际政治博弈[J].太平洋学报。2001,2:100-124.
    [30]高阳,陈世福,陆鑫.强化学习研究综述[J].自动化学报,2004,30(1):86-105.
    [31] G.Weiss, P.Dillenbourg. What is Mufti in Multi-agent Learning Collaborative Learning, Cognitive and Computational Approaches[M]. Amsterdam: Pergamum Press, 1988.
    [32]苏浩铭.基于模型知识的大空间强化学习算法的研究与实现[D].合肥工业大学硕士学位论文,2008.
    [33] Y.Nagayuki, S.Ishii and K.Doya. Multi-agent Reinforcement Learning and Approach Base on The Other Agent’s Internal Model[A]. InProc. IEEE Int.conf. Multi-Agent Systems[C], Boston, MA, 2000, 215-221.
    [34] Mastronarde.N, van der Schaar.M. Online Reinforcement Learning for Dynamic Multimedia Systems[J]. Image Processing IEEE Transactions on, 2010,19(2): 290-305.
    [35] Arel.I, Liu C, Urbanik T, Kohls.A.G. Reinforcement Learning-based Multi-agent System for Network Traffic Signal Control[J]. Intelligent Transport Systems,IET, 2010,4(2):128-135.
    [36] Luis Nunes, Eugenio Oliveira. Cooperative Learning using Advice Exchange[J]. Lecture Notes in Artificial Intelligence, 2003(2636):33-48.
    [37]蒋珊珊,曹先彬,王煦法.基于IGA的用户Agent模型与设计[J].模式识别与人工智能,2004,17(2):244-252.
    [38] P.O.Gutman, I.Iosvich. On the Generalized Wolf Problem: Preprocessing of Nonnegative Large-Scale Linear Programming Problems with Group Constraints[J]. Avtomat. i Telemekh., 2007(8):116-125.
    [39] Michael L.Littman. Markov Games as a Framework for Multi-Agent Reinforcement Learning[J]. ICML, 1994,157-163.
    [40] Amy Greenwald and Keith Hall. Correlated-Q Learning[J]. ICML, 2003, 2(2):84-89.
    [41]邵飞,伍春,汪李峰.基于多Agent强化学习的Ad hoc网络跨层拥塞控制策略[J].电子与信息学报,2010,32(6):1520-1526.
    [42] Li Ning, Gao Yang, Lu Xin, Chen Shi Fu. A Learning Agent Base on Reinforcement Learning[J]. Computer Research and Development, 2001,38(9):1051-1056.
    [43] Junling Hu and Michael P. Wellman. Multiagent Reinforcement Learning: Theoretical Framework and an Algorithm[A]. Proceedings of the Fifteenth International Conference on Machine Learning[C].1998: 242-250.
    [44] Hao Wang,Yang Gao,Xingguo Chen. RL-DOT: A Reinforcement Learning NPC Team for Playing Domination Games[J].IEEE Transactions on Computational Intelligence and AI in Games, 2010:2(1) ,17-26.
    [45] Michael L. Littman. Friend-or-foe Q-learning in General-sum Games[A]. Proceedings of the Eighteenth International Conference on Machine Learning[C]. Williams College: Morgan Kaufman, 2001:322–328.
    [46] Amy Greenwald, Keith Hall, and Roberto Serrano. Correlated-q learning[A]. Proceedings of the Twentieth International Conference on[C]. Washington DC, 2003,242–249.
    [47]赵凤强,徐毅,李广强.基于岛屿群体模型的多目标演化算法研究[J].计算机科学,2010: 37(12):190-192.
    [48]宋梅萍,顾国昌,张国印等.一般和博弈中的合作多agent学习[J].控制理论与应用,2007,24(2): 317-321.
    [49] Junling Hu, Michael P Wellman. Nash Q-learning for General-sum Stochastic Games[J]. Journal of Machine Learning Research, 2003, 4(11):1039-1069.
    [50] V Vassiliades, A Cleanthous, C Christodoulou. Mutltiagent Reinforcement Learning: Spiking and Nonspiking Agents in the iterated Prisoner’s Dilemma[J]. IEEE Transactions on Neural Networks, 2011,22(4):639-653.
    [51] R J Aumann, S Hart. Computing Equilibria for Two-person Games. Handbook of Game Theory with Economic Applications[R]. Amsterdam: Elsevier ,2002. [ 52 ] K G Murty. Computational Complexity of Complementary Pivot Methods[A]. Mathematical Programming Study 7 Complementarily and Fixed Point Problems[C]. Amsterdam: North-Holland Publishing Co. 1978:61-73.
    [53] Lovejoy.W.S. A Survey of Algorithmic Methods for Partially Observed Markov Decision Processes[J]. Annals of Operations Research, 1991,28:47-65.
    [54] N Howard Paradoxes of Rationality. Theory of Metagames and Political Behavior[M]. MIT Press, Massachusetts: Cambridge ,1971.
    [55] L C Thomas. Games, Theory and Application[M]. Halsted Press, Chichester,1984.
    [56] Yong Cao, Renhui Li. Conflict Analysis between Tacit Knowledge Sharing and its Exclusivity Based on Meta-game Theory[A]. 2009 International Symposium on Information Engineering and Electronic Commerce[C]. IEEC, 2009:31-35.
    [57] Rajneesh Sharma and M.Gopal. Hybrid Game Strategy in Fuzzy Markov-Game-Based Control[J]. IEEETransactions on Fuzzy Systems, 2008:1(16),1315-1327.
    [58] M. Milinski, R.D. Sommerfeld, H.J. Krambeck, F.A. Reed and J. Marotzke, The Collective-risk Social Dilemma and the Prevention of Simulated Dangerous Climate Change[J], Proceedings of the National Academy of Sciences 2008:105(7), 2291-2294.
    [59]顾国昌,仲宇,张汝波.一种新的多智能体强化学习及其在多机器人协作任务中的应用[J].机器人,2003,25(4):344-348.
    [60] Wang P, Li H, and Himed B. A Bayesian Parametric Test for Multichannel Adaptive Signal Detection in Nonhomogeneous Environments[J]. IEEE Signal Processing Letters, 2010,17(4):351-354.
    [61]尚秀芹,宋红军,徐海胜等.非均匀环境中的参量多通道目标检测[J].电子与信息学报,2011,33(5):1095-1101.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700