分层强化学习方法研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

分层强化学习方法研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Hierarchical Reinforcement Learning
作者：沈晶
论文级别：博士
学科专业名称：计算机应用技术
中文关键词：分层强化学习 ; 免疫聚类 ; 自动分层 ; 多智能体分层强化学习
英文关键词：Hierarchical reinforcement learning ; Immune clustering ; Automatic hierarchy ; Multi-agent hierarchical reinforcement learning
学位年度：2006
导师：顾国昌
学科代码：081203
学位授予单位：哈尔滨工程大学
论文提交日期：2006-05-01

摘要

强化学习通过试错与环境交互获得策略的改进，其自学习和在线学习的特点使其成为机器学习研究的一个重要分支。但是，强化学习一直被维数灾难所困扰。近年来，分层强化学习在克服维数灾难方面取得了显著进展，典型的成果有Option、HAM和MAXQ等方法，其中Option和MAXQ目前使用较为广泛。Option方法便于自动划分子任务(尤其分区或分段子任务)，且子任务粒度易于控制，但利用先验知识划分子任务时，任务划分结果表达不够明晰，且子任务内部策略难于确定；MAXQ方法在线学习能力强，但自动分层能力较弱，且分层粒度不够精细，难于对一些规模依然很大的子任务做出进一步的分解。
     本文集成Option和MAXQ探讨一种新的分层强化学习方法——OMQ，并深入研究集成过程中所涉及的理论与计算问题，以及该方法在实际应用中需要进一步解决的问题。
     论文完成了以下主要工作：
     (1)提出了OMQ分层强化学习方法，给出了理论框架和学习算法，该框架集成了Option和MAXQ的优势，对学习任务既可以利用先验知识进行预先分层，也可以在学习过程中自动分层，拓展了任务分层能力；根据随机逼近理论采用数学归纳法证明了学习算法在与MAXQ相同的收敛条件下能依概率1收敛到递归最优解；实验表明OMQ学习算法的性能优于Q-学习、Option和MAXQ的学习算法；
     (2)提出了基于免疫聚类的OMQ任务自动分层算法，算法基于aiNet人工免疫网络模型及免疫克隆选择算法实现状态空间聚类，以生成的状态聚类子空间为基础构造子任务，实验表明该算法克服了以往的任务自动分层算法对状态空间可分割性的高度依赖问题；并借鉴免疫系统二次应答机制对算法进一步改进，提出了动态自动分层OMQ算法(DOMQ)，在对状态空间进行初步探测之后即进行自动分层，并可以根据其后的探测结果对已生成子任
Reinforcement learning is an approach that an agent can learn its behivors through trial-and-error interaction with a dynamic environment. It has been an important branch of machine learning for its self-learning and online learning capabilities. But reinforcement learning is bedeviled by the curse of dimensionality. Recently, hierarchical reinforcement learning has made great progresses to combat the curse of dimensionality. There are several valuable works such as Option, HAM, and MAXQ. Where, Option and MAXQ are used more popularly. In Option framework, it is easy to automatically generate subtasks, esp. by partitioning regions or stages, and the granularity of subtask is easy to be controlled. But it is difficult to clearly describe the structure of subtasks and to learn the local strategies when these subtasks are constructed manually according to previous knowledge. The MAXQ approach has enough ability for online learning but weak ability for automatically discovering hierarchies. And besides, the granularity of subtask is not fine enough, and some large-scale subtasks can hardly be decomposed finer.
    In this dissertation, a novel approach of hierarchical reinforcement learning, named OMQ, by integrating Options into MAXQ is proposed. The theoretical and computational issues in OMQ are addressed as well as the rising problems in practice.
    The main contributions of this dissertation are:
    1) The OMQ approach for hierarchical reinforcement learning is presented and its theoretical framework and learning algorithm are discussed. The OMQ framework takes on the advantages of Option and MAXQ, i.e., the hierarchies not only can be constructed manually according to the previous knowledge but also can be generating automatically during learnig. Employing the result from stochastic approximation theory, an inductive proof is given that the OMQ learning algorithm converges with probability 1 to the unique recursively optimal policy in the same convergence condition as MAXQ. The experimental results show that the OMQ learning algorithm has better performance than that of

引文

[1] 高阳，陈世福，陆鑫．强化学习研究综述．自动化学报．2004，30(1)：86-100页
    [2] 张汝波．强化学习理论及应用．哈尔滨：哈尔滨工程大学出版社，2001
    [3] Barto A G, Mahadevan S. Recent Advances in Hierarchical Reinforcement Learning. Discrete Event Dynamic Systems: Theory and Applications, 2003, 13 (4): 41-77P
    [4] Singh S P, Jaakola T, Jordan M I. Reinforcement Learning with Soft State Aggregation. Neural Information Processing Systems 7, Cambridge, Massachusetts: MIT Press, 1995: 361-368P
    [5] Moriarty D, Schultz A, Grefenstette J. Evolutionary Algorithms for Reinforcement Learning. Journal of Artificial Intelligence Research, 1999, 11(1): 241-276P
    [6] Bertsekas D P, Tsitsiklis J N. Neuro-dynamic Programming. Belmont: Athena Scientific, 1996
    [7] Sutton R S, Precup D, Singh S P. Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning. Artificial Intelligence, 1999, 112(1-2): 181-211P
    [8] Parr R. Hierarchical Control and Learning for Markov Decision Processes. Ph.D Dissertation, University of California, Berkeley, 1998
    [9] Dietterich T G. Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition. Journal of Artificial Intelligence Research, 2000, 13: 227-303P
    [10] Howard R A. Dynamic Probabilistic Systems: Semi-Markov and Decision Processes. New York: Wiley, 1971
    [11] Takahashi Y, Asada M. Multi-controller Fusion in Multi-layered Reinforcement Learning. International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI2001), Baden Baden, Germany, 2001: 7-12P
    [12] Singh S. Transfer of Learning by Composing Solutions of Elemental??Sequential Tasks. Machine Learning, 1992,8: 323-339P
    [13] Dayan P, Hinton G Feudal Reinforcement Learning. Proceedings of Advances in Neural Information Processing Systems 5, San Francisco:Morgan Kaufmann, 1993: 271-278P
    [14] Minsky M L. Theory of Neural-Analog Reinforcement Systems and its Application to the Brain-Model Problem. Ph.D Dissertation, Princeton University, 1954
    [15] Bellman R E, Dreyfus S E. Applied Dynamic Programming. Princeton, New Jersey: Princeton University Press, 1962
    [16] Kaelbling L P, Moore A. Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research, 1996,4: 237-285P
    [17] Sutton R S. Learning to Predict by the Methods of Temporal Difference.Machine Learning, 1988,3: 9-44P

    [18] Barto A G, Sutton R S, and Anderson C W. Neuronlike Adaptive Elements that Can Solve Difficult Learning Control Problems. IEEE Transactions on Systems, Man, and Cybernetics, 1983,13(5): 834-846P
    [19] Watkins C, Dayan P. Q-Learning. Machine Learning, 1992,8(3): 279-292P
    [20] Mahadevan S, Marchalleck N, Das T, Gosavi A. Slef-improving Factory Simulation Using Continuous-time Average-reward Reinforcement Learning.Proceedings of the 14th Internatioanl Conference on Machine Learning,Nashville, Tennessee, USA, 1997: 202-210P
    [21] Bradtke S J, Duff M O. Reinforcement Learning Methods for Continuous-time Markov Decision Problems. Advances in Neural Information Processing Systems 7, Cambridge: MIT Press, 1995: 393-400P
    [22] Das T K, Gosavi A, Mahadevan S, Marchellack N. Solving Semi-Markov Decision Problems Using Average Reward Reinforcement Learning. Management Science, 1999,45(4): 560-574P
    [23] Simon H A.The Sciences of the Artificial. Cambridge: MIT Press, 1996

    [24] Korf R. Macro-operators: A Weak Method for Learning. Artificial Intelligence, 1985,26(1): 35-77P


    [25] Sacerdoti E.Planning in a Hierarchy of Abstraction Spaces. Artificial Intelligence, 1974,5(2): 115-135P [26] Currie K, Tate A. O-plan: The Open Planning Architecture. Artifial Intelligence, 1991,52(1): 1104-1111P

    [27] Laird J, Rosenbloom P, Newell A. Chunking in SOAR: The Anatomy of a General Learning Mechanism. Machine Learning, 1986,1: 11-46P

    [28] Knoblock C. Learning Abstraction Hierarchies for Problem Solving.Proceedings of the 8th National Conference on Artificial Intelligence, Boston,USA, 1990: 923-928P

    [29] Drescher G Made-up Minds, A Constructivist Approach to Artificial Intelligence. Cambridge: MIT Press, 1991

    [30] Oates T, Cohen P. Searching for Planning Operators with Context Dependent and Probabilistic Effects. Proceedings of the 13th National Conference on Artificial Intelligence, Portland, USA, 1996: 863-868P

    [31]Brafman R, Tennenholtz M. Modeling Agents as Qualitative Decision Makers. Artificial Intelligence, 1997,94: 217-268P

    [32] Kokotovic P, Khalil H, O'Reilly J. Singular Perturbation Methods in Control:Analysis and Design. London: Academic Press, 1986

    [33] Kaelbling L. Hierarchical Reinforcement Learning: Preliminary Results.Proceedings of the 10th International Conference on Machine Learning, San Francisco: Morgan Kaufmann, 1993: 167-173P

    [34] Thrun S, Schwartz A. Finding Structure in Reinforcement Learning. Advances in Neural Information Processing Systems, MIT Press, 1995:385-392P

    [35] Huber M, Grupen R A. A Feedback Control Structure for On-line Learning Tasks, Robots and Autonomous Systems, 1997,22(3-4): 303-315P

    [36] Grudic G Z, Ungar L H. Localizing Search in Reinforcement Learning.Proceedings of the 17th National Conference on Artificial Intelligence (AAAI 2000), MA:MIT Press, 2000,17: 590-595P

    [37] Harel D. Statecharts: A Visual Formalism for Complex Systems. Science of Computer Programming, 1987,8(3): 231-274P
    [38] Mannor S, Menache I, Hoze A, Klein U. Dynamic Abstraction in Reinforcement Learning via Clustering. The Proceedings of the 21st??International Conference on Machine learning, Banff, Alberta, Canada, 2004: 560-567P
    [39] Perkins T J, Barto A G Lyapunov Design for Safe Reinforcement Learning. Journal of Machine Learning Research, 2003,3(4-5): 803-832P
    [40] McGovern A, Barto A. Autonomous Discovery of Subgoals in Reinforcement Learning Using Deverse Density. Proceedings of the 8th International Conference on Machine Learning, San Fransisco: Morgan Kaufmann, 2001: 361-368P
    [41] Kretchmar R M, Feil T, Bansal R. Improved Automatic Discovery of Subgoals for Options in Hierarchical Reinforcement Learning. Journal of Computer Science and Technology, 2003, 3(2): 9-14P
    [42] Precup D, Sutton R S. Multi-Time Models for Temporally Abstract Planning. Advances in Neural Information Processing Systems 10, MIT Press, 1998: 1050-1056P
    [43] Precup D. Temporal Abstraction in Reinforcement Learning. Ph.D. Dissertation, University of Massachusetts, USA, 2000
    [44] Singh S P, Jaakkola T, Littman M L, Szpesvari C. Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms. Machine Learning, 2000,38(3): 287-308P
    [45] Digney B L. Emergent Hiearchical Control Structures: Learning Reactive Hierarchical Relationships in Reinforcement Environments. From Animals to Animats 4: Proceedings of the 4th International Conference on Simulation of Adaptive Behaviour, North Falmouth, USA, 1996: 363-372P
    [46] McGovern A, Sutton R S. Macro-Actions in Reinforcement Learning: An Empirical Analysis. Amherst Technical Report 98-70, University of Massachusetts, USA, 1998
    [47] Menache I, Mannor S, Shimkin N. Q-Cut: Dynamic Discovery of Sub-Goals in Reinforcement Learning. Lecture Notes in Computer Science 2430. Springer, 2002: 295-306P
    [48] Digney B L. Learning Hierarchical Control Structures for Multiple Tasks and Changing Environments. From Animals to Animats 5: Proceedings of the 5th International Conference on Simulation of Adaptive Behaviour (SAB98),??Zurich, Switzerland, 1998:321-330P
    [49] Singh S. Reinforcement Learning with a Hierarchy of Abstract Models. Proceedings of the 10th National Conference on Artificial Intelligence, San Jose, USA, 1992: 202-207P
    [50] Moore A, Baird L, Kaelbling L P. Multi-Value-Functions: Efficient Automatic Action Hierarchies for Multiple Goal MDPs. Proceedings of the 15th International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 1999: 1316-1323P
    [51] McGovern A. Autonomous Discovery of Temoral Abstraction from Interaction with an Environment. Ph.D Dissertation, University of Massachusetts Amhert, USA, 2002
    [52] Drummond C. Accelerating Reinforcement Learning by Composing Solutions of Automatically Indentified Subtasks. Journal of Artificial Intelligence Research, 2002,16: 59-104P
    [53] Knoblock C A. Automatically Generating Abstractions for Planning. Artificial Intelligence, 1994,68(2): 243-302P
    [54] Hengst B. Discovering Hierarchy in Reinforcement Learning. Ph.D Dissertation, University of New South Wales, Australia, 2003
    [55] Hengst B. Generating Hierarchical Structure in Reinforcement Learning from State Variables. Lecture Notes in Computer Science, Springer, 2000: 533-543P
    [56] Moore A. The Parti-Game Algorithm for Variable Resolution Reinforcement Learning in Multidimensional State-Spaces. Advances in Neural Information Processing Systems 6, Morgan Kaufmann Publishers, 1994: 711—718P
    [57] McCallum A. Reinforcement Learning with Selective Perception and Hidden State. Ph.D Dissertation, University of Rochester, USA, 1995
    [58] Uther W T B. Tree Based Hierarchical Reinforcement Learning. Ph.D Dissertation, Carnegie Mellon University, USA, 2002
    [59] Ryan M R K, Reid M D. Using ILP to Improve Planning in Hierarchical Reinforcement Learning. Lecture Notes in Artificial Intelligence 1866, Springer, 2000: 174-190P[60] De Jong E D, Oates T. A Coevolutionary Approach to Representation Development. Proceedings of the International Conference on Machine Learning, Sydney, Australia, 2002: 1-6P
    [61] 赵志宏，高阳，骆斌，陈世福．多Agent系统中强化学习的研究现状和发展趋势．计算机科学，2004，31(3)：23-27页
    [62] Shoham Y, Powers R, Grenager T. Multi-Agent Reinforcement Learning: A Critical Survey. Tech. Report, Stanford University, 2003
    [63] Lauer M, Riedmiller M. Reinforcement Learning for Stochastic Cooperative Multi-Agent-Systems. Proceedings of the 3rd International Joint Conference on Autonomous Agents and Multi Agent Systems (AAMAS2004), New York, USA, 2004
    [64] Powers R, Shoham Y. New Criteria and a New Algorithm for Learning in Multi-agent Systems. Advances in Neural Information Processing Systems 17, Vancouver, Canada, 2004
    [65] Weiβ G, Dillenbourg P. What is 'multi' in Multi-Agent Learning. Collaborative-learning: Cognitive and Computational Approaches, Oxford: Elsevier, 1999: 64-80P
    [66] Littman M L. Markov Games as a Framework for Multi-Agent Reinforcement Learning. Proceedings of the 11th International Conference on Machine Learning (ML-94), New Brunswick, USA, 1994: 157-163P
    [67] Bowling M, Velsoso M. Rational and Convergent Learning in Stochastic Games. Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI2001), Seattle, USA, 2001: 1021-1026P
    [68] Makar R, Mahadevan S, Ghavamzadeh M. Hierarchical Multi-agent Reinforcement Learning. Proceedings of the 15th International Conference on Autonomous Agents, Montreal, Canada, 2001: 246-253P
    [69] Ghavamzadeh M, Mahadevan S, Makar R. Extending Hierarchical Reinforcement Learning to Continuous-Time, Average-Reward, and Multi-Agent Models. CMPSCI Technical Report 03-23, University of Massachusetts Amherst, USA, 2003
    [70] Ghavamzadeh M, Mahadevan S. Learning to Communicate and Act using??Hierarchical Reinforcement Learning. Proceedings of the 3rd International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS-2004), New York, USA, 2004: 1114-1121P
    [71] Ghavamzadeh M, Mahadevan S. Hierarchical Multiagent Reinforcement Learning. CMPSCI Technical Report 04-02, University of Massachusetts Amherst, USA, 2004
    [72] Ghavamzadeh M. Hierarchical Reinforcement Learning in Continuous State and Multi-Agent Environments. Ph.D Dissertation, University of Massachusetts Amherst, 2005
    [73] 沈晶，顾国昌，刘海波．分层强化学习研究综述．模式识别与人工智能，2005，18(5)：574-581页
    [74] Sutton R, Barto A G. Introduction to Reinforcement Learning. Cambridge: MIT Press, 1998
    [75] Sutton R S, Singh S, Precup D, Ravindran B. Improved Switching among Temporally Abstract Actions. Advances in Neural Information Processing Systems 11, MIT Press, 1999: 1066-1072P
    [76] Dean T, Lin S H. Decomposition Techniques for Planning in Stochastic Domains. Proceedings of the 11th International Joint Conference on Artificial Intelligence (IJCAI-95), San Francisco, USA, 1995: 1121-1127P
    [77] Parr R. Flexible Decomposition Algorithms for Weakly Coupled Markov Decision Problems. Proceedings of the 14th Annual Conference on Uncertainty in Artificial Intelligence (UAI-98), San Francisco, USA, 1998: 422-430P
    [78] Jaakkola T, Jordan M I, Singh S P. On the Convergence of Stochastic Iterative Dynamic Programming Algorithms. Neural Computation, 1994, 6(6): 1185-1201P
    [79] Burnet F M. A Modification of Jerne's Theory of Antibody Production Using the Concept of Clonal Selection. Australian Journal of Science, 1957, 20: 67-69P
    [80] Jerne N K. Towards a Network Theory of the Immune System. Annual Immunology, 1974, 125C: 373-389P[82] Farmer J D, Packard N H, Perelson A S. The Immune System, Adaptation, and Machine Learning. Physica, 1986, 22(D): 187-204P
    [83] Dasgupta D, Attoh-Okine N. Immunity-based Systems: A Survey. Proceeding of 1997 IEEE International Conference on Systems, Man and Cybernetics, Orlando, USA, 1997: 369-374P
    [84] De Castro L N, Von Zuben F J. Artificial Immune Systems: Part I - Basic Theory and Applications. Technical Report - RT DC A 01/99, State University of Campinas, Brazil, 1999
    [85] Timmis J, Hunt J. An Artificial Immune System for Data Analysis. Biosystems, 2000, 55(1-3): 143-150P
    [86] Dasgupta D, Forrest S. An Anomaly Detection Algorithm Inspired by The Immune System. Artificial Immune Systems and Their Applications, Springer-Verlag, 1999: 262-177P
    [87] Cao Y, Dasgupta D. An Immunogenetic Approach in Chemical Spectrum Recognition. Advances in Evolutionary Computing: Theory and Applications, Springer-Verlag, 2003: 897-914P
    [88] Knight T, Timmis J. A Multi-layered Immune Inspired Approach to Data Mining. Proceedings of the 4th International Conference on Recent Advances in Soft Computing, Nottingham, UK, 2002: 266-271P
    [89] Hofmeyr S, Forrest S. Architecture for An Artificial Immune System. Evolutionary Computation, 2000,8(4): 443-473P
    [90] Dasgupta D. Immunity-based Intrusion Detection System: A General Framework. Proceedings of the 22nd National Information Systems Security Conference (NISSC), Virginia, USA, 1999: 147-160P
    [91] Kim J, Bentley P. Immune Memory in The Dynamic Clonal Selection Algorithm. Proceedings of the 1st International Conference on Artificial Immune Systems (ICARIS), Canterbury, Australia, 2002: 57-65P
    [92] Azzedine B, et al. An Artificial Immune Based Intrusion Detection Model for Computer and Telecommunication Systems. Parallel Computing, 2004,30(5-6): 629-646P[93] KrishnaKumar K, Neidhoefer J. Immunized Adaptive Critic for an Autonomous Aircraft Control Application. Artificial Immune Systems and their Applications, Springer-Verlag, 1999: 221-240P
    [94] KrishnaKumar K. Artificial Immune System Approaches for Aerospace Application. The 41st Aerospace Sciences Meeting & Exhibit, Reno, Nevada, USA, 2003
    [95] Bradley D, Tyrell A. Immunotronics: Hardware Fault Tolerance Inspired by the Immune System. Proceedings of the 3rd International conference on Evoluable Systems (ICES2000), vol.1801, Springer-verlag, 2000: 61-71P
    [96] 张雷,李人厚.人工免疫 C-均值聚类算法。西安交通大学学报,2005,39(8): 836-839 页
    [97] Hofmeyr S A. An Immunological Model of Distributed Detection and Its Application to Computer Security. PhD Dissertation, University of New Mexico, 1999
    [98] Ballet P, Tisseau, Harrouet F. A Multi-agent System to Model a Human Humoral Response. Proceeding of 1997 IEEE International Conference on Systems, Man and Cybernetics, Orlando, USA, 1997
    [99] Dasgupta D. An Artificial Immune System as a Multi-agent Decision Support System. Proceedings of the 1998 IEEE International Conference on Systems, Man and Cybernetics, San Diego, USA, 1998: 3816-3820P
    [100] Timmis J, Neal M. A Resource Limited Artificial Immune System for Data Analysis. Knowledge Based Systems, 2001,14(3-4): 121-130P
    [101]Ishida Y. Active Diagnosis by Immunity-based Agent Approach. Proceedings of the International Workshop on Principle of Diagnosis, Val-Morin, Canada, 1996:106-114P
    [102] De Castro L N, Von Zuben F J. An Evolutionary Immune Network for Data Clustering. Proceedings of the IEEE Brazilian Symposium on Artificial Neural Networks, Rio de Janeiro, Brazil, 2000:84-89P
    [103] De Castro L N, Von Zuben F J. AiNet: An Artificial Immune Network for Data Analysis. International Journal of Computation Intelligence and Applications, 2001,1(3): 239-257 P[104] Perelson A S, Oster G F. Theoretical Studies of Clonal Selection: Minimal Antibody Repertoire Size and Reliability of Self-Nonself Discrimination. Journal of Theorical Biology, 1979, 81: 645-670P
    [105] 李涛．计算机免疫学．北京：电子工业出版社，2004
    [106] Forrest S, et al. Self-nonself Discrimination in a Computer. Proceedings of the 1994 IEEE Symposium on Research in Security and Privacy, Oakland, 1994: 202-212P
    [107] Seiden P E, Celada F. A Model for Simulating Cognate Recognition and Response in the Immune System. Journal of Theorical Biology, 1992, 158(3): 329-357P
    [108] De Castro L N, Von Zuben F J. Clonal Selection Algorithm with Engineering Applications. Proceedings of Workshop on Artificial Immune Systems and Their Applications, Genetic and Evolutionary Computation Conference, Las Vegas, USA, 2000: 36-37P
    [109] Lozano-Perez T, Wesley M A. An Algorithm for Planning Collision-free Paths among Polyhedral Obstacles. Communications of the ACM, 1979, 22(10): 560-570P
    [110] Khatib O. Real-time Obstacle Avoidance for Robot Manipulator and Mobile Robots. International Journal of Robotics Research, 1986, 5(1): 90-98P
    [111] Wang Chun-Miao, Soh Y C, Wang Han, Wang Hui. A Hierarchical Genetic Algorithm for Path Palnning in a Static Environment with Obstacles. Proceedings of the 2002 IEEE Canadian Conference on Eleccal and Computer Engineering, Winnipeg, Canada, 2002: 1652-1657P
    [112] D'Amico A, Ippoliti G, Longhi S A. Radial Basis Function Networks Approach for the Tracking Problem of Mobile Robots. Proceedings of the IEEE/ASME International Conference on Advanced Intelligent Mechatronics, Como, Italy, 2001: 498-503P
    [113] Bruce J, Veloso M. Real-time Randomized Path Planning for Robot Navigation. Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and System, Lausanne, Switzerland, 2002: 2383-2388P
    [114] 张汝波，杨广铭，顾国昌，张国印．Q-学习及其在智能机器人局部路径规划??中的应用研究．计算机研究与发展，1999，36(12)：1430-1436页
    [115] 张纯刚，席裕庚．全局环境未知时基于滚动窗口的机器人路径规划．中国科学(E辑)，2001，31(1)：51-58页
    [116] 张纯刚，席裕庚．一类动态不确定环境下机器人的滚动路径规划．自动化学报，2002，28(2)：161-174页
    [117] 朱庆保．动态复杂环境下的机器人路径规划蚂蚁预测算法．计算机学报，2005，28(11)：1898-1906页
    [118] 朴松昊，洪炳镕．一种动态环境下移动机器人的路径规划方法．机器人，2003，25(1)：18-21，43页
    [119] 韩学东，洪炳镕，孟伟．一种新型的路径规划方法——人工水流法．高技术通讯，2004，(4)：53-57页
    [120] 谢宏斌，刘国栋，李春光．动态环境中基于模糊神经网络的机器人路径规划的一种新方法．江南大学学报(自然科学版)，2003，2(1)：20-23，27页
    [121] 刘国栋，谢宏斌，李春光．动态环境中基于遗传算法的移动机器人路径规划的方法．机器人，2003，25(4)：327-330，343页
    [122] 覃柯，孙茂相，孙昌志．动态环境下基于改进人工势场法的机器人运动规划．沈阳工业大学学报，2004，26(5)：568-571，582页
    [123] 徐潼，唐振民．动态环境中的移动机器人避碰规划研究．机器人，2003，25(2)：117-122，139页
    [124] Lovejoy W S. A survey of algorithmic methods for partially observed Markov decision processes. Annals of Operations Research, 1991, 28: 47-65P
    [125] Singh S P, Jaakkola T, Jordan M I. Model-free Reinforcement Learning for non-Markovian Decision Problems. Proceedings of the 11th International Conference on Machine Learning, San Francisco, USA, 1994: 284-292P
    [126] Kaelbling L P, Littman M L, Cassandra A R. Planning and Acting in Partially Observable Stochastic Domains. Artificial Intelligence, 1998, 101(1-2): 99-134P
    [127] Smallwood R D, Sondik E J. The Optimal Control of Partially Observable??Markov Decision Processes over a Finite Horizon. Operations Research, 1973,21: 1071-1088P
    [128] Sondik E J. The Optimal Control of Partially Observable Markov Decision Processes over the Infinite Horizon: Discounted Cost. Operations Research, 1978,26: 282-304P
    [129]Madani O, Hank S, Condon A. On the Undecidability of Probabilistic Planning and Infinite-horizon Partially Observable Markov Decision Processes. Proceedings of the 16th National Conference on Artificial Intelligence, Orlando, USA, 1999: 541-548P
    [130]Hauskrecht M. Planing and Control in Stochastic Domains with Imperfefct Information. Ph.D thesis, Massachusetts Institute of Technology, Cambridge, 2000
    [131] Littman M L, Cassandra A R, Kaelbling L P. Learning Policies for Partially Observable Environments: Scaling up. Proceedings of the 12th International Conference on Machine Learning, Tahoe City, USA, 1995: 362-370P
    [132]Lovejoy W S. Suboptimal Policies, with Bounds, for Parameter Adaptive Decision Processes. Annals of Operations Research, 1993,41(3): 583-599P
    [133] Hernandez N, Mahadevan S. Hierarchical Memory-based Reinforcement Learning. Advances in Neural Information Processing Systems 13, MIT Press, 2001:1047-1053P
    [134]Jonsson A, Barto A. Automated State Abstraction for Options using the U-Tree Algorithm. Advances in Neural Processing Information Systems 13, MIT Press, 2001:1054-1060P
    [135] Tan M. Multi-agent Reinforcement Learning : Independent vs. Cooperative Agents. Proceedings of the 10th International Conference on Machine Learning, Amherst, USA, 1993: 330-337P
    [136]Nunes L, Oliveira E. Cooperative Learning Using Advice Exchange. Adaptive Agents and Multiagent Systems, Lecture Notes in Computer Science, 2636, Berlin, German: Springer-Verlag, 2003, 33-48P
    [137] Hu J, Wellman M P. Nash Q-learning for General-sum Stochastic Games. Journal of Machine Learning Research, 2003,4:1039-1069P[138]Greenwald A, Hall L, Serrano R. Correlated-Q Learning. Proceedings of 20th International Conference on Machine Learning, Washington DC, USA, 2003,242-249P
    [139]Claus C, Boutilier C. The Dynamics of Reinforcement Learning in Cooperative Multiagent System. Proceedings of the 15th National Conference on Artificial Intelligence, Madison, USA: AAAI, 1998, 746-752P
    [140] Chang Y, Kaelbling L. Playing is Believing: The Role of Beliefs in Multi-agent Learning. Advances in Neural Information Processing Systems 13, MIT Press, 2001: 1483-1490P
    [141] Bowling M. Convergence and No-regret in Multiagent Learning. Advances in Neural Information Processing Systems 17, Vancouver, Canada, 2004: 209-216P
    [142]Mataric M. Reinforcement Learning in the Multi-robot Domain. Autonomous Robots, 1997,4(1): 73-83P
    [143]Mehta N, Tadepalli P. Multi-agent Shared Hierarchy Reinforcement Learning. Proceedings of the ICML'05 Workshop on Richer Representations in Reinforcement Learning, Bonn, Germany, 2005
    [144]Rohanimanesh K, Mahadevan S. Learning to Take Concurrent Actions. Proceedings of the Sixteenth Annual Conference on Neural Information Processing Systems, Vancouver, Canada, 2002: 1619-1626P

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700