基于强化学习理论的网络拥塞控制算法研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

读者指南

学术客户端

NSTL服务站

科技查新

基于强化学习理论的网络拥塞控制算法研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：On Congestion Control for Computer Networks Based on Reinforcement Learning Theory
作者：李鑫
论文级别：博士
学科专业名称：控制理论与控制工程
中文关键词：计算机网络 ; 拥塞控制 ; ATM网络 ; ABR流量控制 ; TCP网络 ; 主动队列管理 ; QoS路由 ; 强化学习 ; Q-学习 ; 模糊推理 ; 模拟退火 ; 遗传算法
英文关键词：computer network ; congestion control ; ATM network ; ABR flow control ; TCP network ; active queue management ; QoS routing ; reinforcement learning ; Q-learning ; fuzzy inference ; simulated annealing ; genetic algorithm
学位年度：2009
导师：张嗣瀛
学科代码：081101
学位授予单位：东北大学
论文提交日期：2009-05-01
答辩委员会主席：杨光红

摘要

随着Internet的飞速发展,用户数量迅速增加,新的网络应用不断涌现,使得网络流量急剧增加,由此引发的网络拥塞已经成为制约网络发展和应用的瓶颈问题。信息拥塞是影响网络服务质量(QoS)的主要原因。因此,有效地解决拥塞问题对于提高网络性能具有重要意义。网络系统本身存在的时变性和不确定性等因素导致网络是一个复杂的大系统,数学模型的复杂性和精确性往往难以满足网络的实时需求。因此需要设计基于学习思想的拥塞控制算法,以便获得更好的拥塞控制效果。
     强化学习方法不依赖于被控对象的数学模型和先验知识,而是通过试错和与环境的不断交互获得知识,从而改进行为策略,具有自学习的能力。对于网络这种复杂的时变系统,强化学习是一种理想的选择。鉴于此,本文基于强化学习理论提出了几种拥塞控制算法以解决网络的拥塞控制问题。主要工作概括如下：
     针对单瓶颈ATM网络的拥塞控制问题,基于强化学习理论中的自适应启发评价方法设计了分层的强化学习ABR流量控制器。控制器的动作选择单元利用分层机制,分别基于缓冲区中队列长度和信元丢失率进行控制。ABR发送速率通过对两个子单元的输出利用加权求和得到。然后,基于模拟退火算法设计了控制器的参数学习过程,加快了学习速度,避免了可能存在的局部极值问题。
     针对含有两个瓶颈节点的ATM网络的拥塞控制问题,基于强化学习理论中的Q-学习思想设计了Q-学习ABR流量控制器。控制器在网络模型参数未知的情况下,通过Q-函数的设计,将寻找最优控制策略的问题转化为寻找一个最优H矩阵的问题。基于递归最小二乘算法实现了H矩阵的学习,进而得到了使网络性能指标最优的控制策略。
     针对TCP网络的拥塞控制问题,基于强化学习理论中的Q-学习方法设计了主动队列管理算法。控制器学习TCP网络中状态-动作对所对应的Q-函数值,并利用反映了Q-函数值与当前网络状态联系紧密程度的可信度值来调节学习率。然后,利用状态空间变换的思想对状态空间进行了简化。基于Metropolis规则改进了动作选择策略,实现了对未知空间探索和对已有知识完善两种策略的平衡。其次,基于合作奖赏值将所设计的控制器应用于含有多瓶颈节点的网络环境。
     针对TCP网络中连续的状态空间,基于模糊Q-学习方法设计了主动队列管理算法。学习过程中学习单元所选择的动作以及对应的Q-函数值都是通过模糊推理得到的。然后,利用遗传算法对每条模糊规则的后件部分进行优化,得到适合于每条模糊规则的最优动作。
     针对网络中存在非合作用户的问题,基于Nash Q-学习方法设计了流量控制器。针对不同业务以及同类业务的不同QoS要求,基于价格机制,制定了不同的价格标准,并应用于奖赏值的计算中。学习单元通过选择符合Nash平衡条件的Q-函数值进行学习。各用户选择的数据发送速率能够在使网络整体性能达到最优的情况下,使各个用户也获得尽可能高的利益。
     针对网络的路由选择问题,首先设计了双度量Q-Routing路由算法。将数据包传输时间和路径代价分别作为Q-函数值进行学习,并通过调节考虑两个度量的权重影响路由选择的结果。其次,设计了基于记忆的Q-学习路由算法。路径所对应的Q-函数值通过学习反映网络的状态信息。学习单元通过记忆曾经学习到的最优Q-函数值和曾经发生拥塞路径的恢复速率预测网络流量趋势,进而决定路由策略的选择。
With the rapid development of Internet and the sudden increase of computer network users, various network applications appear almost every day. And the network congestion led by the sharp increase of net flux has been a bottleneck problem, which restricts the development and application of network. The congestion of information is the main reason that affects the quality of service (QoS) in network. Therefore, it is important to solve the congestion problem effectively to improve the performance of network. But network is a complex large system with respect to the nature of time-varying and uncertainty, and the complexity and accuracy of mathematical model can not match the real time requirement of network usually. So a kind of congestion control algorithm based on learning mechanism is needed in order to obtain better congestion control effect.
     The reinforcement learning algorithm is independent of the mathematic model and priori-knowledge of controlled object. It obtains the knowledge through trial-and-error and interaction with the environment to improve its behavior policy. So it has the ability of self-learning. The reinforcement learning is very suitable for complex time-varying network system. So in this dissertation, some congestion control algorithms are proposed based on reinforcement learning theory in order to solve the problem of congestion control for networks. The main innovative contributions of this dissertation are summarized as follows.
     Based on the adaptive heuristic critic algorithm in reinforcement learning theory, a hierarchical reinforcement learning ABR (available bit rate) flow controller is presented for the problem of congestion control in ATM (asynchronous transfer mode) networks with single bottleneck node. The action selection element of the controller considers the queue size in buffer and cell lose ratio respectively based on hierarchical mechanism. The sending rate of ABR is obtained as a weighted combination of the decisions generated by the two sub-elements. Then, the learning process of the parameters in the controller is designed based on the simulated annealing algorithm to accelerate the learning speed, and the problem of the local extremum is avoided effectively.
     Based on the idea of Q-learning in reinforcement learning theory, an ABR controller is presented for the problem of congestion control in ATM networks with two bottleneck nodes. The controller transforms the problem of searching for the optimal control policy to the problem of searching for the optimal matrix H through the design of Q-function without the parameters of network model. The matrix H is learned based on RLS (recursive least squares) and the control policy is obtained with the optimal performance index.
     Based on the Q-learning algorithm in reinforcement learning theory, an active queue management (AQM) algorithm is presented for the problem of congestion control in TCP (transmission control protocol) networks. The controller learns the Q-values corresponding to each state-action pair in TCP network, and adjusts the learning rate by the confidence value which is a measure of how closely the corresponding Q-value represents the current state of network. Then, the state space of network is predigested by the transformation of state space. The action selection policy is improved by Metropolis criterion to cope with the balance between the exploration of unknown space and exploitation of the knowledge having been achieved. Secondly, the proposed controller is applied to the network with multiple bottlenecks network based on the cooperative reward.
     Based on the fuzzy Q-learning algorithm, an AQM algorithm is presented for the continuous state space in TCP network. In the learning process, both the actions selected by the learning agent and Q-values are inferred from fuzzy inference system. Then, the consequent parts of fuzzy rules are optimized by the genetic algorithm, and the optimal action for each fuzzy rule is obtained through the optimization process.
     Based on the Nash Q-learning algorithm, a flow controller is presented for the networks with non-cooperative users. Different price levels are determined for different services and different QoS requirements in the same service based on the pricing scheme and are used in the calculation of reward. The learning process is executed through the selection of Q-values satisfying the Nash equilibrium condition. The uses select the sending rate as high as possible when the performance of the entire network is optimal.
     For the problem of routing selection in network, firstly, a dual metrics Q-Routing algorithm is presented. The algorithm takes the transmission time of packets and the cost of link as Q-values, and learned them respectively. The routing selection decision is determined by the weights considered the two metrics. Secondly, a memory based Q-learning routing algorithm is presented. The Q-values corresponding to the link reflect the states of network through the learning process. The learning agent predicts the traffic trend through the best Q-values learned kept in the memory and the recovery rate for the link have congested before, and determines the selection of routing strategy.

引文

[1]Craig, Zacker,王建华译.现代网络技术[M].北京：机械工业出版社,2002,21-26.
    [2]Jacobson V. Congestion avoidance and control [J]. IEEE/ACM Transactions on Networking,1988, 6(3):314-329.
    [3]Jain R. Congestion control in computer networks: Issues and trends [J]. IEEE Network Magazine, 1999,4(3):24-30.
    [4]Jain R. Congestion control and traffic management in ATM networks: recent advances and a survey [J]. Computer Networks and ISDN Systems,1996,28:1723-1728.
    [5]Floyd S. TCP and explicit congestion notification [J]. ACM Compute Communication Review,1994, 24(5):10-23.
    [6]Braden B. Recommendations on queue management and congestion avoidance in the Internet [Z]. IETF RFC2309,1998.
    [7]莫锦军.网络与ATM技术[M].北京：人民邮电出版社,2003.
    [8]ATM Forum Technical Committee TMWG. ATM Forum Traffic Management Specification. Version 4.0 [S].1996,af-tm-0056.000.
    [9]Raj J. Myths about congestion management in high-speed networks [J]. Internetworking:Research and Experience,1992,3(2):101-113.
    [10]Newman P. Traffic management for ATM local area networks [J]. IEEE Communication Magazine, 1994,32(8):44-50.
    [11]Makrucki B A. Explicit forward congest notification in ATM networks [A]. Proceedings of Tricomm: High-Speed Communications Networks [C], New York,1992,73-96.
    [12]Ramakrishnan K, Jain R. A binary feedback scheme for congestion avoidance in computer networks with a connectionless network layer [J]. ACM Transactions on Computer Systems,1990,8(2): 158-181.
    [13]Bonomi F, Mitra D, Seery J B. Adaptive algorithms for feedback-based flow control in high-speed, wide-area ATM networks [J]. IEEE Journal on Selected Areas in Communications,1995,13(7): 1267-1283.
    [14]任丰原,林闯,任勇等.二进制流量控制算法的性能分析[J].软件学报,2003,14(3)：612-618.
    [15]Arulambalam A, Chen X Q, Ansari N. Allocating fair rates for available bit rate service in ATM networks [J]. IEEE Communications Magazine,1996,34(11):92-101.
    [16]Charny A, Clark D, Jain R. Congestion control with explicit rate indication [A]. Proceedings of IEEE International Conference on Communications [C],1995,3:1954-1963.
    [17]Hernandez V, Enrique J, Benmohamed L, et al. Rate control algorithms for the ATM ABR service [J]. European Transactions on Telecommunications,1997,8(1):7-20.
    [18]Kolarov A, Ramamurthy G. A Control-theoretic approach to the design of an explicit rate controller for ABR service [J]. IEEE/ACM Transactions on Networking,1999,7(5):741-753.
    [19]Rohrs C E, Berry R A. A linear control approach to explicit rate feedback in ATM networks [A]. Proceedings of IEEE INFOCOM [C], Kobe, Japan,1997,277-282.
    [20]谭连生,尹敏.计算机高速互联网中一类基于速率的PD拥塞控制方法[J].自动化学报,2003,29(1)：54-64.
    [21]关新平,刘志新,龙承念.ABR业务流量PID拥塞控制方法研究[J].计算机研究与发展,2003,40(9)：1326-1331.
    [22]姜培刚,李春文,吕景飞等.时延高速通信网络的指数稳定跟踪控制[J].系统仿真学报,2004,16(3)：546-550.
    [23]Ren T, Hou R, Jing Y W, Sun R T, Yuan H W. ABR traffic control using double predictive PI controller [A]. Proceedings of the Sixth World Congress on Intelligent Control and Automation [C], Dalian, China,2006,4580-4584.
    [24]Tao R, Dimirovski G M, Jing Y W, Zheng X P. Congestion control using integral SMC for ATM networks with multiple time-delays and varying bandwidth [A]. Proceedings of the 46th IEEE Conference on Decision and Control [C], New Orleans, Louisiana, USA,2007,5192-5197.
    [25]任涛,井元伟.多时滞ATM网络中ABR流量的积分滑模控制[J].控制与决策,2008,23(1)：91-94.
    [26]Liu Z X, Guan X P. A flow congestion control scheme of ATM networks based on fuzzy PID control [A]. Proceedings of Fifth World Congress on Intelligent Control and Automation [C],2004,2: 1466-1469.
    [27]Sun D H, Zhang Q H, Mu Z C. Single parametric fuzzy adaptive PID control and robustness analysis based on the queue size of network node [A]. Proceedings of 2004 International Conference on Machine Learning and Cybernetics [C],2004,1:397-400.
    [28]Oh S Y, Park D J. Predictive fuzzy logic control of ABR traffic in ATM networks [J]. IEICE Transactions on Communications,1999, E82-B(3):551-555.
    [29]Rose Q Y, David W P. A predictive self-tuning fuzzy-logic feedback rate controller [J]. IEEE/ACM Transaction on Networking,2000,8(6):697-709.
    [30]朱瑞军,马吉荣,仲崇权等.具有极大极小公平性的稳定拥塞控制算法设计[J].大连理工大学学报,2004,44(2)：309-312.
    [31]Cheong F, Lai R, Kim Y S, Hwang D H, Oh H S, Kim J H. The design of a fuzzy logic controller of an ATM switch and its simulation [J]. Journal of Information Science and Engineering,2007,23(3): 741-55.
    [32]Ren T, Gao Z H, Kong W M, Jing Y W, Yang M Y, Dimirovski G M. Performance and Robustness Analysis of Fuzzy-Immune Flow Controller in ATM Networks with Time-Varying Multiple Time-delays [J]. Journal of Control Theory and Applications.2008,6(3):253-258.
    [33]Jagannathan S, Talluri J. Predictive congestion control of ATM networks:multiple sources single buffer scenario [J]. Automatica,2002,38:815-820.
    [34]沈伟,冯瑞,邵惠鹤.网络流量的神经网络自适应Smith预估补偿控制[J].系统仿真学报,2003,15(4)：575-578.
    [35]沈伟,冯瑞,邵惠鹤.基于Kalman算法及神经网络预测的网络流量控制[J].计算机研究与发展,2003,40(8)：1162-1167.
    [36]尹凤杰,井元伟.一种基于速率的单神经元自适应PID拥塞控制方法[J].控制与决策,2005,20(11)：1225-1228.
    [37]Benmohamed L, Meerkov S M. Feedback control of congestion in packet switching networks:the case of multiple congested nodes [A]. Proceedings of the IEEE American Control Conference [C], 1994,1104-1108.
    [38]Kolarov A, Ramamurthy G. A Control-theoretic approach to the design of an explicit rate controller for ABR service [J]. IEEE/ACM Trans on Networking,1999,7(5):741-753.
    [39]Jahromi K K, Nikravesh S K Y, Shafee M. Robust congestion control in networks with multiple congested nodes [A]. The 14th Mediterranean Conference on Control and Automation [C],2006,1-6.
    [40]Biberovic E, Iftar A, Ozbay H. A solution to the robust flow control problem for networks with multiple bottlenecks [A]. Proceedings of. IEEE Conference on Decision and Control [C],2001,3: 2303-2308.
    [41]尹凤杰.基于控制理论的主动队列管理算法及其稳定性研究[D].东北大学博士学位论文,2005.
    [42]Floyd S. A report on some recent developments in TCP congestion control [J]. IEEE Communication Magazine,2001,39(4):84-90.
    [43]Padmanabhan V N. Addressing the challenges of web data transport [D]. University of California, Berkely,1998.
    [44]Wang H, Xin H, Reeves D S, et al. A simple refinement of slow-start of TCP congestion control [A]. Proceedings of IEEE Symposium on Computers and Communications, [C] Los Alamitos, California, USA,2000,98-105.
    [45]Ke J, Willianmson C. Towards a rate-based TCP protocol for the web [A]. Proceedings of IEEE International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommu-nication Systems [C], San Francisco, California, USA,2000,36-45.
    [46]杨华甫,倪子伟.基于TCP的流量控制和拥塞控制分析[J].计算机系统应用,2005,10：50-53.
    [47]Ludwig R, Katz R H. The Eiffel algorithm:making TCP robust against spurious retransmissions [J]. ACM Computer Communication Review,2000,30(1):23-36.
    [48]Widmer J, Denda R, Mauve M. A survey on TCP-Friendly congestion control [J]. IEEE Network, 2001,15(3):28-37.
    [49]罗万明,林闯,阎保平TCP/IP拥塞控制研究[J].计算机学报,2001,24(1)：1-18.
    [50]任丰原,林闯,刘卫东.IP网络中的拥塞控制[J].计算机学报,2003,26(9)：1025-1034.
    [51]王彬TCP/IP网络拥塞控制策略研究[D].浙江大学博士学位论文,2004.
    [52]Floyd S, Jacobson V. Random early detection gateways for congestion avoidance [J]. IEEE/ACM Transactions on Networking,1993,1(4):397-413.
    [53]王宏伟TCP/IP网络拥塞控制中主动队列管理算法研究[D].东北大学博士学位论文,2008.
    [54]Athuraliya S, Low S H, Li V H, et al. REM:active queue management [J]. IEEE Network,2001, 15(3):48-53.
    [55]Gibbens R J, Kelly F P. Resource pricing and the evolution of congestion control [J]. Automatica, 1999,35(12):1969-1985.
    [56]Kelly F P, Maulloo A, Tan D. Rate control in communication networks:shadow prices, proportional fairness and stability [J]. Journal of the Operational Research Society,1998,49(3):237-252.
    [57]Kunniyur S, Strikant R. An adaptive virtual queue (AVQ) algorithm for active queue management [J]. IEEE/ACM Transactions on Networking,2004,12(2):286-299.
    [58]Low S H, Lapsley D E. Optimization flow control I:basic algorithm and convergence [J]. IEEE/ACM Transactions on Networking,1999,7(6):861-875.
    [59]Fernando P. On the stability of optimization-based flow control [A]. Proceedings of the American Control Conference [C], Arlington, VA,2001,4689-4694.
    [60]张军,郑明春.优化理论在TCP拥塞控制中的应用[J].山东理工大学学报(自然科学版),2004,18(2)：61-64.
    [61]Misra V, Gong W B, Towsley D. Fluid-based analysis of a network of AQM routers supporting TCP flows with an application to RED [A]. Proceedings of ACM/SIGCOMM [C], Sweden,2000, 151-160.
    [62]Hollot C V, Misra V, Towsley D, et al. A control theoretic analysis of RED [A]. Proceedings of IEEE INFOCOM [C], Anchorage, Alaska, USA,2001,1510-1519.
    [63]任丰原,王福豹,任勇等.主动队列管理中的PID控制器[J].电子与信息学报,2003,25(1)：94-99.
    [64]Zheng F, Nelson J. An H∞ approach to congestion control design for AQM routers supporting TCP flows in wireless access networks [J]. Computer Networks,2007,51(6):1684-1704.
    [65]Jing Y W, He L, Dimirovski G M. Robust stabilization of state and input delay for active queue management algorithm [A]. Proceedings of the 2007 American Control Conference [C], New York, USA,2007,3083-3087.
    [66]Bartoszewica A, Zuk J. Time-delay sliding plane design for congestion control in multi-source connection-oriented communication networks [A]. Proceedings of the 2008 Mediterranean Conference on Control and Automation [C],2008,652-657.
    [67]Yan P, Gao Y, Ozbay H. A variable structure control approach to active queue management for TCP with ECN [J]. IEEE Transactions on Control System Technology,2005,13(2):203-215.
    [68]Ren F Y, Lin C, Wei B. A nonlinear control theoretic analysis to TCP-RED system [J]. Computer Networks,2005,49(4):580-592.
    [69]张彬,郭军.基于PID控制器的TCP主动队列管理[J].计算机应用研究,2005,4：216-218.
    [70]蔡小玲,汪小帆,王执铨.主动队列管理中的APD和APID控制器的设计[J].计算机应用研究,2005,12：247-251.
    [71]孙德辉,穆志纯,张秋红.基于网络节点输出速率调节的非线性拥塞控制及稳定性鲁棒性分析[J].计算机工程与应用,2005,6：120-123.
    [72]Hong Y, Yang O W. Design of an adaptive PI rate controller for streaming media traffic based on gain and phase margins [J]. IEEE Proceedings of Communication,2006,153(1):5-14.
    [73]Cho H C, Fadali M S, Lee H. Neural network control for TCP network congestion [A].2005 American Control Conference [C], Portland, OR, USA,2005,3480-3485.
    [74]Cho H C, Fadali S M, Lee H. Adaptive neural queue management for TCP networks [J]. Computer & Electrical Engineering,2008,34(6):447-469.
    [75]Wang C G, Li B, Sohraby K, et al. AFRED:an adaptive fuzzy-based control algorithm for active queue management [A]. Proceedings of the 28th Annual IEEE International Conference on Local Computer Networks [C],2003,12-20.
    [76]Giuseppe D F, Frank H, Giuseppe L R, et al. A genetic algorithm for the design of a fuzzy controller for active queue management [J]. IEEE Transactions on Systems, Man, and Cybernetics-Part C: Applications and Reviews,2003,33(3):313-324.
    [77]Bellman R E. Dynamic programming [M]. Princeton University Press, Princeton, N.J,1957.
    [78]Dijkstra E. A note on two problems in connexion with graphs [J]. Numerische Mathematik,1959,1: 269-271.
    [79]Cherkassky B V, Radzik T. Shortest paths algorithms:theory and experimental evaluation [A]. Proceedings of the 5th Annual ACM SI AM Symposium on Discrete Algorithms [C], Ariligton, VA, 1994,516-525.
    [80]Dubrovsky A, Gerla M, Lee S S, Cavendish D. Internet QoS routing with IP telephony and TCP traffic [J]. IEEE Communications,2000.
    [81]Moy J. OSPF Version 2 [J]. Internet Draft,1992.
    [82]International Standards Organization, Protocol for exchange of inter-domain routing information among intermediate systems to support forwarding of IS08473 PDUs [Z]. ISO/1EC1JTC1/SC6 CD 10747,1996.
    [83]李汉兵,喻建平,谢维信.基于资源优化的QoS路径选择模糊算法[J].计算机研究与发展,2000,37(3)：372-275.
    [84]Rio M, Linington P F. Distributed quality of service multicast routing with multiple metrics for receiver initiated joins [A]. In Proceedings of ICON [C], IEEE Computer Society,2000.
    [85]Awerbuch B, Du Y, Khan B, Shavitt Y. Routing through networks with hierarchical topology aggregation. Technical Report 98-16, DIMACS,1998.
    [86]来卫国,季中恒,李鸥,冉崇森.用遗传算法求解最优QoS划分与路由问题[J].计算机应用研究,2007,24(10)：286-8,291.
    [87]Kavian Y S, Rashvand H F, Ren W, Naderi M, Leeson M S, Hines E L. Genetic algorithm quality of service design in resilient dense wavelength division multiplexing optical networks [J]. IET Communications,2008,2(4):505-513.
    [88]孙岩,马华东,刘亮.一种基于蚁群优化的多媒体传感器网络服务感知路由算法[J].电子学报,2007,35(4)：705-711.
    [89]Wang H, Shi Z, Ge A F, Yu C Y. An optimized ant colony algorithm based on the gradual changing orientation factor for multi-constraint QoS routing [J]. Computer Communications,2009,32(4): 586-593.
    [90]Hoceini S, Mellouk A, Amirat Y. Neural Net Based Approach for Adaptive Routing Policy in Telecommunication Networks [J]. Lecture Notes in Computer Science: High Speed Networks and Multimedia Communications,2004,360-368.
    [91]张汝波,顾国昌,刘照德,王醒策.强化学习理论、算法及应用[J].控制理论与应用,2000,17(5)：637-642.
    [92]Rumelhart D E, G E Hinton, R J Williams, Learning internal representations by error propagation:in parallel distributed processing [J]. MA:MIT Press,1986,1:318-362.
    [93]洪家荣.归纳学习[M].北京：科学出版社,1999.
    [94]Minsky M L. Theory of neural analog reinforcement systems and its application to the brain model problem [D]. New Jersey, USA:Princeton University,1954.
    [95]Widrow B, Hoff M E. Adaptive switching circuits [M]. Neurocompating: Foundations of Research, Cambridge, MA:The MIT Press,1988,126-134.
    [96]Rosenblatt F. Principles of neuro dymamics: perceptrons and the theory of brain mechanisms [M]. Washington DC:Spartan Books,1961.
    [97]Widrow B, Gupta N K, Maitra S. Punish reward: learning with acritic in adaptive threshold system [J]. IEEE Trans. on Systems, Man, and Cybernetics,1973,3(5):455-465.
    [98]Sutton R S. Temporal credit assignment in reinforcement learning [D]. Amherst. MA:University of Massachusetts,1984.
    [99]Sutton R S. Learning to predict by the methods of temporal difference [J]. Machine Learning,1988,3: 9-44.
    [100]Dayan P. The convergence of TD (λ) for general λ [J]. Machine Learning,1992,8:341-362.
    [101]Watkins J C H, Dayan P. Q-learing [J]. Machine Learning,1992,8:279-292.
    [102]Cho K G, Sung Y S, Um K. A production technique for a Q-table with an influence map for speeding up Q-learning [A]. The International Conference on Intelligent Pervasive Computing [C],2007, 72-75.
    [103]Li J W, Liu W Y. A novel heuristic Q-learning algorithm for solving stochastic games [A]. IEEE International Joint Conference on Neural Networks [C],2008,1135-1144.
    [104]Littman M L. Markov Games as a framework for multi-agent reinforcement learning [A]. In Proceedings of the 11th International Conference on Machine Learning [C],1994,157-163.
    [105]Hu J, Wellman M P. Nash Q-learning for general-sum stochastic games [J]. Journal of Machine Learning Research,2003,4:1039-1069.
    [106]Littman M L. Friend or foe Q-learning in general-sum Markov Games [A]. In Proceedings of the 18th International Conference on Machine Learning [C],2001,322-328.
    [107]Greenwald A, Keith H. Correlated Q-learning [A]. In Proceedings of the 20th International Conference on Machine Learning [C],2003,242-249.
    [108]Zhang P, Stephane C. Uncertainty estimate with pseudo-entropy in reinforcement learning [J]. Control Theory and Apllications,1998,15(1):100-104.
    [109]Barto A G, Sutton R S, Anderson C W. Neuronlike adaptive elements that can solve difficult learning control problems [J]. IEEE Transactions on System, Man, and Cybernetics, SMC-13,1983,834-846.
    [110]蒋国飞,吴沧浦.基于Q学习算法和BP神经网络的倒立摆控制[J].自动化学报,1998,24(5)：662-666.
    [111]Miller S, Williams R J. Learning to control a bioreactor using a neural net DYNA-Q system [A], Proceedings of the seventh Yale workshop on adaptive and learning system [C], New Haven, CT, 1992,167-172.
    [112]Zomaya A Y. Reinforcement learning for the adaptive control of nonlinear systems [J]. IEEE Control Systems,1994,24(2):357-363.
    [113]Gullapalli V, Franklin J A, Benbrahim H. Acquiring robot skills via reinforcement learning [J]. IEEE Control Systems Maganize,1994,14(1):13-24.
    [114]Lin L J. Self-improving reactive agents based on reinforcement learning, planning and teaching [J]. Machine Learning,1992,8:293-321.
    [115]Tesauro G J, Temporal difference learning and TD-gammon [J]. Communications of the ACM,1995, 38(3):58-68.
    [116]Moriyama K. Learning-rate adjusting Q-learning for prisoner's dilemma games [A]. IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology [C],2008,322-325.
    [117]Crites R, Barto A. Improving elevator performance using reinforcement learning [M]. Advances in Neural Information Processing System 8, Cambridge, MA:MIT Press,1996,1017-1023.
    [118]Singh S P, Bertsekas D. Reinforcement learning for dynamic channel allocation in cellular telephone systems [M]. Advances in Neural Information Processing System, Cambridge, MA. MIT Press.
    [119]Lu S F, Liu X M, Dai S Q. Q-learning for adaptive traffic signal control based on delay minimization strategy [A]. IEEE International Conference on Networking, Sensing and Control [C],2008, 687-691.
    [120]Brooks R. A robust layered control system for a mobile robot [J]. IEEE Transaction Robotics and Automation,1986, RA-2:14-23.
    [121]Tham C K, Prager R W. A modular Q-learning architecture for manipulator task decomposition [A]. In Proceedings of the 11th International Conference on Machine Learning [C], San Franciso,1994.
    [122]Jacobs R, Jordan M. Learning piecewise control strategies in a modular neural network architecture [J]. IEEE Trans, on System, Man and Cybernetics,1993,23(2):337-345.
    [123]Li X, Chen W, Guo J, Zhai Z K, Huang Z K. A new passing strategy based on Q-learning algorithm in RoboCup [A]. International Conference on Computer Science and Software Engineering [C],2008, 524-527.
    [124]Tarraf A A, Habib I W, Saadawi T N. Reinforcement learning-based neural network congestion controllerfor ATM networks [A]. IEEE Military Communications Conference [C],1995,668-672.
    [125]Hwang K S, Hsiao M C, Wu C S, Tan S W. Multi-agent congestion control for high-speed networks using reinforcement Co-learning [J]. Lecture notes in computer science,2005,379-384.
    [126]Hsiao M C, Hwang K S, Tan S W, Wu C S. Reinforcement learning congestion controller for multimedia surveillance system [A]. Proceedings of the IEEE International Conference on Robatics & Automation [C],2003,4403-4407.
    [127]Hwang K. S, Tan S W, Hsiao M C, Wu C S. Cooperative multiagent congestion control for high-speed networks [J]. IEEE Transactions on Systems, Man, and Cybernetics-Part B:Cybernetics, 2005,35(2):255-268.
    [128]Boyan J A, Littman M L. Packet routing in dynamically changing networks:a reinforcement learning approach [M]. Advances in Neural Information Processing Systems 6, MIT Press, Cambridge, MA, 1994,945-951.
    [129]Mellouk A, Hoceinia S, Cheurfaa M. Reinforcing probabilistic selective Quality of Service routes in dynamic irregular networks [J]. Computer Communications,2008,31(11):2706-2715.
    [130]Esfahani A, Analoui M. Widest K-shortest paths Q-routing: A new QoS routing algorithm in telecommunication networks [A]. International Conference on Computer Science and Software Engineering [C], Wuhan, China,2008,1032-1035.
    [131]Tong H, Brown T X. Reinforcement learning for call admission control and routing under Quality of Service constraints in multimedia networks [J]. Machine Learning,2002,49(2-3):111-139.
    [132]Lakshminarayanan G, George B, Venkataramani B, Ramakalyan A. Neural network controlled shift register traffic shaper for ATM networks [A]. International Conference on Global Connectivity in Energy, Computer, Communication and Contrrol [C],1998,33-36.
    [133]Jagannathan S, Talluri J. Traffic rate control of ATM networks using neural network approach:single source/single buffer scenario [A]. Proceedings of the 15th IEEE International Symposium on Intelligent Control [C],2000,315-320.
    [134]Rajesh M Kandadai, James M Tien. A Knowledge-Base Generating Hierarchical Fuzzy-Neural Controller [J]. IEEE Transactions on Neural Networks,1997,8(6):1531-1540.
    [135]郭红霞,吴捷,王春茹.基于强化学习的模型参考自适应控制[J].控制理论与应用,2005,22(2)：291-294.
    [136]Hamid R. Berenji, Pratap Khedkar. Learning and Tuning Fuzzy Logic Controllers Through Reinforcements [J]. IEEE Transactions on Neural Networks,1992,3(5):724-739.
    [137]Kirkpartick S, Gelatt C D, Vecchi M P. Optimization by simulated annealing [J]. Science,1983,220: 671-680.
    [138]Wei Q L, Zhang H G, Dai J. Model-free multiobjective approximate dynamic programming for discrete-time nonlinear systems with general performance index functions [J]. Neurocomputing,2009, 72:1839-1848.
    [139]Bradtke S J, Ydstie B E, Barto A G. Adaptive linear quadratic control using policy iteration [A]. American Control Conference [C],1994,3475-3479.
    [140]Park K H, Kim J H. Two mode Q-learning [A]. The 2003 Congress on Evolutionary Computation [C], 2003,4:2449-2454.
    [141]Park I W, Kim J H, Park K H. Accelerated Q-learning for fail state and action spaces [A]. IEEE International Conference on Systems, Man and Cybernetics [C],2008:763-767.
    [142]Zheng Y, Luo S W, Zhang J. Greedy exploration policy of Q-learning based on state balance [A]. 2005 IEEE Region 10 TENCON [C],2005,1-4.
    [143]Metropolis N, Rosenbluth A, Rosenbluth M, Teller A, Teller E. Equation of state calculations by fast computing machines [J]. The Journal of Chemical Physics,1953,21(6):1087-1092.
    [144]Glorennec P Y, Jouffe L. Fuzzy Q-learning [A]. Proceeding of the 6th IEEE International Conference on Fuzzy Systems [C],1997,659-662.
    [145]Chang C, Chen Y, Huang C. Fuzzy Q-learning admission control for WCDMA/WLAN heterogeneous networks with multimedia traffic [J]. IEEE Transactions on Mobile Computing,2009: accepted for future publication, available online.
    [146]Meng J E, Linn S. Automatic generation of fuzzy inference systems using incremental-topological-preserving-map-based fuzzy Q-learning [A]. IEEE International Conference on Fuzzy Systems [C], Hong Kong,2008,467-474.
    [147]Holland J H. Adaption in natural and artificial systems [M]. MIT Press, Cambridge, MA,1975.
    [148]陈国良,王熙法,庄镇全.遗传算法及其应用[M].北京：人民邮电出版社,1996.
    [149]曹其国.基于抗毁性测度的可靠通信网优化设计[D].天津大学博士学位论文,1998.
    [150]Kelly F P, Maulloo A, Tan D. Rate control in communication networks:shadow prices, proportional fairness and stability [J]. Journal of the Operational Research Society,1998,49(3):237-252.
    [151]Gibbens R J, Kelly F P. Resource pricing and the evolution of congestion control [J]. Automatica, 1999,35(12):1969-1985.
    [152]Marbach P. Analysis of a Static Pricing Scheme for Priority Services [J]. IEEE/ACM Transactions on Networking,2004,12(2):312-325.
    [153]Ganesh A, Laevens K, Steinberg R. Congestion pricing and user adaptation [A]. In Proceedings of IEEE INFOCOM [C], Anchorage AK, New York, IEEE Press,2001,2:959-965.
    [154]Fulp E W, Ott M, Reininger D, et al. Paying for QoS:an optimal distributed algorithm for pricing network resources [A].1998 Sixth International Workshop on Quality of Service [C], Napa CA USA, May 1998, New York NY USA:IEEE Press,1998:75-84.
    [155]Richard J. La and Venkat Anantharam. Charge-senstive TCP and Rate Control in the Internet [A]. Proceedings of IEEE INFOCOM [C],2000,1166-1175.
    [156]Stallings W. High-speed networks: TCP/IP and ATM design principles [M]上海：电子工业出版社,1999,416-478.
    [157]Odlyzko A M. Paris Metro Pricing: The minimalist differentiated services solution [A]. Proceedings of the Seventh International Workshop on Quality of Service [C],1999,159-161.
    [158]Basar T, Srikant R. Revenue Maximizing Pricing and Capacity Expansion in a Many-Users Regime [A]. Proceedings of the IEEE INFOCOM [C], New York, NY,2002,321-329.
    [159]Nash J F. Non-cooperative games [J]. Annals of Mathematics,1951,54(2):286-295.
    [160]谢政.对策论[M].国防科技大学出版社,2004.
    [161]Tekiner F, Ghassemlooy Z, Srikanth T R. Comparison of the Q-routing and shortest path routing algorithms [DB/OL]. http://soe.unn.ac.uk,2004.
    [162]Hoceini S, Mellouk A, Amirat Y. Neural net based approach for adaptive routing policy in telecommunication networks [J]. High Speed Networks and Multimedia Communications,2004, 360-368.
    [163]Waxman B M. Routing of multipoint connections [J]. IEEE Journal on Selected Areas Communications,1999,7(3):350-364.
    [164]Hoceini S, Mellouk A, Smail B. Average-bandwidth delay Q-routing adaptive algorithm [A]. IEEE International Conference on Communications [C], Beijing, China,2008,1840-1844.

常见问题　|　交通位置　|　联系我们　|　OA远程办公

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700