Improving Q-Learning Using Simultaneous Updating and Adaptive Policy Based on Opposite Action

Subject Areas : electrical and computer engineering

M. Pouyan ^{1
*} , S. Golzari ² , A. Mousavi ³ , Ahmad Hatam ⁴

1 -
2 -
3 -
4 -

Received: 2017-07-13 Accepted : 2017-07-13 Published : 2016-09-21

Keywords: Adaptive policy convergence speed opposite action simultaneous updating Q-learning,

Abstract :

Q-learning is a one of the most popular and frequently used model-free reinforcement learning method. Among the advantages of this method is independent in its prior knowledge and there is a proof for its convergence to the optimal policy. One of the main limitations of this method is its low convergence speed, especially when the dimension is high. Accelerating convergence of this method is a challenge. Q-learning can be accelerated the convergence by the notion of opposite action. Since two Q-values are updated simultaneously at each learning step. In this paper, adaptive policy and the notion of opposite action are used to speed up the learning process by integrated approach. The methods are simulated for the grid world problem. The results demonstrate a great advance in the learning in terms of success rate, the percent of optimal states, the number of steps to goal, and average reward.

References:

[1] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, MIT Press, Cambridge, MA, 1998.
[2] J. Qiao, R. Fan, H. Han, and X. Ruan, "Q-learning based on dynamical structure neural network for robot navigation in unknown environment," in Proc. of the 6th Int. Symp. on Neural Networks: Advances in Neural Networks - Part III, ISNN'09, pp. 188-196, 2009.
[3] W. Y. Kwon, I. H. Suh, and S. Lee, "SSPQL: stochastic shortest path-based Q-learning," International J. of Control, Automation, and Systems, vol. 9, no. 2, pp. 328-338, 2011.
[4] P. K. Das, S. C. Mandhata, H. S. Behera, and S. N. Patro, "An improved Q-learning algorithm for path-planning of a mobile robot," International J. of Computer Applications, vol. 51, no. 9, pp. 40-46, 2012.
[5] M. B. Naghibi-Sistani, M. R. Akbarzadeh-Tootoonchi, M. H. Javidi-Dashte Bayaz, and H. Rajabi-Mashhadi, "Application of Q-learning with temperature variation for bidding strategies in market based power systems," Energy Conversion and Management, vol. 47, no. 11, pp. 1529-1538, 2006.
[6] Y. Ozbek, A. Zeid, and S. Kamarthi, "A Q-learning-based adaptive grouping policy for condition-based maintenance of a flow line manufacturing system," International J. of Collaborative Enterprise, vol. 2, no. 4, pp. 302-321, 2011.
[7] R. A. Bianchi, A. Ramisa, and R. L. De Mantaras, "Automatic selection of object recognition methods using reinforcement learning," in Advances in Machine Learning I, Springer Berlin Heidelberg, pp. 421-439, 2010.
[8] H. R. Tizhoosh, "Opposition-based reinforcement learning," J. of Advanced Computational Intelligence and Intelligent Informatics, vol. 10, no. 4, pp. 578-585, 2006.
[9] X. Ma, Y. Xu, G. Q. Sun, L. X. Deng, and Y. B. Li, "State-chain sequential feedback reinforcement learning for path planning of autonomous mobile robots," J. of Zhejiang University Science C, vol. 14, no. 3, pp. 167-178, Mar. 2013.
[10] A. Lampton and J. Valasek, "Multiresolution state-space discretization method for Q-learning," in Proc. American Control Conf., pp. 1646-1651, 2009.
[11] D. Vincze and S. Kovacs, "Incremental rule base creation with fuzzy rule interpolation-based Q-learning," in Proc. Computational Intelligence in Engineering, pp. 191-203, 2010.
[12] K. Terashima and J. Murata, "A study on use of prior information for acceleration of reinforcement learning," in Proc. SICE Annual Conf., pp. 537-543, 2011.
[13] B. Marthi, "Automatic shaping and decomposition of reward functions," in Proc. of the 24th Int. Conf. on Machine Learning, pp. 601-608, 2007.
[14] S. Manju and M. Punithavalli, "An analysis of Q-learning algorithms with strategies of reward function," IJCSE, vol. 3, no. 2, pp. 814-820, Feb. 2011.
[15] M. Guo, Y. Liu, and J. Malec, "A new Q-learning algorithm based on the metropolis criterion," IEEE Trans. Syst. Man Cybern. B, vol. 34, no. 5, pp. 2140-2143, Oct. 2004.
[16] M. Tokic, "Adaptive ε-greedy exploration in reinforcement learning based on value differences," in Proc. of the 33rd annual German Conf. on Advances in Artificial Intelligence, KI'10, pp. 203-210, 2010.
[17] M. Tokic and G. Palm, "Value-difference based exploration: adaptive exploration between epsilon-greedy and softmax," in Proc. of the 34rd annual German Conf. on Advances in Artificial Intelligence, KI'11, pp. 335-346, 2011.
[18] م. پویان، ا. موسوی، ش. گلزاری و ا. حاتم، "روشی نوین برای بهبود عملکرد یادگیری Q با افزایش تعداد به روز رسانی مقادیر Q بر پایه عمل متضاد،" مجموعه مقالات بیستمین کنفرانس سالانه کامپیوتر ایران، دانشگاه فردوسی مشهد، صص. 233-226، 14-12 اسفند 93.
[19] C. J. C. H. Watkins, Learning from Delayed Rewards, Ph. D Thesis, Cambridge University, Cambridge, England, 1989.
[20] M. Pouyan, A. Mousavi, S. Golzari, and A. Hatam, "Improving the performance of Q-learning using simultanous Q-values updating," in Proc. 2014 Int. Congress on Technology, Communication and Knowledge, ICTCK'14 , 6 pp., 26-27 Nov. 2014.
[21] M. Shokri, "Knowledge of opposite actions for reinforcement learning," Applied Soft Computing, vol. 11, no. 6, pp. 4097-4109, 2011.
[22] U. Nehmzow, Scientific Methods in Mobile Robotics: Quantitative Analysis of Agent Behavior, London: Springer-Verlag London Limited, 2006.
[23] L. A. Celiberto, J. P. Matsuura, D. Mantaras, R. Lopez, and R. A. Bianchi, "Using transfer learning to speed-up reinforcement learning: a cased-based approach," in Proc. 2010 Latin American Robotics Symp. and Intelligent Robotic Meeting, LARS'10, pp. 55-60, Sao Bernardo do Campo, Brazil, 23-28 Oct. 2010.

Share To

Article Url

Improving Q-Learning Using Simultaneous Updating and Adaptive Policy Based on Opposite Action