A new Q(lambda) with interim forward view and Monte Carlo equivalence

Rich Sutton, Ashique Rupam Mahmood, Doina Precup, Hado Hasselt
Proceedings of the 31st International Conference on Machine Learning, PMLR 32(2):568-576, 2014.

Abstract

Q-learning, the most popular of reinforcement learning algorithms, has always included an extension to eligibility traces to enable more rapid learning and improved asymptotic performance on non-Markov problems. The lambda parameter smoothly shifts on-policy algorithms such as TD(lambda) and Sarsa(lambda) from a pure bootstrapping form (lambda=0) to a pure Monte Carlo form (lambda=1). In off-policy algorithms, including Q(lambda), GQ(lambda), and off-policy LSTD(lambda), the lambda parameter is intended to play the same role, but does not; on every exploratory action these algorithms bootstrap regardless of the value of lambda, and as a result they fail to approximate Monte Carlo learning when lambda=1. It may seem that this is inevitable for any online off-policy algorithm; if updates are made on each step on which the target policy is followed, then how could just the right updates be ‘un-made’ upon deviation from the target policy? In this paper, we introduce a new version of Q(lambda) that does exactly that, without significantly increased algorithmic complexity. En route to our new Q(lambda), we introduce a new derivation technique based on the forward-view/backward-view analysis familiar from TD(lambda) but extended to apply at every time step rather than only at the end of episodes. We apply this technique to derive first a new off-policy version of TD(lambda), called PTD(lambda), and then our new Q(lambda), called PQ(lambda).

Cite this Paper


BibTeX
@InProceedings{pmlr-v32-sutton14, title = {A new Q(lambda) with interim forward view and Monte Carlo equivalence}, author = {Sutton, Rich and Mahmood, Ashique Rupam and Precup, Doina and Hasselt, Hado}, booktitle = {Proceedings of the 31st International Conference on Machine Learning}, pages = {568--576}, year = {2014}, editor = {Xing, Eric P. and Jebara, Tony}, volume = {32}, number = {2}, series = {Proceedings of Machine Learning Research}, address = {Bejing, China}, month = {22--24 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v32/sutton14.pdf}, url = {https://proceedings.mlr.press/v32/sutton14.html}, abstract = {Q-learning, the most popular of reinforcement learning algorithms, has always included an extension to eligibility traces to enable more rapid learning and improved asymptotic performance on non-Markov problems. The lambda parameter smoothly shifts on-policy algorithms such as TD(lambda) and Sarsa(lambda) from a pure bootstrapping form (lambda=0) to a pure Monte Carlo form (lambda=1). In off-policy algorithms, including Q(lambda), GQ(lambda), and off-policy LSTD(lambda), the lambda parameter is intended to play the same role, but does not; on every exploratory action these algorithms bootstrap regardless of the value of lambda, and as a result they fail to approximate Monte Carlo learning when lambda=1. It may seem that this is inevitable for any online off-policy algorithm; if updates are made on each step on which the target policy is followed, then how could just the right updates be ‘un-made’ upon deviation from the target policy? In this paper, we introduce a new version of Q(lambda) that does exactly that, without significantly increased algorithmic complexity. En route to our new Q(lambda), we introduce a new derivation technique based on the forward-view/backward-view analysis familiar from TD(lambda) but extended to apply at every time step rather than only at the end of episodes. We apply this technique to derive first a new off-policy version of TD(lambda), called PTD(lambda), and then our new Q(lambda), called PQ(lambda).} }
Endnote
%0 Conference Paper %T A new Q(lambda) with interim forward view and Monte Carlo equivalence %A Rich Sutton %A Ashique Rupam Mahmood %A Doina Precup %A Hado Hasselt %B Proceedings of the 31st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2014 %E Eric P. Xing %E Tony Jebara %F pmlr-v32-sutton14 %I PMLR %P 568--576 %U https://proceedings.mlr.press/v32/sutton14.html %V 32 %N 2 %X Q-learning, the most popular of reinforcement learning algorithms, has always included an extension to eligibility traces to enable more rapid learning and improved asymptotic performance on non-Markov problems. The lambda parameter smoothly shifts on-policy algorithms such as TD(lambda) and Sarsa(lambda) from a pure bootstrapping form (lambda=0) to a pure Monte Carlo form (lambda=1). In off-policy algorithms, including Q(lambda), GQ(lambda), and off-policy LSTD(lambda), the lambda parameter is intended to play the same role, but does not; on every exploratory action these algorithms bootstrap regardless of the value of lambda, and as a result they fail to approximate Monte Carlo learning when lambda=1. It may seem that this is inevitable for any online off-policy algorithm; if updates are made on each step on which the target policy is followed, then how could just the right updates be ‘un-made’ upon deviation from the target policy? In this paper, we introduce a new version of Q(lambda) that does exactly that, without significantly increased algorithmic complexity. En route to our new Q(lambda), we introduce a new derivation technique based on the forward-view/backward-view analysis familiar from TD(lambda) but extended to apply at every time step rather than only at the end of episodes. We apply this technique to derive first a new off-policy version of TD(lambda), called PTD(lambda), and then our new Q(lambda), called PQ(lambda).
RIS
TY - CPAPER TI - A new Q(lambda) with interim forward view and Monte Carlo equivalence AU - Rich Sutton AU - Ashique Rupam Mahmood AU - Doina Precup AU - Hado Hasselt BT - Proceedings of the 31st International Conference on Machine Learning DA - 2014/06/18 ED - Eric P. Xing ED - Tony Jebara ID - pmlr-v32-sutton14 PB - PMLR DP - Proceedings of Machine Learning Research VL - 32 IS - 2 SP - 568 EP - 576 L1 - http://proceedings.mlr.press/v32/sutton14.pdf UR - https://proceedings.mlr.press/v32/sutton14.html AB - Q-learning, the most popular of reinforcement learning algorithms, has always included an extension to eligibility traces to enable more rapid learning and improved asymptotic performance on non-Markov problems. The lambda parameter smoothly shifts on-policy algorithms such as TD(lambda) and Sarsa(lambda) from a pure bootstrapping form (lambda=0) to a pure Monte Carlo form (lambda=1). In off-policy algorithms, including Q(lambda), GQ(lambda), and off-policy LSTD(lambda), the lambda parameter is intended to play the same role, but does not; on every exploratory action these algorithms bootstrap regardless of the value of lambda, and as a result they fail to approximate Monte Carlo learning when lambda=1. It may seem that this is inevitable for any online off-policy algorithm; if updates are made on each step on which the target policy is followed, then how could just the right updates be ‘un-made’ upon deviation from the target policy? In this paper, we introduce a new version of Q(lambda) that does exactly that, without significantly increased algorithmic complexity. En route to our new Q(lambda), we introduce a new derivation technique based on the forward-view/backward-view analysis familiar from TD(lambda) but extended to apply at every time step rather than only at the end of episodes. We apply this technique to derive first a new off-policy version of TD(lambda), called PTD(lambda), and then our new Q(lambda), called PQ(lambda). ER -
APA
Sutton, R., Mahmood, A.R., Precup, D. & Hasselt, H.. (2014). A new Q(lambda) with interim forward view and Monte Carlo equivalence. Proceedings of the 31st International Conference on Machine Learning, in Proceedings of Machine Learning Research 32(2):568-576 Available from https://proceedings.mlr.press/v32/sutton14.html.

Related Material