Bias in Natural Actor-Critic Algorithms

Philip Thomas
Proceedings of the 31st International Conference on Machine Learning, PMLR 32(1):441-448, 2014.

Abstract

We show that several popular discounted reward natural actor-critics, including the popular NAC-LSTD and eNAC algorithms, do not generate unbiased estimates of the natural policy gradient as claimed. We derive the first unbiased discounted reward natural actor-critics using batch and iterative approaches to gradient estimation. We argue that the bias makes the existing algorithms more appropriate for the average reward setting. We also show that, when Sarsa(lambda) is guaranteed to converge to an optimal policy, the objective function used by natural actor-critics is concave, so policy gradient methods are guaranteed to converge to globally optimal policies as well.

Cite this Paper


BibTeX
@InProceedings{pmlr-v32-thomas14, title = {Bias in Natural Actor-Critic Algorithms}, author = {Thomas, Philip}, booktitle = {Proceedings of the 31st International Conference on Machine Learning}, pages = {441--448}, year = {2014}, editor = {Xing, Eric P. and Jebara, Tony}, volume = {32}, number = {1}, series = {Proceedings of Machine Learning Research}, address = {Bejing, China}, month = {22--24 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v32/thomas14.pdf}, url = {https://proceedings.mlr.press/v32/thomas14.html}, abstract = {We show that several popular discounted reward natural actor-critics, including the popular NAC-LSTD and eNAC algorithms, do not generate unbiased estimates of the natural policy gradient as claimed. We derive the first unbiased discounted reward natural actor-critics using batch and iterative approaches to gradient estimation. We argue that the bias makes the existing algorithms more appropriate for the average reward setting. We also show that, when Sarsa(lambda) is guaranteed to converge to an optimal policy, the objective function used by natural actor-critics is concave, so policy gradient methods are guaranteed to converge to globally optimal policies as well.} }
Endnote
%0 Conference Paper %T Bias in Natural Actor-Critic Algorithms %A Philip Thomas %B Proceedings of the 31st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2014 %E Eric P. Xing %E Tony Jebara %F pmlr-v32-thomas14 %I PMLR %P 441--448 %U https://proceedings.mlr.press/v32/thomas14.html %V 32 %N 1 %X We show that several popular discounted reward natural actor-critics, including the popular NAC-LSTD and eNAC algorithms, do not generate unbiased estimates of the natural policy gradient as claimed. We derive the first unbiased discounted reward natural actor-critics using batch and iterative approaches to gradient estimation. We argue that the bias makes the existing algorithms more appropriate for the average reward setting. We also show that, when Sarsa(lambda) is guaranteed to converge to an optimal policy, the objective function used by natural actor-critics is concave, so policy gradient methods are guaranteed to converge to globally optimal policies as well.
RIS
TY - CPAPER TI - Bias in Natural Actor-Critic Algorithms AU - Philip Thomas BT - Proceedings of the 31st International Conference on Machine Learning DA - 2014/01/27 ED - Eric P. Xing ED - Tony Jebara ID - pmlr-v32-thomas14 PB - PMLR DP - Proceedings of Machine Learning Research VL - 32 IS - 1 SP - 441 EP - 448 L1 - http://proceedings.mlr.press/v32/thomas14.pdf UR - https://proceedings.mlr.press/v32/thomas14.html AB - We show that several popular discounted reward natural actor-critics, including the popular NAC-LSTD and eNAC algorithms, do not generate unbiased estimates of the natural policy gradient as claimed. We derive the first unbiased discounted reward natural actor-critics using batch and iterative approaches to gradient estimation. We argue that the bias makes the existing algorithms more appropriate for the average reward setting. We also show that, when Sarsa(lambda) is guaranteed to converge to an optimal policy, the objective function used by natural actor-critics is concave, so policy gradient methods are guaranteed to converge to globally optimal policies as well. ER -
APA
Thomas, P.. (2014). Bias in Natural Actor-Critic Algorithms. Proceedings of the 31st International Conference on Machine Learning, in Proceedings of Machine Learning Research 32(1):441-448 Available from https://proceedings.mlr.press/v32/thomas14.html.

Related Material