Online Learning in Markov Decision Processes with Changing Cost Sequences

Travis Dick, Andras Gyorgy, Csaba Szepesvari
Proceedings of the 31st International Conference on Machine Learning, PMLR 32(1):512-520, 2014.

Abstract

In this paper we consider online learning in finite Markov decision processes (MDPs) with changing cost sequences under full and bandit-information. We propose to view this problem as an instance of online linear optimization. We propose two methods for this problem: MD^2 (mirror descent with approximate projections) and the continuous exponential weights algorithm with Dikin walks. We provide a rigorous complexity analysis of these techniques, while providing near-optimal regret-bounds (in particular, we take into account the computational costs of performing approximate projections in MD^2). In the case of full-information feedback, our results complement existing ones. In the case of bandit-information feedback we consider the online stochastic shortest path problem, a special case of the above MDP problems, and manage to improve the existing results by removing the previous restrictive assumption that the state-visitation probabilities are uniformly bounded away from zero under all policies.

Cite this Paper


BibTeX
@InProceedings{pmlr-v32-dick14, title = {Online Learning in Markov Decision Processes with Changing Cost Sequences}, author = {Dick, Travis and Gyorgy, Andras and Szepesvari, Csaba}, booktitle = {Proceedings of the 31st International Conference on Machine Learning}, pages = {512--520}, year = {2014}, editor = {Xing, Eric P. and Jebara, Tony}, volume = {32}, number = {1}, series = {Proceedings of Machine Learning Research}, address = {Bejing, China}, month = {22--24 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v32/dick14.pdf}, url = {https://proceedings.mlr.press/v32/dick14.html}, abstract = {In this paper we consider online learning in finite Markov decision processes (MDPs) with changing cost sequences under full and bandit-information. We propose to view this problem as an instance of online linear optimization. We propose two methods for this problem: MD^2 (mirror descent with approximate projections) and the continuous exponential weights algorithm with Dikin walks. We provide a rigorous complexity analysis of these techniques, while providing near-optimal regret-bounds (in particular, we take into account the computational costs of performing approximate projections in MD^2). In the case of full-information feedback, our results complement existing ones. In the case of bandit-information feedback we consider the online stochastic shortest path problem, a special case of the above MDP problems, and manage to improve the existing results by removing the previous restrictive assumption that the state-visitation probabilities are uniformly bounded away from zero under all policies.} }
Endnote
%0 Conference Paper %T Online Learning in Markov Decision Processes with Changing Cost Sequences %A Travis Dick %A Andras Gyorgy %A Csaba Szepesvari %B Proceedings of the 31st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2014 %E Eric P. Xing %E Tony Jebara %F pmlr-v32-dick14 %I PMLR %P 512--520 %U https://proceedings.mlr.press/v32/dick14.html %V 32 %N 1 %X In this paper we consider online learning in finite Markov decision processes (MDPs) with changing cost sequences under full and bandit-information. We propose to view this problem as an instance of online linear optimization. We propose two methods for this problem: MD^2 (mirror descent with approximate projections) and the continuous exponential weights algorithm with Dikin walks. We provide a rigorous complexity analysis of these techniques, while providing near-optimal regret-bounds (in particular, we take into account the computational costs of performing approximate projections in MD^2). In the case of full-information feedback, our results complement existing ones. In the case of bandit-information feedback we consider the online stochastic shortest path problem, a special case of the above MDP problems, and manage to improve the existing results by removing the previous restrictive assumption that the state-visitation probabilities are uniformly bounded away from zero under all policies.
RIS
TY - CPAPER TI - Online Learning in Markov Decision Processes with Changing Cost Sequences AU - Travis Dick AU - Andras Gyorgy AU - Csaba Szepesvari BT - Proceedings of the 31st International Conference on Machine Learning DA - 2014/01/27 ED - Eric P. Xing ED - Tony Jebara ID - pmlr-v32-dick14 PB - PMLR DP - Proceedings of Machine Learning Research VL - 32 IS - 1 SP - 512 EP - 520 L1 - http://proceedings.mlr.press/v32/dick14.pdf UR - https://proceedings.mlr.press/v32/dick14.html AB - In this paper we consider online learning in finite Markov decision processes (MDPs) with changing cost sequences under full and bandit-information. We propose to view this problem as an instance of online linear optimization. We propose two methods for this problem: MD^2 (mirror descent with approximate projections) and the continuous exponential weights algorithm with Dikin walks. We provide a rigorous complexity analysis of these techniques, while providing near-optimal regret-bounds (in particular, we take into account the computational costs of performing approximate projections in MD^2). In the case of full-information feedback, our results complement existing ones. In the case of bandit-information feedback we consider the online stochastic shortest path problem, a special case of the above MDP problems, and manage to improve the existing results by removing the previous restrictive assumption that the state-visitation probabilities are uniformly bounded away from zero under all policies. ER -
APA
Dick, T., Gyorgy, A. & Szepesvari, C.. (2014). Online Learning in Markov Decision Processes with Changing Cost Sequences. Proceedings of the 31st International Conference on Machine Learning, in Proceedings of Machine Learning Research 32(1):512-520 Available from https://proceedings.mlr.press/v32/dick14.html.

Related Material