The adversarial stochastic shortest path problem with unknown transition probabilities

Gergely Neu, Andras Gyorgy, Csaba Szepesvari
Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, PMLR 22:805-813, 2012.

Abstract

We consider online learning in a special class of episodic Markovian decision processes, namely, loop-free stochastic shortest path problems. In this problem, an agent has to traverse through a finite directed acyclic graph with random transitions while maximizing the obtained rewards along the way. We assume that the reward function can change arbitrarily between consecutive episodes, and is entirely revealed to the agent at the end of each episode. Previous work was concerned with the case when the stochastic dynamics is known ahead of time, whereas the main novelty of this paper is that this assumption is lifted. We propose an algorithm called “follow the perturbed optimistic policy” that combines ideas from the “follow the perturbed leader” method for online learning of arbitrary sequences and “upper confidence reinforcement learning”, an algorithm for regret minimization in Markovian decision processes (with a fixed reward function). We prove that the expected cumulative regret of our algorithm is of order L X A\sqrtT up to logarithmic factors, where L is the length of the longest path in the graph, \X is the state space, \A is the action space and T is the number of episodes. To our knowledge this is the first algorithm that learns and controls stochastic and adversarial components in an online fashion at the same time.

Cite this Paper


BibTeX
@InProceedings{pmlr-v22-neu12, title = {The adversarial stochastic shortest path problem with unknown transition probabilities}, author = {Neu, Gergely and Gyorgy, Andras and Szepesvari, Csaba}, booktitle = {Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics}, pages = {805--813}, year = {2012}, editor = {Lawrence, Neil D. and Girolami, Mark}, volume = {22}, series = {Proceedings of Machine Learning Research}, address = {La Palma, Canary Islands}, month = {21--23 Apr}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v22/neu12/neu12.pdf}, url = {https://proceedings.mlr.press/v22/neu12.html}, abstract = {We consider online learning in a special class of episodic Markovian decision processes, namely, loop-free stochastic shortest path problems. In this problem, an agent has to traverse through a finite directed acyclic graph with random transitions while maximizing the obtained rewards along the way. We assume that the reward function can change arbitrarily between consecutive episodes, and is entirely revealed to the agent at the end of each episode. Previous work was concerned with the case when the stochastic dynamics is known ahead of time, whereas the main novelty of this paper is that this assumption is lifted. We propose an algorithm called “follow the perturbed optimistic policy” that combines ideas from the “follow the perturbed leader” method for online learning of arbitrary sequences and “upper confidence reinforcement learning”, an algorithm for regret minimization in Markovian decision processes (with a fixed reward function). We prove that the expected cumulative regret of our algorithm is of order L X A\sqrtT up to logarithmic factors, where L is the length of the longest path in the graph, \X is the state space, \A is the action space and T is the number of episodes. To our knowledge this is the first algorithm that learns and controls stochastic and adversarial components in an online fashion at the same time.} }
Endnote
%0 Conference Paper %T The adversarial stochastic shortest path problem with unknown transition probabilities %A Gergely Neu %A Andras Gyorgy %A Csaba Szepesvari %B Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2012 %E Neil D. Lawrence %E Mark Girolami %F pmlr-v22-neu12 %I PMLR %P 805--813 %U https://proceedings.mlr.press/v22/neu12.html %V 22 %X We consider online learning in a special class of episodic Markovian decision processes, namely, loop-free stochastic shortest path problems. In this problem, an agent has to traverse through a finite directed acyclic graph with random transitions while maximizing the obtained rewards along the way. We assume that the reward function can change arbitrarily between consecutive episodes, and is entirely revealed to the agent at the end of each episode. Previous work was concerned with the case when the stochastic dynamics is known ahead of time, whereas the main novelty of this paper is that this assumption is lifted. We propose an algorithm called “follow the perturbed optimistic policy” that combines ideas from the “follow the perturbed leader” method for online learning of arbitrary sequences and “upper confidence reinforcement learning”, an algorithm for regret minimization in Markovian decision processes (with a fixed reward function). We prove that the expected cumulative regret of our algorithm is of order L X A\sqrtT up to logarithmic factors, where L is the length of the longest path in the graph, \X is the state space, \A is the action space and T is the number of episodes. To our knowledge this is the first algorithm that learns and controls stochastic and adversarial components in an online fashion at the same time.
RIS
TY - CPAPER TI - The adversarial stochastic shortest path problem with unknown transition probabilities AU - Gergely Neu AU - Andras Gyorgy AU - Csaba Szepesvari BT - Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics DA - 2012/03/21 ED - Neil D. Lawrence ED - Mark Girolami ID - pmlr-v22-neu12 PB - PMLR DP - Proceedings of Machine Learning Research VL - 22 SP - 805 EP - 813 L1 - http://proceedings.mlr.press/v22/neu12/neu12.pdf UR - https://proceedings.mlr.press/v22/neu12.html AB - We consider online learning in a special class of episodic Markovian decision processes, namely, loop-free stochastic shortest path problems. In this problem, an agent has to traverse through a finite directed acyclic graph with random transitions while maximizing the obtained rewards along the way. We assume that the reward function can change arbitrarily between consecutive episodes, and is entirely revealed to the agent at the end of each episode. Previous work was concerned with the case when the stochastic dynamics is known ahead of time, whereas the main novelty of this paper is that this assumption is lifted. We propose an algorithm called “follow the perturbed optimistic policy” that combines ideas from the “follow the perturbed leader” method for online learning of arbitrary sequences and “upper confidence reinforcement learning”, an algorithm for regret minimization in Markovian decision processes (with a fixed reward function). We prove that the expected cumulative regret of our algorithm is of order L X A\sqrtT up to logarithmic factors, where L is the length of the longest path in the graph, \X is the state space, \A is the action space and T is the number of episodes. To our knowledge this is the first algorithm that learns and controls stochastic and adversarial components in an online fashion at the same time. ER -
APA
Neu, G., Gyorgy, A. & Szepesvari, C.. (2012). The adversarial stochastic shortest path problem with unknown transition probabilities. Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 22:805-813 Available from https://proceedings.mlr.press/v22/neu12.html.

Related Material