## Near-optimal Regret Bounds for Reinforcement Learning

** Thomas Jaksch, Ronald Ortner, Peter Auer**; 11(51):1563−1600, 2010.

### Abstract

For undiscounted reinforcement learning in Markov decision
processes (MDPs) we consider the *total regret* of
a learning algorithm with respect to an optimal policy.
In order to describe the transition structure of an MDP we propose a new parameter:
An MDP has *diameter* *D* if for any pair of states *s,s'* there is
a policy which moves from *s* to *s'* in at most *D* steps (on average).
We present a reinforcement learning algorithm with total regret
*Õ(DS√AT)* after *T* steps for any unknown MDP
with *S* states, *A* actions per state, and diameter *D*.
A corresponding lower bound of *Ω(√DSAT)* on the
total regret of any learning algorithm is given as well.

These results are complemented by a sample complexity bound on the
number of suboptimal steps taken by our algorithm. This bound can be
used to achieve a (gap-dependent) regret bound that is logarithmic in *T*.

Finally, we also consider a setting where the MDP is allowed to change
a fixed number of *l* times. We present a modification of our algorithm
that is able to deal with this setting and show a regret bound of
*Õ(l ^{1/3}T^{2/3}DS√A)*.

© JMLR 2010. (edit, beta) |