## Near-optimal Regret Bounds for Reinforcement Learning

** Thomas Jaksch, Ronald Ortner, Peter Auer**; 11(Apr):1563−1600, 2010.

### Abstract

For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the*total regret*of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has

*diameter*

*D*if for any pair of states

*s,s'*there is a policy which moves from

*s*to

*s'*in at most

*D*steps (on average). We present a reinforcement learning algorithm with total regret

*Õ(DS√AT)*after

*T*steps for any unknown MDP with

*S*states,

*A*actions per state, and diameter

*D*. A corresponding lower bound of

*Ω(√DSAT)*on the total regret of any learning algorithm is given as well.

These results are complemented by a sample complexity bound on the number of suboptimal steps taken by our algorithm. This bound can be used to achieve a (gap-dependent) regret bound that is logarithmic in

*T*.

Finally, we also consider a setting where the MDP is allowed to change a fixed number of

*l*times. We present a modification of our algorithm that is able to deal with this setting and show a regret bound of

*Õ(l*.

^{1/3}T^{2/3}DS√A)[abs][pdf]