## Contextual Bandit Algorithms with Supervised Learning Guarantees

**Notable paper award**

* Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, Robert Schapire*; JMLR W&CP 15:19-26, 2011.

### Abstract

We address the problem of competing with any large set of $N$ policies in the non-stochastic bandit setting, where the learner must repeatedly select among $K$ actions but observes only the reward of the chosen action. We present a modification of the Exp4 algorithm of [Auer et al. 2002] called Exp4.P, which with high probability incurs regret at most $O(\sqrt{KT\ln N})$. Such a bound does not hold for Exp4 due to the large variance of the importance-weighted estimates used in the algorithm. The new algorithm is tested empirically in a large-scale, real-world dataset. For the stochastic version of the problem, we can use Exp4.P as a subroutine to compete with a possibly infinite set of policies of VC-dimension $d$ while incurring regret at most $O(\sqrt{Td\ln T})$ with high probability. These guarantees improve on those of all previous algorithms, whether in a stochastic or adversarial environment, and bring us closer to providing guarantees for this setting that are comparable to those in standard supervised learning.[pdf][supplementary][discussion]