Contextual Bandit Algorithms with Supervised Learning Guarantees

Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, Robert Schapire
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, PMLR 15:19-26, 2011.

Abstract

We address the problem of competing with any large set of $N$ policies in the non-stochastic bandit setting, where the learner must repeatedly select among $K$ actions but observes only the reward of the chosen action. We present a modification of the Exp4 algorithm of Auer et al., called Exp4.P, which with high probability incurs regret at most $O(\sqrt{KT\ln N})$. Such a bound does not hold for Exp4 due to the large variance of the importance-weighted estimates used in the algorithm. The new algorithm is tested empirically in a large-scale, real-world dataset. For the stochastic version of the problem, we can use Exp4.P as a subroutine to compete with a possibly infinite set of policies of VC-dimension d while incurring regret at most $O(\sqrt{Td\ln T})$ with high probability. These guarantees improve on those of all previous algorithms, whether in a stochastic or adversarial environment, and bring us closer to providing guarantees for this setting that are comparable to those in standard supervised learning.

Cite this Paper


BibTeX
@InProceedings{pmlr-v15-beygelzimer11a, title = {Contextual Bandit Algorithms with Supervised Learning Guarantees}, author = {Beygelzimer, Alina and Langford, John and Li, Lihong and Reyzin, Lev and Schapire, Robert}, booktitle = {Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics}, pages = {19--26}, year = {2011}, editor = {Gordon, Geoffrey and Dunson, David and Dudík, Miroslav}, volume = {15}, series = {Proceedings of Machine Learning Research}, address = {Fort Lauderdale, FL, USA}, month = {11--13 Apr}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v15/beygelzimer11a/beygelzimer11a.pdf}, url = {https://proceedings.mlr.press/v15/beygelzimer11a.html}, abstract = {We address the problem of competing with any large set of $N$ policies in the non-stochastic bandit setting, where the learner must repeatedly select among $K$ actions but observes only the reward of the chosen action. We present a modification of the Exp4 algorithm of Auer et al., called Exp4.P, which with high probability incurs regret at most $O(\sqrt{KT\ln N})$. Such a bound does not hold for Exp4 due to the large variance of the importance-weighted estimates used in the algorithm. The new algorithm is tested empirically in a large-scale, real-world dataset. For the stochastic version of the problem, we can use Exp4.P as a subroutine to compete with a possibly infinite set of policies of VC-dimension d while incurring regret at most $O(\sqrt{Td\ln T})$ with high probability. These guarantees improve on those of all previous algorithms, whether in a stochastic or adversarial environment, and bring us closer to providing guarantees for this setting that are comparable to those in standard supervised learning.} }
Endnote
%0 Conference Paper %T Contextual Bandit Algorithms with Supervised Learning Guarantees %A Alina Beygelzimer %A John Langford %A Lihong Li %A Lev Reyzin %A Robert Schapire %B Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2011 %E Geoffrey Gordon %E David Dunson %E Miroslav Dudík %F pmlr-v15-beygelzimer11a %I PMLR %P 19--26 %U https://proceedings.mlr.press/v15/beygelzimer11a.html %V 15 %X We address the problem of competing with any large set of $N$ policies in the non-stochastic bandit setting, where the learner must repeatedly select among $K$ actions but observes only the reward of the chosen action. We present a modification of the Exp4 algorithm of Auer et al., called Exp4.P, which with high probability incurs regret at most $O(\sqrt{KT\ln N})$. Such a bound does not hold for Exp4 due to the large variance of the importance-weighted estimates used in the algorithm. The new algorithm is tested empirically in a large-scale, real-world dataset. For the stochastic version of the problem, we can use Exp4.P as a subroutine to compete with a possibly infinite set of policies of VC-dimension d while incurring regret at most $O(\sqrt{Td\ln T})$ with high probability. These guarantees improve on those of all previous algorithms, whether in a stochastic or adversarial environment, and bring us closer to providing guarantees for this setting that are comparable to those in standard supervised learning.
RIS
TY - CPAPER TI - Contextual Bandit Algorithms with Supervised Learning Guarantees AU - Alina Beygelzimer AU - John Langford AU - Lihong Li AU - Lev Reyzin AU - Robert Schapire BT - Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics DA - 2011/06/14 ED - Geoffrey Gordon ED - David Dunson ED - Miroslav Dudík ID - pmlr-v15-beygelzimer11a PB - PMLR DP - Proceedings of Machine Learning Research VL - 15 SP - 19 EP - 26 L1 - http://proceedings.mlr.press/v15/beygelzimer11a/beygelzimer11a.pdf UR - https://proceedings.mlr.press/v15/beygelzimer11a.html AB - We address the problem of competing with any large set of $N$ policies in the non-stochastic bandit setting, where the learner must repeatedly select among $K$ actions but observes only the reward of the chosen action. We present a modification of the Exp4 algorithm of Auer et al., called Exp4.P, which with high probability incurs regret at most $O(\sqrt{KT\ln N})$. Such a bound does not hold for Exp4 due to the large variance of the importance-weighted estimates used in the algorithm. The new algorithm is tested empirically in a large-scale, real-world dataset. For the stochastic version of the problem, we can use Exp4.P as a subroutine to compete with a possibly infinite set of policies of VC-dimension d while incurring regret at most $O(\sqrt{Td\ln T})$ with high probability. These guarantees improve on those of all previous algorithms, whether in a stochastic or adversarial environment, and bring us closer to providing guarantees for this setting that are comparable to those in standard supervised learning. ER -
APA
Beygelzimer, A., Langford, J., Li, L., Reyzin, L. & Schapire, R.. (2011). Contextual Bandit Algorithms with Supervised Learning Guarantees. Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 15:19-26 Available from https://proceedings.mlr.press/v15/beygelzimer11a.html.

Related Material