Adaptive Bandits: Towards the best history-dependent strategy

Maillard Odalric, Rémi Munos
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, PMLR 15:570-578, 2011.

Abstract

We consider multi-armed bandit games with possibly adaptive opponents. We introduce models $\Theta$ of constraints based on equivalence classes on the common history (information shared by the player and the opponent) which define two learning scenarios: (1) The opponent is constrained, i.e. he provides rewards that are stochastic functions of equivalence classes defined by some model $\theta^* \in \Theta$. The regret is measured with respect to (w.r.t.) the best history-dependent strategy. (2) The opponent is arbitrary and we measure the regret w.r.t. the best strategy among all mappings from classes to actions (i.e. the best history-class-based strategy) for the best model in $\Theta$. This allows to model opponents (case 1) or strategies (case 2) which handles finite memory, periodicity, standard stochastic bandits and other situations. When $\Theta=\{ \theta \}$, i.e. only one model is considered, we derive tractable algorithms achieving a tight regret (at time $T$) bounded by $\tilde{O}(\sqrt{TAC})$, where $C$ is the number of classes of $\theta$. Now, when many models are available, all known algorithms achieving a nice regret $O(\sqrt{T})$ are unfortunately not tractable and scale poorly with the number of models $|\Theta|$. Our contribution here is to provide tractable algorithms with regret bounded by $T^{2/3}C^{1/3}\log(|\Theta|)^{1/2}$.

Cite this Paper


BibTeX
@InProceedings{pmlr-v15-odalric11a, title = {Adaptive Bandits: Towards the best history-dependent strategy}, author = {Odalric, Maillard and Munos, Rémi}, booktitle = {Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics}, pages = {570--578}, year = {2011}, editor = {Gordon, Geoffrey and Dunson, David and Dudík, Miroslav}, volume = {15}, series = {Proceedings of Machine Learning Research}, address = {Fort Lauderdale, FL, USA}, month = {11--13 Apr}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v15/odalric11a/odalric11a.pdf}, url = {https://proceedings.mlr.press/v15/odalric11a.html}, abstract = {We consider multi-armed bandit games with possibly adaptive opponents. We introduce models $\Theta$ of constraints based on equivalence classes on the common history (information shared by the player and the opponent) which define two learning scenarios: (1) The opponent is constrained, i.e. he provides rewards that are stochastic functions of equivalence classes defined by some model $\theta^* \in \Theta$. The regret is measured with respect to (w.r.t.) the best history-dependent strategy. (2) The opponent is arbitrary and we measure the regret w.r.t. the best strategy among all mappings from classes to actions (i.e. the best history-class-based strategy) for the best model in $\Theta$. This allows to model opponents (case 1) or strategies (case 2) which handles finite memory, periodicity, standard stochastic bandits and other situations. When $\Theta=\{ \theta \}$, i.e. only one model is considered, we derive tractable algorithms achieving a tight regret (at time $T$) bounded by $\tilde{O}(\sqrt{TAC})$, where $C$ is the number of classes of $\theta$. Now, when many models are available, all known algorithms achieving a nice regret $O(\sqrt{T})$ are unfortunately not tractable and scale poorly with the number of models $|\Theta|$. Our contribution here is to provide tractable algorithms with regret bounded by $T^{2/3}C^{1/3}\log(|\Theta|)^{1/2}$.} }
Endnote
%0 Conference Paper %T Adaptive Bandits: Towards the best history-dependent strategy %A Maillard Odalric %A Rémi Munos %B Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2011 %E Geoffrey Gordon %E David Dunson %E Miroslav Dudík %F pmlr-v15-odalric11a %I PMLR %P 570--578 %U https://proceedings.mlr.press/v15/odalric11a.html %V 15 %X We consider multi-armed bandit games with possibly adaptive opponents. We introduce models $\Theta$ of constraints based on equivalence classes on the common history (information shared by the player and the opponent) which define two learning scenarios: (1) The opponent is constrained, i.e. he provides rewards that are stochastic functions of equivalence classes defined by some model $\theta^* \in \Theta$. The regret is measured with respect to (w.r.t.) the best history-dependent strategy. (2) The opponent is arbitrary and we measure the regret w.r.t. the best strategy among all mappings from classes to actions (i.e. the best history-class-based strategy) for the best model in $\Theta$. This allows to model opponents (case 1) or strategies (case 2) which handles finite memory, periodicity, standard stochastic bandits and other situations. When $\Theta=\{ \theta \}$, i.e. only one model is considered, we derive tractable algorithms achieving a tight regret (at time $T$) bounded by $\tilde{O}(\sqrt{TAC})$, where $C$ is the number of classes of $\theta$. Now, when many models are available, all known algorithms achieving a nice regret $O(\sqrt{T})$ are unfortunately not tractable and scale poorly with the number of models $|\Theta|$. Our contribution here is to provide tractable algorithms with regret bounded by $T^{2/3}C^{1/3}\log(|\Theta|)^{1/2}$.
RIS
TY - CPAPER TI - Adaptive Bandits: Towards the best history-dependent strategy AU - Maillard Odalric AU - Rémi Munos BT - Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics DA - 2011/06/14 ED - Geoffrey Gordon ED - David Dunson ED - Miroslav Dudík ID - pmlr-v15-odalric11a PB - PMLR DP - Proceedings of Machine Learning Research VL - 15 SP - 570 EP - 578 L1 - http://proceedings.mlr.press/v15/odalric11a/odalric11a.pdf UR - https://proceedings.mlr.press/v15/odalric11a.html AB - We consider multi-armed bandit games with possibly adaptive opponents. We introduce models $\Theta$ of constraints based on equivalence classes on the common history (information shared by the player and the opponent) which define two learning scenarios: (1) The opponent is constrained, i.e. he provides rewards that are stochastic functions of equivalence classes defined by some model $\theta^* \in \Theta$. The regret is measured with respect to (w.r.t.) the best history-dependent strategy. (2) The opponent is arbitrary and we measure the regret w.r.t. the best strategy among all mappings from classes to actions (i.e. the best history-class-based strategy) for the best model in $\Theta$. This allows to model opponents (case 1) or strategies (case 2) which handles finite memory, periodicity, standard stochastic bandits and other situations. When $\Theta=\{ \theta \}$, i.e. only one model is considered, we derive tractable algorithms achieving a tight regret (at time $T$) bounded by $\tilde{O}(\sqrt{TAC})$, where $C$ is the number of classes of $\theta$. Now, when many models are available, all known algorithms achieving a nice regret $O(\sqrt{T})$ are unfortunately not tractable and scale poorly with the number of models $|\Theta|$. Our contribution here is to provide tractable algorithms with regret bounded by $T^{2/3}C^{1/3}\log(|\Theta|)^{1/2}$. ER -
APA
Odalric, M. & Munos, R.. (2011). Adaptive Bandits: Towards the best history-dependent strategy. Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 15:570-578 Available from https://proceedings.mlr.press/v15/odalric11a.html.

Related Material