Beating Bandits in Gradually Evolving Worlds

Chao-Kai Chiang, Chia-Jung Lee, Chi-Jen Lu
Proceedings of the 26th Annual Conference on Learning Theory, PMLR 30:210-227, 2013.

Abstract

Consider the online convex optimization problem, in which a player has to choose actions iteratively and suffers corresponding losses according to some convex loss functions, and the goal is to minimize the regret. In the full-information setting, the player after choosing her action can observe the whole loss function in that round, while in the bandit setting, the only information the player can observe is the loss value of that action. Designing such bandit algorithms appears challenging, as the best regret currently achieved for general convex loss functions is much higher than that in the full-information setting, while for strongly convex loss functions, there is even a regret lower bound which is exponentially higher than that achieved in the full-information setting. To aim for smaller regrets, we adopt a relaxed two-point bandit setting in which the player can play two actions in each round and observe the loss values of those two actions. Moreover, we consider loss functions parameterized by their deviation D, which measures how fast they evolve, and we study how regrets depend on D. We show that two-point bandit algorithms can in fact achieve regrets matching those in the full-information setting in terms of D. More precisely, for convex loss functions, we achieve a regret of O(\sqrtD), while for strongly convex loss functions, we achieve a regret of O(\ln D), which is much smaller than the Ω(\sqrtD) lower bound in the traditional bandit setting.

Cite this Paper


BibTeX
@InProceedings{pmlr-v30-Chiang13, title = {Beating Bandits in Gradually Evolving Worlds}, author = {Chiang, Chao-Kai and Lee, Chia-Jung and Lu, Chi-Jen}, booktitle = {Proceedings of the 26th Annual Conference on Learning Theory}, pages = {210--227}, year = {2013}, editor = {Shalev-Shwartz, Shai and Steinwart, Ingo}, volume = {30}, series = {Proceedings of Machine Learning Research}, address = {Princeton, NJ, USA}, month = {12--14 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v30/Chiang13.pdf}, url = {https://proceedings.mlr.press/v30/Chiang13.html}, abstract = {Consider the online convex optimization problem, in which a player has to choose actions iteratively and suffers corresponding losses according to some convex loss functions, and the goal is to minimize the regret. In the full-information setting, the player after choosing her action can observe the whole loss function in that round, while in the bandit setting, the only information the player can observe is the loss value of that action. Designing such bandit algorithms appears challenging, as the best regret currently achieved for general convex loss functions is much higher than that in the full-information setting, while for strongly convex loss functions, there is even a regret lower bound which is exponentially higher than that achieved in the full-information setting. To aim for smaller regrets, we adopt a relaxed two-point bandit setting in which the player can play two actions in each round and observe the loss values of those two actions. Moreover, we consider loss functions parameterized by their deviation D, which measures how fast they evolve, and we study how regrets depend on D. We show that two-point bandit algorithms can in fact achieve regrets matching those in the full-information setting in terms of D. More precisely, for convex loss functions, we achieve a regret of O(\sqrtD), while for strongly convex loss functions, we achieve a regret of O(\ln D), which is much smaller than the Ω(\sqrtD) lower bound in the traditional bandit setting.} }
Endnote
%0 Conference Paper %T Beating Bandits in Gradually Evolving Worlds %A Chao-Kai Chiang %A Chia-Jung Lee %A Chi-Jen Lu %B Proceedings of the 26th Annual Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2013 %E Shai Shalev-Shwartz %E Ingo Steinwart %F pmlr-v30-Chiang13 %I PMLR %P 210--227 %U https://proceedings.mlr.press/v30/Chiang13.html %V 30 %X Consider the online convex optimization problem, in which a player has to choose actions iteratively and suffers corresponding losses according to some convex loss functions, and the goal is to minimize the regret. In the full-information setting, the player after choosing her action can observe the whole loss function in that round, while in the bandit setting, the only information the player can observe is the loss value of that action. Designing such bandit algorithms appears challenging, as the best regret currently achieved for general convex loss functions is much higher than that in the full-information setting, while for strongly convex loss functions, there is even a regret lower bound which is exponentially higher than that achieved in the full-information setting. To aim for smaller regrets, we adopt a relaxed two-point bandit setting in which the player can play two actions in each round and observe the loss values of those two actions. Moreover, we consider loss functions parameterized by their deviation D, which measures how fast they evolve, and we study how regrets depend on D. We show that two-point bandit algorithms can in fact achieve regrets matching those in the full-information setting in terms of D. More precisely, for convex loss functions, we achieve a regret of O(\sqrtD), while for strongly convex loss functions, we achieve a regret of O(\ln D), which is much smaller than the Ω(\sqrtD) lower bound in the traditional bandit setting.
RIS
TY - CPAPER TI - Beating Bandits in Gradually Evolving Worlds AU - Chao-Kai Chiang AU - Chia-Jung Lee AU - Chi-Jen Lu BT - Proceedings of the 26th Annual Conference on Learning Theory DA - 2013/06/13 ED - Shai Shalev-Shwartz ED - Ingo Steinwart ID - pmlr-v30-Chiang13 PB - PMLR DP - Proceedings of Machine Learning Research VL - 30 SP - 210 EP - 227 L1 - http://proceedings.mlr.press/v30/Chiang13.pdf UR - https://proceedings.mlr.press/v30/Chiang13.html AB - Consider the online convex optimization problem, in which a player has to choose actions iteratively and suffers corresponding losses according to some convex loss functions, and the goal is to minimize the regret. In the full-information setting, the player after choosing her action can observe the whole loss function in that round, while in the bandit setting, the only information the player can observe is the loss value of that action. Designing such bandit algorithms appears challenging, as the best regret currently achieved for general convex loss functions is much higher than that in the full-information setting, while for strongly convex loss functions, there is even a regret lower bound which is exponentially higher than that achieved in the full-information setting. To aim for smaller regrets, we adopt a relaxed two-point bandit setting in which the player can play two actions in each round and observe the loss values of those two actions. Moreover, we consider loss functions parameterized by their deviation D, which measures how fast they evolve, and we study how regrets depend on D. We show that two-point bandit algorithms can in fact achieve regrets matching those in the full-information setting in terms of D. More precisely, for convex loss functions, we achieve a regret of O(\sqrtD), while for strongly convex loss functions, we achieve a regret of O(\ln D), which is much smaller than the Ω(\sqrtD) lower bound in the traditional bandit setting. ER -
APA
Chiang, C., Lee, C. & Lu, C.. (2013). Beating Bandits in Gradually Evolving Worlds. Proceedings of the 26th Annual Conference on Learning Theory, in Proceedings of Machine Learning Research 30:210-227 Available from https://proceedings.mlr.press/v30/Chiang13.html.

Related Material