Delay and Cooperation in Nonstochastic Bandits

Nicolò Cesa-Bianchi; Claudio Gentile; Yishay Mansour

We study networks of communicating learning agents that cooperate to solve a common nonstochastic bandit problem. Agents use an underlying communication network to get messages about actions selected by other agents, and drop messages that took more than

$d$ hops to arrive, where

$d$ is a delay parameter. We introduce Exp3-Coop, a cooperative version of the Exp3 algorithm and prove that with

$K$ actions and

$N$ agents the average per-agent regret after

$T$ rounds is at most of order

$\sqrt{\bigl(d+1 + \tfrac{K}{N}\alpha_{\le d}\bigr)(T\ln K)}$ , where

$\alpha_{\le d}$ is the independence number of the

$d$ -th power of the communication graph

$G$ . We then show that for any connected graph, for

$d=\sqrt{K}$ the regret bound is

$K^{1/4}\sqrt{T}$ , strictly better than the minimax regret

$\sqrt{KT}$ for noncooperating agents. More informed choices of

$d$ lead to bounds which are arbitrarily close to the full information minimax regret

$\sqrt{T\ln K}$ when

$G$ is dense. When

$G$ has sparse components, we show that a variant of Exp3-Coop, allowing agents to choose their parameters according to their centrality in

$G$ , strictly improves the regret. Finally, as a by-product of our analysis, we provide the first characterization of the minimax regret for bandit learning with delay.

Delay and Cooperation in Nonstochastic Bandits

Abstract