Experimental Design for Regret Minimization in Linear Bandits
In this paper we propose a novel experimental design-based algorithm to minimize regret in online stochastic linear and combinatorial bandits. While existing literature tends to focus on optimism-based algorithms--which have been shown to be suboptimal in many cases--our approach carefully plans whi...
Saved in:
| Main Authors | , , |
|---|---|
| Format | Journal Article |
| Language | English |
| Published |
01.11.2020
|
| Subjects | |
| Online Access | Get full text |
| DOI | 10.48550/arxiv.2011.00576 |
Cover
| Summary: | In this paper we propose a novel experimental design-based algorithm to
minimize regret in online stochastic linear and combinatorial bandits. While
existing literature tends to focus on optimism-based algorithms--which have
been shown to be suboptimal in many cases--our approach carefully plans which
action to take by balancing the tradeoff between information gain and reward,
overcoming the failures of optimism. In addition, we leverage tools from the
theory of suprema of empirical processes to obtain regret guarantees that scale
with the Gaussian width of the action set, avoiding wasteful union bounds. We
provide state-of-the-art finite time regret guarantees and show that our
algorithm can be applied in both the bandit and semi-bandit feedback regime. In
the combinatorial semi-bandit setting, we show that our algorithm is
computationally efficient and relies only on calls to a linear maximization
oracle. In addition, we show that with slight modification our algorithm can be
used for pure exploration, obtaining state-of-the-art pure exploration
guarantees in the semi-bandit setting. Finally, we provide, to the best of our
knowledge, the first example where optimism fails in the semi-bandit regime,
and show that in this setting our algorithm succeeds. |
|---|---|
| DOI: | 10.48550/arxiv.2011.00576 |