A Linear Response Bandit Problem

We consider a two–armed bandit problem which involves sequential sampling from two non-homogeneous populations. The response in each is determined by a random covariate vector and a vector of parameters whose values are not known a priori. The goal is to maximize cumulative expected reward. We study...

Full description

Saved in:

Bibliographic Details
Published in	Stochastic systems Vol. 3; no. 1; pp. 230 - 261
Main Authors	Goldenshluger, Alexander, Zeevi, Assaf
Format	Journal Article
Language	English
Published	Institute for Operations Research and the Management Sciences (INFORMS) 01.06.2013
Subjects	bandit problems estimation minimax rate–optimal policy regret Sequential allocation
Online Access	Get full text
ISSN	1946-5238 1946-5238
DOI	10.1287/11-SSY032

Cover

More Information
Summary:	We consider a two–armed bandit problem which involves sequential sampling from two non-homogeneous populations. The response in each is determined by a random covariate vector and a vector of parameters whose values are not known a priori. The goal is to maximize cumulative expected reward. We study this problem in a minimax setting, and develop rate-optimal polices that combine myopic action based on least squares estimates with a suitable “forced sampling” strategy. It is shown that the regret grows logarithmically in the time horizon n and no policy can achieve a slower growth rate over all feasible problem instances. In this setting of linear response bandits, the identity of the sub-optimal action changes with the values of the covariate vector, and the optimal policy is subject to sampling from the inferior population at a rate that grows like [Formula: see text].
ISSN:	1946-5238 1946-5238
DOI:	10.1287/11-SSY032