A Linear Response Bandit Problem
We consider a two–armed bandit problem which involves sequential sampling from two non-homogeneous populations. The response in each is determined by a random covariate vector and a vector of parameters whose values are not known a priori. The goal is to maximize cumulative expected reward. We study...
        Saved in:
      
    
          | Published in | Stochastic systems Vol. 3; no. 1; pp. 230 - 261 | 
|---|---|
| Main Authors | , | 
| Format | Journal Article | 
| Language | English | 
| Published | 
            Institute for Operations Research and the Management Sciences (INFORMS)
    
        01.06.2013
     | 
| Subjects | |
| Online Access | Get full text | 
| ISSN | 1946-5238 1946-5238  | 
| DOI | 10.1287/11-SSY032 | 
Cover
| Summary: | We consider a two–armed bandit problem which involves sequential sampling from two non-homogeneous populations. The response in each is determined by a random covariate vector and a vector of parameters whose values are not known a priori. The goal is to maximize cumulative expected reward. We study this problem in a minimax setting, and develop rate-optimal polices that combine myopic action based on least squares estimates with a suitable “forced sampling” strategy. It is shown that the regret grows logarithmically in the time horizon n and no policy can achieve a slower growth rate over all feasible problem instances. In this setting of linear response bandits, the identity of the sub-optimal action changes with the values of the covariate vector, and the optimal policy is subject to sampling from the inferior population at a rate that grows like [Formula: see text]. | 
|---|---|
| ISSN: | 1946-5238 1946-5238  | 
| DOI: | 10.1287/11-SSY032 |