Closing the Gap: A Learning Algorithm for Lost-Sales Inventory Systems with Lead Times
We consider a periodic-review, single-product inventory system with lost sales and positive lead times under censored demand. In contrast to the classical inventory literature, we assume the firm does not know the demand distribution a priori and makes an adaptive inventory-ordering decision in each...
        Saved in:
      
    
          | Published in | Management science Vol. 66; no. 5; pp. 1962 - 1980 | 
|---|---|
| Main Authors | , , | 
| Format | Journal Article | 
| Language | English | 
| Published | 
        Linthicum
          INFORMS
    
        01.05.2020
     Institute for Operations Research and the Management Sciences  | 
| Subjects | |
| Online Access | Get full text | 
| ISSN | 0025-1909 1526-5501  | 
| DOI | 10.1287/mnsc.2019.3288 | 
Cover
| Summary: | We consider a periodic-review, single-product inventory system with lost sales and positive lead times under censored demand. In contrast to the classical inventory literature, we assume the firm does not know the demand distribution a priori and makes an adaptive inventory-ordering decision in each period based only on the past sales (censored demand) data. The standard performance measure is regret, which is the cost difference between a learning algorithm and the clairvoyant (full-information) benchmark. When the benchmark is chosen to be the (full-information) optimal base-stock policy, Huh et al. [Huh WT, Janakiraman G, Muckstadt JA, Rusmevichientong P (2009a) An adaptive algorithm for finding the optimal base-stock policy in lost sales inventory systems with censored demand.
Math. Oper. Res.
34(2):397–416.] developed a nonparametric learning algorithm with a cubic-root convergence rate on regret. An important open question is whether there exists a nonparametric learning algorithm whose regret rate matches the theoretical lower bound of any learning algorithms. In this work, we provide an affirmative answer to this question. More precisely, we propose a new nonparametric algorithm termed
the simulated cycle-update policy
and establish a square-root convergence rate on regret, which is proven to be the lower bound of any learning algorithm. Our algorithm uses a random cycle-updating rule based on an
auxiliary simulated system
running in parallel and also involves two new concepts, namely
the withheld on-hand inventory
and
the double-phase cycle gradient estimation
. The techniques developed are effective for learning a stochastic system with complex system dynamics and lasting impact of decisions.
This paper was accepted by Yinyu Ye, optimization. | 
|---|---|
| Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14  | 
| ISSN: | 0025-1909 1526-5501  | 
| DOI: | 10.1287/mnsc.2019.3288 |