Closing the Gap: A Learning Algorithm for Lost-Sales Inventory Systems with Lead Times

We consider a periodic-review, single-product inventory system with lost sales and positive lead times under censored demand. In contrast to the classical inventory literature, we assume the firm does not know the demand distribution a priori and makes an adaptive inventory-ordering decision in each...

Full description

Saved in:
Bibliographic Details
Published inManagement science Vol. 66; no. 5; pp. 1962 - 1980
Main Authors Zhang, Huanan, Chao, Xiuli, Shi, Cong
Format Journal Article
LanguageEnglish
Published Linthicum INFORMS 01.05.2020
Institute for Operations Research and the Management Sciences
Subjects
Online AccessGet full text
ISSN0025-1909
1526-5501
DOI10.1287/mnsc.2019.3288

Cover

More Information
Summary:We consider a periodic-review, single-product inventory system with lost sales and positive lead times under censored demand. In contrast to the classical inventory literature, we assume the firm does not know the demand distribution a priori and makes an adaptive inventory-ordering decision in each period based only on the past sales (censored demand) data. The standard performance measure is regret, which is the cost difference between a learning algorithm and the clairvoyant (full-information) benchmark. When the benchmark is chosen to be the (full-information) optimal base-stock policy, Huh et al. [Huh WT, Janakiraman G, Muckstadt JA, Rusmevichientong P (2009a) An adaptive algorithm for finding the optimal base-stock policy in lost sales inventory systems with censored demand. Math. Oper. Res. 34(2):397–416.] developed a nonparametric learning algorithm with a cubic-root convergence rate on regret. An important open question is whether there exists a nonparametric learning algorithm whose regret rate matches the theoretical lower bound of any learning algorithms. In this work, we provide an affirmative answer to this question. More precisely, we propose a new nonparametric algorithm termed the simulated cycle-update policy and establish a square-root convergence rate on regret, which is proven to be the lower bound of any learning algorithm. Our algorithm uses a random cycle-updating rule based on an auxiliary simulated system running in parallel and also involves two new concepts, namely the withheld on-hand inventory and the double-phase cycle gradient estimation . The techniques developed are effective for learning a stochastic system with complex system dynamics and lasting impact of decisions. This paper was accepted by Yinyu Ye, optimization.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:0025-1909
1526-5501
DOI:10.1287/mnsc.2019.3288