An online algorithm for the risk-aware restless bandit

•We introduce a general risk measure, which is required to satisfy a specific assumption to measure the performance of both each arm and any observation sequence. This condition is formally given in an assumption which follows the preference of related research. Then, we define a proxy regret functi...

Full description

Saved in:

Bibliographic Details
Published in	European journal of operational research Vol. 290; no. 2; pp. 622 - 639
Main Authors	Xu, Jianyu, Chen, Lujie, Tang, Ou
Format	Journal Article
Language	English
Published	Elsevier B.V 16.04.2021
Subjects	Markov process Multi-armed bandit Online optimization Risk measure Risk-aware Markov process Online optimization Risk-aware Risk measure Multi-armed bandit
Online Access	Get full text
ISSN	0377-2217 1872-6860 1872-6860
DOI	10.1016/j.ejor.2020.08.028

Cover

More Information
Summary:	•We introduce a general risk measure, which is required to satisfy a specific assumption to measure the performance of both each arm and any observation sequence. This condition is formally given in an assumption which follows the preference of related research. Then, we define a proxy regret function and formulate a risk-aware restless bandit model, which admits an optimal policy seeking the single best arm with the lowest risk.•We construct an index-based algorithm, called the risk-aware regeneration cycle algorithm (RARCA), for this model. The proposed algorithm combines the ideas of the classical LCB algorithm and the regeneration cycle method from an existing benchmark research, and extends these to the risk-aware case. Following this, we prove a logarithm upper bound on regret, which implies that the algorithm yields a logarithmic convergence rate for regret.•We analyze several specific and popular risk measures and show that a wide family of risk measures fits into our framework, including mean-deviation, shortfall and a general family of risks that is subject to a discrete Kusuoka representation. The multi-armed bandit (MAB) is a classical model for the exploration vs. exploitation trade-off. Among existing MAB models, the restless bandit model is of increasing interest because of its dynamic nature, which makes it highly applicable in practice. Like other MAB models, the traditional (risk-neutral) restless bandit model searches for the arm with the lowest mean cost and does not consider risk-aversion, which is critical in cases such as clinical trials and financial investment. This limitation thus hinders the application of the traditional restless bandit. Motivated by these concerns, we introduce a general risk measure that satisfies a mild restriction to formulate a risk-aware restless model; in particular, we set a risk measure as the criterion for the performance of each arm, instead of the expectation as in the traditional case. Compared with classical MAB models, we conclude that our model settings accommodate risk-aware researchers and decision makers. We present an index-based online algorithm for the problem, and derive an upper bound on the regret of this algorithm. Then, we conclude that our algorithm retains an instance-based regret of order O(log T/T), which is consistent with the classical MAB model. Further, some specific risk measures, namely, mean-deviation, shortfall and the discrete Kusuoka risk measure, are used to demonstrate the details of our framework.
ISSN:	0377-2217 1872-6860 1872-6860
DOI:	10.1016/j.ejor.2020.08.028