Exp3.P-based Autonomous Decision Algorithm against Non-stationary Opponents with Partially Known Policies

This paper takes into account multi-agent games during which the opponents can change policies and their policy sets are partially known. Our goal is to generate an effective policy such that our agent can obtain a higher reward and meanwhile guarantee bounded regret. Considering such games against...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on games pp. 1 - 18
Main Authors	Zhu, Jin, Du, Chunhui, Chen, Jiacheng, Huang, Lei, Dullerud, Geir E.
Format	Journal Article
Language	English
Published	IEEE 2025
Subjects	Adaptation models Bayes methods Exp3.P-based autonumous decision Games Metalearning Multi-armed bandits Non-stationary opponents with partially known policies Opponent modeling Predator prey systems Prediction algorithms Predictive models Switches Training Training data
Online Access	Get full text
ISSN	2475-1502 2475-1510
DOI	10.1109/TG.2025.3579719

Cover

More Information
Summary:	This paper takes into account multi-agent games during which the opponents can change policies and their policy sets are partially known. Our goal is to generate an effective policy such that our agent can obtain a higher reward and meanwhile guarantee bounded regret. Considering such games against non-stationary opponents with partially known policies, Exp3.P-based Autonomous Decision (EAD) algorithm is put forward which contains three steps. Firstly, we learn embedding of opponent policy via Conditional Encoder-Decoder and employ conditional RL to generate the targeted policy. Secondly, we estimate the opponent policy through online Bayesian belief updates. Finally, we select the adversarial and targeted policy via a multi-armed bandit algorithm. Theoretical analysis is performed for the EAD algorithm. We give the lower bound of the expected reward when using the targeted policy, and prove the EAD algorithm has a bounded regret. Experimental results on Kuhn poker, Grid-world Predator-Prey and Grid world show the effectiveness of the proposed EAD algorithm.
ISSN:	2475-1502 2475-1510
DOI:	10.1109/TG.2025.3579719