Exp3.P-based Autonomous Decision Algorithm against Non-stationary Opponents with Partially Known Policies

This paper takes into account multi-agent games during which the opponents can change policies and their policy sets are partially known. Our goal is to generate an effective policy such that our agent can obtain a higher reward and meanwhile guarantee bounded regret. Considering such games against...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on games pp. 1 - 18
Main Authors Zhu, Jin, Du, Chunhui, Chen, Jiacheng, Huang, Lei, Dullerud, Geir E.
Format Journal Article
LanguageEnglish
Published IEEE 2025
Subjects
Online AccessGet full text
ISSN2475-1502
2475-1510
DOI10.1109/TG.2025.3579719

Cover

More Information
Summary:This paper takes into account multi-agent games during which the opponents can change policies and their policy sets are partially known. Our goal is to generate an effective policy such that our agent can obtain a higher reward and meanwhile guarantee bounded regret. Considering such games against non-stationary opponents with partially known policies, Exp3.P-based Autonomous Decision (EAD) algorithm is put forward which contains three steps. Firstly, we learn embedding of opponent policy via Conditional Encoder-Decoder and employ conditional RL to generate the targeted policy. Secondly, we estimate the opponent policy through online Bayesian belief updates. Finally, we select the adversarial and targeted policy via a multi-armed bandit algorithm. Theoretical analysis is performed for the EAD algorithm. We give the lower bound of the expected reward when using the targeted policy, and prove the EAD algorithm has a bounded regret. Experimental results on Kuhn poker, Grid-world Predator-Prey and Grid world show the effectiveness of the proposed EAD algorithm.
ISSN:2475-1502
2475-1510
DOI:10.1109/TG.2025.3579719