Reward Function Design Method for Long Episode Pursuit Tasks Under Polar Coordinate in Multi-Agent Reinforcement Learning

Multi-agent reinforcement learning has recently been applied to solve pursuit problems. However, it suffers from a large number of time steps per training episode, thus always struggling to converge effectively, resulting in low rewards and an inability for agents to learn strategies. This paper pro...

Full description

Saved in:

Bibliographic Details
Published in	Shanghai jiao tong da xue xue bao Vol. 29; no. 4; pp. 646 - 655
Main Authors	Dong, Yubo, Cui, Tao, Zhou, Yufan, Song, Xun, Zhu, Yue, Dong, Peng
Format	Journal Article
Language	English
Published	Shanghai Shanghai Jiaotong University Press 01.08.2024 Springer Nature B.V
Subjects	Architecture Computer Science Convergence Deep learning Electrical Engineering Engineering Learning Life Sciences Materials Science Multiagent systems Polar coordinate models Reinforcement Trigonometric functions TP242 multi-agent reinforcement learning A 奖赏函数深度强化学习多智能体强化学习 reward function deep reinforcement learning (DRL) 长周期 long episode deep reinforcement learning(DRL)
Online Access	Get full text
ISSN	1007-1172 1995-8188
DOI	10.1007/s12204-024-2713-4

Cover

More Information
Summary:	Multi-agent reinforcement learning has recently been applied to solve pursuit problems. However, it suffers from a large number of time steps per training episode, thus always struggling to converge effectively, resulting in low rewards and an inability for agents to learn strategies. This paper proposes a deep reinforcement learning (DRL) training method that employs an ensemble segmented multi-reward function design approach to address the convergence problem mentioned before. The ensemble reward function combines the advantages of two reward functions, which enhances the training effect of agents in long episode. Then, we eliminate the non-monotonic behavior in reward function introduced by the trigonometric functions in the traditional 2D polar coordinates observation representation. Experimental results demonstrate that this method outperforms the traditional single reward function mechanism in the pursuit scenario by enhancing agents’ policy scores of the task. These ideas offer a solution to the convergence challenges faced by DRL models in long episode pursuit problems, leading to an improved model training performance.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1007-1172 1995-8188
DOI:	10.1007/s12204-024-2713-4