Hierarchical Reinforcement Learning Based on Planning Operators

Learning long-horizon manipulation tasks such as stacking, presents a longstanding challenge in the field of robotic manipulation, particularly when using Reinforcement Learning (RL) methods. RL algorithms focus on learning a policy for executing the entire task instead of learning the correct seque...

Full description

Saved in:
Bibliographic Details
Published inIEEE International Conference on Automation Science and Engineering (CASE) pp. 2006 - 2012
Main Authors Zhang, Jing, Dean, Emmanuel, Ramirez-Amaro, Karinne
Format Conference Proceeding
LanguageEnglish
Published IEEE 28.08.2024
Subjects
Online AccessGet full text
ISSN2161-8089
2161-8070
2161-8089
DOI10.1109/CASE59546.2024.10711595

Cover

More Information
Summary:Learning long-horizon manipulation tasks such as stacking, presents a longstanding challenge in the field of robotic manipulation, particularly when using Reinforcement Learning (RL) methods. RL algorithms focus on learning a policy for executing the entire task instead of learning the correct sequence of actions required to achieve complex goals. While RL aims to find a sequence of actions that maximises the total reward of the task, the main challenge arises when there are infinite possibilities of chaining these actions (e.g. reach, grasp, etc.) to achieve the same task (stacking). In these cases, RL methods may struggle to find the optimal policy. This paper introduces a novel framework that integrates the operator concepts from the symbolic planning domain with hierarchical RL methods. We propose to change the way complex tasks are trained by learning independent policies of the actions defined by high-level operators instead of learning a policy for the complete complex task. Our contribution integrates planning operators (e.g. preconditions and effects) as part of the hierarchical RL algorithm based on the Scheduled Auxiliary Control (SAC-X) method. We developed a dual-purpose high-level operator, which can be used both in holistic planning and as independent, reusable policies. Our approach offers a flexible solution for long-horizon tasks, e.g., stacking and inserting a cube. The experimental results show that our proposed method achieved an average success rate of 97.2% for learning and executing the whole stack. Furthermore, we obtain a high success rate when learning independent policies, e.g. reach (98.9%), lift (99.7%), move (97.4%), etc. The training time is also reduced by 68% when using our proposed approach.
ISSN:2161-8089
2161-8070
2161-8089
DOI:10.1109/CASE59546.2024.10711595