A Generic Markov Decision Process Model and Reinforcement Learning Method for Scheduling Agile Earth Observation Satellites

We investigate a general solution based on reinforcement learning for the agile satellite scheduling problem. The core idea of this method is to determine a value function for evaluating the long-term benefit under a certain state by training from experiences, and then apply this value function to g...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on systems, man, and cybernetics. Systems Vol. 52; no. 3; pp. 1463 - 1474
Main Authors	He, Yongming, Xing, Lining, Chen, Yingwu, Pedrycz, Witold, Wang, Ling, Wu, Guohua
Format	Journal Article
Language	English
Published	New York IEEE 01.03.2022 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Agile Earth observation satellite (AEOS) Algorithms Decision making Earth observations (from space) Heuristic algorithms Heuristic methods Job shop scheduling Linear programming Machine learning Markov analysis Markov decision process (MDP) Markov processes Q-learning reinforcement learning (RL) Satellite observation Satellites Scheduling Signal processing Task analysis task scheduling
Online Access	Get full text
ISSN	2168-2216 2168-2232
DOI	10.1109/TSMC.2020.3020732

Cover

More Information
Summary:	We investigate a general solution based on reinforcement learning for the agile satellite scheduling problem. The core idea of this method is to determine a value function for evaluating the long-term benefit under a certain state by training from experiences, and then apply this value function to guide decisions in unknown situations. First, the process of agile satellite scheduling is modeled as a finite Markov decision process with continuous state space and discrete action space. Two subproblems of the agile Earth observation satellite scheduling problem, i.e., the sequencing problem and the timing problem are solved by the part of the agent and the environment in the model, respectively. A satisfactory solution of the timing problem can be quickly produced by a constructive heuristic algorithm. The objective function of this problem is to maximize the total reward of the entire scheduling process. Based on the above design, we demonstrate that Q-network has advantages in fitting the long-term benefit of such problems. After that, we train the Q-network by Q-learning. The experimental results show that the trained Q-network performs efficiently to cope with unknown data, and can generate high total profit in a short time. The method has good scalability and can be applied to different types of satellite scheduling problems by customizing only the constraints checking process and reward signals.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2168-2216 2168-2232
DOI:	10.1109/TSMC.2020.3020732