Deep Q‐learning: A robust control approach

This work aims at constructing a bridge between robust control theory and reinforcement learning. Although, reinforcement learning has shown admirable results in complex control tasks, the agent's learning behavior is opaque. Meanwhile, system theory has several tools for analyzing and controll...

Full description

Saved in:

Bibliographic Details
Published in	International journal of robust and nonlinear control Vol. 33; no. 1; pp. 526 - 544
Main Authors	Varga, Balázs, Kulcsár, Balázs, Chehreghani, Morteza Haghir
Format	Journal Article
Language	English
Published	Bognor Regis Wiley Subscription Services, Inc 10.01.2023
Subjects	Control tasks Control theory Controlled learning Convergence Deep Q-learning Feedback control Gain scheduling Machine learning Mathematical models Neural Tangent Kernel Robust control Stability analysis System theory Systems theory Task complexity
Online Access	Get full text
ISSN	1049-8923 1099-1239 1099-1239
DOI	10.1002/rnc.6457

Cover

More Information
Summary:	This work aims at constructing a bridge between robust control theory and reinforcement learning. Although, reinforcement learning has shown admirable results in complex control tasks, the agent's learning behavior is opaque. Meanwhile, system theory has several tools for analyzing and controlling dynamical systems. This article places deep Q‐learning is into a control‐oriented perspective to study its learning dynamics with well‐established techniques from robust control. An uncertain linear time‐invariant model is formulated by means of the neural tangent kernel to describe learning. This novel approach allows giving conditions for stability (convergence) of the learning and enables the analysis of the agent's behavior in frequency‐domain. The control‐oriented approach makes it possible to formulate robust controllers that inject dynamical rewards as control input in the loss function to achieve better convergence properties. Three output‐feedback controllers are synthesized: gain scheduling ℋ2$$ {\mathscr{H}}_2 $$, dynamical ℋ∞$$ {\mathscr{H}}_{\infty } $$, and fixed‐structure ℋ∞$$ {\mathscr{H}}_{\infty } $$ controllers. Compared to traditional deep Q‐learning techniques, which involve several heuristics, setting up the learning agent with a control‐oriented tuning methodology is more transparent and has well‐established literature. The proposed approach does not use a target network and randomized replay memory. The role of the target network is overtaken by the control input, which also exploits the temporal dependency of samples (opposed to a randomized memory buffer). Numerical simulations in different OpenAI Gym environments suggest that the ℋ∞$$ {\mathscr{H}}_{\infty } $$ controlled learning can converge faster and receive higher scores (depending on the environment) compared to the benchmark double deep Q‐learning.
Bibliography:	Funding information Chalmers Tekniska Högskola ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1049-8923 1099-1239 1099-1239
DOI:	10.1002/rnc.6457