Finite-time error bounds for Greedy-GQ

Greedy-GQ with linear function approximation, originally proposed in Maei et al. (in: Proceedings of the international conference on machine learning (ICML), 2010), is a value-based off-policy algorithm for optimal control in reinforcement learning, and it has a non-linear two timescale structure wi...

Full description

Saved in:

Bibliographic Details
Published in	Machine learning Vol. 113; no. 9; pp. 5981 - 6018
Main Authors	Wang, Yue, Zhou, Yi, Zou, Shaofeng
Format	Journal Article
Language	English
Published	New York Springer US 01.09.2024 Springer Nature B.V
Subjects	Algorithms Artificial Intelligence Computer Science Control Convergence Convexity Error analysis Greedy algorithms Linear functions Machine Learning Mechatronics Natural Language Processing (NLP) Nested loops Nonlinear control Optimal control Robotics Simulation and Modeling Time Non-linear Two time-scale algorithm Non-asymptotic bound Finite sample analysis Value-based
Online Access	Get full text
ISSN	0885-6125 1573-0565
DOI	10.1007/s10994-024-06542-x

Cover

More Information
Summary:	Greedy-GQ with linear function approximation, originally proposed in Maei et al. (in: Proceedings of the international conference on machine learning (ICML), 2010), is a value-based off-policy algorithm for optimal control in reinforcement learning, and it has a non-linear two timescale structure with non-convex objective function. This paper develops its tightest finite-time error bounds. We show that the Greedy-GQ algorithm converges as fast as O ( 1 / T ) under the i.i.d. setting and O ( log T / T ) under the Markovian setting. We further design variant of the vanilla Greedy-GQ algorithm using the nested-loop approach, and show that its sample complexity is O ( log ( 1 / ϵ ) ϵ - 2 ) , which matches with the one of the vanilla Greedy-GQ. Our finite-time error bounds match with the one of the stochastic gradient descent algorithm for general smooth non-convex optimization problems, despite of its additonal challenge in the two time-scale updates. Our finite-sample analysis provides theoretical guidance on choosing step-sizes for faster convergence in practice, and suggests the trade-off between the convergence rate and the quality of the obtained policy. Our techniques provide a general approach for finite-sample analysis of non-convex two timescale value-based reinforcement learning algorithms.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0885-6125 1573-0565
DOI:	10.1007/s10994-024-06542-x