Warm-up Free Policy Optimization: Improved Regret in Linear Markov Decision Processes

Policy Optimization (PO) methods are among the most popular Reinforcement Learning (RL) algorithms in practice. Recently, Sherman et al. [2023a] proposed a PO-based algorithm with rate-optimal regret guarantees under the linear Markov Decision Process (MDP) model. However, their algorithm relies on...

Full description

Saved in:

Bibliographic Details
Main Authors	Cassel, Asaf, Rosenberg, Aviv
Format	Journal Article
Language	English
Published	03.07.2024
Subjects	Computer Science - Learning Statistics - Machine Learning
Online Access	Get full text
DOI	10.48550/arxiv.2407.03065

Cover

Abstract	Policy Optimization (PO) methods are among the most popular Reinforcement Learning (RL) algorithms in practice. Recently, Sherman et al. [2023a] proposed a PO-based algorithm with rate-optimal regret guarantees under the linear Markov Decision Process (MDP) model. However, their algorithm relies on a costly pure exploration warm-up phase that is hard to implement in practice. This paper eliminates this undesired warm-up phase, replacing it with a simple and efficient contraction mechanism. Our PO algorithm achieves rate-optimal regret with improved dependence on the other parameters of the problem (horizon and function approximation dimension) in two fundamental settings: adversarial losses with full-information feedback and stochastic losses with bandit feedback.
AbstractList	Policy Optimization (PO) methods are among the most popular Reinforcement Learning (RL) algorithms in practice. Recently, Sherman et al. [2023a] proposed a PO-based algorithm with rate-optimal regret guarantees under the linear Markov Decision Process (MDP) model. However, their algorithm relies on a costly pure exploration warm-up phase that is hard to implement in practice. This paper eliminates this undesired warm-up phase, replacing it with a simple and efficient contraction mechanism. Our PO algorithm achieves rate-optimal regret with improved dependence on the other parameters of the problem (horizon and function approximation dimension) in two fundamental settings: adversarial losses with full-information feedback and stochastic losses with bandit feedback.
Author	Cassel, Asaf Rosenberg, Aviv
Author_xml	– sequence: 1 givenname: Asaf surname: Cassel fullname: Cassel, Asaf – sequence: 2 givenname: Aviv surname: Rosenberg fullname: Rosenberg, Aviv
BackLink	https://doi.org/10.48550/arXiv.2407.03065$$DView paper in arXiv
BookMark	eNqFzs0OwUAUQOFZsPD3AFbuC7SGtogtGhKiEWLZTOqSG-1Mc6ca9fRo7K3O5iy-tmhoo1GI_ki6_iwI5FDxk0p37MupKz05CVridFacOY8cQkaEyKSUVLDPC8ropQoyeg6bLGdT4gUOeGMsgDRsSaNi2Cm-mxKWmJD9rBCxSdBatF3RvKrUYu_XjhiEq-Ni7dSAOGfKFFfxFxLXEO__8QaGPz_q
ContentType	Journal Article
Copyright	http://arxiv.org/licenses/nonexclusive-distrib/1.0
Copyright_xml	– notice: http://arxiv.org/licenses/nonexclusive-distrib/1.0
DBID	AKY EPD GOX
DOI	10.48550/arxiv.2407.03065
DatabaseName	arXiv Computer Science arXiv Statistics arXiv.org
DatabaseTitleList
Database_xml	– sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository
DeliveryMethod	fulltext_linktorsrc
ExternalDocumentID	2407_03065
GroupedDBID	AKY EPD GOX
ID	FETCH-arxiv_primary_2407_030653
IEDL.DBID	GOX
IngestDate	Tue Jul 22 21:40:01 EDT 2025
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-arxiv_primary_2407_030653
OpenAccessLink	https://arxiv.org/abs/2407.03065
ParticipantIDs	arxiv_primary_2407_03065
PublicationCentury	2000
PublicationDate	2024-07-03
PublicationDateYYYYMMDD	2024-07-03
PublicationDate_xml	– month: 07 year: 2024 text: 2024-07-03 day: 03
PublicationDecade	2020
PublicationYear	2024
Score	3.756512
SecondaryResourceType	preprint
Snippet	Policy Optimization (PO) methods are among the most popular Reinforcement Learning (RL) algorithms in practice. Recently, Sherman et al. [2023a] proposed a...
SourceID	arxiv
SourceType	Open Access Repository
SubjectTerms	Computer Science - Learning Statistics - Machine Learning
Title	Warm-up Free Policy Optimization: Improved Regret in Linear Markov Decision Processes
URI	https://arxiv.org/abs/2407.03065
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV1NSwMxEB3anryIRaV-z8FrtN3NJq23oi5F0IJY3NuS7E6hh9ay3RZ_vpNkRS-9JkMY8sF7L5mZANzaYRyrIklEqQZKSFloYQZ2KFRppInkXPULd9_x-qYmM_mSJVkL8DcXxlTfi12oD2w3905u3DlWm7ShzUTBJfNOs_A46UtxNfZ_dswxfdM_kEiP4LBhdzgOy9GFFq2OYfZpqqXYrjGtiDAU4sUpH9VlkwP5gEHaU4nvxPq3xsUKWSTyJkSXS_O1w6fmKxxsAvtpcwI36fPH40R4R_J1qBqROx9z72N8Ch3W9tQDpFKP5gzZpIhxVtIo6hvmvFprU1gicwa9faOc7--6gIOIsddHlcaX0KmrLV0xdtb22k_gD3cfc8Y
linkProvider	Cornell University
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Warm-up+Free+Policy+Optimization%3A+Improved+Regret+in+Linear+Markov+Decision+Processes&rft.au=Cassel%2C+Asaf&rft.au=Rosenberg%2C+Aviv&rft.date=2024-07-03&rft_id=info:doi/10.48550%2Farxiv.2407.03065&rft.externalDocID=2407_03065