Warm-up Free Policy Optimization: Improved Regret in Linear Markov Decision Processes

Policy Optimization (PO) methods are among the most popular Reinforcement Learning (RL) algorithms in practice. Recently, Sherman et al. [2023a] proposed a PO-based algorithm with rate-optimal regret guarantees under the linear Markov Decision Process (MDP) model. However, their algorithm relies on...

Full description

Saved in:
Bibliographic Details
Main Authors Cassel, Asaf, Rosenberg, Aviv
Format Journal Article
LanguageEnglish
Published 03.07.2024
Subjects
Online AccessGet full text
DOI10.48550/arxiv.2407.03065

Cover

Abstract Policy Optimization (PO) methods are among the most popular Reinforcement Learning (RL) algorithms in practice. Recently, Sherman et al. [2023a] proposed a PO-based algorithm with rate-optimal regret guarantees under the linear Markov Decision Process (MDP) model. However, their algorithm relies on a costly pure exploration warm-up phase that is hard to implement in practice. This paper eliminates this undesired warm-up phase, replacing it with a simple and efficient contraction mechanism. Our PO algorithm achieves rate-optimal regret with improved dependence on the other parameters of the problem (horizon and function approximation dimension) in two fundamental settings: adversarial losses with full-information feedback and stochastic losses with bandit feedback.
AbstractList Policy Optimization (PO) methods are among the most popular Reinforcement Learning (RL) algorithms in practice. Recently, Sherman et al. [2023a] proposed a PO-based algorithm with rate-optimal regret guarantees under the linear Markov Decision Process (MDP) model. However, their algorithm relies on a costly pure exploration warm-up phase that is hard to implement in practice. This paper eliminates this undesired warm-up phase, replacing it with a simple and efficient contraction mechanism. Our PO algorithm achieves rate-optimal regret with improved dependence on the other parameters of the problem (horizon and function approximation dimension) in two fundamental settings: adversarial losses with full-information feedback and stochastic losses with bandit feedback.
Author Cassel, Asaf
Rosenberg, Aviv
Author_xml – sequence: 1
  givenname: Asaf
  surname: Cassel
  fullname: Cassel, Asaf
– sequence: 2
  givenname: Aviv
  surname: Rosenberg
  fullname: Rosenberg, Aviv
BackLink https://doi.org/10.48550/arXiv.2407.03065$$DView paper in arXiv
BookMark eNqFzs0OwUAUQOFZsPD3AFbuC7SGtogtGhKiEWLZTOqSG-1Mc6ca9fRo7K3O5iy-tmhoo1GI_ki6_iwI5FDxk0p37MupKz05CVridFacOY8cQkaEyKSUVLDPC8ropQoyeg6bLGdT4gUOeGMsgDRsSaNi2Cm-mxKWmJD9rBCxSdBatF3RvKrUYu_XjhiEq-Ni7dSAOGfKFFfxFxLXEO__8QaGPz_q
ContentType Journal Article
Copyright http://arxiv.org/licenses/nonexclusive-distrib/1.0
Copyright_xml – notice: http://arxiv.org/licenses/nonexclusive-distrib/1.0
DBID AKY
EPD
GOX
DOI 10.48550/arxiv.2407.03065
DatabaseName arXiv Computer Science
arXiv Statistics
arXiv.org
DatabaseTitleList
Database_xml – sequence: 1
  dbid: GOX
  name: arXiv.org
  url: http://arxiv.org/find
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
ExternalDocumentID 2407_03065
GroupedDBID AKY
EPD
GOX
ID FETCH-arxiv_primary_2407_030653
IEDL.DBID GOX
IngestDate Tue Jul 22 21:40:01 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-arxiv_primary_2407_030653
OpenAccessLink https://arxiv.org/abs/2407.03065
ParticipantIDs arxiv_primary_2407_03065
PublicationCentury 2000
PublicationDate 2024-07-03
PublicationDateYYYYMMDD 2024-07-03
PublicationDate_xml – month: 07
  year: 2024
  text: 2024-07-03
  day: 03
PublicationDecade 2020
PublicationYear 2024
Score 3.756512
SecondaryResourceType preprint
Snippet Policy Optimization (PO) methods are among the most popular Reinforcement Learning (RL) algorithms in practice. Recently, Sherman et al. [2023a] proposed a...
SourceID arxiv
SourceType Open Access Repository
SubjectTerms Computer Science - Learning
Statistics - Machine Learning
Title Warm-up Free Policy Optimization: Improved Regret in Linear Markov Decision Processes
URI https://arxiv.org/abs/2407.03065
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV1NSwMxEB3anryIRaV-z8FrtN3NJq23oi5F0IJY3NuS7E6hh9ay3RZ_vpNkRS-9JkMY8sF7L5mZANzaYRyrIklEqQZKSFloYQZ2KFRppInkXPULd9_x-qYmM_mSJVkL8DcXxlTfi12oD2w3905u3DlWm7ShzUTBJfNOs_A46UtxNfZ_dswxfdM_kEiP4LBhdzgOy9GFFq2OYfZpqqXYrjGtiDAU4sUpH9VlkwP5gEHaU4nvxPq3xsUKWSTyJkSXS_O1w6fmKxxsAvtpcwI36fPH40R4R_J1qBqROx9z72N8Ch3W9tQDpFKP5gzZpIhxVtIo6hvmvFprU1gicwa9faOc7--6gIOIsddHlcaX0KmrLV0xdtb22k_gD3cfc8Y
linkProvider Cornell University
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Warm-up+Free+Policy+Optimization%3A+Improved+Regret+in+Linear+Markov+Decision+Processes&rft.au=Cassel%2C+Asaf&rft.au=Rosenberg%2C+Aviv&rft.date=2024-07-03&rft_id=info:doi/10.48550%2Farxiv.2407.03065&rft.externalDocID=2407_03065