Warm-up Free Policy Optimization: Improved Regret in Linear Markov Decision Processes
Policy Optimization (PO) methods are among the most popular Reinforcement Learning (RL) algorithms in practice. Recently, Sherman et al. [2023a] proposed a PO-based algorithm with rate-optimal regret guarantees under the linear Markov Decision Process (MDP) model. However, their algorithm relies on...
Saved in:
| Main Authors | , |
|---|---|
| Format | Journal Article |
| Language | English |
| Published |
03.07.2024
|
| Subjects | |
| Online Access | Get full text |
| DOI | 10.48550/arxiv.2407.03065 |
Cover
| Abstract | Policy Optimization (PO) methods are among the most popular Reinforcement
Learning (RL) algorithms in practice. Recently, Sherman et al. [2023a] proposed
a PO-based algorithm with rate-optimal regret guarantees under the linear
Markov Decision Process (MDP) model. However, their algorithm relies on a
costly pure exploration warm-up phase that is hard to implement in practice.
This paper eliminates this undesired warm-up phase, replacing it with a simple
and efficient contraction mechanism. Our PO algorithm achieves rate-optimal
regret with improved dependence on the other parameters of the problem (horizon
and function approximation dimension) in two fundamental settings: adversarial
losses with full-information feedback and stochastic losses with bandit
feedback. |
|---|---|
| AbstractList | Policy Optimization (PO) methods are among the most popular Reinforcement
Learning (RL) algorithms in practice. Recently, Sherman et al. [2023a] proposed
a PO-based algorithm with rate-optimal regret guarantees under the linear
Markov Decision Process (MDP) model. However, their algorithm relies on a
costly pure exploration warm-up phase that is hard to implement in practice.
This paper eliminates this undesired warm-up phase, replacing it with a simple
and efficient contraction mechanism. Our PO algorithm achieves rate-optimal
regret with improved dependence on the other parameters of the problem (horizon
and function approximation dimension) in two fundamental settings: adversarial
losses with full-information feedback and stochastic losses with bandit
feedback. |
| Author | Cassel, Asaf Rosenberg, Aviv |
| Author_xml | – sequence: 1 givenname: Asaf surname: Cassel fullname: Cassel, Asaf – sequence: 2 givenname: Aviv surname: Rosenberg fullname: Rosenberg, Aviv |
| BackLink | https://doi.org/10.48550/arXiv.2407.03065$$DView paper in arXiv |
| BookMark | eNqFzs0OwUAUQOFZsPD3AFbuC7SGtogtGhKiEWLZTOqSG-1Mc6ca9fRo7K3O5iy-tmhoo1GI_ki6_iwI5FDxk0p37MupKz05CVridFacOY8cQkaEyKSUVLDPC8ropQoyeg6bLGdT4gUOeGMsgDRsSaNi2Cm-mxKWmJD9rBCxSdBatF3RvKrUYu_XjhiEq-Ni7dSAOGfKFFfxFxLXEO__8QaGPz_q |
| ContentType | Journal Article |
| Copyright | http://arxiv.org/licenses/nonexclusive-distrib/1.0 |
| Copyright_xml | – notice: http://arxiv.org/licenses/nonexclusive-distrib/1.0 |
| DBID | AKY EPD GOX |
| DOI | 10.48550/arxiv.2407.03065 |
| DatabaseName | arXiv Computer Science arXiv Statistics arXiv.org |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository |
| DeliveryMethod | fulltext_linktorsrc |
| ExternalDocumentID | 2407_03065 |
| GroupedDBID | AKY EPD GOX |
| ID | FETCH-arxiv_primary_2407_030653 |
| IEDL.DBID | GOX |
| IngestDate | Tue Jul 22 21:40:01 EDT 2025 |
| IsDoiOpenAccess | true |
| IsOpenAccess | true |
| IsPeerReviewed | false |
| IsScholarly | false |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-arxiv_primary_2407_030653 |
| OpenAccessLink | https://arxiv.org/abs/2407.03065 |
| ParticipantIDs | arxiv_primary_2407_03065 |
| PublicationCentury | 2000 |
| PublicationDate | 2024-07-03 |
| PublicationDateYYYYMMDD | 2024-07-03 |
| PublicationDate_xml | – month: 07 year: 2024 text: 2024-07-03 day: 03 |
| PublicationDecade | 2020 |
| PublicationYear | 2024 |
| Score | 3.756512 |
| SecondaryResourceType | preprint |
| Snippet | Policy Optimization (PO) methods are among the most popular Reinforcement
Learning (RL) algorithms in practice. Recently, Sherman et al. [2023a] proposed
a... |
| SourceID | arxiv |
| SourceType | Open Access Repository |
| SubjectTerms | Computer Science - Learning Statistics - Machine Learning |
| Title | Warm-up Free Policy Optimization: Improved Regret in Linear Markov Decision Processes |
| URI | https://arxiv.org/abs/2407.03065 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV1NSwMxEB3anryIRaV-z8FrtN3NJq23oi5F0IJY3NuS7E6hh9ay3RZ_vpNkRS-9JkMY8sF7L5mZANzaYRyrIklEqQZKSFloYQZ2KFRppInkXPULd9_x-qYmM_mSJVkL8DcXxlTfi12oD2w3905u3DlWm7ShzUTBJfNOs_A46UtxNfZ_dswxfdM_kEiP4LBhdzgOy9GFFq2OYfZpqqXYrjGtiDAU4sUpH9VlkwP5gEHaU4nvxPq3xsUKWSTyJkSXS_O1w6fmKxxsAvtpcwI36fPH40R4R_J1qBqROx9z72N8Ch3W9tQDpFKP5gzZpIhxVtIo6hvmvFprU1gicwa9faOc7--6gIOIsddHlcaX0KmrLV0xdtb22k_gD3cfc8Y |
| linkProvider | Cornell University |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Warm-up+Free+Policy+Optimization%3A+Improved+Regret+in+Linear+Markov+Decision+Processes&rft.au=Cassel%2C+Asaf&rft.au=Rosenberg%2C+Aviv&rft.date=2024-07-03&rft_id=info:doi/10.48550%2Farxiv.2407.03065&rft.externalDocID=2407_03065 |