Agent RL Scaling Law: Agent RL with Spontaneous Code Execution for Mathematical Problem Solving

Large Language Models (LLMs) often struggle with mathematical reasoning tasks requiring precise, verifiable computation. While Reinforcement Learning (RL) from outcome-based rewards enhances text-based reasoning, understanding how agents autonomously learn to leverage external tools like code execut...

Full description

Saved in:
Bibliographic Details
Main Authors Mai, Xinji, Xu, Haotian, Li, Zhong-Zhi, W, Xing, Wang, Weinong, Hu, Jian, Zhang, Yingying, Zhang, Wenqiang
Format Journal Article
LanguageEnglish
Published 20.08.2025
Subjects
Online AccessGet full text
DOI10.48550/arxiv.2505.07773

Cover

Abstract Large Language Models (LLMs) often struggle with mathematical reasoning tasks requiring precise, verifiable computation. While Reinforcement Learning (RL) from outcome-based rewards enhances text-based reasoning, understanding how agents autonomously learn to leverage external tools like code execution remains crucial. We investigate RL from outcome-based rewards for Tool-Integrated Reasoning, ZeroTIR, training base LLMs to spontaneously generate and execute Python code for mathematical problems without supervised tool-use examples. Our central contribution is we demonstrate that as RL training progresses, key metrics scale predictably. Specifically, we observe strong positive correlations where increased training steps lead to increases in the spontaneous code execution frequency, the average response length, and, critically, the final task accuracy. This suggests a quantifiable relationship between computational effort invested in training and the emergence of effective, tool-augmented reasoning strategies. We implement a robust framework featuring a decoupled code execution environment and validate our findings across standard RL algorithms and frameworks. Experiments show ZeroTIR significantly surpasses non-tool ZeroRL baselines on challenging math benchmarks. Our findings provide a foundational understanding of how autonomous tool use is acquired and scales within Agent RL, offering a reproducible benchmark for future studies. Code is released at https://github.com/yyht/openrlhf_async_piplinehttps://github.com/yyht/openrlhf\_async\_pipline.
AbstractList Large Language Models (LLMs) often struggle with mathematical reasoning tasks requiring precise, verifiable computation. While Reinforcement Learning (RL) from outcome-based rewards enhances text-based reasoning, understanding how agents autonomously learn to leverage external tools like code execution remains crucial. We investigate RL from outcome-based rewards for Tool-Integrated Reasoning, ZeroTIR, training base LLMs to spontaneously generate and execute Python code for mathematical problems without supervised tool-use examples. Our central contribution is we demonstrate that as RL training progresses, key metrics scale predictably. Specifically, we observe strong positive correlations where increased training steps lead to increases in the spontaneous code execution frequency, the average response length, and, critically, the final task accuracy. This suggests a quantifiable relationship between computational effort invested in training and the emergence of effective, tool-augmented reasoning strategies. We implement a robust framework featuring a decoupled code execution environment and validate our findings across standard RL algorithms and frameworks. Experiments show ZeroTIR significantly surpasses non-tool ZeroRL baselines on challenging math benchmarks. Our findings provide a foundational understanding of how autonomous tool use is acquired and scales within Agent RL, offering a reproducible benchmark for future studies. Code is released at https://github.com/yyht/openrlhf_async_piplinehttps://github.com/yyht/openrlhf\_async\_pipline.
Author Zhang, Wenqiang
Mai, Xinji
Hu, Jian
Li, Zhong-Zhi
Xu, Haotian
W, Xing
Zhang, Yingying
Wang, Weinong
Author_xml – sequence: 1
  givenname: Xinji
  surname: Mai
  fullname: Mai, Xinji
– sequence: 2
  givenname: Haotian
  surname: Xu
  fullname: Xu, Haotian
– sequence: 3
  givenname: Zhong-Zhi
  surname: Li
  fullname: Li, Zhong-Zhi
– sequence: 4
  givenname: Xing
  surname: W
  fullname: W, Xing
– sequence: 5
  givenname: Weinong
  surname: Wang
  fullname: Wang, Weinong
– sequence: 6
  givenname: Jian
  surname: Hu
  fullname: Hu, Jian
– sequence: 7
  givenname: Yingying
  surname: Zhang
  fullname: Zhang, Yingying
– sequence: 8
  givenname: Wenqiang
  surname: Zhang
  fullname: Zhang, Wenqiang
BackLink https://doi.org/10.48550/arXiv.2505.07773$$DView paper in arXiv
BookMark eNqFjrsKwjAYRjPo4O0BnPxfwBqtpeImpeJQQax7-K1pG0iTkqYX394q4ur0weHjcMZkoLTihMzX1NnuPI-u0HSicTYe9Rzq-747IuyQcWXhGkGcoBQqgwjbPfxoK2wOcamVRcV1XUGgHxzCjie1FVpBqg2c0ea8QCt6A1yMvkteQKxl0-umZJiirPjsuxOyOIa34LT8pLDSiALNk72T2CfJ_f94AdcjQuQ
ContentType Journal Article
Copyright http://arxiv.org/licenses/nonexclusive-distrib/1.0
Copyright_xml – notice: http://arxiv.org/licenses/nonexclusive-distrib/1.0
DBID GOX
DOI 10.48550/arxiv.2505.07773
DatabaseName arXiv.org
DatabaseTitleList
Database_xml – sequence: 1
  dbid: GOX
  name: arXiv.org
  url: http://arxiv.org/find
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
ExternalDocumentID 2505_07773
GroupedDBID GOX
ID FETCH-arxiv_primary_2505_077733
IEDL.DBID GOX
IngestDate Tue Sep 30 19:17:44 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-arxiv_primary_2505_077733
OpenAccessLink https://arxiv.org/abs/2505.07773
ParticipantIDs arxiv_primary_2505_07773
PublicationCentury 2000
PublicationDate 2025-08-20
PublicationDateYYYYMMDD 2025-08-20
PublicationDate_xml – month: 08
  year: 2025
  text: 2025-08-20
  day: 20
PublicationDecade 2020
PublicationYear 2025
Score 3.8437274
SecondaryResourceType preprint
Snippet Large Language Models (LLMs) often struggle with mathematical reasoning tasks requiring precise, verifiable computation. While Reinforcement Learning (RL) from...
SourceID arxiv
SourceType Open Access Repository
SubjectTerms Computer Science - Artificial Intelligence
Title Agent RL Scaling Law: Agent RL with Spontaneous Code Execution for Mathematical Problem Solving
URI https://arxiv.org/abs/2505.07773
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV1LSwMxEB7anryIRaW-6hy8rnbfqbdSWoq0Kq7C3pZNdgRBuqVW7c93JqnVS49JhmHIg--bhPkCcGWYhDPL9j3GDklQDHmlIuImVUYFpVGx1A7P7pPJS3SXx3kD8LcWplyu376cPrD-uBF8vu6laRo2oclEQYp5H3L3OGmluDb2f3bMMW3XP5AYH8D-ht3hwC1HGxo0P4RiIOVL-DTFjCeEoQKn5fctbnvlKhSzRT1nnkaciOOwrghHazJ2UyDTSpxt1VXZ_aP7BAaz-l2uA47gcjx6Hk48G1KxcPoRhURb2GjDY2hxlk8dQKXpNYxV0vdD4pwr0ZXfN6RVSUpHfhScQGeXl9PdQ2ewF8iHtT05DufQWi0_6YJRdKW7dip_ACNBdsU
linkProvider Cornell University
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Agent+RL+Scaling+Law%3A+Agent+RL+with+Spontaneous+Code+Execution+for+Mathematical+Problem+Solving&rft.au=Mai%2C+Xinji&rft.au=Xu%2C+Haotian&rft.au=Li%2C+Zhong-Zhi&rft.au=W%2C+Xing&rft.date=2025-08-20&rft_id=info:doi/10.48550%2Farxiv.2505.07773&rft.externalDocID=2505_07773