Agent RL Scaling Law: Agent RL with Spontaneous Code Execution for Mathematical Problem Solving

Large Language Models (LLMs) often struggle with mathematical reasoning tasks requiring precise, verifiable computation. While Reinforcement Learning (RL) from outcome-based rewards enhances text-based reasoning, understanding how agents autonomously learn to leverage external tools like code execut...

Full description

Saved in:

Bibliographic Details
Main Authors	Mai, Xinji, Xu, Haotian, Li, Zhong-Zhi, W, Xing, Wang, Weinong, Hu, Jian, Zhang, Yingying, Zhang, Wenqiang
Format	Journal Article
Language	English
Published	20.08.2025
Subjects	Computer Science - Artificial Intelligence
Online Access	Get full text
DOI	10.48550/arxiv.2505.07773

Cover

Abstract	Large Language Models (LLMs) often struggle with mathematical reasoning tasks requiring precise, verifiable computation. While Reinforcement Learning (RL) from outcome-based rewards enhances text-based reasoning, understanding how agents autonomously learn to leverage external tools like code execution remains crucial. We investigate RL from outcome-based rewards for Tool-Integrated Reasoning, ZeroTIR, training base LLMs to spontaneously generate and execute Python code for mathematical problems without supervised tool-use examples. Our central contribution is we demonstrate that as RL training progresses, key metrics scale predictably. Specifically, we observe strong positive correlations where increased training steps lead to increases in the spontaneous code execution frequency, the average response length, and, critically, the final task accuracy. This suggests a quantifiable relationship between computational effort invested in training and the emergence of effective, tool-augmented reasoning strategies. We implement a robust framework featuring a decoupled code execution environment and validate our findings across standard RL algorithms and frameworks. Experiments show ZeroTIR significantly surpasses non-tool ZeroRL baselines on challenging math benchmarks. Our findings provide a foundational understanding of how autonomous tool use is acquired and scales within Agent RL, offering a reproducible benchmark for future studies. Code is released at https://github.com/yyht/openrlhf_async_piplinehttps://github.com/yyht/openrlhf\_async\_pipline.
AbstractList	Large Language Models (LLMs) often struggle with mathematical reasoning tasks requiring precise, verifiable computation. While Reinforcement Learning (RL) from outcome-based rewards enhances text-based reasoning, understanding how agents autonomously learn to leverage external tools like code execution remains crucial. We investigate RL from outcome-based rewards for Tool-Integrated Reasoning, ZeroTIR, training base LLMs to spontaneously generate and execute Python code for mathematical problems without supervised tool-use examples. Our central contribution is we demonstrate that as RL training progresses, key metrics scale predictably. Specifically, we observe strong positive correlations where increased training steps lead to increases in the spontaneous code execution frequency, the average response length, and, critically, the final task accuracy. This suggests a quantifiable relationship between computational effort invested in training and the emergence of effective, tool-augmented reasoning strategies. We implement a robust framework featuring a decoupled code execution environment and validate our findings across standard RL algorithms and frameworks. Experiments show ZeroTIR significantly surpasses non-tool ZeroRL baselines on challenging math benchmarks. Our findings provide a foundational understanding of how autonomous tool use is acquired and scales within Agent RL, offering a reproducible benchmark for future studies. Code is released at https://github.com/yyht/openrlhf_async_piplinehttps://github.com/yyht/openrlhf\_async\_pipline.
Author	Zhang, Wenqiang Mai, Xinji Hu, Jian Li, Zhong-Zhi Xu, Haotian W, Xing Zhang, Yingying Wang, Weinong
Author_xml	– sequence: 1 givenname: Xinji surname: Mai fullname: Mai, Xinji – sequence: 2 givenname: Haotian surname: Xu fullname: Xu, Haotian – sequence: 3 givenname: Zhong-Zhi surname: Li fullname: Li, Zhong-Zhi – sequence: 4 givenname: Xing surname: W fullname: W, Xing – sequence: 5 givenname: Weinong surname: Wang fullname: Wang, Weinong – sequence: 6 givenname: Jian surname: Hu fullname: Hu, Jian – sequence: 7 givenname: Yingying surname: Zhang fullname: Zhang, Yingying – sequence: 8 givenname: Wenqiang surname: Zhang fullname: Zhang, Wenqiang
BackLink	https://doi.org/10.48550/arXiv.2505.07773$$DView paper in arXiv
BookMark	eNqFjrsKwjAYRjPo4O0BnPxfwBqtpeImpeJQQax7-K1pG0iTkqYX394q4ur0weHjcMZkoLTihMzX1NnuPI-u0HSicTYe9Rzq-747IuyQcWXhGkGcoBQqgwjbPfxoK2wOcamVRcV1XUGgHxzCjie1FVpBqg2c0ea8QCt6A1yMvkteQKxl0-umZJiirPjsuxOyOIa34LT8pLDSiALNk72T2CfJ_f94AdcjQuQ
ContentType	Journal Article
Copyright	http://arxiv.org/licenses/nonexclusive-distrib/1.0
Copyright_xml	– notice: http://arxiv.org/licenses/nonexclusive-distrib/1.0
DBID	GOX
DOI	10.48550/arxiv.2505.07773
DatabaseName	arXiv.org
DatabaseTitleList
Database_xml	– sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository
DeliveryMethod	fulltext_linktorsrc
ExternalDocumentID	2505_07773
GroupedDBID	GOX
ID	FETCH-arxiv_primary_2505_077733
IEDL.DBID	GOX
IngestDate	Tue Sep 30 19:17:44 EDT 2025
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-arxiv_primary_2505_077733
OpenAccessLink	https://arxiv.org/abs/2505.07773
ParticipantIDs	arxiv_primary_2505_07773
PublicationCentury	2000
PublicationDate	2025-08-20
PublicationDateYYYYMMDD	2025-08-20
PublicationDate_xml	– month: 08 year: 2025 text: 2025-08-20 day: 20
PublicationDecade	2020
PublicationYear	2025
Score	3.8437274
SecondaryResourceType	preprint
Snippet	Large Language Models (LLMs) often struggle with mathematical reasoning tasks requiring precise, verifiable computation. While Reinforcement Learning (RL) from...
SourceID	arxiv
SourceType	Open Access Repository
SubjectTerms	Computer Science - Artificial Intelligence
Title	Agent RL Scaling Law: Agent RL with Spontaneous Code Execution for Mathematical Problem Solving
URI	https://arxiv.org/abs/2505.07773
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV1LSwMxEB7anryIRaW-6hy8rnbfqbdSWoq0Kq7C3pZNdgRBuqVW7c93JqnVS49JhmHIg--bhPkCcGWYhDPL9j3GDklQDHmlIuImVUYFpVGx1A7P7pPJS3SXx3kD8LcWplyu376cPrD-uBF8vu6laRo2oclEQYp5H3L3OGmluDb2f3bMMW3XP5AYH8D-ht3hwC1HGxo0P4RiIOVL-DTFjCeEoQKn5fctbnvlKhSzRT1nnkaciOOwrghHazJ2UyDTSpxt1VXZ_aP7BAaz-l2uA47gcjx6Hk48G1KxcPoRhURb2GjDY2hxlk8dQKXpNYxV0vdD4pwr0ZXfN6RVSUpHfhScQGeXl9PdQ2ewF8iHtT05DufQWi0_6YJRdKW7dip_ACNBdsU
linkProvider	Cornell University
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Agent+RL+Scaling+Law%3A+Agent+RL+with+Spontaneous+Code+Execution+for+Mathematical+Problem+Solving&rft.au=Mai%2C+Xinji&rft.au=Xu%2C+Haotian&rft.au=Li%2C+Zhong-Zhi&rft.au=W%2C+Xing&rft.date=2025-08-20&rft_id=info:doi/10.48550%2Farxiv.2505.07773&rft.externalDocID=2505_07773