ExACT: Teaching AI Agents to Explore with Reflective-MCTS and Exploratory Learning
Autonomous agents have demonstrated significant potential in automating complex multistep decision-making tasks. However, even state-of-the-art vision-language models (VLMs), such as GPT-4o, still fall short of human-level performance, particularly in intricate web environments and long-horizon task...
Saved in:
| Main Authors | , , , , , , |
|---|---|
| Format | Journal Article |
| Language | English |
| Published |
02.10.2024
|
| Subjects | |
| Online Access | Get full text |
| DOI | 10.48550/arxiv.2410.02052 |
Cover
| Summary: | Autonomous agents have demonstrated significant potential in automating
complex multistep decision-making tasks. However, even state-of-the-art
vision-language models (VLMs), such as GPT-4o, still fall short of human-level
performance, particularly in intricate web environments and long-horizon tasks.
To address these limitations, we present ExACT, an approach to combine
test-time search and self-learning to build o1-like models for agentic
applications. We first introduce Reflective Monte Carlo Tree Search (R-MCTS), a
novel test time algorithm designed to enhance AI agents' ability to explore
decision space on the fly. R-MCTS extends traditional MCTS by 1) incorporating
contrastive reflection, allowing agents to learn from past interactions and
dynamically improve their search efficiency; and 2) using multi-agent debate
for reliable state evaluation. Next, we introduce Exploratory Learning, a novel
learning strategy to teach agents to search at inference time without relying
on any external search algorithms. On the challenging VisualWebArena benchmark,
our GPT-4o based R-MCTS agent achieves a 6% to 30% relative improvement across
various tasks compared to the previous state-of-the-art. Additionally, we show
that the knowledge and experience gained from test-time search can be
effectively transferred back to GPT-4o via fine-tuning. After Exploratory
Learning, GPT-4o 1) demonstrates the ability to explore the environment,
evaluate a state, and backtrack to viable ones when it detects that the current
state cannot lead to success, and 2) matches 87% of R-MCTS's performance while
using significantly less compute. Notably, our work demonstrates the compute
scaling properties in both training - data collection with R-MCTS - and testing
time. These results suggest a promising research direction to enhance VLMs'
capabilities for agentic applications via test-time search and self-learning. |
|---|---|
| DOI: | 10.48550/arxiv.2410.02052 |