Abstract
Mobile agents show immense potential, yet current state-of-the-art (SoTA) agents exhibit inadequate success rates on real-world, long-horizon, cross-application tasks. We attribute this bottleneck to the agents' excessive reliance on static, internal knowledge within MLLMs, which leads to two critical failure points: 1) strategic hallucinations in high-level planning and 2) operational errors during low-level execution on user interfaces (UI).
The core insight of this paper is that high-level planning and low-level UI operations require fundamentally distinct types of knowledge. Planning demands high-level, strategy-oriented experiences, whereas operations necessitate low-level, precise instructions closely tied to specific app UIs. Motivated by these insights, we propose Mobile-Agent-RAG, a novel hierarchical multi-agent framework that innovatively integrates dual-level retrieval augmentation.
At the planning stage, we introduce Manager-RAG to reduce strategic hallucinations by retrieving human-validated comprehensive task plans that provide high-level guidance. At the execution stage, we develop Operator-RAG to improve execution accuracy by retrieving the most precise low-level guidance for accurate atomic actions, aligned with the current app and subtask. To accurately deliver these knowledge types, we construct two specialized retrieval-oriented knowledge bases. Furthermore, we introduce Mobile-Eval-RAG, a challenging benchmark for evaluating such agents on realistic multi-app, long-horizon tasks. Extensive experiments demonstrate that Mobile-Agent-RAG significantly outperforms SoTA baselines, improving task completion rate by 11.0% and step efficiency by 10.2%, establishing a robust paradigm for context-aware, reliable multi-agent mobile automation.
Methodology
Hierarchical Multi-Agent Architecture
Mobile-Agent-RAG utilizes a hierarchical structure separating high-level planning from low-level execution. The framework consists of two core agents:
- Manager Agent (empowered by Manager-RAG): Responsible for high-level strategic planning and subtask decomposition. It retrieves human-validated task plans (DMR) to guide long-term strategies, effectively reducing strategic hallucinations.
- Operator Agent (empowered by Operator-RAG): Translates subtasks into concrete atomic actions. It interacts with the Operator-RAG knowledge base to retrieve app-specific, UI-grounded examples (DOR) for precise action execution.
The system is further supported by auxiliary modules: a Perceptor for fine-grained visual perception, an Action Reflector for outcome evaluation, and a Notetaker for information aggregation.
Retrieval-Oriented Knowledge Base Collection
To support the dual-level RAG mechanism, we construct two specialized knowledge bases:
- Manager-RAG Knowledge Base (KMR): Contains pairs of task instructions and human-annotated operation steps derived from the Mobile-Eval-RAG dataset. This supports high-level planning.
- Operator-RAG Knowledge Base (KORapp): Stores app-specific triples of (subtask, screenshot, action). These are collected via semi-automated logging and human verification to ensure precise atomic action generation.
Experimental Results
Comparison with SoTA Baselines
We evaluated Mobile-Agent-RAG on the proposed Mobile-Eval-RAG benchmark, which features 50 challenging multi-app, long-horizon tasks.
As shown above, Mobile-Agent-RAG significantly outperforms baselines like Mobile-Agent-E and AppAgent. Specifically, it improves the Task Completion Rate (CR) by 11.0% and Step Efficiency by 10.2% on complex multi-app tasks.
Generalization Across MLLMs
Our framework demonstrates robust generalization across different backbone models (Gemini-1.5-Pro, GPT-4o, Claude-3.5-Sonnet). Notably, RAG provides greater compensation for weaker models while continuing to boost stronger ones.
Ablation Analysis
Ablation studies confirm the necessity of both modules. Removing Operator-RAG significantly lowers execution accuracy (OA), while removing Manager-RAG limits the maximum achievable success rate in long-horizon planning.
BibTeX
@article{zhou2025mobileagentrag,
title={Mobile-Agent-RAG: Driving Smart Multi-Agent Coordination with Contextual Knowledge Empowerment for Long-Horizon Mobile Automation},
author={Zhou, Yuxiang and Li, Jichang and Zhang, Yanhao and Lu, Haonan and Li, Guanbin},
journal={arXiv preprint arXiv:2511.12254},
year={2025}
}