Mobile-Agent-RAG: Driving Smart Multi-Agent Coordination with Contextual Knowledge Empowerment for Long-Horizon Mobile Automation

1School of Computer Science and Engineering, Sun Yat-sen University 2Pengcheng Laboratory 3OPPO AI Center, OPPO Inc., China 4Guangdong Key Laboratory of Big Data Analysis and Processing
*Indicates Equal Contribution    Corresponding Author
Mobile-Agent-RAG Concept Overview

Mobile-Agent-RAG is a novel hierarchical multi-agent framework integrating dual-level retrieval augmentation. It employs Manager-RAG to reduce strategic hallucinations in high-level planning and Operator-RAG to improve execution accuracy in low-level UI operations.

Abstract

Mobile agents show immense potential, yet current state-of-the-art (SoTA) agents exhibit inadequate success rates on real-world, long-horizon, cross-application tasks. We attribute this bottleneck to the agents' excessive reliance on static, internal knowledge within MLLMs, which leads to two critical failure points: 1) strategic hallucinations in high-level planning and 2) operational errors during low-level execution on user interfaces (UI).

The core insight of this paper is that high-level planning and low-level UI operations require fundamentally distinct types of knowledge. Planning demands high-level, strategy-oriented experiences, whereas operations necessitate low-level, precise instructions closely tied to specific app UIs. Motivated by these insights, we propose Mobile-Agent-RAG, a novel hierarchical multi-agent framework that innovatively integrates dual-level retrieval augmentation.

At the planning stage, we introduce Manager-RAG to reduce strategic hallucinations by retrieving human-validated comprehensive task plans that provide high-level guidance. At the execution stage, we develop Operator-RAG to improve execution accuracy by retrieving the most precise low-level guidance for accurate atomic actions, aligned with the current app and subtask. To accurately deliver these knowledge types, we construct two specialized retrieval-oriented knowledge bases. Furthermore, we introduce Mobile-Eval-RAG, a challenging benchmark for evaluating such agents on realistic multi-app, long-horizon tasks. Extensive experiments demonstrate that Mobile-Agent-RAG significantly outperforms SoTA baselines, improving task completion rate by 11.0% and step efficiency by 10.2%, establishing a robust paradigm for context-aware, reliable multi-agent mobile automation.

Methodology

Mobile-Agent-RAG Framework Architecture

Hierarchical Multi-Agent Architecture

Mobile-Agent-RAG utilizes a hierarchical structure separating high-level planning from low-level execution. The framework consists of two core agents:

  • Manager Agent (empowered by Manager-RAG): Responsible for high-level strategic planning and subtask decomposition. It retrieves human-validated task plans (DMR) to guide long-term strategies, effectively reducing strategic hallucinations.
  • Operator Agent (empowered by Operator-RAG): Translates subtasks into concrete atomic actions. It interacts with the Operator-RAG knowledge base to retrieve app-specific, UI-grounded examples (DOR) for precise action execution.

The system is further supported by auxiliary modules: a Perceptor for fine-grained visual perception, an Action Reflector for outcome evaluation, and a Notetaker for information aggregation.

Retrieval-Oriented Knowledge Base Collection

Knowledge Base Collection Process

To support the dual-level RAG mechanism, we construct two specialized knowledge bases:

  • Manager-RAG Knowledge Base (KMR): Contains pairs of task instructions and human-annotated operation steps derived from the Mobile-Eval-RAG dataset. This supports high-level planning.
  • Operator-RAG Knowledge Base (KORapp): Stores app-specific triples of (subtask, screenshot, action). These are collected via semi-automated logging and human verification to ensure precise atomic action generation.

Experimental Results

Comparison with SoTA Baselines

We evaluated Mobile-Agent-RAG on the proposed Mobile-Eval-RAG benchmark, which features 50 challenging multi-app, long-horizon tasks.

Main Results Comparison Table

As shown above, Mobile-Agent-RAG significantly outperforms baselines like Mobile-Agent-E and AppAgent. Specifically, it improves the Task Completion Rate (CR) by 11.0% and Step Efficiency by 10.2% on complex multi-app tasks.

Generalization Across MLLMs

Performance across MLLMs

Our framework demonstrates robust generalization across different backbone models (Gemini-1.5-Pro, GPT-4o, Claude-3.5-Sonnet). Notably, RAG provides greater compensation for weaker models while continuing to boost stronger ones.

Ablation Analysis

Ablation Study

Ablation studies confirm the necessity of both modules. Removing Operator-RAG significantly lowers execution accuracy (OA), while removing Manager-RAG limits the maximum achievable success rate in long-horizon planning.

BibTeX

@article{zhou2025mobileagentrag,
  title={Mobile-Agent-RAG: Driving Smart Multi-Agent Coordination with Contextual Knowledge Empowerment for Long-Horizon Mobile Automation},
  author={Zhou, Yuxiang and Li, Jichang and Zhang, Yanhao and Lu, Haonan and Li, Guanbin},
  journal={arXiv preprint arXiv:2511.12254},
  year={2025}
}