Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:An LLM-LVLM Driven Agent for Iterative and Fine-Grained Image Editing

Aug 24, 2025

Zihan Liang, Jiahao Sun, Haoran Ma

Figure 1 for An LLM-LVLM Driven Agent for Iterative and Fine-Grained Image Editing

Figure 2 for An LLM-LVLM Driven Agent for Iterative and Fine-Grained Image Editing

Figure 3 for An LLM-LVLM Driven Agent for Iterative and Fine-Grained Image Editing

Figure 4 for An LLM-LVLM Driven Agent for Iterative and Fine-Grained Image Editing

Share this with someone who'll enjoy it:

Abstract:Despite the remarkable capabilities of text-to-image (T2I) generation models, real-world applications often demand fine-grained, iterative image editing that existing methods struggle to provide. Key challenges include granular instruction understanding, robust context preservation during modifications, and the lack of intelligent feedback mechanisms for iterative refinement. This paper introduces RefineEdit-Agent, a novel, training-free intelligent agent framework designed to address these limitations by enabling complex, iterative, and context-aware image editing. RefineEdit-Agent leverages the powerful planning capabilities of Large Language Models (LLMs) and the advanced visual understanding and evaluation prowess of Vision-Language Large Models (LVLMs) within a closed-loop system. Our framework comprises an LVLM-driven instruction parser and scene understanding module, a multi-level LLM-driven editing planner for goal decomposition, tool selection, and sequence generation, an iterative image editing module, and a crucial LVLM-driven feedback and evaluation loop. To rigorously evaluate RefineEdit-Agent, we propose LongBench-T2I-Edit, a new benchmark featuring 500 initial images with complex, multi-turn editing instructions across nine visual dimensions. Extensive experiments demonstrate that RefineEdit-Agent significantly outperforms state-of-the-art baselines, achieving an average score of 3.67 on LongBench-T2I-Edit, compared to 2.29 for Direct Re-Prompting, 2.91 for InstructPix2Pix, 3.16 for GLIGEN-based Edit, and 3.39 for ControlNet-XL. Ablation studies, human evaluations, and analyses of iterative refinement, backbone choices, tool usage, and robustness to instruction complexity further validate the efficacy of our agentic design in delivering superior edit fidelity and context preservation.

View paper on

Share this with someone who'll enjoy it:

Title:An LLM-LVLM Driven Agent for Iterative and Fine-Grained Image Editing

Paper and Code