
This repository accompanies our survey paper:
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers
Welcome to Awesome-Think-With-Images! The field of multimodal AI is undergoing a fundamental evolution, moving beyond static visual perception towards a new paradigm where vision becomes a dynamic, manipulable cognitive workspace. This repository is the first comprehensive resource that systematically curates the pivotal research enabling this shift.
We structure this collection along a trajectory of increasing cognitive autonomy, as detailed in our survey. This journey unfolds across three key stages:
- Stage 1: Tool-Driven Visual Exploration β Models as "Commanders" orchestrating external visual tools.
- Stage 2: Programmatic Visual Manipulation β Models as "Visual Programmers" creating bespoke analyses.
- Stage 3: Intrinsic Visual Imagination β Models as "Visual Thinkers" generating internal mental imagery.

The paradigm shift from βThinking about Imagesβ to βThinking with Imagesβ, an evolution that transforms vision from a static input into a dynamic and manipulable cognitive workspace.
As detailed in our survey, this paradigm shift unlocks three key capabilities: Dynamic Perceptual Exploration, Structured Visual Reasoning, and Goal-Oriented Generative Planning. This collection is for researchers, developers, and enthusiasts eager to explore the forefront of AI that can truly see, reason, and imagine.
We structure this collection along a trajectory of increasing cognitive autonomy. This journey unfolds across three key stages, forming the taxonomy of our work:

The taxonomy of "Thinking with Images" organizing the field into core methodologies (across three stages), evaluation benchmarks, and key applications.
This collection is for researchers, developers, and enthusiasts eager to explore the forefront of AI that can truly see, reason, and imagine.
- [2025-07] We have released "Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers", the first comprehensive survey dedicated to the emerging paradigm of "Think with Images".
- [2025-06] We created this repository to maintain a paper list on Awesome-Think-With-Images. Contributions are welcome!
- [2025-05] We are excited to release OpenThinkIMG, the first dedicated end-to-end open-source framework designed to empower LVLMs to truly think with images! For ease of use, we've configured a Docker environment. We warmly invite the community to explore, use, and contribute.
- Stage 1: Tool-Driven Visual Exploration
- Stage 2: Programmatic Visual Manipulation
- Stage 3: Intrinsic Visual Imagination
- Evaluation & Benchmarks
- Contributing & Citation
This section provides a conceptual map to navigate the paper list. The following papers are organized according to the primary mechanism they employ, aligning with the three-stage framework from our survey.
In this stage, the model acts as a planner, orchestrating a predefined suite of external visual tools. Intelligence is demonstrated by selecting the right tool for the right sub-task.
Leveraging in-context learning to guide tool use without parameter updates.
- Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
- PromptCap: Prompt-Guided Task-Aware Image Captioning
- MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
- What does CLIP know about a red circle? Visual prompt engineering for VLMs
- Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
- DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM
- Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models
- Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models
- ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration
- DyFo: A Training-Free Dynamic Focus Visual Search for Enhancing LMMs in Fine-Grained Visual Understanding
- VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search
- Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought
- Visual Abstract Thinking Empowers Multimodal Reasoning
- MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs
Fine-tuning models on data demonstrating how to invoke tools and integrate their outputs.
- LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
- V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs
- CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations
- Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models
- Instruction-Guided Visual Masking
- From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis
- TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action
- CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation
- UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning
- Donβt Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation
- VGR: Visual Grounded Reasoning
- Multi-Step Visual Reasoning with Visual Tokens Scaling and Verification
- WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent
Using rewards to train agents to discover optimal tool-use strategies.
- Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement
- GRIT: Teaching MLLMs to Think with Images
- Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning
- OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning
- VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning
- DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
- Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL
- Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
- One RL to See Them All: Visual Triple Unified Reinforcement Learning
- UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning
- VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection
- Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO
- Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing
- VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning
- WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent
Here, models evolve into "visual programmers," generating executable code (e.g., Python) to create custom visual analyses. This unlocks compositional flexibility and interpretability.
Guiding models to generate code as a transparent, intermediate reasoning step.
- Visual programming: Compositional visual reasoning without training
- ViperGPT: Visual Inference via Python Execution for Reasoning
- Visual sketchpad: Sketching as a visual chain of thought for multimodal language models
- VipAct: Visual-perception enhancement via specialized vlm agent collaboration and tool-use
- SketchAgent: Language-Driven Sequential Sketch Generation
- CAD-Assistant: Tool-Augmented VLLMs as Generic CAD Task Solvers?
- MMFactory: A Universal Solution Search Engine for Vision-Language Tasks
- ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding
- Interactive Sketchpad: A Multimodal Tutoring System for Collaborative, Visual Problem-Solving
Distilling programmatic logic into models or using code to bootstrap high-quality training data.
- Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models
- ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models
- Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation
- Advancing vision-language models in front-end development via data synthesis
- COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning
- MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning
- PyVision: Agentic Vision with Dynamic Tooling
- Thyme: Think Beyond Images
Optimizing code generation policies using feedback from execution results.
- Visual Agentic Reinforcement Fine-Tuning
- ProgRM: Build Better GUI Agents with Progress Rewards
- Thyme: Think Beyond Images
The most advanced stage, where models achieve full cognitive autonomy. They generate new images or visual representations internally as integral steps in a closed-loop thought process.
Training on interleaved text-image data to teach models the grammar of multimodal thought.
- Generating images with multimodal language models
- NExT-GPT: Any-to-Any Multimodal LLM
- Minigpt-5: Interleaved vision-and-language generation via generative vokens
- Generative multimodal models are in-context learners
- SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
- Chameleon: Mixed-Modal Early-Fusion Foundation Models
- Show-o: One single transformer to unify multimodal understanding and generation
- Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
- Emu3: Next-Token Prediction is All You Need
- VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
- Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
- Metamorph: Multimodal understanding and generation via instruction tuning
- LMFusion: Adapting Pretrained Language Models for Multimodal Generation
- TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation
- Dual Diffusion for Unified Image Generation and Understanding
- Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
- Imagine while Reasoning in Space: Multimodal Visualization-of-Thought
- GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing
- Cot-vla: Visual chain-of-thought reasoning for vision-language-action models
- Transfer between Modalities with MetaQueries
- BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
- Emerging properties in unified multimodal pretraining
- Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation
- Thinking with Generated Images
- UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation
- Show-o2: Improved Native Unified Multimodal Models
- Qwen-Image Technical Report
Empowering models to discover generative reasoning strategies through trial, error, and reward.
- Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step
- Visual Planning: Let's Think Only with Images
- T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT
- MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO
- GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning
- Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO
- Robotic Control via Embodied Chain-of-Thought Reasoning
- FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving
- ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning
- Qwen-Image Technical Report
Essential resources for measuring progress. These benchmarks are specifically designed to test the multi-step, constructive, and simulative reasoning capabilities required for "Thinking with Images".
- A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision-Language Models
- m&m's: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks
- Vgbench: Evaluating large language models on vector graphics understanding and generation
- ARC Prize 2024: Technical Report
- CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation
- WorldScore: A Unified Evaluation Benchmark for World Generation
- MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models
- PointArena: Probing Multimodal Grounding Through Language-Guided Pointing
- ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models
- Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps
- PhyX: Does Your Model Have the "Wits" for Physical Reasoning?
- OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning
- VisualQuality-R1: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank
- GEMeX-ThinkVG: Towards Thinking with Visual Grounding in Medical VQA via Reinforcement Learning
- ViC-Bench: Benchmarking Visual-Interleaved Chain-of-Thought Capability in MLLMs with Free-Style Intermediate State Representations
We welcome contributions! If you have a paper that fits into this framework, please open a pull request. Let's build this resource together.
If you find our survey and this repository useful for your research, please consider citing our work:
@article{su2025thinking,
title={Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers},
author={Su, Zhaochen and Xia, Peng and Guo, Hangyu and Liu, Zhenhua and Ma, Yan and Qu, Xiaoye and Liu, Jiaqi and Li, Yanshu and Zeng, Kaide and Yang, Zhengyuan and others},
journal={arXiv preprint arXiv:2506.23918},
year={2025}
}