🧠🤖 Awesome-Think-With-Images

Thinking with Images: Next Frontier in Multimodal AI

This repository accompanies our survey paper:
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Introduction

Welcome to Awesome-Think-With-Images! The field of multimodal AI is undergoing a fundamental evolution, moving beyond static visual perception towards a new paradigm where vision becomes a dynamic, manipulable cognitive workspace. This repository is the first comprehensive resource that systematically curates the pivotal research enabling this shift.

We structure this collection along a trajectory of increasing cognitive autonomy, as detailed in our survey. This journey unfolds across three key stages:

Stage 1: Tool-Driven Visual Exploration — Models as "Commanders" orchestrating external visual tools.
Stage 2: Programmatic Visual Manipulation — Models as "Visual Programmers" creating bespoke analyses.
Stage 3: Intrinsic Visual Imagination — Models as "Visual Thinkers" generating internal mental imagery.

The Paradigm Shift from Think about Images to Think with Images

The paradigm shift from “Thinking about Images” to “Thinking with Images”, an evolution that transforms vision from a static input into a dynamic and manipulable cognitive workspace.

As detailed in our survey, this paradigm shift unlocks three key capabilities: Dynamic Perceptual Exploration, Structured Visual Reasoning, and Goal-Oriented Generative Planning. This collection is for researchers, developers, and enthusiasts eager to explore the forefront of AI that can truly see, reason, and imagine.

Conceptual comparison of “Thinking about Images” versus “Thinking with Images”.

We structure this collection along a trajectory of increasing cognitive autonomy. This journey unfolds across three key stages, forming the taxonomy of our work:

The taxonomy of "Thinking with Images" organizing the field into core methodologies (across three stages), evaluation benchmarks, and key applications.

This collection is for researchers, developers, and enthusiasts eager to explore the forefront of AI that can truly see, reason, and imagine.

🔔 News

[2025-07] We have released "Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers", the first comprehensive survey dedicated to the emerging paradigm of "Think with Images".
[2025-06] We created this repository to maintain a paper list on Awesome-Think-With-Images. Contributions are welcome!
[2025-05] We are excited to release OpenThinkIMG, the first dedicated end-to-end open-source framework designed to empower LVLMs to truly think with images! For ease of use, we've configured a Docker environment. We warmly invite the community to explore, use, and contribute.

📜 Table of Contents

Stage 1: Tool-Driven Visual Exploration
Stage 2: Programmatic Visual Manipulation
Stage 3: Intrinsic Visual Imagination
- SFT-Based Approaches
- RL-Based Approaches
Evaluation & Benchmarks
- Benchmarks for Thinking with Images
Contributing & Citation

🧭 The Three-Stage Evolution of Thinking with Images

This section provides a conceptual map to navigate the paper list. The following papers are organized according to the primary mechanism they employ, aligning with the three-stage framework from our survey.

🛠️ Stage 1: Tool-Driven Visual Exploration

In this stage, the model acts as a planner, orchestrating a predefined suite of external visual tools. Intelligence is demonstrated by selecting the right tool for the right sub-task.

➤ Prompt-Based Approaches

Leveraging in-context learning to guide tool use without parameter updates.

➤ SFT-Based Approaches

Fine-tuning models on data demonstrating how to invoke tools and integrate their outputs.

➤ RL-Based Approaches

Using rewards to train agents to discover optimal tool-use strategies.

💻 Stage 2: Programmatic Visual Manipulation

Here, models evolve into "visual programmers," generating executable code (e.g., Python) to create custom visual analyses. This unlocks compositional flexibility and interpretability.

➤ Prompt-Based Approaches

Guiding models to generate code as a transparent, intermediate reasoning step.

➤ SFT-Based Approaches

Distilling programmatic logic into models or using code to bootstrap high-quality training data.

➤ RL-Based Approaches

Optimizing code generation policies using feedback from execution results.

🎨 Stage 3: Intrinsic Visual Imagination

The most advanced stage, where models achieve full cognitive autonomy. They generate new images or visual representations internally as integral steps in a closed-loop thought process.

➤ SFT-Based Approaches

Training on interleaved text-image data to teach models the grammar of multimodal thought.

➤ RL-Based Approaches

Empowering models to discover generative reasoning strategies through trial, error, and reward.

📊 Evaluation & Benchmarks

Essential resources for measuring progress. These benchmarks are specifically designed to test the multi-step, constructive, and simulative reasoning capabilities required for "Thinking with Images".

➤ Benchmarks for Thinking with Images

🙏 Contributing & Citation

We welcome contributions! If you have a paper that fits into this framework, please open a pull request. Let's build this resource together.

If you find our survey and this repository useful for your research, please consider citing our work:

@article{su2025thinking,
  title={Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers},
  author={Su, Zhaochen and Xia, Peng and Guo, Hangyu and Liu, Zhenhua and Ma, Yan and Qu, Xiaoye and Liu, Jiaqi and Li, Yanshu and Zeng, Kaide and Yang, Zhengyuan and others},
  journal={arXiv preprint arXiv:2506.23918},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
README.md		README.md
logo.png		logo.png
paradigm_comparison.png		paradigm_comparison.png
paradigm_shift_figure.png		paradigm_shift_figure.png
taxonomy_tree.png		taxonomy_tree.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧠🤖 Awesome-Think-With-Images

Thinking with Images: Next Frontier in Multimodal AI

Introduction

🔔 News

📜 Table of Contents

🧭 The Three-Stage Evolution of Thinking with Images

🛠️ Stage 1: Tool-Driven Visual Exploration

➤ Prompt-Based Approaches

➤ SFT-Based Approaches

➤ RL-Based Approaches

💻 Stage 2: Programmatic Visual Manipulation

➤ Prompt-Based Approaches

➤ SFT-Based Approaches

➤ RL-Based Approaches

🎨 Stage 3: Intrinsic Visual Imagination

➤ SFT-Based Approaches

➤ RL-Based Approaches

📊 Evaluation & Benchmarks

➤ Benchmarks for Thinking with Images

🙏 Contributing & Citation

Star History

About

Uh oh!

Releases

Packages

Contributors 15

zhaochen0110/Awesome_Think_With_Images

Folders and files

Latest commit

History

Repository files navigation

🧠🤖 Awesome-Think-With-Images

Thinking with Images: Next Frontier in Multimodal AI

Introduction

🔔 News

📜 Table of Contents

🧭 The Three-Stage Evolution of Thinking with Images

🛠️ Stage 1: Tool-Driven Visual Exploration

➤ Prompt-Based Approaches

➤ SFT-Based Approaches

➤ RL-Based Approaches

💻 Stage 2: Programmatic Visual Manipulation

➤ Prompt-Based Approaches

➤ SFT-Based Approaches

➤ RL-Based Approaches

🎨 Stage 3: Intrinsic Visual Imagination

➤ SFT-Based Approaches

➤ RL-Based Approaches

📊 Evaluation & Benchmarks

➤ Benchmarks for Thinking with Images

🙏 Contributing & Citation

Star History

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 15

Packages