Skip to content

Resources and paper list for "Thinking with Images for LVLMs". This repository accompanies our survey on how LVLMs can leverage visual information for complex reasoning, planning, and generation.

Notifications You must be signed in to change notification settings

zhaochen0110/Awesome_Think_With_Images

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

85 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ§ πŸ€– Awesome-Think-With-Images

Logo

Thinking with Images: Next Frontier in Multimodal AI

This repository accompanies our survey paper:
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Awesome License: MIT

Introduction

Welcome to Awesome-Think-With-Images! The field of multimodal AI is undergoing a fundamental evolution, moving beyond static visual perception towards a new paradigm where vision becomes a dynamic, manipulable cognitive workspace. This repository is the first comprehensive resource that systematically curates the pivotal research enabling this shift.

We structure this collection along a trajectory of increasing cognitive autonomy, as detailed in our survey. This journey unfolds across three key stages:

  1. Stage 1: Tool-Driven Visual Exploration β€” Models as "Commanders" orchestrating external visual tools.
  2. Stage 2: Programmatic Visual Manipulation β€” Models as "Visual Programmers" creating bespoke analyses.
  3. Stage 3: Intrinsic Visual Imagination β€” Models as "Visual Thinkers" generating internal mental imagery.
The Paradigm Shift from Think about Images to Think with Images

The paradigm shift from β€œThinking about Images” to β€œThinking with Images”, an evolution that transforms vision from a static input into a dynamic and manipulable cognitive workspace.

As detailed in our survey, this paradigm shift unlocks three key capabilities: Dynamic Perceptual Exploration, Structured Visual Reasoning, and Goal-Oriented Generative Planning. This collection is for researchers, developers, and enthusiasts eager to explore the forefront of AI that can truly see, reason, and imagine.

Paradigm Comparison

Conceptual comparison of β€œThinking about Images” versus β€œThinking with Images”.

We structure this collection along a trajectory of increasing cognitive autonomy. This journey unfolds across three key stages, forming the taxonomy of our work:

Taxonomy of Thinking with Images

The taxonomy of "Thinking with Images" organizing the field into core methodologies (across three stages), evaluation benchmarks, and key applications.

This collection is for researchers, developers, and enthusiasts eager to explore the forefront of AI that can truly see, reason, and imagine.


πŸ”” News

  • [2025-07] We have released "Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers", the first comprehensive survey dedicated to the emerging paradigm of "Think with Images".
  • [2025-06] We created this repository to maintain a paper list on Awesome-Think-With-Images. Contributions are welcome!
  • [2025-05] We are excited to release OpenThinkIMG, the first dedicated end-to-end open-source framework designed to empower LVLMs to truly think with images! For ease of use, we've configured a Docker environment. We warmly invite the community to explore, use, and contribute.

πŸ“œ Table of Contents


🧭 The Three-Stage Evolution of Thinking with Images

This section provides a conceptual map to navigate the paper list. The following papers are organized according to the primary mechanism they employ, aligning with the three-stage framework from our survey.


πŸ› οΈ Stage 1: Tool-Driven Visual Exploration

In this stage, the model acts as a planner, orchestrating a predefined suite of external visual tools. Intelligence is demonstrated by selecting the right tool for the right sub-task.

➀ Prompt-Based Approaches

Leveraging in-context learning to guide tool use without parameter updates.

➀ SFT-Based Approaches

Fine-tuning models on data demonstrating how to invoke tools and integrate their outputs.

➀ RL-Based Approaches

Using rewards to train agents to discover optimal tool-use strategies.


πŸ’» Stage 2: Programmatic Visual Manipulation

Here, models evolve into "visual programmers," generating executable code (e.g., Python) to create custom visual analyses. This unlocks compositional flexibility and interpretability.

➀ Prompt-Based Approaches

Guiding models to generate code as a transparent, intermediate reasoning step.

➀ SFT-Based Approaches

Distilling programmatic logic into models or using code to bootstrap high-quality training data.

➀ RL-Based Approaches

Optimizing code generation policies using feedback from execution results.


🎨 Stage 3: Intrinsic Visual Imagination

The most advanced stage, where models achieve full cognitive autonomy. They generate new images or visual representations internally as integral steps in a closed-loop thought process.

➀ SFT-Based Approaches

Training on interleaved text-image data to teach models the grammar of multimodal thought.

➀ RL-Based Approaches

Empowering models to discover generative reasoning strategies through trial, error, and reward.


πŸ“Š Evaluation & Benchmarks

Essential resources for measuring progress. These benchmarks are specifically designed to test the multi-step, constructive, and simulative reasoning capabilities required for "Thinking with Images".

➀ Benchmarks for Thinking with Images


πŸ™ Contributing & Citation

We welcome contributions! If you have a paper that fits into this framework, please open a pull request. Let's build this resource together.

If you find our survey and this repository useful for your research, please consider citing our work:

@article{su2025thinking,
  title={Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers},
  author={Su, Zhaochen and Xia, Peng and Guo, Hangyu and Liu, Zhenhua and Ma, Yan and Qu, Xiaoye and Liu, Jiaqi and Li, Yanshu and Zeng, Kaide and Yang, Zhengyuan and others},
  journal={arXiv preprint arXiv:2506.23918},
  year={2025}
}

Star History

Star History Chart

About

Resources and paper list for "Thinking with Images for LVLMs". This repository accompanies our survey on how LVLMs can leverage visual information for complex reasoning, planning, and generation.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published