Skip to content

Conversation

shkarupa-alex
Copy link
Contributor

This is the third and i hope the last PR to support models like OlmOcr ( 1 - #1802 , 2 - #1808 ).

VLM models may generate something more than just text or markdown.
E.g.:

  • OlmOcr generates json with information about page language, rotation and recognized text
  • Nanonets-OCR-s generates recognized text as markdown, but tables and some other elements as HTML.

For such models we need a way to decode vlm response (fully convert it to markdown).

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

Signed-off-by: Shkarupa Alex <shkarupa.alex@gmail.com>
Copy link
Contributor

github-actions bot commented Jul 8, 2025

DCO Check Passed

Thanks @shkarupa-alex, all your commits are properly signed off. 🎉

Copy link

mergify bot commented Jul 8, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@shkarupa-alex shkarupa-alex changed the title (feat: vlm): Ability to preprocess VLM response (feat): Ability to preprocess VLM response Jul 8, 2025
@shkarupa-alex shkarupa-alex changed the title (feat): Ability to preprocess VLM response feat(vlm): Ability to preprocess VLM response Jul 8, 2025
Copy link

codecov bot commented Jul 8, 2025

Codecov Report

Attention: Patch coverage is 14.28571% with 6 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling/models/api_vlm_model.py 0.00% 2 Missing ⚠️
.../models/vlm_models_inline/hf_transformers_model.py 0.00% 2 Missing ⚠️
docling/models/vlm_models_inline/mlx_model.py 0.00% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@dolfim-ibm
Copy link
Contributor

@shkarupa-alex we would like to propose an alternative solution.

First, let's add a method decode_response() in the ApiVlmOptions (similar also in HuggingFaceTransformersVlmModel) which is simply returning the input text.

Then, in your OlmOcr example, you can make a derived class from ApiVlmOptions which overloads the json parsing.

Are you willing to adapt your PR in this direction? (sorry for the late response)

@shkarupa-alex
Copy link
Contributor Author

I will apply your recommendations in few days

@shkarupa-alex
Copy link
Contributor Author

@dolfim-ibm i moved decode_response to vlm option as you proposed.
But to keep api consistent i also moved per-page prompt formulation (intorduced in #1808) to vlm.

…de). Per-page prompt formulation also moved to vlm options to keep api consistent.

Signed-off-by: Shkarupa Alex <shkarupa.alex@gmail.com>
@shkarupa-alex shkarupa-alex force-pushed the vlm-preprocess-response branch from 21932ac to 713612e Compare August 10, 2025 09:35
@dolfim-ibm
Copy link
Contributor

Thanks @shkarupa-alex, I just noticed I had a big typo in my message above. What we planned is to put the functions in the model classes not the options. Sorry about the message being misleading.

@shkarupa-alex
Copy link
Contributor Author

shkarupa-alex commented Aug 11, 2025

@dolfim-ibm could you please clarify what is a model?
If you are talking about ApiVlmModel, HuggingFaceMlxModel and HuggingFaceTransformersVlmModel it is a good place, but it is not possible now to override them for user https://github.com/docling-project/docling/blob/main/docling/pipeline/vlm_pipeline.py#L78

Can you propose how to pass overrided model class (or instance) to pipeline?

Copy link
Contributor

@dolfim-ibm dolfim-ibm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for following up on all iterations!

@dolfim-ibm
Copy link
Contributor

@shkarupa-alex we just discussed a bit more the approach and we think it is covering the short-term needs. Again thanks a lot for the contribution and for following up on the discussions.

We are actually rethinking the whole stage and model-runtime design, so we might have to do a few iterations on this topic as well in the next days.

@dolfim-ibm dolfim-ibm merged commit 5f050f9 into docling-project:main Aug 12, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants