Skip to content

Conversation

dolfim-ibm
Copy link
Contributor

@dolfim-ibm dolfim-ibm commented Sep 11, 2025

This PR refactors the pipelines to allow the enrichment on "standard items" for all pipeline. It enables to run picture classification and description on the embedded images for MS Word and HTML documents.

Actual changes:

  1. Move artifacts_path to the base pipeline_options and class --> remove redundant code from other pipelines.
  2. Add ConvertPipeline which has options to enable the common enrich steps
  3. Fix the enrich model base class to allow the case of embedded images in PictureItem without the page images.

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Copy link
Contributor

github-actions bot commented Sep 11, 2025

DCO Check Passed

Thanks @dolfim-ibm, all your commits are properly signed off. 🎉

Copy link

dosubot bot commented Sep 11, 2025

Related Documentation

Checked 2 published document(s). No updates required.

How did I do? Any feedback?  Join Discord

Copy link

mergify bot commented Sep 11, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

PeterStaar-IBM
PeterStaar-IBM previously approved these changes Sep 11, 2025
Copy link
Contributor

@PeterStaar-IBM PeterStaar-IBM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

Copy link

codecov bot commented Sep 11, 2025

Codecov Report

❌ Patch coverage is 72.72727% with 21 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling/models/base_model.py 33.33% 8 Missing ⚠️
docling/pipeline/base_extraction_pipeline.py 41.66% 7 Missing ⚠️
docling/pipeline/base_pipeline.py 82.75% 5 Missing ⚠️
docling/cli/main.py 66.66% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@dolfim-ibm dolfim-ibm requested a review from vagenas September 11, 2025 11:25
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Copy link
Contributor

@cau-git cau-git left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@ceberam ceberam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@dolfim-ibm dolfim-ibm merged commit 2c91234 into main Sep 11, 2025
13 of 14 checks passed
@dolfim-ibm dolfim-ibm deleted the feat-enrich-docx-html branch September 11, 2025 13:09
Copy link

dosubot bot commented Sep 11, 2025

Documentation Updates

Checked 2 published document(s). No updates required.

How did I do? Any feedback?  Join Discord

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants