Skip to content

feat: Persist the SitemapRequestLoader state #1347

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

Mantisus
Copy link
Collaborator

Description

  • Persist the SitemapRequestLoader state

Issues

@Mantisus Mantisus self-assigned this Aug 11, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds persistence functionality to the SitemapRequestLoader by implementing state management through a new SitemapRequestLoaderState model and RecoverableState integration. The changes enable the loader to save and restore its internal state, allowing it to resume sitemap processing after interruptions.

  • Added state persistence model with queue, progress tracking, and completion status
  • Refactored internal data structures to use deques and sets that can be serialized
  • Added context manager support for proper resource cleanup

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
src/crawlee/request_loaders/_sitemap_request_loader.py Core implementation of state persistence with new state model and async context manager
src/crawlee/_utils/sitemap.py Added exception handling for SAXParseException during parser cleanup
tests/unit/request_loaders/test_sitemap_request_loader.py Added asyncio.sleep calls to allow background loading time in tests
docs/guides/code_examples/request_loaders/sitemap_example.py Updated example to include sleep for proper loader initialization

@Mantisus Mantisus force-pushed the persist-sitemap-loader branch 2 times, most recently from 62b743d to 146e805 Compare August 11, 2025 20:17
@Mantisus Mantisus force-pushed the persist-sitemap-loader branch from 146e805 to c3f0f68 Compare August 11, 2025 20:51
@Mantisus Mantisus requested a review from vdusek August 12, 2025 14:13
vdusek

This comment was marked as resolved.

@Mantisus Mantisus requested a review from vdusek August 12, 2025 23:59
@@ -23,12 +29,45 @@
logger = getLogger(__name__)


class SitemapRequestLoaderState(BaseModel):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I see this from the outside, it would be great if you could write down how the persistence mechanism works in the docblock of this class.

Also, I don't see processed sitemap URLs being tracked in a any way, is that intentional?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see processed sitemap URLs being tracked in a any way, is that intentional?

Yes, I may be wrong, but I think that cyclic links are not expected in sitemaps. Thanks to this, we don't need to store links to processed sitemaps.

JS uses similar behavior - https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L108

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't be surprised to encounter a cyclic sitemap somewhere, but I don't have a real-world example 🤷

Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, however, let's wait for @Pijukatel and/or @janbuchar as well

@Mantisus Mantisus force-pushed the persist-sitemap-loader branch from 0524cfc to 0600583 Compare August 21, 2025 20:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Persist the SitemapRequestLoader state
3 participants