-
Notifications
You must be signed in to change notification settings - Fork 429
feat: Persist the SitemapRequestLoader
state
#1347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds persistence functionality to the SitemapRequestLoader
by implementing state management through a new SitemapRequestLoaderState
model and RecoverableState
integration. The changes enable the loader to save and restore its internal state, allowing it to resume sitemap processing after interruptions.
- Added state persistence model with queue, progress tracking, and completion status
- Refactored internal data structures to use deques and sets that can be serialized
- Added context manager support for proper resource cleanup
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
File | Description |
---|---|
src/crawlee/request_loaders/_sitemap_request_loader.py |
Core implementation of state persistence with new state model and async context manager |
src/crawlee/_utils/sitemap.py |
Added exception handling for SAXParseException during parser cleanup |
tests/unit/request_loaders/test_sitemap_request_loader.py |
Added asyncio.sleep calls to allow background loading time in tests |
docs/guides/code_examples/request_loaders/sitemap_example.py |
Updated example to include sleep for proper loader initialization |
62b743d
to
146e805
Compare
146e805
to
c3f0f68
Compare
docs/guides/code_examples/request_loaders/sitemap_example_with_persist.py
Outdated
Show resolved
Hide resolved
@@ -23,12 +29,45 @@ | |||
logger = getLogger(__name__) | |||
|
|||
|
|||
class SitemapRequestLoaderState(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I see this from the outside, it would be great if you could write down how the persistence mechanism works in the docblock of this class.
Also, I don't see processed sitemap URLs being tracked in a any way, is that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see processed sitemap URLs being tracked in a any way, is that intentional?
Yes, I may be wrong, but I think that cyclic links are not expected in sitemaps. Thanks to this, we don't need to store links to processed sitemaps.
JS uses similar behavior - https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L108
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't be surprised to encounter a cyclic sitemap somewhere, but I don't have a real-world example 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, however, let's wait for @Pijukatel and/or @janbuchar as well
0524cfc
to
0600583
Compare
Description
SitemapRequestLoader
stateIssues
SitemapRequestLoader
state #1269