Skip to content

feat: Implement SQLStorageClient based on sqlalchemy v2+ #1339

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 24 commits into
base: master
Choose a base branch
from

Conversation

Mantisus
Copy link
Collaborator

@Mantisus Mantisus commented Aug 1, 2025

Description

  • Add SQLStorageClient which can accept a database connection string or a pre-configured AsyncEngine, or creates a default crawlee.db database in Configuration.storage_dir.

Issues

@Mantisus Mantisus self-assigned this Aug 1, 2025
@Mantisus Mantisus added this to the 1.0 milestone Aug 1, 2025
@Mantisus Mantisus requested a review from Copilot August 1, 2025 21:23
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a new SQL-based storage client (SQLStorageClient) that provides persistent data storage using SQLAlchemy v2+ for datasets, key-value stores, and request queues.

Key changes:

  • Adds SQLStorageClient with support for connection strings, pre-configured engines, or default SQLite database
  • Implements SQL-based clients for all three storage types with database schema management and transaction handling
  • Updates storage model configurations to support SQLAlchemy ORM mapping with from_attributes=True

Reviewed Changes

Copilot reviewed 16 out of 18 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/crawlee/storage_clients/_sql/ New SQL storage implementation with database models, clients, and schema management
tests/unit/storage_clients/_sql/ Comprehensive test suite for SQL storage functionality
tests/unit/storages/ Updates to test fixtures to include SQL storage client testing
src/crawlee/storage_clients/models.py Adds from_attributes=True to model configs for SQLAlchemy ORM compatibility
pyproject.toml Adds new sql optional dependency group
src/crawlee/storage_clients/__init__.py Adds conditional import for SQLStorageClient
Comments suppressed due to low confidence (1)

tests/unit/storages/test_request_queue.py:23

  • The test fixture only tests 'sql' storage client, but the removed 'memory' and 'file_system' parameters suggest this may have unintentionally reduced test coverage. Consider including all storage client types to ensure comprehensive testing.
@pytest.fixture(params=['sql'])

Mantisus and others added 2 commits August 2, 2025 00:25
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@Mantisus
Copy link
Collaborator Author

Mantisus commented Aug 1, 2025

When implementing, I opted out of SQLModel for several reasons:

  • Poor library support. As of today, SQLModel has a huge number of PRs and update requests, some of which are several years old. The latest releases have been mostly cosmetic (updating dependencies, documentation, builds, and checks, etc.).
  • Model hierarchy issue: if we use SQLModel, it's expected that we'll inherit existing Pydantic models from it. This greatly increases base dependencies (SQLModel, SQLAlchemy, aiosqlite). I don't think we should do this (see the last point).
  • It doesn't support optimization constraints for database tables, such as string length limits.
  • Poor typing when using anything other than select - Add an overload to the exec method with _Executable statement for update and delete statements fastapi/sqlmodel#909.
  • Overall, we can achieve the same behavior using only SQLAlchemy v2+ — https://docs.sqlalchemy.org/en/20/orm/dataclasses.html#integrating-with-alternate-dataclass-providers-such-as-pydantic. However, this retains the inheritance hierarchy and dependency issue.
  • I think that data models for SQL can be simpler while being better adapted for SQL than the models used in the framework. This way, we can optimize each data model for its task.

@Mantisus
Copy link
Collaborator Author

Mantisus commented Aug 1, 2025

The storage client has been repeatedly tested with SQLLite and a local PostgreSQL (a simple container installation without fine-tuning).
Сode for testing

import asyncio

from crawlee.crawlers import BasicCrawler, BasicCrawlingContext
from crawlee.storage_clients import SQLStorageClient
from crawlee.storages import RequestQueue, KeyValueStore
from crawlee import service_locator
from crawlee import ConcurrencySettings


LOCAL_POSTGRE = None  # 'postgresql+asyncpg://myuser:mypassword@localhost:5432/postgres'
USE_STATE = True
KVS = True
DATASET = True
CRAWLERS = 1
REQUESTS = 10000
DROP_STORAGES = True


async def main() -> None:
    service_locator.set_storage_client(
        SQLStorageClient(
            connection_string=LOCAL_POSTGRE if LOCAL_POSTGRE else None,
        )
    )

    kvs = await KeyValueStore.open()
    queue_1 = await RequestQueue.open(name='test_queue_1')
    queue_2 = await RequestQueue.open(name='test_queue_2')
    queue_3 = await RequestQueue.open(name='test_queue_3')

    urls = [f'https://crawlee.dev/page/{i}' for i in range(REQUESTS)]

    await queue_1.add_requests(urls)
    await queue_2.add_requests(urls)
    await queue_3.add_requests(urls)

    crawler_1 = BasicCrawler(concurrency_settings=ConcurrencySettings(desired_concurrency=50), request_manager=queue_1)
    crawler_2 = BasicCrawler(concurrency_settings=ConcurrencySettings(desired_concurrency=50), request_manager=queue_2)
    crawler_3 = BasicCrawler(concurrency_settings=ConcurrencySettings(desired_concurrency=50), request_manager=queue_3)

    # Define the default request handler
    @crawler_1.router.default_handler
    @crawler_2.router.default_handler
    @crawler_3.router.default_handler
    async def request_handler(context: BasicCrawlingContext) -> None:
        if USE_STATE:
            # Use state to store data
            state_data = await context.use_state()
            state_data['a'] = context.request.url

        if KVS:
            # Use KeyValueStore to store data
            await kvs.set_value(context.request.url, {'url': context.request.url, 'title': 'Example Title'})
        if DATASET:
            await context.push_data({'url': context.request.url, 'title': 'Example Title'})

    crawlers = [crawler_1]
    if CRAWLERS > 1:
        crawlers.append(crawler_2)
    if CRAWLERS > 2:
        crawlers.append(crawler_3)

    # Run the crawler
    data = await asyncio.gather(*[crawler.run() for crawler in crawlers])

    print(data)

    if DROP_STORAGES:
        # Drop all storages
        await queue_1.drop()
        await queue_2.drop()
        await queue_3.drop()
        await kvs.drop()


if __name__ == '__main__':
    asyncio.run(main())

This allows you to load work with storage without real requests.

@Mantisus
Copy link
Collaborator Author

Mantisus commented Aug 1, 2025

The use of accessed_modified_update_interval is related to optimization. Frequent updates to metadata just to change the access time can overload the database.

@Mantisus Mantisus removed this from the 1.0 milestone Aug 4, 2025
Copy link
Collaborator

@Pijukatel Pijukatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First part review. I will do RQ and tests in second part.

I have only minor comments. My main suggestion is to extract more code that is shared in all 3 clients. It is easier to understand all the clients once the reader easily knows which part of the code is exactly the same in all clients and which part of the code is unique and specific to the client. It also makes it easier to maintain the code.

Drawback would be that understanding just one class in the isolation would be little bit harder. But who wants to understand just one client?

flatten: list[str] | None = None,
view: str | None = None,
) -> DatasetItemsListPage:
# Check for unsupported arguments and log a warning if found.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this unsupported just in this initial commit or there is no plan for supporting them in the future?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will complicate database queries quite a bit. I don't plan to support this. But we could reconsider this in the future.

Since SQLite now supports JSON operations, this is possible - https://sqlite.org/json1.html

)

@override
async def iterate_items(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shares same code with get_data, maybe extract to internal function reused in both places?

Just a note:
I guess there could be also room for some optimization in the future to make iterate_items more lazy when it comes to getting data from db or lazy + buffer instead of getting all data from db now and iterating over them.

impl = DateTime(timezone=True)
cache_ok = True

def process_result_value(self, value: datetime | None, _dialect: Dialect) -> datetime | None:
Copy link
Collaborator

@Pijukatel Pijukatel Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it allow None at input/output?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TypeDecorator expects that value in process_result_value can be None, since this is a generic type definition and the column can have nullable=True.

"""Convert Python value to database value."""
return 'default' if value is None else value

def process_result_value(self, value: str | None, _dialect: Dialect) -> str | None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input value should be str and not None?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will cause a conflict in mypy, since value in the signature of TypeDecorator.process_result_value is Any | None.

self._accessed_modified_update_interval = storage_client.get_accessed_modified_update_interval()

@override
async def get_metadata(self) -> DatasetMetadata:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These three methods are near identical in all clients
get_session
get_autocommit_session

and maybe

get_metadata

Maybe we could reuse it. Maybe define them in a standalone class and use them as mixin in all three clients.
Class could be generic based on the metadata type

Or maybe push them down to SQLStorageClient as they are indeed specific to it

Copy link
Collaborator

@Pijukatel Pijukatel Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this also be further expanded by adding method safely_open to the mixin or down to the SQLStorageClient which would be wrapper for client specific open?

async with storage_client.create_session() as session:
  # client specific open
  try:
      # Commit the insert or update metadata to the database
      await session.commit()
  except SQLAlchemyError:
      ...
      client = cls(
          id=orm_metadata.id,
          storage_client=storage_client,
      )

Could be done in many different ways. Point is to extract and centralize the SQL specific code shared in all clients and keep clients clean with only their specific, unique code.

Maybe a decorator in SQLStorageClient?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

purge and drop also show significant degree of similarity

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the portion of metadata update related to the times could be extracted,

self._last_accessed_at and self._last_modified_at has the same optimization mechanics shared in all SQL clients.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, great ideas for code optimization.

I moved the duplicate code to a mixin.

@Pijukatel
Copy link
Collaborator

It would also be good to mention it in docs and maybe show an example use.

Copy link
Collaborator

@Pijukatel Pijukatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will continue with the review later. There are many ways how to approach the RQclient implementation. I guess I have some different expectations in my mind (I am not saying those are correct :D ). Maybe we should define the expectations first, so that I do the review correctly based on that.

My initial expectations for the RQclient:

  • Can be used on the APify platform and outside as well
  • Supports any persistance
  • Supports parallel consumers/producers (Use case being - speeding up crawlers on Apify platform with multiprocessing to fully utilize resources available -> for example Parsel based actor could have multiple ParselCrawlers under the hood and all of them working on the same RQ, but reducing the costs by avoiding ApifyRQClient)

Most typical use case:

  • Crawlee outside of Apify platform
  • Crawlee on Apify platform, but avoiding expensive ApifyRQClient

Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job Max!

I haven't read the whole source code yet, but for the beginning...

Could you please share a bit more detail about the solution? For example, how it was tested, which tables it creates, how do we storage dataset records, kvs records, request records, and any other relevant context...

Since the dependencies include only sqlalchemy and aiosqlite, am I right in assuming this storage client currently supports only SQLite? If so, it might make sense to name it SqliteStorageClient and also update the name of the extra. On the other hand, if it's intended to be a general SQL client, could you clarify what steps are needed to connect to a different database (e.g. Postgres)?

Also the documentation is missing. As a first step, we should extend the existing guide (https://crawlee.dev/python/docs/guides/storage-clients) with a section describing this storage client. If the functionality turns out to be more complex (e.g. supporting multiple databases with configuration), we instead create a dedicated guide as well.

Since this feature involve more complexity and potential edge cases, I'd suggest marking it under experimental flag for the v1.0 release.

@Mantisus
Copy link
Collaborator Author

@vdusek thanks. Sure.

Could you please share a bit more detail about the solution? For example, how it was tested, which tables it creates, how do we storage dataset records, kvs records, request records, and any other relevant context...

For testing, I used the code provided in this comment: #1339 (comment)

I used BasicCrawler to create an intensive load on the storage, as there is no need to execute requests

At the same time, I think it loads the storage quite diversely using different storage mechanisms. Direct writing to KVS, updating the key in KVS, writing to Datase, using RecoverableState with ‘use_state’, running up to 3 parallel queues.

About Tables.

Metadata tables duplicate Pydantic models. However, they are described in accordance with SQLAlchem.

Dataset:

  • dataset_metadata - one record per Dataset
  • dataset_item - for Dataset records.
    • PK is order_id with auto-increment. This will allow us to retrieve records in the order they were recorded.
    • FK is metadata_id
    • data contains serialized JSON.

KVS:

  • kvs_metadata - one record per KVS
  • kvs_record - for KVS records
    • PK is metadata_id and key. This simultaneously ensures the uniqueness of the key for the KVS and provides an index for Select.
    • FK is metadata_id
    • value is stored as BigBinary
    • And 2 columns with record metadata - size and content_type. This allows us to avoid reading value' from the database when executing iterate_keys`

RequestQueue: (It will probably be updated yet, since I am making some optimizations after the review @Pijukatel).

  • request_queue_metadata - one record per queue
  • request - for Queue recirds
    • PK is request_id (created deterministically from unique_key after this PR refactor!: Remove Request.id field #1366) and metadata_id. I use a BigInteger request_id because unique_key can be quite a long string, for example, in POST requests.
      -FK is metadata_id
    • sequence_number - for sorting in the required order. For forefront requests, sequence_number is negative; for regular requests, it is positive. This simplifies Select.
    • is_handled - bool
    • data - Simply serialized JSON
  • request_queue_state - Contains counters data, one record per queue. Thanks, @Pijukatel, for this idea.

As @Pijukatel correctly pointed out in his review, several independent clients cannot currently work with the queue. I am working on optimizations for this.

Any ideas for improvements and optimizations for tables are welcome. I also think this is a good time to discuss how we will support this in the future.

For example, will we write migrations in case of table changes?

@Mantisus
Copy link
Collaborator Author

his storage client currently supports only SQLite?

SQLite is only used as a standard database. However, the client supports SQLite, PostgreSQL, and MySQL. By using dialects, you can obtain more optimized queries for these three databases. Without using dialects, I am not yet sure about the queue. This is not so critical for Dataset and KVS.

To use a different database, you need to install the appropriate asynchronous library supported by SQLAlchemy. For example, asyncpg and pass either the URI in ‘connection_string’ for SQLStorageClient, or a preconfigured AsyncEngine in ‘engine’.

For standard SQLite, some optimization settings are performed in SQLStorageClient.

@Mantisus
Copy link
Collaborator Author

Also the documentation is missing. As a first step, we should extend the existing guide

Yes, I put off working on the documentation. I assumed that there might be critical updates during the review process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for SQLite storage client
3 participants