Initial commit

rustyconover · rustyconover · commit dfcf102479e2 · 2025-07-06T23:20:19.000-04:00
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -0,0 +1,26 @@
+name: CI
+
+on:
+  push:
+    branches:
+      - main
+  pull_request:
+    branches:
+      - main
+
+jobs:
+  test:
+    runs-on: ubuntu-latest
+
+    steps:
+    - name: Checkout repository
+      uses: actions/checkout@v4
+
+    - name: Install the latest version of rye
+      uses: eifinger/setup-rye@v4
+
+    - name: Install project dependencies
+      run: rye sync
+
+    - name: Run tests
+      run: rye test
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,10 @@
+# python generated files
+__pycache__/
+*.py[oc]
+build/
+dist/
+wheels/
+*.egg-info
+
+# venv
+.venv
diff --git a/.python-version b/.python-version
@@ -0,0 +1 @@
+3.12.4
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,7 @@
+Copyright 2024-2025 Rusty Conover <rusty@query.farm> - https://query.farm
+
+Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
diff --git a/README.md b/README.md
@@ -0,0 +1,233 @@
+# Query Farm SQL Scan Planning
+
+A Python library for intelligent file filtering using SQL expressions and metadata-based scan planning. This library enables efficient data lake query optimization by determining which files need to be scanned based on their statistical metadata.
+
+## Overview
+
+Query Farm SQL Scan Planning provides predicate pushdown capabilities for file-based data storage systems. By maintaining metadata about file contents (min/max values, value sets, null presence), the library can quickly determine which files contain data that could satisfy a given `SQL WHERE` clause, significantly reducing query execution time.
+
+## Features
+
+- **SQL Expression Parsing**: Parse and evaluate complex `SQL WHERE` clauses using [SQLGlot](https://github.com/tobymao/sqlglot)
+- **Metadata-Based Filtering**: Support for both range-based (min/max) and set-based field metadata
+- **Null Handling**: Comprehensive support for NULL value semantics in SQL expressions
+- **Complex Predicates**: Handle AND, OR, XOR, NOT, IN, BETWEEN, CASE statements, and more
+- **Multiple Data Types**: Support for integers, floats, strings, decimals, and NULL values
+- **Dialect Support**: Configurable SQL dialect support (default: DuckDB)
+
+## Installation
+
+```bash
+pip install query-farm-sql-scan-planning
+```
+
+Or using rye:
+
+```bash
+rye add query-farm-sql-scan-planning
+```
+
+## Quick Start
+
+```python
+from query_farm_sql_scan_planning import Planner, RangeFieldInfo, SetFieldInfo
+
+# Define file metadata
+files = [
+    (
+        "data_2023_q1.parquet",
+        {
+            "sales_amount": RangeFieldInfo[int](
+                min_value=100, max_value=50000,
+                has_nulls=False, has_non_nulls=True
+            ),
+            "region": SetFieldInfo[str](
+                values={"US", "CA", "MX"},
+                has_nulls=False, has_non_nulls=True
+            ),
+        }
+    ),
+    (
+        "data_2023_q2.parquet",
+        {
+            "sales_amount": RangeFieldInfo[int](
+                min_value=200, max_value=75000,
+                has_nulls=False, has_non_nulls=True
+            ),
+            "region": SetFieldInfo[str](
+                values={"US", "EU", "UK"},
+                has_nulls=False, has_non_nulls=True
+            ),
+        }
+    ),
+]
+
+# Create planner
+planner = Planner(files)
+
+# Filter files based on SQL expressions
+matching_files = planner.get_matching_files("sales_amount > 40000 AND region = 'US'")
+print(matching_files)  # {'data_2023_q1.parquet', 'data_2023_q2.parquet'}
+
+# More complex queries
+matching_files = planner.get_matching_files("region IN ('EU', 'UK')")
+print(matching_files)  # {'data_2023_q2.parquet'}
+```
+
+## Field Information Types
+
+### `RangeFieldInfo`
+
+For fields with known minimum and maximum values:
+
+```python
+RangeFieldInfo[int](
+    min_value=0,
+    max_value=100,
+    has_nulls=False,      # Whether the field contains NULL values
+    has_non_nulls=True    # Whether the field contains non-NULL values
+)
+```
+
+### `SetFieldInfo`
+
+For fields with a known set of possible values (useful for categorical data):
+
+```python
+SetFieldInfo[str](
+    values={"apple", "banana", "cherry"},
+    has_nulls=False,
+    has_non_nulls=True
+)
+```
+
+**Note**: `SetFieldInfo` can produce false positives - if a value is in the set, the file *might* contain it, but the file could contain additional values not in the set.
+
+## Supported SQL Operations
+
+### Comparison Operators
+- `=`, `!=`, `<>` (equality and inequality)
+- `<`, `<=`, `>`, `>=` (range comparisons)
+- `IS NULL`, `IS NOT NULL` (null checks)
+- `IS DISTINCT FROM`, `IS NOT DISTINCT FROM` (null-safe comparisons)
+
+### Logical Operators
+- `AND`, `OR`, `XOR` (logical connectors)
+- `NOT` (negation)
+
+### Set Operations
+- `IN`, `NOT IN` (membership tests)
+- `BETWEEN`, `NOT BETWEEN` (range tests)
+
+### Control Flow
+- `CASE WHEN ... THEN ... ELSE ... END` (conditional expressions)
+
+### Literals
+- Numeric literals: `123`, `45.67`
+- String literals: `'hello'`
+- Boolean literals: `TRUE`, `FALSE`
+- NULL literal: `NULL`
+
+## Examples
+
+### Range Queries
+```python
+# Files with sales between 1000 and 5000
+planner.get_matching_files("sales_amount BETWEEN 1000 AND 5000")
+
+# Files with any sales over 10000
+planner.get_matching_files("sales_amount > 10000")
+```
+
+### Set Membership
+```python
+# Files containing specific regions
+planner.get_matching_files("region IN ('US', 'CA')")
+
+# Files not containing specific regions
+planner.get_matching_files("region NOT IN ('UNKNOWN', 'TEST')")
+```
+
+### Complex Conditions
+```python
+# Combination of range and set conditions
+planner.get_matching_files(
+    "sales_amount > 5000 AND region IN ('US', 'EU') AND customer_id IS NOT NULL"
+)
+
+# Case expressions
+planner.get_matching_files(
+    "CASE WHEN region = 'US' THEN sales_amount > 1000 ELSE sales_amount > 500 END"
+)
+```
+
+### Null Handling
+```python
+# Files that might contain null values in sales_amount
+planner.get_matching_files("sales_amount IS NULL")
+
+# Files with non-null sales amounts over 1000
+planner.get_matching_files("sales_amount IS NOT NULL AND sales_amount > 1000")
+```
+
+## Performance Considerations
+
+- **Metadata Quality**: More accurate metadata (tighter ranges, complete value sets) leads to better filtering
+- **Expression Complexity**: Simple expressions evaluate faster than complex nested conditions
+- **False Positives**: The library errs on the side of including files that might match rather than risk excluding files that do match
+
+## Use Cases
+
+- **Data Lake Query Optimization**: Skip irrelevant files in distributed query engines
+- **ETL Pipeline Optimization**: Process only files containing relevant data
+- **Data Catalog Integration**: Enhance metadata catalogs with query planning capabilities
+- **Columnar Storage**: Optimize scans of Parquet, ORC, or similar formats
+
+## Development
+
+### Setup
+```bash
+git clone https://github.com/query-farm/python-sql-scan-planning.git
+cd python-sql-scan-planning
+rye sync
+```
+
+### Running Tests
+```bash
+rye run pytest
+```
+
+### Code Quality
+```bash
+rye run ruff check
+rye run pytest --mypy
+```
+
+## Dependencies
+
+- **sqlglot**: SQL parsing and AST manipulation
+- **Python 3.12+**: Required for modern type hints and pattern matching
+
+## Contributing
+
+1. Fork the repository
+2. Create a feature branch
+3. Add tests for new functionality
+4. Ensure all tests pass
+5. Submit a pull request
+
+## License
+
+This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
+
+## Related Projects
+
+- [SQLGlot](https://github.com/tobymao/sqlglot) - SQL parser and transpiler
+
+## Author
+
+This Python module was created by [Query.Farm](https://query.farm).
+
+# License
+
+MIT Licensed.
diff --git a/pyproject.toml b/pyproject.toml
@@ -0,0 +1,45 @@
+[project]
+name = "query-farm-sql-scan-planning"
+version = "0.1.0"
+description = "Add your description here"
+authors = [
+    { name = "Rusty Conover", email = "rusty@conover.me" }
+]
+dependencies = [
+    "sqlglot>=26.33.0",
+]
+requires-python = ">= 3.12"
+keywords = ["sql", "predicate pushdown", "sql scan planning", "scan planning"]
+classifiers = [
+    "Development Status :: 4 - Beta",
+    "Intended Audience :: Developers",
+    "Topic :: Database",
+    "Topic :: Database :: Database Engines/Servers",
+    "Programming Language :: Python :: 3.12"
+]
+
+
+
+[project.urls]
+Repository = "https://github.com/query-farm/python-sql-scan-planning.git"
+Issues = "https://github.com/query-farm/python-sql-scan-planning/issues"
+
+[build-system]
+requires = ["hatchling==1.26.3", "hatch-vcs"]
+build-backend = "hatchling.build"
+
+[tool.rye]
+managed = true
+dev-dependencies = [
+    "pytest>=8.3.2",
+    "pytest-mypy>=0.10.3",
+    "pytest-env>=1.1.3",
+    "pytest-cov>=5.0.0",
+    "ruff>=0.6.2",
+]
+
+[tool.hatch.metadata]
+allow-direct-references = true
+
+[tool.hatch.build.targets.wheel]
+packages = ["src/query_farm_sql_scan_planning"]
diff --git a/requirements-dev.lock b/requirements-dev.lock
@@ -0,0 +1,43 @@
+# generated by rye
+# use `rye lock` or `rye sync` to update this lockfile
+#
+# last locked with the following flags:
+#   pre: false
+#   features: []
+#   all-features: false
+#   with-sources: false
+#   generate-hashes: false
+#   universal: false
+
+-e file:.
+coverage==7.9.2
+    # via pytest-cov
+filelock==3.18.0
+    # via pytest-mypy
+iniconfig==2.1.0
+    # via pytest
+mypy==1.16.1
+    # via pytest-mypy
+mypy-extensions==1.1.0
+    # via mypy
+packaging==25.0
+    # via pytest
+pathspec==0.12.1
+    # via mypy
+pluggy==1.6.0
+    # via pytest
+    # via pytest-cov
+pygments==2.19.2
+    # via pytest
+pytest==8.4.1
+    # via pytest-cov
+    # via pytest-env
+    # via pytest-mypy
+pytest-cov==6.2.1
+pytest-env==1.1.5
+pytest-mypy==1.0.1
+ruff==0.12.2
+sqlglot==26.33.0
+    # via query-farm-sql-scan-planning
+typing-extensions==4.14.1
+    # via mypy
diff --git a/requirements.lock b/requirements.lock
@@ -0,0 +1,14 @@
+# generated by rye
+# use `rye lock` or `rye sync` to update this lockfile
+#
+# last locked with the following flags:
+#   pre: false
+#   features: []
+#   all-features: false
+#   with-sources: false
+#   generate-hashes: false
+#   universal: false
+
+-e file:.
+sqlglot==26.33.0
+    # via query-farm-sql-scan-planning
diff --git a/src/query_farm_sql_scan_planning/__init__.py b/src/query_farm_sql_scan_planning/__init__.py
@@ -0,0 +1,2 @@
+def hello() -> str:
+    return "Hello from query-farm-sql-scan-planning!"
diff --git a/src/query_farm_sql_scan_planning/planner.py b/src/query_farm_sql_scan_planning/planner.py
diff --git a/src/query_farm_sql_scan_planning/test_planner.py b/src/query_farm_sql_scan_planning/test_planner.py

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,2 @@`
	`1`	`+def hello() -> str:`
	`2`	`+ return "Hello from query-farm-sql-scan-planning!"`