Skip to content

Commit dfcf102

Browse files
committed
Initial commit
0 parents  commit dfcf102

File tree

11 files changed

+1049
-0
lines changed

11 files changed

+1049
-0
lines changed

.github/workflows/ci.yml

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
name: CI
2+
3+
on:
4+
push:
5+
branches:
6+
- main
7+
pull_request:
8+
branches:
9+
- main
10+
11+
jobs:
12+
test:
13+
runs-on: ubuntu-latest
14+
15+
steps:
16+
- name: Checkout repository
17+
uses: actions/checkout@v4
18+
19+
- name: Install the latest version of rye
20+
uses: eifinger/setup-rye@v4
21+
22+
- name: Install project dependencies
23+
run: rye sync
24+
25+
- name: Run tests
26+
run: rye test

.gitignore

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# python generated files
2+
__pycache__/
3+
*.py[oc]
4+
build/
5+
dist/
6+
wheels/
7+
*.egg-info
8+
9+
# venv
10+
.venv

.python-version

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
3.12.4

LICENSE

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
Copyright 2024-2025 Rusty Conover <rusty@query.farm> - https://query.farm
2+
3+
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
4+
5+
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
6+
7+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

README.md

Lines changed: 233 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,233 @@
1+
# Query Farm SQL Scan Planning
2+
3+
A Python library for intelligent file filtering using SQL expressions and metadata-based scan planning. This library enables efficient data lake query optimization by determining which files need to be scanned based on their statistical metadata.
4+
5+
## Overview
6+
7+
Query Farm SQL Scan Planning provides predicate pushdown capabilities for file-based data storage systems. By maintaining metadata about file contents (min/max values, value sets, null presence), the library can quickly determine which files contain data that could satisfy a given `SQL WHERE` clause, significantly reducing query execution time.
8+
9+
## Features
10+
11+
- **SQL Expression Parsing**: Parse and evaluate complex `SQL WHERE` clauses using [SQLGlot](https://github.com/tobymao/sqlglot)
12+
- **Metadata-Based Filtering**: Support for both range-based (min/max) and set-based field metadata
13+
- **Null Handling**: Comprehensive support for NULL value semantics in SQL expressions
14+
- **Complex Predicates**: Handle AND, OR, XOR, NOT, IN, BETWEEN, CASE statements, and more
15+
- **Multiple Data Types**: Support for integers, floats, strings, decimals, and NULL values
16+
- **Dialect Support**: Configurable SQL dialect support (default: DuckDB)
17+
18+
## Installation
19+
20+
```bash
21+
pip install query-farm-sql-scan-planning
22+
```
23+
24+
Or using rye:
25+
26+
```bash
27+
rye add query-farm-sql-scan-planning
28+
```
29+
30+
## Quick Start
31+
32+
```python
33+
from query_farm_sql_scan_planning import Planner, RangeFieldInfo, SetFieldInfo
34+
35+
# Define file metadata
36+
files = [
37+
(
38+
"data_2023_q1.parquet",
39+
{
40+
"sales_amount": RangeFieldInfo[int](
41+
min_value=100, max_value=50000,
42+
has_nulls=False, has_non_nulls=True
43+
),
44+
"region": SetFieldInfo[str](
45+
values={"US", "CA", "MX"},
46+
has_nulls=False, has_non_nulls=True
47+
),
48+
}
49+
),
50+
(
51+
"data_2023_q2.parquet",
52+
{
53+
"sales_amount": RangeFieldInfo[int](
54+
min_value=200, max_value=75000,
55+
has_nulls=False, has_non_nulls=True
56+
),
57+
"region": SetFieldInfo[str](
58+
values={"US", "EU", "UK"},
59+
has_nulls=False, has_non_nulls=True
60+
),
61+
}
62+
),
63+
]
64+
65+
# Create planner
66+
planner = Planner(files)
67+
68+
# Filter files based on SQL expressions
69+
matching_files = planner.get_matching_files("sales_amount > 40000 AND region = 'US'")
70+
print(matching_files) # {'data_2023_q1.parquet', 'data_2023_q2.parquet'}
71+
72+
# More complex queries
73+
matching_files = planner.get_matching_files("region IN ('EU', 'UK')")
74+
print(matching_files) # {'data_2023_q2.parquet'}
75+
```
76+
77+
## Field Information Types
78+
79+
### `RangeFieldInfo`
80+
81+
For fields with known minimum and maximum values:
82+
83+
```python
84+
RangeFieldInfo[int](
85+
min_value=0,
86+
max_value=100,
87+
has_nulls=False, # Whether the field contains NULL values
88+
has_non_nulls=True # Whether the field contains non-NULL values
89+
)
90+
```
91+
92+
### `SetFieldInfo`
93+
94+
For fields with a known set of possible values (useful for categorical data):
95+
96+
```python
97+
SetFieldInfo[str](
98+
values={"apple", "banana", "cherry"},
99+
has_nulls=False,
100+
has_non_nulls=True
101+
)
102+
```
103+
104+
**Note**: `SetFieldInfo` can produce false positives - if a value is in the set, the file *might* contain it, but the file could contain additional values not in the set.
105+
106+
## Supported SQL Operations
107+
108+
### Comparison Operators
109+
- `=`, `!=`, `<>` (equality and inequality)
110+
- `<`, `<=`, `>`, `>=` (range comparisons)
111+
- `IS NULL`, `IS NOT NULL` (null checks)
112+
- `IS DISTINCT FROM`, `IS NOT DISTINCT FROM` (null-safe comparisons)
113+
114+
### Logical Operators
115+
- `AND`, `OR`, `XOR` (logical connectors)
116+
- `NOT` (negation)
117+
118+
### Set Operations
119+
- `IN`, `NOT IN` (membership tests)
120+
- `BETWEEN`, `NOT BETWEEN` (range tests)
121+
122+
### Control Flow
123+
- `CASE WHEN ... THEN ... ELSE ... END` (conditional expressions)
124+
125+
### Literals
126+
- Numeric literals: `123`, `45.67`
127+
- String literals: `'hello'`
128+
- Boolean literals: `TRUE`, `FALSE`
129+
- NULL literal: `NULL`
130+
131+
## Examples
132+
133+
### Range Queries
134+
```python
135+
# Files with sales between 1000 and 5000
136+
planner.get_matching_files("sales_amount BETWEEN 1000 AND 5000")
137+
138+
# Files with any sales over 10000
139+
planner.get_matching_files("sales_amount > 10000")
140+
```
141+
142+
### Set Membership
143+
```python
144+
# Files containing specific regions
145+
planner.get_matching_files("region IN ('US', 'CA')")
146+
147+
# Files not containing specific regions
148+
planner.get_matching_files("region NOT IN ('UNKNOWN', 'TEST')")
149+
```
150+
151+
### Complex Conditions
152+
```python
153+
# Combination of range and set conditions
154+
planner.get_matching_files(
155+
"sales_amount > 5000 AND region IN ('US', 'EU') AND customer_id IS NOT NULL"
156+
)
157+
158+
# Case expressions
159+
planner.get_matching_files(
160+
"CASE WHEN region = 'US' THEN sales_amount > 1000 ELSE sales_amount > 500 END"
161+
)
162+
```
163+
164+
### Null Handling
165+
```python
166+
# Files that might contain null values in sales_amount
167+
planner.get_matching_files("sales_amount IS NULL")
168+
169+
# Files with non-null sales amounts over 1000
170+
planner.get_matching_files("sales_amount IS NOT NULL AND sales_amount > 1000")
171+
```
172+
173+
## Performance Considerations
174+
175+
- **Metadata Quality**: More accurate metadata (tighter ranges, complete value sets) leads to better filtering
176+
- **Expression Complexity**: Simple expressions evaluate faster than complex nested conditions
177+
- **False Positives**: The library errs on the side of including files that might match rather than risk excluding files that do match
178+
179+
## Use Cases
180+
181+
- **Data Lake Query Optimization**: Skip irrelevant files in distributed query engines
182+
- **ETL Pipeline Optimization**: Process only files containing relevant data
183+
- **Data Catalog Integration**: Enhance metadata catalogs with query planning capabilities
184+
- **Columnar Storage**: Optimize scans of Parquet, ORC, or similar formats
185+
186+
## Development
187+
188+
### Setup
189+
```bash
190+
git clone https://github.com/query-farm/python-sql-scan-planning.git
191+
cd python-sql-scan-planning
192+
rye sync
193+
```
194+
195+
### Running Tests
196+
```bash
197+
rye run pytest
198+
```
199+
200+
### Code Quality
201+
```bash
202+
rye run ruff check
203+
rye run pytest --mypy
204+
```
205+
206+
## Dependencies
207+
208+
- **sqlglot**: SQL parsing and AST manipulation
209+
- **Python 3.12+**: Required for modern type hints and pattern matching
210+
211+
## Contributing
212+
213+
1. Fork the repository
214+
2. Create a feature branch
215+
3. Add tests for new functionality
216+
4. Ensure all tests pass
217+
5. Submit a pull request
218+
219+
## License
220+
221+
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
222+
223+
## Related Projects
224+
225+
- [SQLGlot](https://github.com/tobymao/sqlglot) - SQL parser and transpiler
226+
227+
## Author
228+
229+
This Python module was created by [Query.Farm](https://query.farm).
230+
231+
# License
232+
233+
MIT Licensed.

pyproject.toml

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
[project]
2+
name = "query-farm-sql-scan-planning"
3+
version = "0.1.0"
4+
description = "Add your description here"
5+
authors = [
6+
{ name = "Rusty Conover", email = "rusty@conover.me" }
7+
]
8+
dependencies = [
9+
"sqlglot>=26.33.0",
10+
]
11+
requires-python = ">= 3.12"
12+
keywords = ["sql", "predicate pushdown", "sql scan planning", "scan planning"]
13+
classifiers = [
14+
"Development Status :: 4 - Beta",
15+
"Intended Audience :: Developers",
16+
"Topic :: Database",
17+
"Topic :: Database :: Database Engines/Servers",
18+
"Programming Language :: Python :: 3.12"
19+
]
20+
21+
22+
23+
[project.urls]
24+
Repository = "https://github.com/query-farm/python-sql-scan-planning.git"
25+
Issues = "https://github.com/query-farm/python-sql-scan-planning/issues"
26+
27+
[build-system]
28+
requires = ["hatchling==1.26.3", "hatch-vcs"]
29+
build-backend = "hatchling.build"
30+
31+
[tool.rye]
32+
managed = true
33+
dev-dependencies = [
34+
"pytest>=8.3.2",
35+
"pytest-mypy>=0.10.3",
36+
"pytest-env>=1.1.3",
37+
"pytest-cov>=5.0.0",
38+
"ruff>=0.6.2",
39+
]
40+
41+
[tool.hatch.metadata]
42+
allow-direct-references = true
43+
44+
[tool.hatch.build.targets.wheel]
45+
packages = ["src/query_farm_sql_scan_planning"]

requirements-dev.lock

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# generated by rye
2+
# use `rye lock` or `rye sync` to update this lockfile
3+
#
4+
# last locked with the following flags:
5+
# pre: false
6+
# features: []
7+
# all-features: false
8+
# with-sources: false
9+
# generate-hashes: false
10+
# universal: false
11+
12+
-e file:.
13+
coverage==7.9.2
14+
# via pytest-cov
15+
filelock==3.18.0
16+
# via pytest-mypy
17+
iniconfig==2.1.0
18+
# via pytest
19+
mypy==1.16.1
20+
# via pytest-mypy
21+
mypy-extensions==1.1.0
22+
# via mypy
23+
packaging==25.0
24+
# via pytest
25+
pathspec==0.12.1
26+
# via mypy
27+
pluggy==1.6.0
28+
# via pytest
29+
# via pytest-cov
30+
pygments==2.19.2
31+
# via pytest
32+
pytest==8.4.1
33+
# via pytest-cov
34+
# via pytest-env
35+
# via pytest-mypy
36+
pytest-cov==6.2.1
37+
pytest-env==1.1.5
38+
pytest-mypy==1.0.1
39+
ruff==0.12.2
40+
sqlglot==26.33.0
41+
# via query-farm-sql-scan-planning
42+
typing-extensions==4.14.1
43+
# via mypy

requirements.lock

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# generated by rye
2+
# use `rye lock` or `rye sync` to update this lockfile
3+
#
4+
# last locked with the following flags:
5+
# pre: false
6+
# features: []
7+
# all-features: false
8+
# with-sources: false
9+
# generate-hashes: false
10+
# universal: false
11+
12+
-e file:.
13+
sqlglot==26.33.0
14+
# via query-farm-sql-scan-planning
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
def hello() -> str:
2+
return "Hello from query-farm-sql-scan-planning!"

0 commit comments

Comments
 (0)