Skip to content

Commit 7b9ba30

Browse files
committed
Merge branch 'release/v0.7.0' - The Adaptive Intelligence Update
2 parents 02f3127 + 0c8bb74 commit 7b9ba30

File tree

320 files changed

+115071
-514
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

320 files changed

+115071
-514
lines changed

.claude/settings.local.json

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
{
2+
"permissions": {
3+
"allow": [
4+
"Bash(cd:*)",
5+
"Bash(python3:*)",
6+
"Bash(python:*)",
7+
"Bash(grep:*)",
8+
"Bash(mkdir:*)",
9+
"Bash(cp:*)",
10+
"Bash(rm:*)",
11+
"Bash(true)",
12+
"Bash(./package-extension.sh:*)",
13+
"Bash(find:*)",
14+
"Bash(chmod:*)",
15+
"Bash(rg:*)",
16+
"Bash(/Users/unclecode/.npm-global/lib/node_modules/@anthropic-ai/claude-code/vendor/ripgrep/arm64-darwin/rg -A 5 -B 5 \"Script Builder\" docs/md_v2/apps/crawl4ai-assistant/)",
17+
"Bash(/Users/unclecode/.npm-global/lib/node_modules/@anthropic-ai/claude-code/vendor/ripgrep/arm64-darwin/rg -A 30 \"generateCode\\(events, format\\)\" docs/md_v2/apps/crawl4ai-assistant/content/content.js)",
18+
"Bash(/Users/unclecode/.npm-global/lib/node_modules/@anthropic-ai/claude-code/vendor/ripgrep/arm64-darwin/rg \"<style>\" docs/md_v2/apps/crawl4ai-assistant/index.html -A 5)",
19+
"Bash(git checkout:*)",
20+
"Bash(docker logs:*)",
21+
"Bash(curl:*)",
22+
"Bash(docker compose:*)",
23+
"Bash(./test-final-integration.sh:*)",
24+
"Bash(mv:*)"
25+
]
26+
},
27+
"enableAllProjectMcpServers": false
28+
}

.gitignore

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,6 @@
1+
# Scripts folder (private tools)
2+
.scripts/
3+
14
# Byte-compiled / optimized / DLL files
25
__pycache__/
36
*.py[cod]
@@ -265,3 +268,6 @@ tests/**/benchmark_reports
265268

266269
docs/**/data
267270
.codecat/
271+
272+
docs/apps/linkdin/debug*/
273+
docs/apps/linkdin/samples/insights/*

CHANGELOG.md

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,56 @@ All notable changes to Crawl4AI will be documented in this file.
55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
66
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77

8+
## [0.7.x] - 2025-06-29
9+
10+
### Added
11+
- **Virtual Scroll Support**: New `VirtualScrollConfig` for handling virtualized scrolling on modern websites
12+
- Automatically detects and handles three scrolling scenarios:
13+
- Content unchanged (continue scrolling)
14+
- Content appended (traditional infinite scroll)
15+
- Content replaced (true virtual scroll - Twitter/Instagram style)
16+
- Captures ALL content from pages that replace DOM elements during scroll
17+
- Intelligent deduplication based on normalized text content
18+
- Configurable scroll amount, count, and wait times
19+
- Seamless integration with existing extraction strategies
20+
- Comprehensive examples including Twitter timeline, Instagram grid, and mixed content scenarios
21+
22+
## [Unreleased]
23+
24+
### Added
25+
- **AsyncUrlSeeder**: High-performance URL discovery system for intelligent crawling at scale
26+
- Discover URLs from sitemaps and Common Crawl index
27+
- Extract and analyze page metadata without full crawling
28+
- BM25 relevance scoring for query-based URL filtering
29+
- Multi-domain parallel discovery with `many_urls()` method
30+
- Automatic caching with TTL for discovered URLs
31+
- Rate limiting and concurrent request management
32+
- Live URL validation with HEAD requests
33+
- JSON-LD and Open Graph metadata extraction
34+
- **SeedingConfig**: Configuration class for URL seeding operations
35+
- Support for multiple discovery sources (`sitemap`, `cc`, `sitemap+cc`)
36+
- Pattern-based URL filtering with wildcards
37+
- Configurable concurrency and rate limiting
38+
- Query-based relevance scoring with BM25
39+
- Score threshold filtering for quality control
40+
- Comprehensive documentation for URL seeding feature
41+
- Detailed comparison with deep crawling approaches
42+
- Complete API reference with examples
43+
- Integration guide with AsyncWebCrawler
44+
- Performance benchmarks and best practices
45+
- Example scripts demonstrating URL seeding:
46+
- `url_seeder_demo.py`: Interactive Rich-based demonstration
47+
- `url_seeder_quick_demo.py`: Screenshot-friendly examples
48+
- Test suite for URL seeding with BM25 scoring
49+
50+
### Changed
51+
- Updated `__init__.py` to export AsyncUrlSeeder and SeedingConfig
52+
- Enhanced documentation with URL seeding integration examples
53+
54+
### Fixed
55+
- Corrected examples to properly extract URLs from seeder results before passing to `arun_many()`
56+
- Fixed logger color compatibility issue (changed `lightblack` to `bright_black`)
57+
858
## [0.6.2] - 2025-05-02
959

1060
### Added

Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
FROM python:3.12-slim-bookworm AS build
22

33
# C4ai version
4-
ARG C4AI_VER=0.6.0
4+
ARG C4AI_VER=0.7.0-r1
55
ENV C4AI_VERSION=$C4AI_VER
66
LABEL c4ai.version=$C4AI_VER
77

0 commit comments

Comments
 (0)