Releases: unclecode/crawl4ai
v0.7.1:Update
🛠️ Crawl4AI v0.7.1: Minor Cleanup Update
July 17, 2025 • 2 min read
A small maintenance release that removes unused code and improves documentation.
🎯 What's Changed
- Removed unused StealthConfig from
crawl4ai/browser_manager.py
- Updated documentation with better examples and parameter explanations
- Fixed virtual scroll configuration examples in docs
🧹 Code Cleanup
Removed unused StealthConfig
import and configuration that wasn't being used anywhere in the codebase. The project uses its own custom stealth implementation through JavaScript injection instead.
# Removed unused code:
from playwright_stealth import StealthConfig
stealth_config = StealthConfig(...) # This was never used
📖 Documentation Updates
- Fixed adaptive crawling parameter examples
- Updated session management documentation
- Corrected virtual scroll configuration examples
🚀 Installation
pip install crawl4ai==0.7.1
No breaking changes - upgrade directly from v0.7.0.
Questions? Issues?
- GitHub: github.com/unclecode/crawl4ai
- Discord: discord.gg/crawl4ai
v0.7.0: The Adaptive Intelligence Update
🚀 Crawl4AI v0.7.0: The Adaptive Intelligence Update
January 28, 2025 • 10 min read
Today I'm releasing Crawl4AI v0.7.0—the Adaptive Intelligence Update. This release introduces fundamental improvements in how Crawl4AI handles modern web complexity through adaptive learning, intelligent content discovery, and advanced extraction capabilities.
🎯 What's New at a Glance
- Adaptive Crawling: Your crawler now learns and adapts to website patterns
- Virtual Scroll Support: Complete content extraction from infinite scroll pages
- Link Preview with 3-Layer Scoring: Intelligent link analysis and prioritization
- Async URL Seeder: Discover thousands of URLs in seconds with intelligent filtering
- PDF Parsing: Extract data from PDF documents
- Performance Optimizations: Significant speed and memory improvements
🧠 Adaptive Crawling: Intelligence Through Pattern Learning
The Problem: Websites change. Class names shift. IDs disappear. Your carefully crafted selectors break at 3 AM, and you wake up to empty datasets and angry stakeholders.
My Solution: I implemented an adaptive learning system that observes patterns, builds confidence scores, and adjusts extraction strategies on the fly. It's like having a junior developer who gets better at their job with every page they scrape.
Technical Deep-Dive
The Adaptive Crawler maintains a persistent state for each domain, tracking:
- Pattern success rates
- Selector stability over time
- Content structure variations
- Extraction confidence scores
from crawl4ai import AdaptiveCrawler, AdaptiveConfig, CrawlState
# Initialize with custom learning parameters
config = AdaptiveConfig(
confidence_threshold=0.7, # Min confidence to use learned patterns
max_history=100, # Remember last 100 crawls per domain
learning_rate=0.2, # How quickly to adapt to changes
patterns_per_page=3, # Patterns to learn per page type
extraction_strategy='css' # 'css' or 'xpath'
)
adaptive_crawler = AdaptiveCrawler(config)
# First crawl - crawler learns the structure
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
"https://news.example.com/article/12345",
config=CrawlerRunConfig(
adaptive_config=config,
extraction_hints={ # Optional hints to speed up learning
"title": "article h1",
"content": "article .body-content"
}
)
)
# Crawler identifies and stores patterns
if result.success:
state = adaptive_crawler.get_state("news.example.com")
print(f"Learned {len(state.patterns)} patterns")
print(f"Confidence: {state.avg_confidence:.2%}")
# Subsequent crawls - uses learned patterns
result2 = await crawler.arun(
"https://news.example.com/article/67890",
config=CrawlerRunConfig(adaptive_config=config)
)
# Automatically extracts using learned patterns!
Expected Real-World Impact:
- News Aggregation: Maintain 95%+ extraction accuracy even as news sites update their templates
- E-commerce Monitoring: Track product changes across hundreds of stores without constant maintenance
- Research Data Collection: Build robust academic datasets that survive website redesigns
- Reduced Maintenance: Cut selector update time by 80% for frequently-changing sites
🌊 Virtual Scroll: Complete Content Capture
The Problem: Modern web apps only render what's visible. Scroll down, new content appears, old content vanishes into the void. Traditional crawlers capture that first viewport and miss 90% of the content. It's like reading only the first page of every book.
My Solution: I built Virtual Scroll support that mimics human browsing behavior, capturing content as it loads and preserving it before the browser's garbage collector strikes.
Implementation Details
from crawl4ai import VirtualScrollConfig
# For social media feeds (Twitter/X style)
twitter_config = VirtualScrollConfig(
container_selector="[data-testid='primaryColumn']",
scroll_count=20, # Number of scrolls
scroll_by="container_height", # Smart scrolling by container size
wait_after_scroll=1.0, # Let content load
capture_method="incremental", # Capture new content on each scroll
deduplicate=True # Remove duplicate elements
)
# For e-commerce product grids (Instagram style)
grid_config = VirtualScrollConfig(
container_selector="main .product-grid",
scroll_count=30,
scroll_by=800, # Fixed pixel scrolling
wait_after_scroll=1.5, # Images need time
stop_on_no_change=True # Smart stopping
)
# For news feeds with lazy loading
news_config = VirtualScrollConfig(
container_selector=".article-feed",
scroll_count=50,
scroll_by="page_height", # Viewport-based scrolling
wait_after_scroll=0.5,
wait_for_selector=".article-card", # Wait for specific elements
timeout=30000 # Max 30 seconds total
)
# Use it in your crawl
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
"https://twitter.com/trending",
config=CrawlerRunConfig(
virtual_scroll_config=twitter_config,
# Combine with other features
extraction_strategy=JsonCssExtractionStrategy({
"tweets": {
"selector": "[data-testid='tweet']",
"fields": {
"text": {"selector": "[data-testid='tweetText']", "type": "text"},
"likes": {"selector": "[data-testid='like']", "type": "text"}
}
}
})
)
)
print(f"Captured {len(result.extracted_content['tweets'])} tweets")
Key Capabilities:
- DOM Recycling Awareness: Detects and handles virtual DOM element recycling
- Smart Scroll Physics: Three modes - container height, page height, or fixed pixels
- Content Preservation: Captures content before it's destroyed
- Intelligent Stopping: Stops when no new content appears
- Memory Efficient: Streams content instead of holding everything in memory
Expected Real-World Impact:
- Social Media Analysis: Capture entire Twitter threads with hundreds of replies, not just top 10
- E-commerce Scraping: Extract 500+ products from infinite scroll catalogs vs. 20-50 with traditional methods
- News Aggregation: Get all articles from modern news sites, not just above-the-fold content
- Research Applications: Complete data extraction from academic databases using virtual pagination
🔗 Link Preview: Intelligent Link Analysis and Scoring
The Problem: You crawl a page and get 200 links. Which ones matter? Which lead to the content you actually want? Traditional crawlers force you to follow everything or build complex filters.
My Solution: I implemented a three-layer scoring system that analyzes links like a human would—considering their position, context, and relevance to your goals.
The Three-Layer Scoring System
from crawl4ai import LinkPreviewConfig
# Configure intelligent link analysis
link_config = LinkPreviewConfig(
# What to analyze
include_internal=True,
include_external=True,
max_links=100, # Analyze top 100 links
# Relevance scoring
query="machine learning tutorials", # Your interest
score_threshold=0.3, # Minimum relevance score
# Performance
concurrent_requests=10, # Parallel processing
timeout_per_link=5000, # 5s per link
# Advanced scoring weights
scoring_weights={
"intrinsic": 0.3, # Link quality indicators
"contextual": 0.5, # Relevance to query
"popularity": 0.2 # Link prominence
}
)
# Use in your crawl
result = await crawler.arun(
"https://tech-blog.example.com",
config=CrawlerRunConfig(
link_preview_config=link_config,
score_links=True
)
)
# Access scored and sorted links
for link in result.links["internal"][:10]: # Top 10 internal links
print(f"Score: {link['total_score']:.3f}")
print(f" Intrinsic: {link['intrinsic_score']:.1f}/10") # Position, attributes
print(f" Contextual: {link['contextual_score']:.1f}/1") # Relevance to query
print(f" URL: {link['href']}")
print(f" Title: {link['head_data']['title']}")
print(f" Description: {link['head_data']['meta']['description'][:100]}...")
Scoring Components:
-
Intrinsic Score (0-10): Based on link quality indicators
- Position on page (navigation, content, footer)
- Link attributes (rel, title, class names)
- Anchor text quality and length
- URL structure and depth
-
Contextual Score (0-1): Relevance to your query
- Semantic similarity using embeddings
- Keyword matching in link text and title
- Meta description analysis
- Content preview scoring
-
Total Score: Weighted combination for final ranking
Expected Real-World Impact:
- Research Efficiency: Find relevant papers 10x faster by following only high-score links
- Competitive Analysis: Automatically identify important pages on competitor sites
- Content Discovery: Build topic-focused crawlers that stay on track
- SEO Audits: Identify and prioritize high-value internal linking opportunities
🎣 Async URL Seeder: Automated URL Discovery at Scale
The Problem: You want to crawl an entire domain but only have the homepage. Or worse, you want specific content types across thousands of pages. Manual URL discovery? That's a job for machines, not humans.
My Solution: I built Async URL Seeder—a turbocharged URL discovery engine that combines multiple sources with intelligent filtering and relevance scoring.
Technical Architecture
...
v0.6.3
Release 0.6.3 (unreleased)
Features
- extraction: add
RegexExtractionStrategy
for pattern-based extraction, including built-in patterns for emails, URLs, phones, dates, support for custom regexes, an LLM-assisted pattern generator, optimized HTML preprocessing viafit_html
, and enhanced network response body capture (9b5ccac) - docker-api: introduce job-based polling endpoints—
POST /crawl/job
&GET /crawl/job/{task_id}
for crawls,POST /llm/job
&GET /llm/job/{task_id}
for LLM tasks—backed by Redis task management with configurable TTL, moved schemas toschemas.py
, and addeddemo_docker_polling.py
example (94e9959) - browser: improve profile management and cleanup—add process cleanup for existing Chromium instances on Windows/Unix, fix profile creation by passing full browser config, ship detailed browser/CLI docs and initial profile-creation test, bump version to 0.6.3 (9499164)
Fixes
- crawler: remove automatic page closure in
take_screenshot
andtake_screenshot_naive
, preventing premature teardown; callers now must explicitly close pages (BREAKING CHANGE) (a3e9ef9)
Documentation
- format bash scripts in
docs/apps/linkdin/README.md
so examples copy & paste cleanly (87d4b0f) - update the same README with full
litellm
argument details for correct script usage (bd5a9ac)
Refactoring
- logger: centralize color codes behind an
Enum
inasync_logger
,browser_profiler
,content_filter_strategy
and related modules for cleaner, type-safe formatting (cd2b490)
Experimental
- start migration of logging stack to
rich
(WIP, work ongoing) (b2f3cb0)
Crawl4AI 0.6.0
🚀 0.6.0 — 22 Apr 2025
Highlights
- World‑aware crawlers:
crun_cfg = CrawlerRunConfig(
url="https://browserleaks.com/geo", # test page that shows your location
locale="en-US", # Accept-Language & UI locale
timezone_id="America/Los_Angeles", # JS Date()/Intl timezone
geolocation=GeolocationConfig( # override GPS coords
latitude=34.0522,
longitude=-118.2437,
accuracy=10.0,
)
)
- Table‑to‑DataFrame extraction, flip
df = pd.DataFrame(result.media["tables"][0]["rows"], columns=result.media["tables"][0]["headers"])
and get CSV or pandas without extra parsing. - Crawler pool with pre‑warm, pages launch hot, lower P90 latency, lower memory.
- Network and console capture, full traffic log plus MHTML snapshot for audits and debugging.
Added
- Geolocation, locale, and timezone flags for every crawl.
- Browser pooling with page pre‑warming.
- Table extractor that exports to CSV or pandas.
- Crawler pool manager in SDK and Docker API.
- Network & console log capture, plus MHTML snapshot.
- MCP socket and SSE endpoints with playground UI.
- Stress‑test framework (
tests/memory
) for 1 k+ URL runs. - Docs v2: TOC, GitHub badge, copy‑code buttons, Docker API demo.
- “Ask AI” helper button, work in progress, shipping soon.
- New examples: geo location, network/console capture, Docker API, markdown source selection, crypto analysis.
Changed
- Browser strategy consolidation, legacy docker modules removed.
ProxyConfig
moved toasync_configs
.- Server migrated to pool‑based crawler management.
- FastAPI validators replace custom query validation.
- Docker build now uses a Chromium base image.
- Repo cleanup, ≈36 k insertions, ≈5 k deletions across 121 files.
Fixed
- Session leaks, duplicate visits, URL normalisation.
- Target‑element regressions in scraping strategies.
- Logged URL readability, encoded URL decoding, middle truncation.
- Closed issues: #701 #733 #756 #774 #804 #822 #839 #841 #842 #843 #867 #902 #911.
Removed
- Obsolete modules in
crawl4ai/browser/*
.
Deprecated
- Old markdown generator names now alias
DefaultMarkdownGenerator
and warn.
Upgrade notes
- Update any imports from
crawl4ai/browser/*
to the new pooled browser modules. - If you override
AsyncPlaywrightCrawlerStrategy.get_page
adopt the new signature. - Rebuild Docker images to pick up the Chromium layer.
- Switch to
DefaultMarkdownGenerator
to silence deprecation warnings.
121 files changed, ≈36 223 insertions, ≈4 975 deletions
Crawl4AI v0.5.0.post1
Crawl4AI v0.5.0.post1 Release
Release Theme: Power, Flexibility, and Scalability
Crawl4AI v0.5.0 is a major release focused on significantly enhancing the library's power, flexibility, and scalability.
Key Features
- Deep Crawling System - Explore websites beyond initial URLs with BFS, DFS, and BestFirst strategies, with page limiting and scoring capabilities
- Memory-Adaptive Dispatcher - Scale to thousands of URLs with intelligent memory monitoring and concurrency control
- Multiple Crawling Strategies - Choose between browser-based (Playwright) or lightweight HTTP-only crawling
- Docker Deployment - Easy deployment with FastAPI server, JWT authentication, and streaming/non-streaming endpoints
- Command-Line Interface - New
crwl
CLI provides convenient access to all features with intuitive commands - Browser Profiler - Create and manage persistent browser profiles to save authentication states for protected content
- Crawl4AI Coding Assistant - Interactive chat interface for asking questions about Crawl4AI and generating Python code examples
- LXML Scraping Mode - Fast HTML parsing using the
lxml
library for 10-20x speedup with complex pages - Proxy Rotation - Built-in support for dynamic proxy switching with authentication and session persistence
- PDF Processing - Extract and process data from PDF files (both local and remote)
Additional Improvements
- LLM Content Filter for intelligent markdown generation
- URL redirection tracking
- LLM-powered schema generation utility for extraction templates
- robots.txt compliance support
- Enhanced browser context management
- Improved serialization and config handling
Breaking Changes
This release contains several breaking changes. Please review the full release notes for migration guidance.
For complete details, visit: https://docs.crawl4ai.com/blog/releases/0.5.0/