A comprehensive web content extraction toolkit with YouTube transcripts, URL metadata, and Reddit content downloading - all with intelligent file naming and bulk processing capabilities.
- High Success Rate: 95%+ extraction success using youtube-transcript-api
- Multiple Formats: Raw, stitched, and formatted transcript views
- Bulk Processing: Extract from multiple videos simultaneously
- Smart Export: Markdown, JSON, CSV, and ZIP archives
- Universal Support: Works with YouTube, Reddit, GitHub, Twitter/X, and generic websites
- Intelligent Fallbacks: Multiple extraction strategies for maximum reliability
- Bulk Processing: Process hundreds of URLs with progress tracking
- Rich Metadata: Titles, descriptions, thumbnails, Open Graph data
- Complete Extraction: Posts, comments, metadata, author info, and images
- Nested Comments: Hierarchical comment threads with proper indentation
- Bulk Processing: Download multiple Reddit posts simultaneously
- Flexible Options: Configurable comment limits, metadata inclusion
- Multiple Formats: Markdown, JSON, and organized ZIP archives
- Template-Based: Customizable filename templates with metadata variables
- Auto-Sanitization: Clean, OS-compatible filenames
- Content-Aware: Different templates for different content types
YouTube: {title} - {channel} - {date}
Reddit: reddit_{subreddit}_{title}_{date}
URL: {domain} - {title} - {date}
- Python 3.9+
- Flask
- youtube-transcript-api
# Clone the repository
git clone https://github.com/yourusername/content-extractor-pro.git
cd content-extractor-pro
# Install dependencies
pip install -r backend/requirements.txt
# Start the application
./start.sh
- Main App: http://localhost:8000/frontend/
- Simple URL Extractor: http://localhost:8000/frontend/simple_extractor.html
- Reddit Downloader: http://localhost:8000/frontend/reddit_downloader.html
- Debug Tools: http://localhost:8000/frontend/debug_frontend.html
- Single Video: Paste a YouTube URL and click "Extract Transcript"
- Bulk Processing: Add multiple URLs (one per line) for batch processing
- Export Options: Download as Markdown, TXT, JSON, or ZIP archive
# Example filename output
How_to_Learn_Python_Fast_-_FreeCodeCamp_-_2025-06-01.md
- Single URL: Extract metadata from any website
- Bulk URLs: Process multiple URLs with progress tracking
- Text Extraction: Auto-detect URLs from pasted text
# Example filename output
github_com_-_Awesome_Python_Project_-_2025-06-01.md
- Single Post: Download Reddit posts with comments and metadata
- Bulk Processing: Extract from multiple Reddit URLs simultaneously
- Customization: Configure comment limits, author info, timestamps
# Example filename output
reddit_programming_How_to_Learn_Python_2025-06-01.md
Variable | Description | Example |
---|---|---|
{title} |
Content title | How_to_Learn_Python |
{channel} |
YouTube channel | FreeCodeCamp |
{domain} |
Website domain | github_com |
{subreddit} |
Reddit subreddit | programming |
{author} |
Content author | john_doe |
{date} |
Current date | 2025-06-01 |
{time} |
Current time | 14-30-25 |
{views} |
View count | 1000000 |
{score} |
Reddit score | 500 |
{comments} |
Comment count | 25 |
content-extractor-pro/
โโโ frontend/ # Web interface (HTML/CSS/JS)
โโโ backend/ # Flask API server
โโโ assets/ # Static assets
โโโ docs/ # Documentation
โโโ tests/ # Test files
โโโ scripts/ # Utility scripts
โโโ config/ # Configuration files
โโโ start.sh # Quick start script
GET /api/health
- Health checkPOST /api/extract
- YouTube transcript extractionPOST /api/extract-url-metadata
- URL metadata extraction
For higher rate limits and better reliability:
- Create a Reddit app at https://www.reddit.com/prefs/apps
- Configure Client ID and Secret in the Reddit Downloader interface
- Enjoy improved rate limits and reliability
FLASK_ENV=development
FLASK_DEBUG=True
PORT=5002
Platform | Success Rate | Notes |
---|---|---|
YouTube | 95%+ | Using official transcript API |
100% | Public JSON API | |
GitHub | 100% | GitHub API integration |
Generic URLs | 80%+ | Multiple fallback strategies |
Twitter/X | 60%+ | JavaScript-heavy, best effort |
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- youtube-transcript-api for YouTube transcript extraction
- Flask for the backend API
- JSZip for client-side ZIP generation
- Instagram post extraction
- TikTok transcript support
- Podcast transcript extraction
- Custom template editor
- Scheduled bulk processing
- API rate limiting dashboard
- ๐ Bug Reports: Open an issue
- ๐ก Feature Requests: Start a discussion
- ๐ง Email: your.email@example.com
Made with โค๏ธ for content creators, researchers, and data enthusiasts