🚀 Content Extractor Pro

A comprehensive web content extraction toolkit with YouTube transcripts, URL metadata, and Reddit content downloading - all with intelligent file naming and bulk processing capabilities.

✨ Features

🎬 YouTube Transcript Extraction

High Success Rate: 95%+ extraction success using youtube-transcript-api
Multiple Formats: Raw, stitched, and formatted transcript views
Bulk Processing: Extract from multiple videos simultaneously
Smart Export: Markdown, JSON, CSV, and ZIP archives

🔗 URL Metadata Extraction

Universal Support: Works with YouTube, Reddit, GitHub, Twitter/X, and generic websites
Intelligent Fallbacks: Multiple extraction strategies for maximum reliability
Bulk Processing: Process hundreds of URLs with progress tracking
Rich Metadata: Titles, descriptions, thumbnails, Open Graph data

🔴 Reddit Content Downloader

Complete Extraction: Posts, comments, metadata, author info, and images
Nested Comments: Hierarchical comment threads with proper indentation
Bulk Processing: Download multiple Reddit posts simultaneously
Flexible Options: Configurable comment limits, metadata inclusion
Multiple Formats: Markdown, JSON, and organized ZIP archives

📋 Smart File Naming System

Template-Based: Customizable filename templates with metadata variables
Auto-Sanitization: Clean, OS-compatible filenames
Content-Aware: Different templates for different content types

YouTube: {title} - {channel} - {date}
Reddit:  reddit_{subreddit}_{title}_{date}
URL:     {domain} - {title} - {date}

🚀 Quick Start

Prerequisites

Python 3.9+
Flask
youtube-transcript-api

Installation

# Clone the repository
git clone https://github.com/yourusername/content-extractor-pro.git
cd content-extractor-pro

# Install dependencies
pip install -r backend/requirements.txt

# Start the application
./start.sh

Access the Tools

Main App: http://localhost:8000/frontend/
Simple URL Extractor: http://localhost:8000/frontend/simple_extractor.html
Reddit Downloader: http://localhost:8000/frontend/reddit_downloader.html
Debug Tools: http://localhost:8000/frontend/debug_frontend.html

📖 Usage Guide

YouTube Transcript Extraction

Single Video: Paste a YouTube URL and click "Extract Transcript"
Bulk Processing: Add multiple URLs (one per line) for batch processing
Export Options: Download as Markdown, TXT, JSON, or ZIP archive

# Example filename output
How_to_Learn_Python_Fast_-_FreeCodeCamp_-_2025-06-01.md

URL Metadata Extraction

Single URL: Extract metadata from any website
Bulk URLs: Process multiple URLs with progress tracking
Text Extraction: Auto-detect URLs from pasted text

# Example filename output
github_com_-_Awesome_Python_Project_-_2025-06-01.md

Reddit Content Downloading

Single Post: Download Reddit posts with comments and metadata
Bulk Processing: Extract from multiple Reddit URLs simultaneously
Customization: Configure comment limits, author info, timestamps

# Example filename output
reddit_programming_How_to_Learn_Python_2025-06-01.md

🛠️ Template Variables

Variable	Description	Example
`{title}`	Content title	`How_to_Learn_Python`
`{channel}`	YouTube channel	`FreeCodeCamp`
`{domain}`	Website domain	`github_com`
`{subreddit}`	Reddit subreddit	`programming`
`{author}`	Content author	`john_doe`
`{date}`	Current date	`2025-06-01`
`{time}`	Current time	`14-30-25`
`{views}`	View count	`1000000`
`{score}`	Reddit score	`500`
`{comments}`	Comment count	`25`

🏗️ Architecture

content-extractor-pro/
├── frontend/           # Web interface (HTML/CSS/JS)
├── backend/           # Flask API server
├── assets/            # Static assets
├── docs/              # Documentation
├── tests/             # Test files
├── scripts/           # Utility scripts
├── config/            # Configuration files
└── start.sh           # Quick start script

Backend API Endpoints

GET /api/health - Health check
POST /api/extract - YouTube transcript extraction
POST /api/extract-url-metadata - URL metadata extraction

🔧 Configuration

Reddit API (Optional)

For higher rate limits and better reliability:

Create a Reddit app at https://www.reddit.com/prefs/apps
Configure Client ID and Secret in the Reddit Downloader interface
Enjoy improved rate limits and reliability

Environment Variables

FLASK_ENV=development
FLASK_DEBUG=True
PORT=5002

📊 Success Rates

Platform	Success Rate	Notes
YouTube	95%+	Using official transcript API
Reddit	100%	Public JSON API
GitHub	100%	GitHub API integration
Generic URLs	80%+	Multiple fallback strategies
Twitter/X	60%+	JavaScript-heavy, best effort

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

youtube-transcript-api for YouTube transcript extraction
Flask for the backend API
JSZip for client-side ZIP generation

🔮 Roadmap

📞 Support

🐛 Bug Reports: Open an issue
💡 Feature Requests: Start a discussion
📧 Email: your.email@example.com

Made with ❤️ for content creators, researchers, and data enthusiasts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🚀 Content Extractor Pro

✨ Features

🎬 YouTube Transcript Extraction

🔗 URL Metadata Extraction

🔴 Reddit Content Downloader

📋 Smart File Naming System

🚀 Quick Start

Prerequisites

Installation

Access the Tools

📖 Usage Guide

YouTube Transcript Extraction

URL Metadata Extraction

Reddit Content Downloading

🛠️ Template Variables

🏗️ Architecture

Backend API Endpoints

🔧 Configuration

Reddit API (Optional)

Environment Variables

📊 Success Rates

🤝 Contributing

📝 License

🙏 Acknowledgments

🔮 Roadmap

📞 Support

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github/workflows		.github/workflows
assets		assets
backend		backend
config		config
docs		docs
frontend		frontend
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Reddit_Downloader(Api_Based).ipynb		Reddit_Downloader(Api_Based).ipynb
cursor_make_youtube_transcript_extracto.md		cursor_make_youtube_transcript_extracto.md
start.sh		start.sh
tasks:issues.md		tasks:issues.md

License

Paranjayy/content-extractor

Folders and files

Latest commit

History

Repository files navigation

🚀 Content Extractor Pro

✨ Features

🎬 YouTube Transcript Extraction

🔗 URL Metadata Extraction

🔴 Reddit Content Downloader

📋 Smart File Naming System

🚀 Quick Start

Prerequisites

Installation

Access the Tools

📖 Usage Guide

YouTube Transcript Extraction

URL Metadata Extraction

Reddit Content Downloading

🛠️ Template Variables

🏗️ Architecture

Backend API Endpoints

🔧 Configuration

Reddit API (Optional)

Environment Variables

📊 Success Rates

🤝 Contributing

📝 License

🙏 Acknowledgments

🔮 Roadmap

📞 Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages