A production-ready system for transcribing entire YouTube channels using GPU-accelerated Whisper AI with faster-whisper for maximum performance.
- GPU Accelerated - 35-40x realtime speed with faster-whisper on CUDA
- Modal Cloud Support - 70-200x realtime with parallel A10G GPUs
- Bulk Processing - Transcribe entire channels (1000+ videos)
- Multi-Channel - Process unlimited channels with automatic organization
- Resumable - Interrupt and resume anytime without losing progress
- Parallel Downloads - Download up to 10 videos simultaneously
- Smart Storage - Auto-cleanup of audio files after transcription
- Progress Tracking - SQLite database tracks every video
- Voice Activity Detection - Automatically skip silence to save compute
- Error Handling - Robust retry logic and error recovery
Windows:
# Option 1: Double-click setup.bat
# Option 2: Run in terminal
python setup.pymacOS/Linux:
python3 setup.pyThe interactive wizard will:
- β Check your system requirements
- β Install dependencies (Local GPU / Modal Cloud / Both)
- β Test GPU availability
- β Setup Modal authentication
- β Create config file
- β Guide you through everything
Other options:
- Advanced users:
python quick-setup.pyfor minimal setup - Manual setup: See INSTALL.md
- Complete guide: docs/GETTING_STARTED.md
Local GPU (All-in-one):
python scripts/run_transcriber.py
# Downloads + transcribes on your GPUModal Cloud (Hybrid - 2 steps):
# Step 1: Download audio locally
python scripts/prepare_for_modal.py
# Step 2: Transcribe on Modal GPUs (parallel, 70-200x realtime)
modal run scripts/modal_hybrid.py --max-files 10Check Progress:
python scripts/utils/check_status.py| Scenario | Recommended | Why |
|---|---|---|
| Have NVIDIA GPU (4GB+ VRAM) | Local GPU | Free, fast enough (35-40x realtime) |
| Large channel (500+ videos) | Modal Cloud | Fastest (70-200x realtime), costs $30-40/1000hrs |
| No NVIDIA GPU | Modal Cloud | Only option for GPU acceleration |
| Urgent deadline | Modal Cloud | Parallel processing = done in minutes |
Note: Modal requires hybrid approach (download locally, transcribe on cloud) due to YouTube bot detection. See docs/MODAL_QUICKSTART.md
- QUICKSTART.md - TL;DR for impatient users
- INSTALL.md - Complete installation guide
- docs/GETTING_STARTED.md - Comprehensive beginner guide
- docs/MODAL_QUICKSTART.md - Modal cloud setup
- docs/MULTI_CHANNEL_GUIDE.md - Managing multiple channels
- WORKFLOW.md - Configuration workflow
- SETUP_GUIDE.md - Setup methods comparison
- CHANGELOG.md - Version history
For Local GPU:
- GPU: NVIDIA GPU with 4GB+ VRAM (8GB recommended)
- RTX 4060: Tested, works great (8GB VRAM)
- RTX 3060+: Recommended
- GTX 1660+: Minimum
- RAM: 16GB+ recommended for large channels
- Storage: 50-100GB free space for temp audio files
For Modal Cloud:
- Storage: 10-50GB for audio downloads
- Internet: Fast connection recommended
- Python: 3.9 or higher
- CUDA: 12.x runtime (for local GPU only)
- OS: Windows 10/11, Linux, macOS
Edit config/config.py to customize:
# Channel to transcribe
CHANNEL_URL = "https://www.youtube.com/@YourChannel"
# Whisper model (tiny/base/small/medium/large)
MODEL_SIZE = "base" # Recommended for most users
# Recommended by GPU VRAM:
# 4GB: tiny or base
# 6GB: small
# 8GB: small or medium
# 12GB+: medium or large
# Performance settings
DOWNLOAD_WORKERS = 10 # Parallel downloads
TRANSCRIBE_WORKERS = 1 # Must stay at 1 (GPU limitation)
BATCH_SIZE = 20 # Videos per batch
# Output settings
DELETE_AUDIO_AFTER_TRANSCRIPTION = True # Save disk space
LANGUAGE = "en" # Language or None for auto-detect
DEVICE = "cuda" # "cuda" or "cpu"Model: base
- Speed: 35-40x realtime
- Example: 60-minute video β 90-120 seconds
- Throughput: ~50-60 videos/hour (12-min average)
- 1000 hours of content: ~25-30 hours processing
| Model | Speed (RTX 4060) | Quality | VRAM | Use Case |
|---|---|---|---|---|
| tiny | 60-80x realtime | Good | 1GB | Quick drafts |
| base | 35-40x realtime | Better | 1-2GB | Recommended |
| small | 20-25x realtime | Great | 2-3GB | Higher accuracy |
| medium | 10-15x realtime | Excellent | 5GB | Professional use |
| large | 5-8x realtime | Best | 10GB | Maximum accuracy |
- Speed: 70-200x realtime per GPU
- Parallelization: 100+ GPUs simultaneously
- Example: 1000 hours β ~8-15 minutes with 100 GPUs
- Cost: ~$30-40 per 1000 hours
YT Transcribe/
βββ setup.py # Interactive setup wizard
βββ quick-setup.py # Quick setup for advanced users
βββ setup.bat # Windows launcher
β
βββ config/
β βββ config.example.py # Configuration template
β βββ config.py # Your configuration (gitignored)
β
βββ scripts/
β βββ run_transcriber.py # Local GPU (all-in-one)
β βββ prepare_for_modal.py # Modal step 1: Download audio
β βββ modal_hybrid.py # Modal step 2: Transcribe
β βββ utils/
β βββ check_status.py # Check progress
β βββ reset_channel.py # Manage channels
β βββ ...
β
βββ src/
β βββ channel_transcriber.py # Core transcription engine
β
βββ data/ # Auto-created, gitignored
βββ temp_audio/
β βββ {Channel Name}/ # Downloaded audio
βββ transcripts/
β βββ {Channel Name}/ # Output transcripts
βββ transcription.log # Detailed logs
βββ transcription_progress.db # Progress database
1. SCRAPE CHANNEL
ββ Fetch all video metadata from YouTube
ββ Store in SQLite database
ββ Calculate total duration
2. DOWNLOAD BATCH (20 videos)
ββ Parallel download with 10 workers
ββ Extract audio only (saves time/bandwidth)
ββ Store in data/temp_audio/{Channel}/
3. TRANSCRIBE BATCH
ββ Load faster-whisper model on GPU
ββ Process each video sequentially
ββ Apply Voice Activity Detection (VAD)
ββ Generate timestamped transcript
ββ Save to data/transcripts/{Channel}/
ββ Delete audio file (configurable)
4. REPEAT
ββ Continue until all videos processed
Process unlimited channels with automatic organization:
data/
βββ temp_audio/
β βββ My First Million/
β βββ Lex Fridman/
β βββ Your Channel/
βββ transcripts/
βββ My First Million/
β βββ video1_abc123.txt
β βββ video2_def456.txt
βββ Lex Fridman/
βββ Your Channel/
Switch channels by editing config.py. Database tracks all channels separately.
# Check CUDA version
nvidia-smi
# Verify PyTorch sees GPU
python -c "import torch; print(torch.cuda.is_available())"
# Test faster-whisper
python -c "from faster_whisper import WhisperModel; model = WhisperModel('base', device='cuda')"If you get CUDA out of memory errors:
- Use smaller model (
tinyorbase) - Close other GPU applications
- Reduce batch size in config.py
Install CUDA 12.x runtime from: https://developer.nvidia.com/cuda-downloads
The script automatically adds CUDA to PATH on Windows.
# Re-run Modal setup
modal setup
# Verify authentication
modal token listFor more help, see:
- INSTALL.md - Installation troubleshooting
- docs/GETTING_STARTED.md - Complete guide
- 4x faster than openai-whisper
- Uses CTranslate2 (optimized inference)
- Same accuracy as original Whisper
- Lower memory usage
- Better batching
Why download locally?
YouTube blocks cloud IPs (AWS, GCP, Modal) with bot detection. Solution:
β
Hybrid Approach (works):
Your Computer β YouTube (downloads from home IP)
β
Your Computer β Modal (uploads audio only)
β
Modal GPUs β Transcribe in parallel (70-200x realtime)
β
Your Computer β Results stream back
Performance:
- Download locally: ~5-10 min for 100 videos
- Transcribe on Modal: ~3-5 min for 100 videos (parallel)
- Total: ~8-15 min vs 25+ hours on local GPU
Proven: 1000+ videos successfully transcribed using this approach.
| Environment | Library | Speed | Reason |
|---|---|---|---|
| Local GPU | faster-whisper |
35-40x | 4x faster, worth complex setup for sustained use |
| Modal Cloud | openai-whisper |
70-200x | Simpler, faster-whisper has cuDNN issues on Modal |
With 100 parallel GPUs on Modal, individual GPU speed matters less.
- Use
baseorsmallmodel - Local GPU is perfect
- Default settings work great
- Use Modal Cloud for speed
- Or use local GPU with
basemodel overnight - Enable
DELETE_AUDIO_AFTER_TRANSCRIPTION = True - Monitor disk space
- Use
mediumorlargemodel - Accept slower processing
- Ensure sufficient VRAM
- Use Modal Cloud with 100+ parallel GPUs
- Or use
tinymodel on local GPU - Accept lower accuracy for drafts
MIT License - Free to use and modify
- Whisper: OpenAI
- faster-whisper: Systran (CTranslate2 implementation)
- yt-dlp: yt-dlp contributors
- PyTorch: Facebook AI Research
- Modal: Modal Labs
Happy Transcribing! π
Current System Status:
- β GPU acceleration with faster-whisper (35-40x realtime)
- β Modal Cloud support (70-200x realtime, parallel)
- β Multi-channel support with automatic organization
- β Interactive setup wizard
- β Resumable processing
- β Production-ready and battle-tested on 1000+ videos