# NVIDIA Parakeet Migration ## Summary Replaced Faster-Whisper with NVIDIA Parakeet TDT (Token-and-Duration Transducer) for real-time speech transcription. ## Changes Made ### 1. New Transcriber: `parakeet_transcriber.py` - **Model**: `nvidia/parakeet-tdt-0.6b-v3` (600M parameters) - **Features**: - Real-time streaming transcription - Word-level timestamps for LLM pre-computation - GPU-accelerated (CUDA) - Lower latency than Faster-Whisper - Native PyTorch (no CTranslate2 dependency) ### 2. Requirements Updated **Removed**: - `faster-whisper==1.2.1` - `ctranslate2==4.5.0` **Added**: - `transformers==4.47.1` - HuggingFace model loading - `accelerate==1.2.1` - GPU optimization - `sentencepiece==0.2.0` - Tokenization **Kept**: - `torch==2.9.1` & `torchaudio==2.9.1` - Core ML framework - `silero-vad==5.1.2` - VAD still uses Silero (CPU) ### 3. Server Updates: `stt_server.py` **Changes**: - Import `ParakeetTranscriber` instead of `WhisperTranscriber` - Partial transcripts now include `words` array with timestamps - Final transcripts include `words` array for LLM pre-computation - Startup logs show "Loading NVIDIA Parakeet TDT model" **Word-level Token Format**: ```json { "type": "partial", "text": "hello world", "words": [ {"word": "hello", "start_time": 0.0, "end_time": 0.5}, {"word": "world", "start_time": 0.5, "end_time": 1.0} ], "user_id": "123", "timestamp": 1234.56 } ``` ## Advantages Over Faster-Whisper 1. **Real-time Performance**: TDT architecture designed for streaming 2. **No cuDNN Issues**: Native PyTorch, no CTranslate2 library loading problems 3. **Word-level Tokens**: Enables LLM prompt pre-computation during speech 4. **Lower Latency**: Optimized for real-time use cases 5. **Better GPU Utilization**: Uses standard PyTorch CUDA 6. **Simpler Dependencies**: No external compiled libraries ## Deployment 1. **Build Container**: ```bash docker-compose build miku-stt ``` 2. **First Run** (downloads model ~600MB): ```bash docker-compose up miku-stt ``` Model will be cached in `/models` volume for subsequent runs. 3. **Verify GPU Usage**: ```bash docker exec miku-stt nvidia-smi ``` You should see `python3` process using VRAM (~1.5GB for model + inference). ## Testing Same test procedure as before: 1. Join voice channel 2. `!miku listen` 3. Speak clearly 4. Check logs for "Parakeet model loaded" 5. Verify transcripts appear faster than before ## Bot-Side Compatibility No changes needed to bot code - STT WebSocket protocol is identical. The bot will automatically receive word-level tokens in partial/final transcript messages. ### Future Enhancement: LLM Pre-computation The `words` array can be used to start LLM inference before full transcript completes: - Send partial words to LLM as they arrive - LLM begins processing prompt tokens - Faster response time when user finishes speaking ## Rollback (if needed) To revert to Faster-Whisper: 1. Restore `requirements.txt` from git 2. Restore `stt_server.py` from git 3. Delete `parakeet_transcriber.py` 4. Rebuild container ## Performance Expectations - **Model Load Time**: ~5-10 seconds (first time downloads from HuggingFace) - **VRAM Usage**: ~1.5GB (vs ~800MB for Whisper small) - **Latency**: ~200-500ms for 2-second audio chunks - **GPU Utilization**: 30-60% during active transcription - **Accuracy**: Similar to Whisper small (designed for English)