115 lines
3.4 KiB
Markdown
115 lines
3.4 KiB
Markdown
|
|
# NVIDIA Parakeet Migration
|
||
|
|
|
||
|
|
## Summary
|
||
|
|
|
||
|
|
Replaced Faster-Whisper with NVIDIA Parakeet TDT (Token-and-Duration Transducer) for real-time speech transcription.
|
||
|
|
|
||
|
|
## Changes Made
|
||
|
|
|
||
|
|
### 1. New Transcriber: `parakeet_transcriber.py`
|
||
|
|
- **Model**: `nvidia/parakeet-tdt-0.6b-v3` (600M parameters)
|
||
|
|
- **Features**:
|
||
|
|
- Real-time streaming transcription
|
||
|
|
- Word-level timestamps for LLM pre-computation
|
||
|
|
- GPU-accelerated (CUDA)
|
||
|
|
- Lower latency than Faster-Whisper
|
||
|
|
- Native PyTorch (no CTranslate2 dependency)
|
||
|
|
|
||
|
|
### 2. Requirements Updated
|
||
|
|
**Removed**:
|
||
|
|
- `faster-whisper==1.2.1`
|
||
|
|
- `ctranslate2==4.5.0`
|
||
|
|
|
||
|
|
**Added**:
|
||
|
|
- `transformers==4.47.1` - HuggingFace model loading
|
||
|
|
- `accelerate==1.2.1` - GPU optimization
|
||
|
|
- `sentencepiece==0.2.0` - Tokenization
|
||
|
|
|
||
|
|
**Kept**:
|
||
|
|
- `torch==2.9.1` & `torchaudio==2.9.1` - Core ML framework
|
||
|
|
- `silero-vad==5.1.2` - VAD still uses Silero (CPU)
|
||
|
|
|
||
|
|
### 3. Server Updates: `stt_server.py`
|
||
|
|
**Changes**:
|
||
|
|
- Import `ParakeetTranscriber` instead of `WhisperTranscriber`
|
||
|
|
- Partial transcripts now include `words` array with timestamps
|
||
|
|
- Final transcripts include `words` array for LLM pre-computation
|
||
|
|
- Startup logs show "Loading NVIDIA Parakeet TDT model"
|
||
|
|
|
||
|
|
**Word-level Token Format**:
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"type": "partial",
|
||
|
|
"text": "hello world",
|
||
|
|
"words": [
|
||
|
|
{"word": "hello", "start_time": 0.0, "end_time": 0.5},
|
||
|
|
{"word": "world", "start_time": 0.5, "end_time": 1.0}
|
||
|
|
],
|
||
|
|
"user_id": "123",
|
||
|
|
"timestamp": 1234.56
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
## Advantages Over Faster-Whisper
|
||
|
|
|
||
|
|
1. **Real-time Performance**: TDT architecture designed for streaming
|
||
|
|
2. **No cuDNN Issues**: Native PyTorch, no CTranslate2 library loading problems
|
||
|
|
3. **Word-level Tokens**: Enables LLM prompt pre-computation during speech
|
||
|
|
4. **Lower Latency**: Optimized for real-time use cases
|
||
|
|
5. **Better GPU Utilization**: Uses standard PyTorch CUDA
|
||
|
|
6. **Simpler Dependencies**: No external compiled libraries
|
||
|
|
|
||
|
|
## Deployment
|
||
|
|
|
||
|
|
1. **Build Container**:
|
||
|
|
```bash
|
||
|
|
docker-compose build miku-stt
|
||
|
|
```
|
||
|
|
|
||
|
|
2. **First Run** (downloads model ~600MB):
|
||
|
|
```bash
|
||
|
|
docker-compose up miku-stt
|
||
|
|
```
|
||
|
|
Model will be cached in `/models` volume for subsequent runs.
|
||
|
|
|
||
|
|
3. **Verify GPU Usage**:
|
||
|
|
```bash
|
||
|
|
docker exec miku-stt nvidia-smi
|
||
|
|
```
|
||
|
|
You should see `python3` process using VRAM (~1.5GB for model + inference).
|
||
|
|
|
||
|
|
## Testing
|
||
|
|
|
||
|
|
Same test procedure as before:
|
||
|
|
1. Join voice channel
|
||
|
|
2. `!miku listen`
|
||
|
|
3. Speak clearly
|
||
|
|
4. Check logs for "Parakeet model loaded"
|
||
|
|
5. Verify transcripts appear faster than before
|
||
|
|
|
||
|
|
## Bot-Side Compatibility
|
||
|
|
|
||
|
|
No changes needed to bot code - STT WebSocket protocol is identical. The bot will automatically receive word-level tokens in partial/final transcript messages.
|
||
|
|
|
||
|
|
### Future Enhancement: LLM Pre-computation
|
||
|
|
The `words` array can be used to start LLM inference before full transcript completes:
|
||
|
|
- Send partial words to LLM as they arrive
|
||
|
|
- LLM begins processing prompt tokens
|
||
|
|
- Faster response time when user finishes speaking
|
||
|
|
|
||
|
|
## Rollback (if needed)
|
||
|
|
|
||
|
|
To revert to Faster-Whisper:
|
||
|
|
1. Restore `requirements.txt` from git
|
||
|
|
2. Restore `stt_server.py` from git
|
||
|
|
3. Delete `parakeet_transcriber.py`
|
||
|
|
4. Rebuild container
|
||
|
|
|
||
|
|
## Performance Expectations
|
||
|
|
|
||
|
|
- **Model Load Time**: ~5-10 seconds (first time downloads from HuggingFace)
|
||
|
|
- **VRAM Usage**: ~1.5GB (vs ~800MB for Whisper small)
|
||
|
|
- **Latency**: ~200-500ms for 2-second audio chunks
|
||
|
|
- **GPU Utilization**: 30-60% during active transcription
|
||
|
|
- **Accuracy**: Similar to Whisper small (designed for English)
|