stt/PARAKEET_MIGRATION.md

# NVIDIA Parakeet Migration

## Summary

Replaced Faster-Whisper with NVIDIA Parakeet TDT (Token-and-Duration Transducer) for real-time speech transcription.

## Changes Made

### 1. New Transcriber: `parakeet_transcriber.py`
- **Model**: `nvidia/parakeet-tdt-0.6b-v3` (600M parameters)
- **Features**:
  - Real-time streaming transcription
  - Word-level timestamps for LLM pre-computation
  - GPU-accelerated (CUDA)
  - Lower latency than Faster-Whisper
  - Native PyTorch (no CTranslate2 dependency)

### 2. Requirements Updated
**Removed**:
- `faster-whisper==1.2.1`
- `ctranslate2==4.5.0`

**Added**:
- `transformers==4.47.1` - HuggingFace model loading
- `accelerate==1.2.1` - GPU optimization
- `sentencepiece==0.2.0` - Tokenization

**Kept**:
- `torch==2.9.1` & `torchaudio==2.9.1` - Core ML framework
- `silero-vad==5.1.2` - VAD still uses Silero (CPU)

### 3. Server Updates: `stt_server.py`
**Changes**:
- Import `ParakeetTranscriber` instead of `WhisperTranscriber`
- Partial transcripts now include `words` array with timestamps
- Final transcripts include `words` array for LLM pre-computation
- Startup logs show "Loading NVIDIA Parakeet TDT model"

**Word-level Token Format**:
```json
{
  "type": "partial",
  "text": "hello world",
  "words": [
    {"word": "hello", "start_time": 0.0, "end_time": 0.5},
    {"word": "world", "start_time": 0.5, "end_time": 1.0}
  ],
  "user_id": "123",
  "timestamp": 1234.56
}
```

## Advantages Over Faster-Whisper

1. **Real-time Performance**: TDT architecture designed for streaming
2. **No cuDNN Issues**: Native PyTorch, no CTranslate2 library loading problems
3. **Word-level Tokens**: Enables LLM prompt pre-computation during speech
4. **Lower Latency**: Optimized for real-time use cases
5. **Better GPU Utilization**: Uses standard PyTorch CUDA
6. **Simpler Dependencies**: No external compiled libraries

## Deployment

1. **Build Container**:
   ```bash
   docker-compose build miku-stt
   ```

2. **First Run** (downloads model ~600MB):
   ```bash
   docker-compose up miku-stt
   ```
   Model will be cached in `/models` volume for subsequent runs.

3. **Verify GPU Usage**:
   ```bash
   docker exec miku-stt nvidia-smi
   ```
   You should see `python3` process using VRAM (~1.5GB for model + inference).

## Testing

Same test procedure as before:
1. Join voice channel
2. `!miku listen`
3. Speak clearly
4. Check logs for "Parakeet model loaded"
5. Verify transcripts appear faster than before

## Bot-Side Compatibility

No changes needed to bot code - STT WebSocket protocol is identical. The bot will automatically receive word-level tokens in partial/final transcript messages.

### Future Enhancement: LLM Pre-computation
The `words` array can be used to start LLM inference before full transcript completes:
- Send partial words to LLM as they arrive
- LLM begins processing prompt tokens
- Faster response time when user finishes speaking

## Rollback (if needed)

To revert to Faster-Whisper:
1. Restore `requirements.txt` from git
2. Restore `stt_server.py` from git  
3. Delete `parakeet_transcriber.py`
4. Rebuild container

## Performance Expectations

- **Model Load Time**: ~5-10 seconds (first time downloads from HuggingFace)
- **VRAM Usage**: ~1.5GB (vs ~800MB for Whisper small)
- **Latency**: ~200-500ms for 2-second audio chunks
- **GPU Utilization**: 30-60% during active transcription
- **Accuracy**: Similar to Whisper small (designed for English)
Changed stt to parakeet — still experiemntal, though performance seems to be better 2026-01-18 03:35:50 +02:00			`# NVIDIA Parakeet Migration`

			`## Summary`

			`Replaced Faster-Whisper with NVIDIA Parakeet TDT (Token-and-Duration Transducer) for real-time speech transcription.`

			`## Changes Made`

			### 1. New Transcriber: `parakeet_transcriber.py`
			- Model: `nvidia/parakeet-tdt-0.6b-v3` (600M parameters)
			`- Features:`
			`- Real-time streaming transcription`
			`- Word-level timestamps for LLM pre-computation`
			`- GPU-accelerated (CUDA)`
			`- Lower latency than Faster-Whisper`
			`- Native PyTorch (no CTranslate2 dependency)`

			`### 2. Requirements Updated`
			`Removed:`
			- `faster-whisper==1.2.1`
			- `ctranslate2==4.5.0`

			`Added:`
			- `transformers==4.47.1` - HuggingFace model loading
			- `accelerate==1.2.1` - GPU optimization
			- `sentencepiece==0.2.0` - Tokenization

			`Kept:`
			- `torch==2.9.1` & `torchaudio==2.9.1` - Core ML framework
			- `silero-vad==5.1.2` - VAD still uses Silero (CPU)

			### 3. Server Updates: `stt_server.py`
			`Changes:`
			- Import `ParakeetTranscriber` instead of `WhisperTranscriber`
			- Partial transcripts now include `words` array with timestamps
			- Final transcripts include `words` array for LLM pre-computation
			`- Startup logs show "Loading NVIDIA Parakeet TDT model"`

			`Word-level Token Format:`
			```json
			`{`
			`"type": "partial",`
			`"text": "hello world",`
			`"words": [`
			`{"word": "hello", "start_time": 0.0, "end_time": 0.5},`
			`{"word": "world", "start_time": 0.5, "end_time": 1.0}`
			`],`
			`"user_id": "123",`
			`"timestamp": 1234.56`
			`}`
			```

			`## Advantages Over Faster-Whisper`

			`1. Real-time Performance: TDT architecture designed for streaming`
			`2. No cuDNN Issues: Native PyTorch, no CTranslate2 library loading problems`
			`3. Word-level Tokens: Enables LLM prompt pre-computation during speech`
			`4. Lower Latency: Optimized for real-time use cases`
			`5. Better GPU Utilization: Uses standard PyTorch CUDA`
			`6. Simpler Dependencies: No external compiled libraries`

			`## Deployment`

			`1. Build Container:`
			```bash
			`docker-compose build miku-stt`
			```

			`2. First Run (downloads model ~600MB):`
			```bash
			`docker-compose up miku-stt`
			```
			Model will be cached in `/models` volume for subsequent runs.

			`3. Verify GPU Usage:`
			```bash
			`docker exec miku-stt nvidia-smi`
			```
			You should see `python3` process using VRAM (~1.5GB for model + inference).

			`## Testing`

			`Same test procedure as before:`
			`1. Join voice channel`
			2. `!miku listen`
			`3. Speak clearly`
			`4. Check logs for "Parakeet model loaded"`
			`5. Verify transcripts appear faster than before`

			`## Bot-Side Compatibility`

			`No changes needed to bot code - STT WebSocket protocol is identical. The bot will automatically receive word-level tokens in partial/final transcript messages.`

			`### Future Enhancement: LLM Pre-computation`
			The `words` array can be used to start LLM inference before full transcript completes:
			`- Send partial words to LLM as they arrive`
			`- LLM begins processing prompt tokens`
			`- Faster response time when user finishes speaking`

			`## Rollback (if needed)`

			`To revert to Faster-Whisper:`
			1. Restore `requirements.txt` from git
			2. Restore `stt_server.py` from git
			3. Delete `parakeet_transcriber.py`
			`4. Rebuild container`

			`## Performance Expectations`

			`- Model Load Time: ~5-10 seconds (first time downloads from HuggingFace)`
			`- VRAM Usage: ~1.5GB (vs ~800MB for Whisper small)`
			`- Latency: ~200-500ms for 2-second audio chunks`
			`- GPU Utilization: 30-60% during active transcription`
			`- Accuracy: Similar to Whisper small (designed for English)`