# Miku STT (Speech-to-Text) Server Real-time speech-to-text service for Miku voice chat using Silero VAD (CPU) and Faster-Whisper (GPU). ## Architecture - **Silero VAD** (CPU): Lightweight voice activity detection, runs continuously - **Faster-Whisper** (GPU GTX 1660): Efficient speech transcription using CTranslate2 - **FastAPI WebSocket**: Real-time bidirectional communication ## Features - ✅ Real-time voice activity detection with conservative settings - ✅ Streaming partial transcripts during speech - ✅ Final transcript on speech completion - ✅ Interruption detection (user speaking over Miku) - ✅ Multi-user support with isolated sessions - ✅ KV cache optimization ready (partial text for LLM precomputation) ## API Endpoints ### WebSocket: `/ws/stt/{user_id}` Real-time STT session for a specific user. **Client sends:** Raw PCM audio (int16, 16kHz mono, 20ms chunks = 320 samples) **Server sends:** JSON events: ```json // VAD events {"type": "vad", "event": "speech_start", "speaking": true, "probability": 0.85, "timestamp": 1250.5} {"type": "vad", "event": "speaking", "speaking": true, "probability": 0.92, "timestamp": 1270.5} {"type": "vad", "event": "speech_end", "speaking": false, "probability": 0.35, "timestamp": 3500.0} // Transcription events {"type": "partial", "text": "Hello how are", "user_id": "123", "timestamp": 2000.0} {"type": "final", "text": "Hello how are you?", "user_id": "123", "timestamp": 3500.0} // Interruption detection {"type": "interruption", "probability": 0.92, "timestamp": 1500.0} ``` ### HTTP GET: `/health` Health check with model status. **Response:** ```json { "status": "healthy", "models": { "vad": {"loaded": true, "device": "cpu"}, "whisper": {"loaded": true, "model": "small", "device": "cuda"} }, "sessions": { "active": 2, "users": ["user123", "user456"] } } ``` ## Configuration ### VAD Parameters (Conservative) - **Threshold**: 0.5 (speech probability) - **Min speech duration**: 250ms (avoid false triggers) - **Min silence duration**: 500ms (don't cut off mid-sentence) - **Speech padding**: 30ms (context around speech) ### Whisper Parameters - **Model**: small (balanced speed/quality, ~500MB VRAM) - **Compute**: float16 (GPU optimization) - **Language**: en (English) - **Beam size**: 5 (quality/speed balance) ## Usage Example ```python import asyncio import websockets import numpy as np async def stream_audio(): uri = "ws://localhost:8001/ws/stt/user123" async with websockets.connect(uri) as websocket: # Wait for ready ready = await websocket.recv() print(ready) # Stream audio chunks (16kHz, 20ms chunks) for audio_chunk in audio_stream: # Convert to bytes (int16) audio_bytes = audio_chunk.astype(np.int16).tobytes() await websocket.send(audio_bytes) # Receive events event = await websocket.recv() print(event) asyncio.run(stream_audio()) ``` ## Docker Setup ### Build ```bash docker-compose build miku-stt ``` ### Run ```bash docker-compose up -d miku-stt ``` ### Logs ```bash docker-compose logs -f miku-stt ``` ### Test ```bash curl http://localhost:8001/health ``` ## GPU Sharing with Soprano Both STT (Whisper) and TTS (Soprano) run on GTX 1660 but at different times: 1. **User speaking** → Whisper active, Soprano idle 2. **LLM processing** → Both idle 3. **Miku speaking** → Soprano active, Whisper idle (VAD monitoring only) Interruption detection runs VAD continuously but doesn't use GPU. ## Performance - **VAD latency**: 10-20ms per chunk (CPU) - **Whisper latency**: ~1-2s for 2s audio (GPU) - **Memory usage**: - Silero VAD: ~100MB (CPU) - Faster-Whisper small: ~500MB (GPU VRAM) ## Future Improvements - [ ] Multi-language support (auto-detect) - [ ] Word-level timestamps for better sync - [ ] Custom vocabulary/prompt tuning - [ ] Speaker diarization (multiple speakers) - [ ] Noise suppression preprocessing