3.9 KiB
3.9 KiB
Miku STT (Speech-to-Text) Server
Real-time speech-to-text service for Miku voice chat using Silero VAD (CPU) and Faster-Whisper (GPU).
Architecture
- Silero VAD (CPU): Lightweight voice activity detection, runs continuously
- Faster-Whisper (GPU GTX 1660): Efficient speech transcription using CTranslate2
- FastAPI WebSocket: Real-time bidirectional communication
Features
- ✅ Real-time voice activity detection with conservative settings
- ✅ Streaming partial transcripts during speech
- ✅ Final transcript on speech completion
- ✅ Interruption detection (user speaking over Miku)
- ✅ Multi-user support with isolated sessions
- ✅ KV cache optimization ready (partial text for LLM precomputation)
API Endpoints
WebSocket: /ws/stt/{user_id}
Real-time STT session for a specific user.
Client sends: Raw PCM audio (int16, 16kHz mono, 20ms chunks = 320 samples)
Server sends: JSON events:
// VAD events
{"type": "vad", "event": "speech_start", "speaking": true, "probability": 0.85, "timestamp": 1250.5}
{"type": "vad", "event": "speaking", "speaking": true, "probability": 0.92, "timestamp": 1270.5}
{"type": "vad", "event": "speech_end", "speaking": false, "probability": 0.35, "timestamp": 3500.0}
// Transcription events
{"type": "partial", "text": "Hello how are", "user_id": "123", "timestamp": 2000.0}
{"type": "final", "text": "Hello how are you?", "user_id": "123", "timestamp": 3500.0}
// Interruption detection
{"type": "interruption", "probability": 0.92, "timestamp": 1500.0}
HTTP GET: /health
Health check with model status.
Response:
{
"status": "healthy",
"models": {
"vad": {"loaded": true, "device": "cpu"},
"whisper": {"loaded": true, "model": "small", "device": "cuda"}
},
"sessions": {
"active": 2,
"users": ["user123", "user456"]
}
}
Configuration
VAD Parameters (Conservative)
- Threshold: 0.5 (speech probability)
- Min speech duration: 250ms (avoid false triggers)
- Min silence duration: 500ms (don't cut off mid-sentence)
- Speech padding: 30ms (context around speech)
Whisper Parameters
- Model: small (balanced speed/quality, ~500MB VRAM)
- Compute: float16 (GPU optimization)
- Language: en (English)
- Beam size: 5 (quality/speed balance)
Usage Example
import asyncio
import websockets
import numpy as np
async def stream_audio():
uri = "ws://localhost:8001/ws/stt/user123"
async with websockets.connect(uri) as websocket:
# Wait for ready
ready = await websocket.recv()
print(ready)
# Stream audio chunks (16kHz, 20ms chunks)
for audio_chunk in audio_stream:
# Convert to bytes (int16)
audio_bytes = audio_chunk.astype(np.int16).tobytes()
await websocket.send(audio_bytes)
# Receive events
event = await websocket.recv()
print(event)
asyncio.run(stream_audio())
Docker Setup
Build
docker-compose build miku-stt
Run
docker-compose up -d miku-stt
Logs
docker-compose logs -f miku-stt
Test
curl http://localhost:8001/health
GPU Sharing with Soprano
Both STT (Whisper) and TTS (Soprano) run on GTX 1660 but at different times:
- User speaking → Whisper active, Soprano idle
- LLM processing → Both idle
- Miku speaking → Soprano active, Whisper idle (VAD monitoring only)
Interruption detection runs VAD continuously but doesn't use GPU.
Performance
- VAD latency: 10-20ms per chunk (CPU)
- Whisper latency: ~1-2s for 2s audio (GPU)
- Memory usage:
- Silero VAD: ~100MB (CPU)
- Faster-Whisper small: ~500MB (GPU VRAM)
Future Improvements
- Multi-language support (auto-detect)
- Word-level timestamps for better sync
- Custom vocabulary/prompt tuning
- Speaker diarization (multiple speakers)
- Noise suppression preprocessing