Files
miku-discord/stt
..

Miku STT (Speech-to-Text) Server

Real-time speech-to-text service for Miku voice chat using Silero VAD (CPU) and Faster-Whisper (GPU).

Architecture

  • Silero VAD (CPU): Lightweight voice activity detection, runs continuously
  • Faster-Whisper (GPU GTX 1660): Efficient speech transcription using CTranslate2
  • FastAPI WebSocket: Real-time bidirectional communication

Features

  • Real-time voice activity detection with conservative settings
  • Streaming partial transcripts during speech
  • Final transcript on speech completion
  • Interruption detection (user speaking over Miku)
  • Multi-user support with isolated sessions
  • KV cache optimization ready (partial text for LLM precomputation)

API Endpoints

WebSocket: /ws/stt/{user_id}

Real-time STT session for a specific user.

Client sends: Raw PCM audio (int16, 16kHz mono, 20ms chunks = 320 samples)

Server sends: JSON events:

// VAD events
{"type": "vad", "event": "speech_start", "speaking": true, "probability": 0.85, "timestamp": 1250.5}
{"type": "vad", "event": "speaking", "speaking": true, "probability": 0.92, "timestamp": 1270.5}
{"type": "vad", "event": "speech_end", "speaking": false, "probability": 0.35, "timestamp": 3500.0}

// Transcription events
{"type": "partial", "text": "Hello how are", "user_id": "123", "timestamp": 2000.0}
{"type": "final", "text": "Hello how are you?", "user_id": "123", "timestamp": 3500.0}

// Interruption detection
{"type": "interruption", "probability": 0.92, "timestamp": 1500.0}

HTTP GET: /health

Health check with model status.

Response:

{
  "status": "healthy",
  "models": {
    "vad": {"loaded": true, "device": "cpu"},
    "whisper": {"loaded": true, "model": "small", "device": "cuda"}
  },
  "sessions": {
    "active": 2,
    "users": ["user123", "user456"]
  }
}

Configuration

VAD Parameters (Conservative)

  • Threshold: 0.5 (speech probability)
  • Min speech duration: 250ms (avoid false triggers)
  • Min silence duration: 500ms (don't cut off mid-sentence)
  • Speech padding: 30ms (context around speech)

Whisper Parameters

  • Model: small (balanced speed/quality, ~500MB VRAM)
  • Compute: float16 (GPU optimization)
  • Language: en (English)
  • Beam size: 5 (quality/speed balance)

Usage Example

import asyncio
import websockets
import numpy as np

async def stream_audio():
    uri = "ws://localhost:8001/ws/stt/user123"
    
    async with websockets.connect(uri) as websocket:
        # Wait for ready
        ready = await websocket.recv()
        print(ready)
        
        # Stream audio chunks (16kHz, 20ms chunks)
        for audio_chunk in audio_stream:
            # Convert to bytes (int16)
            audio_bytes = audio_chunk.astype(np.int16).tobytes()
            await websocket.send(audio_bytes)
            
            # Receive events
            event = await websocket.recv()
            print(event)

asyncio.run(stream_audio())

Docker Setup

Build

docker-compose build miku-stt

Run

docker-compose up -d miku-stt

Logs

docker-compose logs -f miku-stt

Test

curl http://localhost:8001/health

GPU Sharing with Soprano

Both STT (Whisper) and TTS (Soprano) run on GTX 1660 but at different times:

  1. User speaking → Whisper active, Soprano idle
  2. LLM processing → Both idle
  3. Miku speaking → Soprano active, Whisper idle (VAD monitoring only)

Interruption detection runs VAD continuously but doesn't use GPU.

Performance

  • VAD latency: 10-20ms per chunk (CPU)
  • Whisper latency: ~1-2s for 2s audio (GPU)
  • Memory usage:
    • Silero VAD: ~100MB (CPU)
    • Faster-Whisper small: ~500MB (GPU VRAM)

Future Improvements

  • Multi-language support (auto-detect)
  • Word-level timestamps for better sync
  • Custom vocabulary/prompt tuning
  • Speaker diarization (multiple speakers)
  • Noise suppression preprocessing