refactor: Implement low-latency STT pipeline with speculative transcription

Major architectural overhaul of the speech-to-text pipeline for real-time voice chat:

STT Server Rewrite:
- Replaced RealtimeSTT dependency with direct Silero VAD + Faster-Whisper integration
- Achieved sub-second latency by eliminating unnecessary abstractions
- Uses small.en Whisper model for fast transcription (~850ms)

Speculative Transcription (NEW):
- Start transcribing at 150ms silence (speculative) while still listening
- If speech continues, discard speculative result and keep buffering
- If 400ms silence confirmed, use pre-computed speculative result immediately
- Reduces latency by ~250-850ms for typical utterances with clear pauses

VAD Implementation:
- Silero VAD with ONNX (CPU-efficient) for 32ms chunk processing
- Direct speech boundary detection without RealtimeSTT overhead
- Configurable thresholds for silence detection (400ms final, 150ms speculative)

Architecture:
- Single Whisper model loaded once, shared across sessions
- VAD runs on every 512-sample chunk for immediate speech detection
- Background transcription worker thread for non-blocking processing
- Greedy decoding (beam_size=1) for maximum speed

Performance:
- Previous: 400ms silence wait + ~850ms transcription = ~1.25s total latency
- Current: 400ms silence wait + 0ms (speculative ready) = ~400ms (best case)
- Single model reduces VRAM usage, prevents OOM on GTX 1660

Container Manager Updates:
- Updated health check logic to work with new response format
- Changed from checking 'warmed_up' flag to just 'status: ready'
- Improved terminology from 'warmup' to 'models loading'

Files Changed:
- stt-realtime/stt_server.py: Complete rewrite with Silero VAD + speculative transcription
- stt-realtime/requirements.txt: Removed RealtimeSTT, using torch.hub for Silero VAD
- bot/utils/container_manager.py: Updated health check for new STT response format
- bot/api.py: Updated docstring to reflect new architecture
- backups/: Archived old RealtimeSTT-based implementation

This addresses low latency requirements while maintaining accuracy with configurable
speech detection thresholds.
This commit is contained in:
2026-01-22 22:08:07 +02:00
parent 2934efba22
commit eb03dfce4d
5 changed files with 850 additions and 400 deletions

View File

@@ -1,7 +1,7 @@
# container_manager.py
"""
Manages Docker containers for STT and TTS services.
Handles startup, shutdown, and warmup detection.
Handles startup, shutdown, and readiness detection.
"""
import asyncio
@@ -18,12 +18,12 @@ class ContainerManager:
STT_CONTAINER = "miku-stt"
TTS_CONTAINER = "miku-rvc-api"
# Warmup check endpoints
# Health check endpoints
STT_HEALTH_URL = "http://miku-stt:8767/health" # HTTP health check endpoint
TTS_HEALTH_URL = "http://miku-rvc-api:8765/health"
# Warmup timeouts
STT_WARMUP_TIMEOUT = 30 # seconds
# Startup timeouts (time to load models and become ready)
STT_WARMUP_TIMEOUT = 30 # seconds (Whisper model loading)
TTS_WARMUP_TIMEOUT = 60 # seconds (RVC takes longer)
@classmethod
@@ -65,17 +65,17 @@ class ContainerManager:
logger.info(f"{cls.TTS_CONTAINER} started")
# Wait for warmup
logger.info("⏳ Waiting for containers to warm up...")
# Wait for models to load and become ready
logger.info("⏳ Waiting for models to load...")
stt_ready = await cls._wait_for_stt_warmup()
if not stt_ready:
logger.error("STT failed to warm up")
logger.error("STT failed to become ready")
return False
tts_ready = await cls._wait_for_tts_warmup()
if not tts_ready:
logger.error("TTS failed to warm up")
logger.error("TTS failed to become ready")
return False
logger.info("✅ All voice containers ready!")
@@ -130,7 +130,8 @@ class ContainerManager:
async with session.get(cls.STT_HEALTH_URL, timeout=aiohttp.ClientTimeout(total=2)) as resp:
if resp.status == 200:
data = await resp.json()
if data.get("status") == "ready" and data.get("warmed_up"):
# New STT server returns {"status": "ready"} when models are loaded
if data.get("status") == "ready":
logger.info("✓ STT is ready")
return True
except Exception: