refactor: Implement low-latency STT pipeline with speculative transcription

Major architectural overhaul of the speech-to-text pipeline for real-time voice chat:

STT Server Rewrite:
- Replaced RealtimeSTT dependency with direct Silero VAD + Faster-Whisper integration
- Achieved sub-second latency by eliminating unnecessary abstractions
- Uses small.en Whisper model for fast transcription (~850ms)

Speculative Transcription (NEW):
- Start transcribing at 150ms silence (speculative) while still listening
- If speech continues, discard speculative result and keep buffering
- If 400ms silence confirmed, use pre-computed speculative result immediately
- Reduces latency by ~250-850ms for typical utterances with clear pauses

VAD Implementation:
- Silero VAD with ONNX (CPU-efficient) for 32ms chunk processing
- Direct speech boundary detection without RealtimeSTT overhead
- Configurable thresholds for silence detection (400ms final, 150ms speculative)

Architecture:
- Single Whisper model loaded once, shared across sessions
- VAD runs on every 512-sample chunk for immediate speech detection
- Background transcription worker thread for non-blocking processing
- Greedy decoding (beam_size=1) for maximum speed

Performance:
- Previous: 400ms silence wait + ~850ms transcription = ~1.25s total latency
- Current: 400ms silence wait + 0ms (speculative ready) = ~400ms (best case)
- Single model reduces VRAM usage, prevents OOM on GTX 1660

Container Manager Updates:
- Updated health check logic to work with new response format
- Changed from checking 'warmed_up' flag to just 'status: ready'
- Improved terminology from 'warmup' to 'models loading'

Files Changed:
- stt-realtime/stt_server.py: Complete rewrite with Silero VAD + speculative transcription
- stt-realtime/requirements.txt: Removed RealtimeSTT, using torch.hub for Silero VAD
- bot/utils/container_manager.py: Updated health check for new STT response format
- bot/api.py: Updated docstring to reflect new architecture
- backups/: Archived old RealtimeSTT-based implementation

This addresses low latency requirements while maintaining accuracy with configurable
speech detection thresholds.

This commit is contained in:

koko210Serve

2026-01-22 22:08:07 +02:00

parent 2934efba22

commit eb03dfce4d

5 changed files with 850 additions and 400 deletions

11

stt-realtime/requirements.txt

View File

@@ -1,19 +1,16 @@
 # RealtimeSTT dependencies
 RealtimeSTT>=0.3.104
 # Low-latency STT dependencies
 websockets>=12.0
 numpy>=1.24.0
 # For faster-whisper backend (GPU accelerated)
 # Faster-whisper backend (GPU accelerated)
 faster-whisper>=1.0.0
 ctranslate2>=4.4.0
 # Audio processing
 soundfile>=0.12.0
 librosa>=0.10.0
 # VAD dependencies (included with RealtimeSTT but explicit)
 webrtcvad>=2.0.10
 silero-vad>=5.1
 # VAD - Silero (loaded via torch.hub)
 # No explicit package needed, comes with torch
 # Utilities
 aiohttp>=3.9.0

refactor: Implement low-latency STT pipeline with speculative transcription

11 stt-realtime/requirements.txt Unescape Escape View File

11

stt-realtime/requirements.txt

View File