refactor: Implement low-latency STT pipeline with speculative transcription
Major architectural overhaul of the speech-to-text pipeline for real-time voice chat: STT Server Rewrite: - Replaced RealtimeSTT dependency with direct Silero VAD + Faster-Whisper integration - Achieved sub-second latency by eliminating unnecessary abstractions - Uses small.en Whisper model for fast transcription (~850ms) Speculative Transcription (NEW): - Start transcribing at 150ms silence (speculative) while still listening - If speech continues, discard speculative result and keep buffering - If 400ms silence confirmed, use pre-computed speculative result immediately - Reduces latency by ~250-850ms for typical utterances with clear pauses VAD Implementation: - Silero VAD with ONNX (CPU-efficient) for 32ms chunk processing - Direct speech boundary detection without RealtimeSTT overhead - Configurable thresholds for silence detection (400ms final, 150ms speculative) Architecture: - Single Whisper model loaded once, shared across sessions - VAD runs on every 512-sample chunk for immediate speech detection - Background transcription worker thread for non-blocking processing - Greedy decoding (beam_size=1) for maximum speed Performance: - Previous: 400ms silence wait + ~850ms transcription = ~1.25s total latency - Current: 400ms silence wait + 0ms (speculative ready) = ~400ms (best case) - Single model reduces VRAM usage, prevents OOM on GTX 1660 Container Manager Updates: - Updated health check logic to work with new response format - Changed from checking 'warmed_up' flag to just 'status: ready' - Improved terminology from 'warmup' to 'models loading' Files Changed: - stt-realtime/stt_server.py: Complete rewrite with Silero VAD + speculative transcription - stt-realtime/requirements.txt: Removed RealtimeSTT, using torch.hub for Silero VAD - bot/utils/container_manager.py: Updated health check for new STT response format - bot/api.py: Updated docstring to reflect new architecture - backups/: Archived old RealtimeSTT-based implementation This addresses low latency requirements while maintaining accuracy with configurable speech detection thresholds.
This commit is contained in:
@@ -1,19 +1,16 @@
|
||||
# RealtimeSTT dependencies
|
||||
RealtimeSTT>=0.3.104
|
||||
# Low-latency STT dependencies
|
||||
websockets>=12.0
|
||||
numpy>=1.24.0
|
||||
|
||||
# For faster-whisper backend (GPU accelerated)
|
||||
# Faster-whisper backend (GPU accelerated)
|
||||
faster-whisper>=1.0.0
|
||||
ctranslate2>=4.4.0
|
||||
|
||||
# Audio processing
|
||||
soundfile>=0.12.0
|
||||
librosa>=0.10.0
|
||||
|
||||
# VAD dependencies (included with RealtimeSTT but explicit)
|
||||
webrtcvad>=2.0.10
|
||||
silero-vad>=5.1
|
||||
# VAD - Silero (loaded via torch.hub)
|
||||
# No explicit package needed, comes with torch
|
||||
|
||||
# Utilities
|
||||
aiohttp>=3.9.0
|
||||
|
||||
Reference in New Issue
Block a user