Commit Graph

3 Commits

Author SHA1 Message Date
9b74acd03b Fix missing sklearn module in miku-bot; upgrade miku-stt to CUDA 12.8.1 + PyTorch 2.7.1
- miku-bot: Re-add scikit-learn to requirements.txt (needed for vision color extraction)
- miku-stt: Upgrade from CUDA 12.6.2 to 12.8.1, PyTorch 2.5.1 to 2.7.1 per RealtimeSTT PR #295
- miku-stt: Use Ubuntu 24.04 with Python 3.12 (single installation, no dual Python)
- miku-stt: Add requirements-gpu-torch.txt for separate PyTorch installation
- miku-stt: Use --break-system-packages flag for Ubuntu 24.04 pip compatibility
2026-02-23 14:31:48 +02:00
eb03dfce4d refactor: Implement low-latency STT pipeline with speculative transcription
Major architectural overhaul of the speech-to-text pipeline for real-time voice chat:

STT Server Rewrite:
- Replaced RealtimeSTT dependency with direct Silero VAD + Faster-Whisper integration
- Achieved sub-second latency by eliminating unnecessary abstractions
- Uses small.en Whisper model for fast transcription (~850ms)

Speculative Transcription (NEW):
- Start transcribing at 150ms silence (speculative) while still listening
- If speech continues, discard speculative result and keep buffering
- If 400ms silence confirmed, use pre-computed speculative result immediately
- Reduces latency by ~250-850ms for typical utterances with clear pauses

VAD Implementation:
- Silero VAD with ONNX (CPU-efficient) for 32ms chunk processing
- Direct speech boundary detection without RealtimeSTT overhead
- Configurable thresholds for silence detection (400ms final, 150ms speculative)

Architecture:
- Single Whisper model loaded once, shared across sessions
- VAD runs on every 512-sample chunk for immediate speech detection
- Background transcription worker thread for non-blocking processing
- Greedy decoding (beam_size=1) for maximum speed

Performance:
- Previous: 400ms silence wait + ~850ms transcription = ~1.25s total latency
- Current: 400ms silence wait + 0ms (speculative ready) = ~400ms (best case)
- Single model reduces VRAM usage, prevents OOM on GTX 1660

Container Manager Updates:
- Updated health check logic to work with new response format
- Changed from checking 'warmed_up' flag to just 'status: ready'
- Improved terminology from 'warmup' to 'models loading'

Files Changed:
- stt-realtime/stt_server.py: Complete rewrite with Silero VAD + speculative transcription
- stt-realtime/requirements.txt: Removed RealtimeSTT, using torch.hub for Silero VAD
- bot/utils/container_manager.py: Updated health check for new STT response format
- bot/api.py: Updated docstring to reflect new architecture
- backups/: Archived old RealtimeSTT-based implementation

This addresses low latency requirements while maintaining accuracy with configurable
speech detection thresholds.
2026-01-22 22:08:07 +02:00
2934efba22 Implemented experimental real production ready voice chat, relegated old flow to voice debug mode. New Web UI panel for Voice Chat. 2026-01-20 23:06:17 +02:00