# STT Voice Testing Guide ## Phase 4B: Bot-Side STT Integration - COMPLETE ✅ All code has been deployed to containers. Ready for testing! ## Architecture Overview ``` Discord Voice (User) → Opus 48kHz stereo ↓ VoiceReceiver.write() ↓ Opus decode → Stereo-to-mono → Resample to 16kHz ↓ STTClient.send_audio() → WebSocket ↓ miku-stt:8001 (Silero VAD + Faster-Whisper) ↓ JSON events (vad, partial, final, interruption) ↓ VoiceReceiver callbacks → voice_manager ↓ on_final_transcript() → _generate_voice_response() ↓ LLM streaming → TTS tokens → Audio playback ``` ## New Voice Commands ### 1. Start Listening ``` !miku listen ``` - Starts listening to **your** voice in the current voice channel - You must be in the same channel as Miku - Miku will transcribe your speech and respond with voice ``` !miku listen @username ``` - Start listening to a specific user's voice - Useful for moderators or testing with multiple users ### 2. Stop Listening ``` !miku stop-listening ``` - Stop listening to your voice - Miku will no longer transcribe or respond to your speech ``` !miku stop-listening @username ``` - Stop listening to a specific user ## Testing Procedure ### Test 1: Basic STT Connection 1. Join a voice channel 2. `!miku join` - Miku joins your channel 3. `!miku listen` - Start listening to your voice 4. Check bot logs for "Started listening to user" 5. Check STT logs: `docker logs miku-stt --tail 50` - Should show: "WebSocket connection from user {user_id}" - Should show: "Session started for user {user_id}" ### Test 2: VAD Detection 1. After `!miku listen`, speak into your microphone 2. Say something like: "Hello Miku, can you hear me?" 3. Check STT logs for VAD events: ``` [DEBUG] VAD: speech_start probability=0.85 [DEBUG] VAD: speaking probability=0.92 [DEBUG] VAD: speech_end probability=0.15 ``` 4. Bot logs should show: "VAD event for user {id}: speech_start/speaking/speech_end" ### Test 3: Transcription 1. Speak clearly into microphone: "Hey Miku, tell me a joke" 2. Watch bot logs for: - "Partial transcript from user {id}: Hey Miku..." - "Final transcript from user {id}: Hey Miku, tell me a joke" 3. Miku should respond with LLM-generated speech 4. Check channel for: "🎤 Miku: *[her response]*" ### Test 4: Interruption Detection 1. `!miku listen` 2. `!miku say Tell me a very long story about your favorite song` 3. While Miku is speaking, start talking yourself 4. Speak loudly enough to trigger VAD (probability > 0.7) 5. Expected behavior: - Miku's audio should stop immediately - Bot logs: "User {id} interrupted Miku (probability={prob})" - STT logs: "Interruption detected during TTS playback" - RVC logs: "Interrupted: Flushed {N} ZMQ chunks" ### Test 5: Multi-User (if available) 1. Have two users join voice channel 2. `!miku listen @user1` - Listen to first user 3. `!miku listen @user2` - Listen to second user 4. Both users speak separately 5. Verify Miku responds to each user individually 6. Check STT logs for multiple active sessions ## Logs to Monitor ### Bot Logs ```bash docker logs -f miku-bot | grep -E "(listen|STT|transcript|interrupt)" ``` Expected output: ``` [INFO] Started listening to user 123456789 (username) [DEBUG] VAD event for user 123456789: speech_start [DEBUG] Partial transcript from user 123456789: Hello Miku... [INFO] Final transcript from user 123456789: Hello Miku, how are you? [INFO] User 123456789 interrupted Miku (probability=0.82) ``` ### STT Logs ```bash docker logs -f miku-stt ``` Expected output: ``` [INFO] WebSocket connection from user_123456789 [INFO] Session started for user 123456789 [DEBUG] Received 320 audio samples from user_123456789 [DEBUG] VAD speech_start: probability=0.87 [INFO] Transcribing audio segment (duration=2.5s) [INFO] Final transcript: "Hello Miku, how are you?" ``` ### RVC Logs (for interruption) ```bash docker logs -f miku-rvc-api | grep -i interrupt ``` Expected output: ``` [INFO] Interrupted: Flushed 15 ZMQ chunks, cleared 48000 RVC buffer samples ``` ## Component Status ### ✅ Completed - [x] STT container running (miku-stt:8001) - [x] Silero VAD on CPU with chunk buffering - [x] Faster-Whisper on GTX 1660 (1.3GB VRAM) - [x] STTClient WebSocket client - [x] VoiceReceiver Discord audio sink - [x] VoiceSession STT integration - [x] listen/stop-listening commands - [x] /interrupt endpoint in RVC API - [x] LLM response generation from transcripts - [x] Interruption detection and cancellation ### ⏳ Pending Testing - [ ] Basic STT connection test - [ ] VAD speech detection test - [ ] End-to-end transcription test - [ ] LLM voice response test - [ ] Interruption cancellation test - [ ] Multi-user testing (if available) ### 🔧 Configuration Tuning (after testing) - VAD sensitivity (currently threshold=0.5) - VAD timing (min_speech=250ms, min_silence=500ms) - Interruption threshold (currently 0.7) - Whisper beam size and patience - LLM streaming chunk size ## API Endpoints ### STT Container (port 8001) - WebSocket: `ws://localhost:8001/ws/stt/{user_id}` - Health: `http://localhost:8001/health` ### RVC Container (port 8765) - WebSocket: `ws://localhost:8765/ws/stream` - Interrupt: `http://localhost:8765/interrupt` (POST) - Health: `http://localhost:8765/health` ## Troubleshooting ### No audio received from Discord - Check bot logs for "write() called with data" - Verify user is in same voice channel as Miku - Check Discord permissions (View Channel, Connect, Speak) ### VAD not detecting speech - Check chunk buffer accumulation in STT logs - Verify audio format: PCM int16, 16kHz mono - Try speaking louder or more clearly - Check VAD threshold (may need adjustment) ### Transcription empty or gibberish - Verify Whisper model loaded (check STT startup logs) - Check GPU VRAM usage: `nvidia-smi` - Ensure audio segments are at least 1-2 seconds long - Try speaking more clearly with less background noise ### Interruption not working - Verify Miku is actually speaking (check miku_speaking flag) - Check VAD probability in logs (must be > 0.7) - Verify /interrupt endpoint returns success - Check RVC logs for flushed chunks ### Multiple users causing issues - Check STT logs for per-user session management - Verify each user has separate STTClient instance - Check for resource contention on GTX 1660 ## Next Steps After Testing ### Phase 4C: LLM KV Cache Precomputation - Use partial transcripts to start LLM generation early - Precompute KV cache for common phrases - Reduce latency between speech end and response start ### Phase 4D: Multi-User Refinement - Queue management for multiple simultaneous speakers - Priority system for interruptions - Resource allocation for multiple Whisper requests ### Phase 4E: Latency Optimization - Profile each stage of the pipeline - Optimize audio chunk sizes - Reduce WebSocket message overhead - Tune Whisper beam search parameters - Implement VAD lookahead for quicker detection ## Hardware Utilization ### Current Allocation - **AMD RX 6800**: LLaMA text models (idle during listen/speak) - **GTX 1660**: - Listen phase: Faster-Whisper (1.3GB VRAM) - Speak phase: Soprano TTS + RVC (time-multiplexed) - **CPU**: Silero VAD, audio preprocessing ### Expected Performance - VAD latency: <50ms (CPU processing) - Transcription latency: 200-500ms (Whisper inference) - LLM streaming: 20-30 tokens/sec (RX 6800) - TTS synthesis: Real-time (GTX 1660) - Total latency (speech → response): 1-2 seconds ## Testing Checklist Before marking Phase 4B as complete: - [ ] Test basic STT connection with `!miku listen` - [ ] Verify VAD detects speech start/end correctly - [ ] Confirm transcripts are accurate and complete - [ ] Test LLM voice response generation works - [ ] Verify interruption cancels TTS playback - [ ] Check multi-user handling (if possible) - [ ] Verify resource cleanup on `!miku stop-listening` - [ ] Test edge cases (silence, background noise, overlapping speech) - [ ] Profile latencies at each stage - [ ] Document any configuration tuning needed --- **Status**: Code deployed, ready for user testing! 🎤🤖