Files
miku-discord/STT_VOICE_TESTING.md

8.1 KiB

STT Voice Testing Guide

Phase 4B: Bot-Side STT Integration - COMPLETE

All code has been deployed to containers. Ready for testing!

Architecture Overview

Discord Voice (User) → Opus 48kHz stereo
                ↓
        VoiceReceiver.write()
                ↓
        Opus decode → Stereo-to-mono → Resample to 16kHz
                ↓
        STTClient.send_audio() → WebSocket
                ↓
        miku-stt:8001 (Silero VAD + Faster-Whisper)
                ↓
        JSON events (vad, partial, final, interruption)
                ↓
        VoiceReceiver callbacks → voice_manager
                ↓
        on_final_transcript() → _generate_voice_response()
                ↓
        LLM streaming → TTS tokens → Audio playback

New Voice Commands

1. Start Listening

!miku listen
  • Starts listening to your voice in the current voice channel
  • You must be in the same channel as Miku
  • Miku will transcribe your speech and respond with voice
!miku listen @username
  • Start listening to a specific user's voice
  • Useful for moderators or testing with multiple users

2. Stop Listening

!miku stop-listening
  • Stop listening to your voice
  • Miku will no longer transcribe or respond to your speech
!miku stop-listening @username
  • Stop listening to a specific user

Testing Procedure

Test 1: Basic STT Connection

  1. Join a voice channel
  2. !miku join - Miku joins your channel
  3. !miku listen - Start listening to your voice
  4. Check bot logs for "Started listening to user"
  5. Check STT logs: docker logs miku-stt --tail 50
    • Should show: "WebSocket connection from user {user_id}"
    • Should show: "Session started for user {user_id}"

Test 2: VAD Detection

  1. After !miku listen, speak into your microphone
  2. Say something like: "Hello Miku, can you hear me?"
  3. Check STT logs for VAD events:
    [DEBUG] VAD: speech_start probability=0.85
    [DEBUG] VAD: speaking probability=0.92
    [DEBUG] VAD: speech_end probability=0.15
    
  4. Bot logs should show: "VAD event for user {id}: speech_start/speaking/speech_end"

Test 3: Transcription

  1. Speak clearly into microphone: "Hey Miku, tell me a joke"
  2. Watch bot logs for:
    • "Partial transcript from user {id}: Hey Miku..."
    • "Final transcript from user {id}: Hey Miku, tell me a joke"
  3. Miku should respond with LLM-generated speech
  4. Check channel for: "🎤 Miku: [her response]"

Test 4: Interruption Detection

  1. !miku listen
  2. !miku say Tell me a very long story about your favorite song
  3. While Miku is speaking, start talking yourself
  4. Speak loudly enough to trigger VAD (probability > 0.7)
  5. Expected behavior:
    • Miku's audio should stop immediately
    • Bot logs: "User {id} interrupted Miku (probability={prob})"
    • STT logs: "Interruption detected during TTS playback"
    • RVC logs: "Interrupted: Flushed {N} ZMQ chunks"

Test 5: Multi-User (if available)

  1. Have two users join voice channel
  2. !miku listen @user1 - Listen to first user
  3. !miku listen @user2 - Listen to second user
  4. Both users speak separately
  5. Verify Miku responds to each user individually
  6. Check STT logs for multiple active sessions

Logs to Monitor

Bot Logs

docker logs -f miku-bot | grep -E "(listen|STT|transcript|interrupt)"

Expected output:

[INFO] Started listening to user 123456789 (username)
[DEBUG] VAD event for user 123456789: speech_start
[DEBUG] Partial transcript from user 123456789: Hello Miku...
[INFO] Final transcript from user 123456789: Hello Miku, how are you?
[INFO] User 123456789 interrupted Miku (probability=0.82)

STT Logs

docker logs -f miku-stt

Expected output:

[INFO] WebSocket connection from user_123456789
[INFO] Session started for user 123456789
[DEBUG] Received 320 audio samples from user_123456789
[DEBUG] VAD speech_start: probability=0.87
[INFO] Transcribing audio segment (duration=2.5s)
[INFO] Final transcript: "Hello Miku, how are you?"

RVC Logs (for interruption)

docker logs -f miku-rvc-api | grep -i interrupt

Expected output:

[INFO] Interrupted: Flushed 15 ZMQ chunks, cleared 48000 RVC buffer samples

Component Status

Completed

  • STT container running (miku-stt:8001)
  • Silero VAD on CPU with chunk buffering
  • Faster-Whisper on GTX 1660 (1.3GB VRAM)
  • STTClient WebSocket client
  • VoiceReceiver Discord audio sink
  • VoiceSession STT integration
  • listen/stop-listening commands
  • /interrupt endpoint in RVC API
  • LLM response generation from transcripts
  • Interruption detection and cancellation

Pending Testing

  • Basic STT connection test
  • VAD speech detection test
  • End-to-end transcription test
  • LLM voice response test
  • Interruption cancellation test
  • Multi-user testing (if available)

🔧 Configuration Tuning (after testing)

  • VAD sensitivity (currently threshold=0.5)
  • VAD timing (min_speech=250ms, min_silence=500ms)
  • Interruption threshold (currently 0.7)
  • Whisper beam size and patience
  • LLM streaming chunk size

API Endpoints

STT Container (port 8001)

  • WebSocket: ws://localhost:8001/ws/stt/{user_id}
  • Health: http://localhost:8001/health

RVC Container (port 8765)

  • WebSocket: ws://localhost:8765/ws/stream
  • Interrupt: http://localhost:8765/interrupt (POST)
  • Health: http://localhost:8765/health

Troubleshooting

No audio received from Discord

  • Check bot logs for "write() called with data"
  • Verify user is in same voice channel as Miku
  • Check Discord permissions (View Channel, Connect, Speak)

VAD not detecting speech

  • Check chunk buffer accumulation in STT logs
  • Verify audio format: PCM int16, 16kHz mono
  • Try speaking louder or more clearly
  • Check VAD threshold (may need adjustment)

Transcription empty or gibberish

  • Verify Whisper model loaded (check STT startup logs)
  • Check GPU VRAM usage: nvidia-smi
  • Ensure audio segments are at least 1-2 seconds long
  • Try speaking more clearly with less background noise

Interruption not working

  • Verify Miku is actually speaking (check miku_speaking flag)
  • Check VAD probability in logs (must be > 0.7)
  • Verify /interrupt endpoint returns success
  • Check RVC logs for flushed chunks

Multiple users causing issues

  • Check STT logs for per-user session management
  • Verify each user has separate STTClient instance
  • Check for resource contention on GTX 1660

Next Steps After Testing

Phase 4C: LLM KV Cache Precomputation

  • Use partial transcripts to start LLM generation early
  • Precompute KV cache for common phrases
  • Reduce latency between speech end and response start

Phase 4D: Multi-User Refinement

  • Queue management for multiple simultaneous speakers
  • Priority system for interruptions
  • Resource allocation for multiple Whisper requests

Phase 4E: Latency Optimization

  • Profile each stage of the pipeline
  • Optimize audio chunk sizes
  • Reduce WebSocket message overhead
  • Tune Whisper beam search parameters
  • Implement VAD lookahead for quicker detection

Hardware Utilization

Current Allocation

  • AMD RX 6800: LLaMA text models (idle during listen/speak)
  • GTX 1660:
    • Listen phase: Faster-Whisper (1.3GB VRAM)
    • Speak phase: Soprano TTS + RVC (time-multiplexed)
  • CPU: Silero VAD, audio preprocessing

Expected Performance

  • VAD latency: <50ms (CPU processing)
  • Transcription latency: 200-500ms (Whisper inference)
  • LLM streaming: 20-30 tokens/sec (RX 6800)
  • TTS synthesis: Real-time (GTX 1660)
  • Total latency (speech → response): 1-2 seconds

Testing Checklist

Before marking Phase 4B as complete:

  • Test basic STT connection with !miku listen
  • Verify VAD detects speech start/end correctly
  • Confirm transcripts are accurate and complete
  • Test LLM voice response generation works
  • Verify interruption cancels TTS playback
  • Check multi-user handling (if possible)
  • Verify resource cleanup on !miku stop-listening
  • Test edge cases (silence, background noise, overlapping speech)
  • Profile latencies at each stage
  • Document any configuration tuning needed

Status: Code deployed, ready for user testing! 🎤🤖