readmes/STT_VOICE_TESTING.md

# STT Voice Testing Guide

## Phase 4B: Bot-Side STT Integration - COMPLETE ✅

All code has been deployed to containers. Ready for testing!

## Architecture Overview

```
Discord Voice (User) → Opus 48kHz stereo
                ↓
        VoiceReceiver.write()
                ↓
        Opus decode → Stereo-to-mono → Resample to 16kHz
                ↓
        STTClient.send_audio() → WebSocket
                ↓
        miku-stt:8001 (Silero VAD + Faster-Whisper)
                ↓
        JSON events (vad, partial, final, interruption)
                ↓
        VoiceReceiver callbacks → voice_manager
                ↓
        on_final_transcript() → _generate_voice_response()
                ↓
        LLM streaming → TTS tokens → Audio playback
```

## New Voice Commands

### 1. Start Listening
```
!miku listen
```
- Starts listening to **your** voice in the current voice channel
- You must be in the same channel as Miku
- Miku will transcribe your speech and respond with voice

```
!miku listen @username
```
- Start listening to a specific user's voice
- Useful for moderators or testing with multiple users

### 2. Stop Listening
```
!miku stop-listening
```
- Stop listening to your voice
- Miku will no longer transcribe or respond to your speech

```
!miku stop-listening @username
```
- Stop listening to a specific user

## Testing Procedure

### Test 1: Basic STT Connection
1. Join a voice channel
2. `!miku join` - Miku joins your channel
3. `!miku listen` - Start listening to your voice
4. Check bot logs for "Started listening to user"
5. Check STT logs: `docker logs miku-stt --tail 50`
   - Should show: "WebSocket connection from user {user_id}"
   - Should show: "Session started for user {user_id}"

### Test 2: VAD Detection
1. After `!miku listen`, speak into your microphone
2. Say something like: "Hello Miku, can you hear me?"
3. Check STT logs for VAD events:
   ```
   [DEBUG] VAD: speech_start probability=0.85
   [DEBUG] VAD: speaking probability=0.92
   [DEBUG] VAD: speech_end probability=0.15
   ```
4. Bot logs should show: "VAD event for user {id}: speech_start/speaking/speech_end"

### Test 3: Transcription
1. Speak clearly into microphone: "Hey Miku, tell me a joke"
2. Watch bot logs for:
   - "Partial transcript from user {id}: Hey Miku..."
   - "Final transcript from user {id}: Hey Miku, tell me a joke"
3. Miku should respond with LLM-generated speech
4. Check channel for: "🎤 Miku: *[her response]*"

### Test 4: Interruption Detection
1. `!miku listen`
2. `!miku say Tell me a very long story about your favorite song`
3. While Miku is speaking, start talking yourself
4. Speak loudly enough to trigger VAD (probability > 0.7)
5. Expected behavior:
   - Miku's audio should stop immediately
   - Bot logs: "User {id} interrupted Miku (probability={prob})"
   - STT logs: "Interruption detected during TTS playback"
   - RVC logs: "Interrupted: Flushed {N} ZMQ chunks"

### Test 5: Multi-User (if available)
1. Have two users join voice channel
2. `!miku listen @user1` - Listen to first user
3. `!miku listen @user2` - Listen to second user
4. Both users speak separately
5. Verify Miku responds to each user individually
6. Check STT logs for multiple active sessions

## Logs to Monitor

### Bot Logs
```bash
docker logs -f miku-bot | grep -E "(listen|STT|transcript|interrupt)"
```
Expected output:
```
[INFO] Started listening to user 123456789 (username)
[DEBUG] VAD event for user 123456789: speech_start
[DEBUG] Partial transcript from user 123456789: Hello Miku...
[INFO] Final transcript from user 123456789: Hello Miku, how are you?
[INFO] User 123456789 interrupted Miku (probability=0.82)
```

### STT Logs
```bash
docker logs -f miku-stt
```
Expected output:
```
[INFO] WebSocket connection from user_123456789
[INFO] Session started for user 123456789
[DEBUG] Received 320 audio samples from user_123456789
[DEBUG] VAD speech_start: probability=0.87
[INFO] Transcribing audio segment (duration=2.5s)
[INFO] Final transcript: "Hello Miku, how are you?"
```

### RVC Logs (for interruption)
```bash
docker logs -f miku-rvc-api | grep -i interrupt
```
Expected output:
```
[INFO] Interrupted: Flushed 15 ZMQ chunks, cleared 48000 RVC buffer samples
```

## Component Status

### ✅ Completed
- [x] STT container running (miku-stt:8001)
- [x] Silero VAD on CPU with chunk buffering
- [x] Faster-Whisper on GTX 1660 (1.3GB VRAM)
- [x] STTClient WebSocket client
- [x] VoiceReceiver Discord audio sink
- [x] VoiceSession STT integration
- [x] listen/stop-listening commands
- [x] /interrupt endpoint in RVC API
- [x] LLM response generation from transcripts
- [x] Interruption detection and cancellation

### ⏳ Pending Testing
- [ ] Basic STT connection test
- [ ] VAD speech detection test
- [ ] End-to-end transcription test
- [ ] LLM voice response test
- [ ] Interruption cancellation test
- [ ] Multi-user testing (if available)

### 🔧 Configuration Tuning (after testing)
- VAD sensitivity (currently threshold=0.5)
- VAD timing (min_speech=250ms, min_silence=500ms)
- Interruption threshold (currently 0.7)
- Whisper beam size and patience
- LLM streaming chunk size

## API Endpoints

### STT Container (port 8001)
- WebSocket: `ws://localhost:8001/ws/stt/{user_id}`
- Health: `http://localhost:8001/health`

### RVC Container (port 8765)
- WebSocket: `ws://localhost:8765/ws/stream`
- Interrupt: `http://localhost:8765/interrupt` (POST)
- Health: `http://localhost:8765/health`

## Troubleshooting

### No audio received from Discord
- Check bot logs for "write() called with data"
- Verify user is in same voice channel as Miku
- Check Discord permissions (View Channel, Connect, Speak)

### VAD not detecting speech
- Check chunk buffer accumulation in STT logs
- Verify audio format: PCM int16, 16kHz mono
- Try speaking louder or more clearly
- Check VAD threshold (may need adjustment)

### Transcription empty or gibberish
- Verify Whisper model loaded (check STT startup logs)
- Check GPU VRAM usage: `nvidia-smi`
- Ensure audio segments are at least 1-2 seconds long
- Try speaking more clearly with less background noise

### Interruption not working
- Verify Miku is actually speaking (check miku_speaking flag)
- Check VAD probability in logs (must be > 0.7)
- Verify /interrupt endpoint returns success
- Check RVC logs for flushed chunks

### Multiple users causing issues
- Check STT logs for per-user session management
- Verify each user has separate STTClient instance
- Check for resource contention on GTX 1660

## Next Steps After Testing

### Phase 4C: LLM KV Cache Precomputation
- Use partial transcripts to start LLM generation early
- Precompute KV cache for common phrases
- Reduce latency between speech end and response start

### Phase 4D: Multi-User Refinement
- Queue management for multiple simultaneous speakers
- Priority system for interruptions
- Resource allocation for multiple Whisper requests

### Phase 4E: Latency Optimization
- Profile each stage of the pipeline
- Optimize audio chunk sizes
- Reduce WebSocket message overhead
- Tune Whisper beam search parameters
- Implement VAD lookahead for quicker detection

## Hardware Utilization

### Current Allocation
- **AMD RX 6800**: LLaMA text models (idle during listen/speak)
- **GTX 1660**: 
  - Listen phase: Faster-Whisper (1.3GB VRAM)
  - Speak phase: Soprano TTS + RVC (time-multiplexed)
- **CPU**: Silero VAD, audio preprocessing

### Expected Performance
- VAD latency: <50ms (CPU processing)
- Transcription latency: 200-500ms (Whisper inference)
- LLM streaming: 20-30 tokens/sec (RX 6800)
- TTS synthesis: Real-time (GTX 1660)
- Total latency (speech → response): 1-2 seconds

## Testing Checklist

Before marking Phase 4B as complete:

- [ ] Test basic STT connection with `!miku listen`
- [ ] Verify VAD detects speech start/end correctly
- [ ] Confirm transcripts are accurate and complete
- [ ] Test LLM voice response generation works
- [ ] Verify interruption cancels TTS playback
- [ ] Check multi-user handling (if possible)
- [ ] Verify resource cleanup on `!miku stop-listening`
- [ ] Test edge cases (silence, background noise, overlapping speech)
- [ ] Profile latencies at each stage
- [ ] Document any configuration tuning needed

---

**Status**: Code deployed, ready for user testing! 🎤🤖
moved AI generated readmes to readme folder (may delete) 2026-01-27 19:57:48 +02:00			`# STT Voice Testing Guide`

			`## Phase 4B: Bot-Side STT Integration - COMPLETE ✅`

			`All code has been deployed to containers. Ready for testing!`

			`## Architecture Overview`

			```
			`Discord Voice (User) → Opus 48kHz stereo`
			`↓`
			`VoiceReceiver.write()`
			`↓`
			`Opus decode → Stereo-to-mono → Resample to 16kHz`
			`↓`
			`STTClient.send_audio() → WebSocket`
			`↓`
			`miku-stt:8001 (Silero VAD + Faster-Whisper)`
			`↓`
			`JSON events (vad, partial, final, interruption)`
			`↓`
			`VoiceReceiver callbacks → voice_manager`
			`↓`
			`on_final_transcript() → _generate_voice_response()`
			`↓`
			`LLM streaming → TTS tokens → Audio playback`
			```

			`## New Voice Commands`

			`### 1. Start Listening`
			```
			`!miku listen`
			```
			`- Starts listening to your voice in the current voice channel`
			`- You must be in the same channel as Miku`
			`- Miku will transcribe your speech and respond with voice`

			```
			`!miku listen @username`
			```
			`- Start listening to a specific user's voice`
			`- Useful for moderators or testing with multiple users`

			`### 2. Stop Listening`
			```
			`!miku stop-listening`
			```
			`- Stop listening to your voice`
			`- Miku will no longer transcribe or respond to your speech`

			```
			`!miku stop-listening @username`
			```
			`- Stop listening to a specific user`

			`## Testing Procedure`

			`### Test 1: Basic STT Connection`
			`1. Join a voice channel`
			2. `!miku join` - Miku joins your channel
			3. `!miku listen` - Start listening to your voice
			`4. Check bot logs for "Started listening to user"`
			5. Check STT logs: `docker logs miku-stt --tail 50`
			`- Should show: "WebSocket connection from user {user_id}"`
			`- Should show: "Session started for user {user_id}"`

			`### Test 2: VAD Detection`
			1. After `!miku listen`, speak into your microphone
			`2. Say something like: "Hello Miku, can you hear me?"`
			`3. Check STT logs for VAD events:`
			```
			`[DEBUG] VAD: speech_start probability=0.85`
			`[DEBUG] VAD: speaking probability=0.92`
			`[DEBUG] VAD: speech_end probability=0.15`
			```
			`4. Bot logs should show: "VAD event for user {id}: speech_start/speaking/speech_end"`

			`### Test 3: Transcription`
			`1. Speak clearly into microphone: "Hey Miku, tell me a joke"`
			`2. Watch bot logs for:`
			`- "Partial transcript from user {id}: Hey Miku..."`
			`- "Final transcript from user {id}: Hey Miku, tell me a joke"`
			`3. Miku should respond with LLM-generated speech`
			`4. Check channel for: "🎤 Miku: [her response]"`

			`### Test 4: Interruption Detection`
			1. `!miku listen`
			2. `!miku say Tell me a very long story about your favorite song`
			`3. While Miku is speaking, start talking yourself`
			`4. Speak loudly enough to trigger VAD (probability > 0.7)`
			`5. Expected behavior:`
			`- Miku's audio should stop immediately`
			`- Bot logs: "User {id} interrupted Miku (probability={prob})"`
			`- STT logs: "Interruption detected during TTS playback"`
			`- RVC logs: "Interrupted: Flushed {N} ZMQ chunks"`

			`### Test 5: Multi-User (if available)`
			`1. Have two users join voice channel`
			2. `!miku listen @user1` - Listen to first user
			3. `!miku listen @user2` - Listen to second user
			`4. Both users speak separately`
			`5. Verify Miku responds to each user individually`
			`6. Check STT logs for multiple active sessions`

			`## Logs to Monitor`

			`### Bot Logs`
			```bash
			`docker logs -f miku-bot \| grep -E "(listen\|STT\|transcript\|interrupt)"`
			```
			`Expected output:`
			```
			`[INFO] Started listening to user 123456789 (username)`
			`[DEBUG] VAD event for user 123456789: speech_start`
			`[DEBUG] Partial transcript from user 123456789: Hello Miku...`
			`[INFO] Final transcript from user 123456789: Hello Miku, how are you?`
			`[INFO] User 123456789 interrupted Miku (probability=0.82)`
			```

			`### STT Logs`
			```bash
			`docker logs -f miku-stt`
			```
			`Expected output:`
			```
			`[INFO] WebSocket connection from user_123456789`
			`[INFO] Session started for user 123456789`
			`[DEBUG] Received 320 audio samples from user_123456789`
			`[DEBUG] VAD speech_start: probability=0.87`
			`[INFO] Transcribing audio segment (duration=2.5s)`
			`[INFO] Final transcript: "Hello Miku, how are you?"`
			```

			`### RVC Logs (for interruption)`
			```bash
			`docker logs -f miku-rvc-api \| grep -i interrupt`
			```
			`Expected output:`
			```
			`[INFO] Interrupted: Flushed 15 ZMQ chunks, cleared 48000 RVC buffer samples`
			```

			`## Component Status`

			`### ✅ Completed`
			`- [x] STT container running (miku-stt:8001)`
			`- [x] Silero VAD on CPU with chunk buffering`
			`- [x] Faster-Whisper on GTX 1660 (1.3GB VRAM)`
			`- [x] STTClient WebSocket client`
			`- [x] VoiceReceiver Discord audio sink`
			`- [x] VoiceSession STT integration`
			`- [x] listen/stop-listening commands`
			`- [x] /interrupt endpoint in RVC API`
			`- [x] LLM response generation from transcripts`
			`- [x] Interruption detection and cancellation`

			`### ⏳ Pending Testing`
			`- [ ] Basic STT connection test`
			`- [ ] VAD speech detection test`
			`- [ ] End-to-end transcription test`
			`- [ ] LLM voice response test`
			`- [ ] Interruption cancellation test`
			`- [ ] Multi-user testing (if available)`

			`### 🔧 Configuration Tuning (after testing)`
			`- VAD sensitivity (currently threshold=0.5)`
			`- VAD timing (min_speech=250ms, min_silence=500ms)`
			`- Interruption threshold (currently 0.7)`
			`- Whisper beam size and patience`
			`- LLM streaming chunk size`

			`## API Endpoints`

			`### STT Container (port 8001)`
			- WebSocket: `ws://localhost:8001/ws/stt/{user_id}`
			- Health: `http://localhost:8001/health`

			`### RVC Container (port 8765)`
			- WebSocket: `ws://localhost:8765/ws/stream`
			- Interrupt: `http://localhost:8765/interrupt` (POST)
			- Health: `http://localhost:8765/health`

			`## Troubleshooting`

			`### No audio received from Discord`
			`- Check bot logs for "write() called with data"`
			`- Verify user is in same voice channel as Miku`
			`- Check Discord permissions (View Channel, Connect, Speak)`

			`### VAD not detecting speech`
			`- Check chunk buffer accumulation in STT logs`
			`- Verify audio format: PCM int16, 16kHz mono`
			`- Try speaking louder or more clearly`
			`- Check VAD threshold (may need adjustment)`

			`### Transcription empty or gibberish`
			`- Verify Whisper model loaded (check STT startup logs)`
			- Check GPU VRAM usage: `nvidia-smi`
			`- Ensure audio segments are at least 1-2 seconds long`
			`- Try speaking more clearly with less background noise`

			`### Interruption not working`
			`- Verify Miku is actually speaking (check miku_speaking flag)`
			`- Check VAD probability in logs (must be > 0.7)`
			`- Verify /interrupt endpoint returns success`
			`- Check RVC logs for flushed chunks`

			`### Multiple users causing issues`
			`- Check STT logs for per-user session management`
			`- Verify each user has separate STTClient instance`
			`- Check for resource contention on GTX 1660`

			`## Next Steps After Testing`

			`### Phase 4C: LLM KV Cache Precomputation`
			`- Use partial transcripts to start LLM generation early`
			`- Precompute KV cache for common phrases`
			`- Reduce latency between speech end and response start`

			`### Phase 4D: Multi-User Refinement`
			`- Queue management for multiple simultaneous speakers`
			`- Priority system for interruptions`
			`- Resource allocation for multiple Whisper requests`

			`### Phase 4E: Latency Optimization`
			`- Profile each stage of the pipeline`
			`- Optimize audio chunk sizes`
			`- Reduce WebSocket message overhead`
			`- Tune Whisper beam search parameters`
			`- Implement VAD lookahead for quicker detection`

			`## Hardware Utilization`

			`### Current Allocation`
			`- AMD RX 6800: LLaMA text models (idle during listen/speak)`
			`- GTX 1660:`
			`- Listen phase: Faster-Whisper (1.3GB VRAM)`
			`- Speak phase: Soprano TTS + RVC (time-multiplexed)`
			`- CPU: Silero VAD, audio preprocessing`

			`### Expected Performance`
			`- VAD latency: <50ms (CPU processing)`
			`- Transcription latency: 200-500ms (Whisper inference)`
			`- LLM streaming: 20-30 tokens/sec (RX 6800)`
			`- TTS synthesis: Real-time (GTX 1660)`
			`- Total latency (speech → response): 1-2 seconds`

			`## Testing Checklist`

			`Before marking Phase 4B as complete:`

			- [ ] Test basic STT connection with `!miku listen`
			`- [ ] Verify VAD detects speech start/end correctly`
			`- [ ] Confirm transcripts are accurate and complete`
			`- [ ] Test LLM voice response generation works`
			`- [ ] Verify interruption cancels TTS playback`
			`- [ ] Check multi-user handling (if possible)`
			- [ ] Verify resource cleanup on `!miku stop-listening`
			`- [ ] Test edge cases (silence, background noise, overlapping speech)`
			`- [ ] Profile latencies at each stage`
			`- [ ] Document any configuration tuning needed`

			`---`

			`Status: Code deployed, ready for user testing! 🎤🤖`