moved AI generated readmes to readme folder (may delete)

2026-01-27 19:57:48 +02:00
parent 0f1c30f757
commit c58b941587
34 changed files with 8709 additions and 770 deletions
--- a/readmes/VOICE_TO_VOICE_REFERENCE.md
+++ b/readmes/VOICE_TO_VOICE_REFERENCE.md
@@ -0,0 +1,323 @@
+# Voice-to-Voice Quick Reference
+
+## Complete Pipeline Status ✅
+
+All phases complete and deployed!
+
+## Phase Completion Status
+
+### ✅ Phase 1: Voice Connection (COMPLETE)
+- Discord voice channel connection
+- Audio playback via discord.py
+- Resource management and cleanup
+
+### ✅ Phase 2: Audio Streaming (COMPLETE)
+- Soprano TTS server (GTX 1660)
+- RVC voice conversion
+- Real-time streaming via WebSocket
+- Token-by-token synthesis
+
+### ✅ Phase 3: Text-to-Voice (COMPLETE)
+- LLaMA text generation (AMD RX 6800)
+- Streaming token pipeline
+- TTS integration with `!miku say`
+- Natural conversation flow
+
+### ✅ Phase 4A: STT Container (COMPLETE)
+- Silero VAD on CPU
+- Faster-Whisper on GTX 1660
+- WebSocket server at port 8001
+- Per-user session management
+- Chunk buffering for VAD
+
+### ✅ Phase 4B: Bot STT Integration (COMPLETE - READY FOR TESTING)
+- Discord audio capture
+- Opus decode + resampling
+- STT client WebSocket integration
+- Voice commands: `!miku listen`, `!miku stop-listening`
+- LLM voice response generation
+- Interruption detection and cancellation
+- `/interrupt` endpoint in RVC API
+
+## Quick Start Commands
+
+### Setup
+```bash
+!miku join              # Join your voice channel
+!miku listen            # Start listening to your voice
+```
+
+### Usage
+- **Speak** into your microphone
+- Miku will **transcribe** your speech
+- Miku will **respond** with voice
+- **Interrupt** her by speaking while she's talking
+
+### Teardown
+```bash
+!miku stop-listening    # Stop listening to your voice
+!miku leave             # Leave voice channel
+```
+
+## Architecture Diagram
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                         USER INPUT                              │
+└─────────────────────────────────────────────────────────────────┘
+                              │
+                              │ Discord Voice (Opus 48kHz)
+                              ▼
+┌─────────────────────────────────────────────────────────────────┐
+│                    miku-bot Container                           │
+│  ┌───────────────────────────────────────────────────────────┐ │
+│  │ VoiceReceiver (discord.sinks.Sink)                        │ │
+│  │  - Opus decode → PCM                                      │ │
+│  │  - Stereo → Mono                                          │ │
+│  │  - Resample 48kHz → 16kHz                                 │ │
+│  └─────────────────┬─────────────────────────────────────────┘ │
+│                    │ PCM int16, 16kHz, 20ms chunks              │
+│  ┌─────────────────▼─────────────────────────────────────────┐ │
+│  │ STTClient (WebSocket)                                     │ │
+│  │  - Sends audio to miku-stt                                │ │
+│  │  - Receives VAD events, transcripts                       │ │
+│  └─────────────────┬─────────────────────────────────────────┘ │
+└────────────────────┼───────────────────────────────────────────┘
+                     │ ws://miku-stt:8001/ws/stt/{user_id}
+                     ▼
+┌─────────────────────────────────────────────────────────────────┐
+│                    miku-stt Container                           │
+│  ┌───────────────────────────────────────────────────────────┐ │
+│  │ VADProcessor (Silero VAD 5.1.2)         [CPU]            │ │
+│  │  - Chunk buffering (512 samples min)                      │ │
+│  │  - Speech detection (threshold=0.5)                       │ │
+│  │  - Events: speech_start, speaking, speech_end             │ │
+│  └─────────────────┬─────────────────────────────────────────┘ │
+│                    │ Audio segments                             │
+│  ┌─────────────────▼─────────────────────────────────────────┐ │
+│  │ WhisperTranscriber (Faster-Whisper 1.2.1) [GTX 1660]    │ │
+│  │  - Model: small (1.3GB VRAM)                              │ │
+│  │  - Transcribes speech segments                            │ │
+│  │  - Returns: partial & final transcripts                   │ │
+│  └─────────────────┬─────────────────────────────────────────┘ │
+└────────────────────┼───────────────────────────────────────────┘
+                     │ JSON events via WebSocket
+                     ▼
+┌─────────────────────────────────────────────────────────────────┐
+│                    miku-bot Container                           │
+│  ┌───────────────────────────────────────────────────────────┐ │
+│  │ voice_manager.py Callbacks                                │ │
+│  │  - on_vad_event()         → Log VAD states                │ │
+│  │  - on_partial_transcript() → Show typing indicator        │ │
+│  │  - on_final_transcript()   → Generate LLM response        │ │
+│  │  - on_interruption()       → Cancel TTS playback          │ │
+│  └─────────────────┬─────────────────────────────────────────┘ │
+│                    │ Final transcript text                      │
+│  ┌─────────────────▼─────────────────────────────────────────┐ │
+│  │ _generate_voice_response()                                │ │
+│  │  - Build LLM prompt with conversation history             │ │
+│  │  - Stream LLM response                                    │ │
+│  │  - Send tokens to TTS                                     │ │
+│  └─────────────────┬─────────────────────────────────────────┘ │
+└────────────────────┼───────────────────────────────────────────┘
+                     │ HTTP streaming to LLaMA server
+                     ▼
+┌─────────────────────────────────────────────────────────────────┐
+│              llama-cpp-server (AMD RX 6800)                     │
+│  - Streaming text generation                                   │
+│  - 20-30 tokens/sec                                            │
+│  - Returns: {"delta": {"content": "token"}}                    │
+└─────────────────┬───────────────────────────────────────────────┘
+                  │ Token stream
+                  ▼
+┌─────────────────────────────────────────────────────────────────┐
+│                    miku-bot Container                           │
+│  ┌───────────────────────────────────────────────────────────┐ │
+│  │ audio_source.send_token()                                 │ │
+│  │  - Buffers tokens                                         │ │
+│  │  - Sends to RVC WebSocket                                 │ │
+│  └─────────────────┬─────────────────────────────────────────┘ │
+└────────────────────┼───────────────────────────────────────────┘
+                     │ ws://miku-rvc-api:8765/ws/stream
+                     ▼
+┌─────────────────────────────────────────────────────────────────┐
+│                 miku-rvc-api Container                          │
+│  ┌───────────────────────────────────────────────────────────┐ │
+│  │ Soprano TTS Server (miku-soprano-tts)    [GTX 1660]      │ │
+│  │  - Text → Audio synthesis                                 │ │
+│  │  - 32kHz output                                           │ │
+│  └─────────────────┬─────────────────────────────────────────┘ │
+│                    │ Raw audio via ZMQ                          │
+│  ┌─────────────────▼─────────────────────────────────────────┐ │
+│  │ RVC Voice Conversion                     [GTX 1660]      │ │
+│  │  - Voice cloning & pitch shifting                         │ │
+│  │  - 48kHz output                                           │ │
+│  └─────────────────┬─────────────────────────────────────────┘ │
+└────────────────────┼───────────────────────────────────────────┘
+                     │ PCM float32, 48kHz
+                     ▼
+┌─────────────────────────────────────────────────────────────────┐
+│                    miku-bot Container                           │
+│  ┌───────────────────────────────────────────────────────────┐ │
+│  │ discord.VoiceClient                                       │ │
+│  │  - Plays audio in voice channel                           │ │
+│  │  - Can be interrupted by user speech                      │ │
+│  └───────────────────────────────────────────────────────────┘ │
+└─────────────────────────────────────────────────────────────────┘
+                              │
+                              ▼
+┌─────────────────────────────────────────────────────────────────┐
+│                       USER OUTPUT                               │
+│                   (Miku's voice response)                       │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+## Interruption Flow
+
+```
+User speaks during Miku's TTS
+         │
+         ▼
+VAD detects speech (probability > 0.7)
+         │
+         ▼
+STT sends interruption event
+         │
+         ▼
+on_user_interruption() callback
+         │
+         ▼
+_cancel_tts() → voice_client.stop()
+         │
+         ▼
+POST http://miku-rvc-api:8765/interrupt
+         │
+         ▼
+Flush ZMQ socket + clear RVC buffers
+         │
+         ▼
+Miku stops speaking, ready for new input
+```
+
+## Hardware Utilization
+
+### Listen Phase (User Speaking)
+- **CPU**: Silero VAD processing
+- **GTX 1660**: Faster-Whisper transcription (1.3GB VRAM)
+- **AMD RX 6800**: Idle
+
+### Think Phase (LLM Generation)
+- **CPU**: Idle
+- **GTX 1660**: Idle
+- **AMD RX 6800**: LLaMA inference (20-30 tokens/sec)
+
+### Speak Phase (Miku Responding)
+- **CPU**: Silero VAD monitoring for interruption
+- **GTX 1660**: Soprano TTS + RVC synthesis
+- **AMD RX 6800**: Idle
+
+## Performance Metrics
+
+### Expected Latencies
+| Stage                    | Latency      |
+|--------------------------|--------------|
+| Discord audio capture    | ~20ms        |
+| Opus decode + resample   | <10ms        |
+| VAD processing           | <50ms        |
+| Whisper transcription    | 200-500ms    |
+| LLM token generation     | 33-50ms/tok  |
+| TTS synthesis            | Real-time    |
+| **Total (speech → response)** | **1-2s** |
+
+### VRAM Usage
+| GPU         | Component      | VRAM      |
+|-------------|----------------|-----------|
+| AMD RX 6800 | LLaMA 8B Q4    | ~5.5GB    |
+| GTX 1660    | Whisper small  | 1.3GB     |
+| GTX 1660    | Soprano + RVC  | ~3GB      |
+
+## Key Files
+
+### Bot Container
+- `bot/utils/stt_client.py` - WebSocket client for STT
+- `bot/utils/voice_receiver.py` - Discord audio sink
+- `bot/utils/voice_manager.py` - Voice session with STT integration
+- `bot/commands/voice.py` - Voice commands including listen/stop-listening
+
+### STT Container
+- `stt/vad_processor.py` - Silero VAD with chunk buffering
+- `stt/whisper_transcriber.py` - Faster-Whisper transcription
+- `stt/stt_server.py` - FastAPI WebSocket server
+
+### RVC Container
+- `soprano_to_rvc/soprano_rvc_api.py` - TTS + RVC pipeline with /interrupt endpoint
+
+## Configuration Files
+
+### docker-compose.yml
+- Network: `miku-network` (all containers)
+- Ports:
+  - miku-bot: 8081 (API)
+  - miku-rvc-api: 8765 (TTS)
+  - miku-stt: 8001 (STT)
+  - llama-cpp-server: 8080 (LLM)
+
+### VAD Settings (stt/vad_processor.py)
+```python
+threshold = 0.5          # Speech detection sensitivity
+min_speech = 250         # Minimum speech duration (ms)
+min_silence = 500        # Silence before speech_end (ms)
+interruption_threshold = 0.7  # Probability for interruption
+```
+
+### Whisper Settings (stt/whisper_transcriber.py)
+```python
+model = "small"          # 1.3GB VRAM
+device = "cuda"
+compute_type = "float16"
+beam_size = 5
+patience = 1.0
+```
+
+## Testing Commands
+
+```bash
+# Check all container health
+curl http://localhost:8001/health  # STT
+curl http://localhost:8765/health  # RVC
+curl http://localhost:8080/health  # LLM
+
+# Monitor logs
+docker logs -f miku-bot | grep -E "(listen|transcript|interrupt)"
+docker logs -f miku-stt
+docker logs -f miku-rvc-api | grep interrupt
+
+# Test interrupt endpoint
+curl -X POST http://localhost:8765/interrupt
+
+# Check GPU usage
+nvidia-smi
+```
+
+## Troubleshooting
+
+| Issue | Solution |
+|-------|----------|
+| No audio from Discord | Check bot has Connect and Speak permissions |
+| VAD not detecting | Speak louder, check microphone, lower threshold |
+| Empty transcripts | Speak for at least 1-2 seconds, check Whisper model |
+| Interruption not working | Verify `miku_speaking=true`, check VAD probability |
+| High latency | Profile each stage, check GPU utilization |
+
+## Next Features (Phase 4C+)
+
+- [ ] KV cache precomputation from partial transcripts
+- [ ] Multi-user simultaneous conversation
+- [ ] Latency optimization (<1s total)
+- [ ] Voice activity history and analytics
+- [ ] Emotion detection from speech patterns
+- [ ] Context-aware interruption handling
+
+---
+
+**Ready to test!** Use `!miku join` → `!miku listen` → speak to Miku 🎤