add: absorb soprano_to_rvc as regular subdirectory

Voice conversion pipeline (Soprano TTS → RVC) with Docker support. Previously tracked as bare gitlink; removed .git/ directories and absorbed into main repo for unified tracking. Includes: Soprano TTS, RVC WebUI integration, Docker configs, WebSocket API, and benchmark scripts. Updated .gitignore to exclude large model weights (*.pth, *.pt, *.onnx, *.index). 287 files (3.1GB of ML weights properly excluded via gitignore).
2026-03-04 00:24:53 +02:00
parent 34b184a05a
commit 8ca716029e
287 changed files with 47102 additions and 0 deletions
--- a/soprano_to_rvc/WEBSOCKET_API.md
+++ b/soprano_to_rvc/WEBSOCKET_API.md
@@ -0,0 +1,429 @@
+# WebSocket Streaming API Documentation
+
+## Overview
+
+The WebSocket endpoint (`/ws/stream`) enables real-time, token-by-token TTS streaming, perfect for Discord voice chat with LLM streaming responses.
+
+## Endpoint
+
+```
+ws://localhost:8765/ws/stream
+```
+
+## Protocol
+
+### Client → Server Messages
+
+**Send Token:**
+```json
+{
+  "token": "Hello",
+  "pitch_shift": 0
+}
+```
+
+**Flush Buffer:**
+```json
+{
+  "flush": true
+}
+```
+
+### Server → Client Messages
+
+**Audio Data (Binary):**
+- Format: PCM float32
+- Sample Rate: 48kHz
+- Channels: Mono
+- Byte order: Native (little-endian on x86)
+
+**Error (JSON):**
+```json
+{
+  "error": "Error message"
+}
+```
+
+## Behavior
+
+### Automatic Synthesis Triggers
+
+The server will automatically synthesize and send audio when:
+
+1. **Sentence boundary**: `.` `!` `?` `。` `！` `？`
+2. **Pause boundary**: `,` `;` `，` `、`
+3. **Buffer limit**: More than 200 characters accumulated
+4. **Explicit flush**: Client sends `{"flush": true}`
+
+### Example Flow
+
+```
+Client: {"token": "Hello"}
+Client: {"token": " "}
+Client: {"token": "world"}
+Client: {"token": "!"}
+  ↓ Server detects '!' (sentence boundary)
+Server: [binary audio data for "Hello world!"]
+
+Client: {"token": " "}
+Client: {"token": "How"}
+Client: {"token": " "}
+Client: {"token": "are"}
+Client: {"token": " "}
+Client: {"token": "you"}
+Client: {"token": "?"}
+  ↓ Server detects '?' (sentence boundary)
+Server: [binary audio data for "How are you?"]
+```
+
+## Integration Examples
+
+### Python with websockets
+
+```python
+import websockets
+import json
+import numpy as np
+
+async def speak(text_stream):
+    async with websockets.connect('ws://localhost:8765/ws/stream') as ws:
+        async for token in text_stream:
+            # Send token
+            await ws.send(json.dumps({"token": token, "pitch_shift": 0}))
+            
+            # Receive audio (non-blocking)
+            try:
+                audio_bytes = await asyncio.wait_for(ws.recv(), timeout=0.5)
+                audio = np.frombuffer(audio_bytes, dtype=np.float32)
+                # Play audio...
+            except asyncio.TimeoutError:
+                continue  # No audio yet
+        
+        # Flush remaining
+        await ws.send(json.dumps({"flush": True}))
+        while True:
+            try:
+                audio_bytes = await asyncio.wait_for(ws.recv(), timeout=1.0)
+                # Process final chunks...
+            except asyncio.TimeoutError:
+                break
+```
+
+### Discord.py Integration
+
+```python
+import discord
+from discord.ext import commands
+import websockets
+import json
+import io
+import asyncio
+
+class MikuVoice(commands.Cog):
+    def __init__(self, bot):
+        self.bot = bot
+        self.ws_url = 'ws://localhost:8765/ws/stream'
+    
+    async def stream_to_discord(self, voice_client, text_stream):
+        """
+        Stream TTS audio to Discord voice channel as LLM tokens arrive.
+        
+        Args:
+            voice_client: discord.VoiceClient
+            text_stream: Async generator yielding text tokens
+        """
+        async with websockets.connect(self.ws_url) as ws:
+            # Queue for audio chunks
+            audio_queue = asyncio.Queue()
+            
+            # Task to play audio from queue
+            async def player():
+                while True:
+                    audio_bytes = await audio_queue.get()
+                    if audio_bytes is None:  # Sentinel
+                        break
+                    
+                    # Create Discord audio source
+                    audio_source = discord.FFmpegPCMAudio(
+                        io.BytesIO(audio_bytes),
+                        pipe=True,
+                        options='-f f32le -ar 48000 -ac 1'
+                    )
+                    
+                    # Play (wait for previous to finish)
+                    while voice_client.is_playing():
+                        await asyncio.sleep(0.1)
+                    
+                    voice_client.play(audio_source)
+            
+            # Start player task
+            player_task = asyncio.create_task(player())
+            
+            # Stream tokens
+            try:
+                async for token in text_stream:
+                    # Send token to TTS
+                    await ws.send(json.dumps({
+                        "token": token,
+                        "pitch_shift": 0
+                    }))
+                    
+                    # Receive audio (non-blocking)
+                    try:
+                        audio_bytes = await asyncio.wait_for(
+                            ws.recv(),
+                            timeout=0.5
+                        )
+                        await audio_queue.put(audio_bytes)
+                    except asyncio.TimeoutError:
+                        continue
+                
+                # Flush remaining buffer
+                await ws.send(json.dumps({"flush": True}))
+                
+                # Get remaining audio
+                while True:
+                    try:
+                        audio_bytes = await asyncio.wait_for(
+                            ws.recv(),
+                            timeout=1.0
+                        )
+                        await audio_queue.put(audio_bytes)
+                    except asyncio.TimeoutError:
+                        break
+            
+            finally:
+                # Signal player to stop
+                await audio_queue.put(None)
+                await player_task
+    
+    @commands.command()
+    async def speak(self, ctx, *, prompt: str):
+        """Make Miku speak in voice channel with streaming TTS"""
+        
+        # Connect to voice if needed
+        if not ctx.voice_client:
+            if not ctx.author.voice:
+                await ctx.send("You need to be in a voice channel!")
+                return
+            await ctx.author.voice.channel.connect()
+        
+        # Get LLM response stream (example with llamacpp)
+        async def llm_stream():
+            # Replace with your actual LLM streaming code
+            response = await your_llm_client.stream(prompt)
+            async for token in response:
+                yield token
+        
+        # Stream to Discord
+        await self.stream_to_discord(ctx.voice_client, llm_stream())
+        await ctx.send("✓ Done speaking!")
+
+async def setup(bot):
+    await bot.add_cog(MikuVoice(bot))
+```
+
+### JavaScript/Node.js
+
+```javascript
+const WebSocket = require('ws');
+
+async function streamTTS(tokens) {
+    const ws = new WebSocket('ws://localhost:8765/ws/stream');
+    
+    ws.on('open', () => {
+        // Send tokens
+        for (const token of tokens) {
+            ws.send(JSON.stringify({
+                token: token,
+                pitch_shift: 0
+            }));
+        }
+        
+        // Flush
+        ws.send(JSON.stringify({ flush: true }));
+    });
+    
+    ws.on('message', (data) => {
+        // data is Buffer containing PCM float32 audio
+        const samples = new Float32Array(
+            data.buffer,
+            data.byteOffset,
+            data.length / 4
+        );
+        
+        // Play audio...
+        playAudio(samples);
+    });
+}
+```
+
+## Performance Characteristics
+
+### Latency Breakdown
+
+**Token-by-token (recommended):**
+```
+LLM token → Bot (5ms) → WebSocket (5ms) → Soprano (80ms) → RVC (100ms) → Discord (20ms)
+Total: ~210ms from token to sound
+```
+
+**Sentence-by-sentence:**
+```
+Full sentence (1000ms) → WebSocket (5ms) → Soprano (200ms) → RVC (300ms) → Discord (20ms)
+Total: ~1525ms from start to sound
+```
+
+### Throughput
+
+- **Audio generation**: ~0.95x realtime (GPU accelerated)
+- **Network overhead**: <1% (binary protocol)
+- **Concurrent connections**: 10+ supported
+
+## Audio Format Details
+
+### Raw PCM Format
+
+The WebSocket sends raw PCM audio data:
+
+```python
+# Receiving and converting to numpy
+audio_bytes = await websocket.recv()
+audio = np.frombuffer(audio_bytes, dtype=np.float32)
+
+# Audio properties
+sample_rate = 48000  # Hz
+channels = 1         # Mono
+dtype = np.float32   # 32-bit float
+value_range = [-1.0, 1.0]  # Normalized
+```
+
+### Converting to Other Formats
+
+**To WAV:**
+```python
+import wave
+import struct
+
+with wave.open('output.wav', 'wb') as wav:
+    wav.setnchannels(1)
+    wav.setsampwidth(4)  # 4 bytes = float32
+    wav.setframerate(48000)
+    wav.writeframes(audio_bytes)
+```
+
+**To Discord Opus:**
+```python
+import discord
+
+# Discord expects PCM s16le (16-bit signed integer)
+audio_float = np.frombuffer(audio_bytes, dtype=np.float32)
+audio_int16 = (audio_float * 32767).astype(np.int16)
+audio_source = discord.PCMAudio(io.BytesIO(audio_int16.tobytes()))
+```
+
+## Best Practices
+
+### 1. Token Buffering
+
+Don't send every character individually - send word-by-word or phrase-by-phrase:
+
+```python
+# ✗ Bad: Too granular
+for char in text:
+    await ws.send(json.dumps({"token": char}))
+
+# ✓ Good: Word-by-word
+for word in text.split():
+    await ws.send(json.dumps({"token": " " + word}))
+```
+
+### 2. Error Handling
+
+Always handle disconnections gracefully:
+
+```python
+try:
+    async with websockets.connect(url) as ws:
+        # ... streaming code ...
+except websockets.exceptions.ConnectionClosed:
+    logger.error("Connection lost, reconnecting...")
+    # Retry logic...
+```
+
+### 3. Backpressure
+
+If Discord's audio buffer fills, slow down token sending:
+
+```python
+if voice_client.is_playing():
+    await asyncio.sleep(0.1)  # Wait for buffer space
+```
+
+### 4. Flush at End
+
+Always flush to ensure all audio is sent:
+
+```python
+# After sending all tokens
+await ws.send(json.dumps({"flush": True}))
+
+# Wait for remaining audio
+try:
+    while True:
+        await asyncio.wait_for(ws.recv(), timeout=1.0)
+except asyncio.TimeoutError:
+    pass  # All audio received
+```
+
+## Troubleshooting
+
+### No Audio Received
+
+**Problem**: Sending tokens but no audio comes back
+
+**Solutions**:
+1. Check if you're hitting sentence boundaries (`.` `!` `?`)
+2. Try sending `{"flush": true}` manually
+3. Verify token format: `{"token": "text", "pitch_shift": 0}`
+
+### Audio Choppy/Gaps
+
+**Problem**: Audio plays but with interruptions
+
+**Solutions**:
+1. Increase buffer size on Discord side
+2. Send tokens in larger chunks (word-by-word, not char-by-char)
+3. Check network latency: `ping localhost`
+
+### Connection Drops
+
+**Problem**: WebSocket disconnects unexpectedly
+
+**Solutions**:
+1. Implement reconnection logic with exponential backoff
+2. Send periodic ping/pong frames
+3. Check Docker container logs: `docker logs miku-rvc-api`
+
+### High Latency
+
+**Problem**: Long delay between token and audio
+
+**Solutions**:
+1. Verify GPU is being used: Check logs for "Found GPU AMD Radeon RX 6800"
+2. Reduce sentence buffer triggers (adjust in code)
+3. Use smaller chunks: `chunk_size=5` in Soprano config
+
+## Comparison with HTTP API
+
+| Feature | WebSocket (`/ws/stream`) | HTTP (`/api/speak`) |
+|---------|-------------------------|-------------------|
+| Latency | ~200ms | ~1500ms |
+| Streaming | ✓ Token-by-token | ✗ Request-response |
+| Overhead | 5ms per message | 100ms per request |
+| Connection | Persistent | Per-request |
+| Backpressure | ✓ Bidirectional | ✗ One-way |
+| Complexity | Medium | Low |
+| Use Case | Real-time voice chat | Simple TTS requests |
+
+**Recommendation**: Use WebSocket for Discord bot, keep HTTP for testing/debugging.