add: absorb soprano_to_rvc as regular subdirectory
Voice conversion pipeline (Soprano TTS → RVC) with Docker support. Previously tracked as bare gitlink; removed .git/ directories and absorbed into main repo for unified tracking. Includes: Soprano TTS, RVC WebUI integration, Docker configs, WebSocket API, and benchmark scripts. Updated .gitignore to exclude large model weights (*.pth, *.pt, *.onnx, *.index). 287 files (3.1GB of ML weights properly excluded via gitignore).
This commit is contained in:
429
soprano_to_rvc/WEBSOCKET_API.md
Normal file
429
soprano_to_rvc/WEBSOCKET_API.md
Normal file
@@ -0,0 +1,429 @@
|
||||
# WebSocket Streaming API Documentation
|
||||
|
||||
## Overview
|
||||
|
||||
The WebSocket endpoint (`/ws/stream`) enables real-time, token-by-token TTS streaming, perfect for Discord voice chat with LLM streaming responses.
|
||||
|
||||
## Endpoint
|
||||
|
||||
```
|
||||
ws://localhost:8765/ws/stream
|
||||
```
|
||||
|
||||
## Protocol
|
||||
|
||||
### Client → Server Messages
|
||||
|
||||
**Send Token:**
|
||||
```json
|
||||
{
|
||||
"token": "Hello",
|
||||
"pitch_shift": 0
|
||||
}
|
||||
```
|
||||
|
||||
**Flush Buffer:**
|
||||
```json
|
||||
{
|
||||
"flush": true
|
||||
}
|
||||
```
|
||||
|
||||
### Server → Client Messages
|
||||
|
||||
**Audio Data (Binary):**
|
||||
- Format: PCM float32
|
||||
- Sample Rate: 48kHz
|
||||
- Channels: Mono
|
||||
- Byte order: Native (little-endian on x86)
|
||||
|
||||
**Error (JSON):**
|
||||
```json
|
||||
{
|
||||
"error": "Error message"
|
||||
}
|
||||
```
|
||||
|
||||
## Behavior
|
||||
|
||||
### Automatic Synthesis Triggers
|
||||
|
||||
The server will automatically synthesize and send audio when:
|
||||
|
||||
1. **Sentence boundary**: `.` `!` `?` `。` `!` `?`
|
||||
2. **Pause boundary**: `,` `;` `,` `、`
|
||||
3. **Buffer limit**: More than 200 characters accumulated
|
||||
4. **Explicit flush**: Client sends `{"flush": true}`
|
||||
|
||||
### Example Flow
|
||||
|
||||
```
|
||||
Client: {"token": "Hello"}
|
||||
Client: {"token": " "}
|
||||
Client: {"token": "world"}
|
||||
Client: {"token": "!"}
|
||||
↓ Server detects '!' (sentence boundary)
|
||||
Server: [binary audio data for "Hello world!"]
|
||||
|
||||
Client: {"token": " "}
|
||||
Client: {"token": "How"}
|
||||
Client: {"token": " "}
|
||||
Client: {"token": "are"}
|
||||
Client: {"token": " "}
|
||||
Client: {"token": "you"}
|
||||
Client: {"token": "?"}
|
||||
↓ Server detects '?' (sentence boundary)
|
||||
Server: [binary audio data for "How are you?"]
|
||||
```
|
||||
|
||||
## Integration Examples
|
||||
|
||||
### Python with websockets
|
||||
|
||||
```python
|
||||
import websockets
|
||||
import json
|
||||
import numpy as np
|
||||
|
||||
async def speak(text_stream):
|
||||
async with websockets.connect('ws://localhost:8765/ws/stream') as ws:
|
||||
async for token in text_stream:
|
||||
# Send token
|
||||
await ws.send(json.dumps({"token": token, "pitch_shift": 0}))
|
||||
|
||||
# Receive audio (non-blocking)
|
||||
try:
|
||||
audio_bytes = await asyncio.wait_for(ws.recv(), timeout=0.5)
|
||||
audio = np.frombuffer(audio_bytes, dtype=np.float32)
|
||||
# Play audio...
|
||||
except asyncio.TimeoutError:
|
||||
continue # No audio yet
|
||||
|
||||
# Flush remaining
|
||||
await ws.send(json.dumps({"flush": True}))
|
||||
while True:
|
||||
try:
|
||||
audio_bytes = await asyncio.wait_for(ws.recv(), timeout=1.0)
|
||||
# Process final chunks...
|
||||
except asyncio.TimeoutError:
|
||||
break
|
||||
```
|
||||
|
||||
### Discord.py Integration
|
||||
|
||||
```python
|
||||
import discord
|
||||
from discord.ext import commands
|
||||
import websockets
|
||||
import json
|
||||
import io
|
||||
import asyncio
|
||||
|
||||
class MikuVoice(commands.Cog):
|
||||
def __init__(self, bot):
|
||||
self.bot = bot
|
||||
self.ws_url = 'ws://localhost:8765/ws/stream'
|
||||
|
||||
async def stream_to_discord(self, voice_client, text_stream):
|
||||
"""
|
||||
Stream TTS audio to Discord voice channel as LLM tokens arrive.
|
||||
|
||||
Args:
|
||||
voice_client: discord.VoiceClient
|
||||
text_stream: Async generator yielding text tokens
|
||||
"""
|
||||
async with websockets.connect(self.ws_url) as ws:
|
||||
# Queue for audio chunks
|
||||
audio_queue = asyncio.Queue()
|
||||
|
||||
# Task to play audio from queue
|
||||
async def player():
|
||||
while True:
|
||||
audio_bytes = await audio_queue.get()
|
||||
if audio_bytes is None: # Sentinel
|
||||
break
|
||||
|
||||
# Create Discord audio source
|
||||
audio_source = discord.FFmpegPCMAudio(
|
||||
io.BytesIO(audio_bytes),
|
||||
pipe=True,
|
||||
options='-f f32le -ar 48000 -ac 1'
|
||||
)
|
||||
|
||||
# Play (wait for previous to finish)
|
||||
while voice_client.is_playing():
|
||||
await asyncio.sleep(0.1)
|
||||
|
||||
voice_client.play(audio_source)
|
||||
|
||||
# Start player task
|
||||
player_task = asyncio.create_task(player())
|
||||
|
||||
# Stream tokens
|
||||
try:
|
||||
async for token in text_stream:
|
||||
# Send token to TTS
|
||||
await ws.send(json.dumps({
|
||||
"token": token,
|
||||
"pitch_shift": 0
|
||||
}))
|
||||
|
||||
# Receive audio (non-blocking)
|
||||
try:
|
||||
audio_bytes = await asyncio.wait_for(
|
||||
ws.recv(),
|
||||
timeout=0.5
|
||||
)
|
||||
await audio_queue.put(audio_bytes)
|
||||
except asyncio.TimeoutError:
|
||||
continue
|
||||
|
||||
# Flush remaining buffer
|
||||
await ws.send(json.dumps({"flush": True}))
|
||||
|
||||
# Get remaining audio
|
||||
while True:
|
||||
try:
|
||||
audio_bytes = await asyncio.wait_for(
|
||||
ws.recv(),
|
||||
timeout=1.0
|
||||
)
|
||||
await audio_queue.put(audio_bytes)
|
||||
except asyncio.TimeoutError:
|
||||
break
|
||||
|
||||
finally:
|
||||
# Signal player to stop
|
||||
await audio_queue.put(None)
|
||||
await player_task
|
||||
|
||||
@commands.command()
|
||||
async def speak(self, ctx, *, prompt: str):
|
||||
"""Make Miku speak in voice channel with streaming TTS"""
|
||||
|
||||
# Connect to voice if needed
|
||||
if not ctx.voice_client:
|
||||
if not ctx.author.voice:
|
||||
await ctx.send("You need to be in a voice channel!")
|
||||
return
|
||||
await ctx.author.voice.channel.connect()
|
||||
|
||||
# Get LLM response stream (example with llamacpp)
|
||||
async def llm_stream():
|
||||
# Replace with your actual LLM streaming code
|
||||
response = await your_llm_client.stream(prompt)
|
||||
async for token in response:
|
||||
yield token
|
||||
|
||||
# Stream to Discord
|
||||
await self.stream_to_discord(ctx.voice_client, llm_stream())
|
||||
await ctx.send("✓ Done speaking!")
|
||||
|
||||
async def setup(bot):
|
||||
await bot.add_cog(MikuVoice(bot))
|
||||
```
|
||||
|
||||
### JavaScript/Node.js
|
||||
|
||||
```javascript
|
||||
const WebSocket = require('ws');
|
||||
|
||||
async function streamTTS(tokens) {
|
||||
const ws = new WebSocket('ws://localhost:8765/ws/stream');
|
||||
|
||||
ws.on('open', () => {
|
||||
// Send tokens
|
||||
for (const token of tokens) {
|
||||
ws.send(JSON.stringify({
|
||||
token: token,
|
||||
pitch_shift: 0
|
||||
}));
|
||||
}
|
||||
|
||||
// Flush
|
||||
ws.send(JSON.stringify({ flush: true }));
|
||||
});
|
||||
|
||||
ws.on('message', (data) => {
|
||||
// data is Buffer containing PCM float32 audio
|
||||
const samples = new Float32Array(
|
||||
data.buffer,
|
||||
data.byteOffset,
|
||||
data.length / 4
|
||||
);
|
||||
|
||||
// Play audio...
|
||||
playAudio(samples);
|
||||
});
|
||||
}
|
||||
```
|
||||
|
||||
## Performance Characteristics
|
||||
|
||||
### Latency Breakdown
|
||||
|
||||
**Token-by-token (recommended):**
|
||||
```
|
||||
LLM token → Bot (5ms) → WebSocket (5ms) → Soprano (80ms) → RVC (100ms) → Discord (20ms)
|
||||
Total: ~210ms from token to sound
|
||||
```
|
||||
|
||||
**Sentence-by-sentence:**
|
||||
```
|
||||
Full sentence (1000ms) → WebSocket (5ms) → Soprano (200ms) → RVC (300ms) → Discord (20ms)
|
||||
Total: ~1525ms from start to sound
|
||||
```
|
||||
|
||||
### Throughput
|
||||
|
||||
- **Audio generation**: ~0.95x realtime (GPU accelerated)
|
||||
- **Network overhead**: <1% (binary protocol)
|
||||
- **Concurrent connections**: 10+ supported
|
||||
|
||||
## Audio Format Details
|
||||
|
||||
### Raw PCM Format
|
||||
|
||||
The WebSocket sends raw PCM audio data:
|
||||
|
||||
```python
|
||||
# Receiving and converting to numpy
|
||||
audio_bytes = await websocket.recv()
|
||||
audio = np.frombuffer(audio_bytes, dtype=np.float32)
|
||||
|
||||
# Audio properties
|
||||
sample_rate = 48000 # Hz
|
||||
channels = 1 # Mono
|
||||
dtype = np.float32 # 32-bit float
|
||||
value_range = [-1.0, 1.0] # Normalized
|
||||
```
|
||||
|
||||
### Converting to Other Formats
|
||||
|
||||
**To WAV:**
|
||||
```python
|
||||
import wave
|
||||
import struct
|
||||
|
||||
with wave.open('output.wav', 'wb') as wav:
|
||||
wav.setnchannels(1)
|
||||
wav.setsampwidth(4) # 4 bytes = float32
|
||||
wav.setframerate(48000)
|
||||
wav.writeframes(audio_bytes)
|
||||
```
|
||||
|
||||
**To Discord Opus:**
|
||||
```python
|
||||
import discord
|
||||
|
||||
# Discord expects PCM s16le (16-bit signed integer)
|
||||
audio_float = np.frombuffer(audio_bytes, dtype=np.float32)
|
||||
audio_int16 = (audio_float * 32767).astype(np.int16)
|
||||
audio_source = discord.PCMAudio(io.BytesIO(audio_int16.tobytes()))
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Token Buffering
|
||||
|
||||
Don't send every character individually - send word-by-word or phrase-by-phrase:
|
||||
|
||||
```python
|
||||
# ✗ Bad: Too granular
|
||||
for char in text:
|
||||
await ws.send(json.dumps({"token": char}))
|
||||
|
||||
# ✓ Good: Word-by-word
|
||||
for word in text.split():
|
||||
await ws.send(json.dumps({"token": " " + word}))
|
||||
```
|
||||
|
||||
### 2. Error Handling
|
||||
|
||||
Always handle disconnections gracefully:
|
||||
|
||||
```python
|
||||
try:
|
||||
async with websockets.connect(url) as ws:
|
||||
# ... streaming code ...
|
||||
except websockets.exceptions.ConnectionClosed:
|
||||
logger.error("Connection lost, reconnecting...")
|
||||
# Retry logic...
|
||||
```
|
||||
|
||||
### 3. Backpressure
|
||||
|
||||
If Discord's audio buffer fills, slow down token sending:
|
||||
|
||||
```python
|
||||
if voice_client.is_playing():
|
||||
await asyncio.sleep(0.1) # Wait for buffer space
|
||||
```
|
||||
|
||||
### 4. Flush at End
|
||||
|
||||
Always flush to ensure all audio is sent:
|
||||
|
||||
```python
|
||||
# After sending all tokens
|
||||
await ws.send(json.dumps({"flush": True}))
|
||||
|
||||
# Wait for remaining audio
|
||||
try:
|
||||
while True:
|
||||
await asyncio.wait_for(ws.recv(), timeout=1.0)
|
||||
except asyncio.TimeoutError:
|
||||
pass # All audio received
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### No Audio Received
|
||||
|
||||
**Problem**: Sending tokens but no audio comes back
|
||||
|
||||
**Solutions**:
|
||||
1. Check if you're hitting sentence boundaries (`.` `!` `?`)
|
||||
2. Try sending `{"flush": true}` manually
|
||||
3. Verify token format: `{"token": "text", "pitch_shift": 0}`
|
||||
|
||||
### Audio Choppy/Gaps
|
||||
|
||||
**Problem**: Audio plays but with interruptions
|
||||
|
||||
**Solutions**:
|
||||
1. Increase buffer size on Discord side
|
||||
2. Send tokens in larger chunks (word-by-word, not char-by-char)
|
||||
3. Check network latency: `ping localhost`
|
||||
|
||||
### Connection Drops
|
||||
|
||||
**Problem**: WebSocket disconnects unexpectedly
|
||||
|
||||
**Solutions**:
|
||||
1. Implement reconnection logic with exponential backoff
|
||||
2. Send periodic ping/pong frames
|
||||
3. Check Docker container logs: `docker logs miku-rvc-api`
|
||||
|
||||
### High Latency
|
||||
|
||||
**Problem**: Long delay between token and audio
|
||||
|
||||
**Solutions**:
|
||||
1. Verify GPU is being used: Check logs for "Found GPU AMD Radeon RX 6800"
|
||||
2. Reduce sentence buffer triggers (adjust in code)
|
||||
3. Use smaller chunks: `chunk_size=5` in Soprano config
|
||||
|
||||
## Comparison with HTTP API
|
||||
|
||||
| Feature | WebSocket (`/ws/stream`) | HTTP (`/api/speak`) |
|
||||
|---------|-------------------------|-------------------|
|
||||
| Latency | ~200ms | ~1500ms |
|
||||
| Streaming | ✓ Token-by-token | ✗ Request-response |
|
||||
| Overhead | 5ms per message | 100ms per request |
|
||||
| Connection | Persistent | Per-request |
|
||||
| Backpressure | ✓ Bidirectional | ✗ One-way |
|
||||
| Complexity | Medium | Low |
|
||||
| Use Case | Real-time voice chat | Simple TTS requests |
|
||||
|
||||
**Recommendation**: Use WebSocket for Discord bot, keep HTTP for testing/debugging.
|
||||
Reference in New Issue
Block a user