Voice conversion pipeline (Soprano TTS → RVC) with Docker support. Previously tracked as bare gitlink; removed .git/ directories and absorbed into main repo for unified tracking. Includes: Soprano TTS, RVC WebUI integration, Docker configs, WebSocket API, and benchmark scripts. Updated .gitignore to exclude large model weights (*.pth, *.pt, *.onnx, *.index). 287 files (3.1GB of ML weights properly excluded via gitignore).
11 KiB
WebSocket Streaming API Documentation
Overview
The WebSocket endpoint (/ws/stream) enables real-time, token-by-token TTS streaming, perfect for Discord voice chat with LLM streaming responses.
Endpoint
ws://localhost:8765/ws/stream
Protocol
Client → Server Messages
Send Token:
{
"token": "Hello",
"pitch_shift": 0
}
Flush Buffer:
{
"flush": true
}
Server → Client Messages
Audio Data (Binary):
- Format: PCM float32
- Sample Rate: 48kHz
- Channels: Mono
- Byte order: Native (little-endian on x86)
Error (JSON):
{
"error": "Error message"
}
Behavior
Automatic Synthesis Triggers
The server will automatically synthesize and send audio when:
- Sentence boundary:
.!?。!? - Pause boundary:
,;,、 - Buffer limit: More than 200 characters accumulated
- Explicit flush: Client sends
{"flush": true}
Example Flow
Client: {"token": "Hello"}
Client: {"token": " "}
Client: {"token": "world"}
Client: {"token": "!"}
↓ Server detects '!' (sentence boundary)
Server: [binary audio data for "Hello world!"]
Client: {"token": " "}
Client: {"token": "How"}
Client: {"token": " "}
Client: {"token": "are"}
Client: {"token": " "}
Client: {"token": "you"}
Client: {"token": "?"}
↓ Server detects '?' (sentence boundary)
Server: [binary audio data for "How are you?"]
Integration Examples
Python with websockets
import websockets
import json
import numpy as np
async def speak(text_stream):
async with websockets.connect('ws://localhost:8765/ws/stream') as ws:
async for token in text_stream:
# Send token
await ws.send(json.dumps({"token": token, "pitch_shift": 0}))
# Receive audio (non-blocking)
try:
audio_bytes = await asyncio.wait_for(ws.recv(), timeout=0.5)
audio = np.frombuffer(audio_bytes, dtype=np.float32)
# Play audio...
except asyncio.TimeoutError:
continue # No audio yet
# Flush remaining
await ws.send(json.dumps({"flush": True}))
while True:
try:
audio_bytes = await asyncio.wait_for(ws.recv(), timeout=1.0)
# Process final chunks...
except asyncio.TimeoutError:
break
Discord.py Integration
import discord
from discord.ext import commands
import websockets
import json
import io
import asyncio
class MikuVoice(commands.Cog):
def __init__(self, bot):
self.bot = bot
self.ws_url = 'ws://localhost:8765/ws/stream'
async def stream_to_discord(self, voice_client, text_stream):
"""
Stream TTS audio to Discord voice channel as LLM tokens arrive.
Args:
voice_client: discord.VoiceClient
text_stream: Async generator yielding text tokens
"""
async with websockets.connect(self.ws_url) as ws:
# Queue for audio chunks
audio_queue = asyncio.Queue()
# Task to play audio from queue
async def player():
while True:
audio_bytes = await audio_queue.get()
if audio_bytes is None: # Sentinel
break
# Create Discord audio source
audio_source = discord.FFmpegPCMAudio(
io.BytesIO(audio_bytes),
pipe=True,
options='-f f32le -ar 48000 -ac 1'
)
# Play (wait for previous to finish)
while voice_client.is_playing():
await asyncio.sleep(0.1)
voice_client.play(audio_source)
# Start player task
player_task = asyncio.create_task(player())
# Stream tokens
try:
async for token in text_stream:
# Send token to TTS
await ws.send(json.dumps({
"token": token,
"pitch_shift": 0
}))
# Receive audio (non-blocking)
try:
audio_bytes = await asyncio.wait_for(
ws.recv(),
timeout=0.5
)
await audio_queue.put(audio_bytes)
except asyncio.TimeoutError:
continue
# Flush remaining buffer
await ws.send(json.dumps({"flush": True}))
# Get remaining audio
while True:
try:
audio_bytes = await asyncio.wait_for(
ws.recv(),
timeout=1.0
)
await audio_queue.put(audio_bytes)
except asyncio.TimeoutError:
break
finally:
# Signal player to stop
await audio_queue.put(None)
await player_task
@commands.command()
async def speak(self, ctx, *, prompt: str):
"""Make Miku speak in voice channel with streaming TTS"""
# Connect to voice if needed
if not ctx.voice_client:
if not ctx.author.voice:
await ctx.send("You need to be in a voice channel!")
return
await ctx.author.voice.channel.connect()
# Get LLM response stream (example with llamacpp)
async def llm_stream():
# Replace with your actual LLM streaming code
response = await your_llm_client.stream(prompt)
async for token in response:
yield token
# Stream to Discord
await self.stream_to_discord(ctx.voice_client, llm_stream())
await ctx.send("✓ Done speaking!")
async def setup(bot):
await bot.add_cog(MikuVoice(bot))
JavaScript/Node.js
const WebSocket = require('ws');
async function streamTTS(tokens) {
const ws = new WebSocket('ws://localhost:8765/ws/stream');
ws.on('open', () => {
// Send tokens
for (const token of tokens) {
ws.send(JSON.stringify({
token: token,
pitch_shift: 0
}));
}
// Flush
ws.send(JSON.stringify({ flush: true }));
});
ws.on('message', (data) => {
// data is Buffer containing PCM float32 audio
const samples = new Float32Array(
data.buffer,
data.byteOffset,
data.length / 4
);
// Play audio...
playAudio(samples);
});
}
Performance Characteristics
Latency Breakdown
Token-by-token (recommended):
LLM token → Bot (5ms) → WebSocket (5ms) → Soprano (80ms) → RVC (100ms) → Discord (20ms)
Total: ~210ms from token to sound
Sentence-by-sentence:
Full sentence (1000ms) → WebSocket (5ms) → Soprano (200ms) → RVC (300ms) → Discord (20ms)
Total: ~1525ms from start to sound
Throughput
- Audio generation: ~0.95x realtime (GPU accelerated)
- Network overhead: <1% (binary protocol)
- Concurrent connections: 10+ supported
Audio Format Details
Raw PCM Format
The WebSocket sends raw PCM audio data:
# Receiving and converting to numpy
audio_bytes = await websocket.recv()
audio = np.frombuffer(audio_bytes, dtype=np.float32)
# Audio properties
sample_rate = 48000 # Hz
channels = 1 # Mono
dtype = np.float32 # 32-bit float
value_range = [-1.0, 1.0] # Normalized
Converting to Other Formats
To WAV:
import wave
import struct
with wave.open('output.wav', 'wb') as wav:
wav.setnchannels(1)
wav.setsampwidth(4) # 4 bytes = float32
wav.setframerate(48000)
wav.writeframes(audio_bytes)
To Discord Opus:
import discord
# Discord expects PCM s16le (16-bit signed integer)
audio_float = np.frombuffer(audio_bytes, dtype=np.float32)
audio_int16 = (audio_float * 32767).astype(np.int16)
audio_source = discord.PCMAudio(io.BytesIO(audio_int16.tobytes()))
Best Practices
1. Token Buffering
Don't send every character individually - send word-by-word or phrase-by-phrase:
# ✗ Bad: Too granular
for char in text:
await ws.send(json.dumps({"token": char}))
# ✓ Good: Word-by-word
for word in text.split():
await ws.send(json.dumps({"token": " " + word}))
2. Error Handling
Always handle disconnections gracefully:
try:
async with websockets.connect(url) as ws:
# ... streaming code ...
except websockets.exceptions.ConnectionClosed:
logger.error("Connection lost, reconnecting...")
# Retry logic...
3. Backpressure
If Discord's audio buffer fills, slow down token sending:
if voice_client.is_playing():
await asyncio.sleep(0.1) # Wait for buffer space
4. Flush at End
Always flush to ensure all audio is sent:
# After sending all tokens
await ws.send(json.dumps({"flush": True}))
# Wait for remaining audio
try:
while True:
await asyncio.wait_for(ws.recv(), timeout=1.0)
except asyncio.TimeoutError:
pass # All audio received
Troubleshooting
No Audio Received
Problem: Sending tokens but no audio comes back
Solutions:
- Check if you're hitting sentence boundaries (
.!?) - Try sending
{"flush": true}manually - Verify token format:
{"token": "text", "pitch_shift": 0}
Audio Choppy/Gaps
Problem: Audio plays but with interruptions
Solutions:
- Increase buffer size on Discord side
- Send tokens in larger chunks (word-by-word, not char-by-char)
- Check network latency:
ping localhost
Connection Drops
Problem: WebSocket disconnects unexpectedly
Solutions:
- Implement reconnection logic with exponential backoff
- Send periodic ping/pong frames
- Check Docker container logs:
docker logs miku-rvc-api
High Latency
Problem: Long delay between token and audio
Solutions:
- Verify GPU is being used: Check logs for "Found GPU AMD Radeon RX 6800"
- Reduce sentence buffer triggers (adjust in code)
- Use smaller chunks:
chunk_size=5in Soprano config
Comparison with HTTP API
| Feature | WebSocket (/ws/stream) |
HTTP (/api/speak) |
|---|---|---|
| Latency | ~200ms | ~1500ms |
| Streaming | ✓ Token-by-token | ✗ Request-response |
| Overhead | 5ms per message | 100ms per request |
| Connection | Persistent | Per-request |
| Backpressure | ✓ Bidirectional | ✗ One-way |
| Complexity | Medium | Low |
| Use Case | Real-time voice chat | Simple TTS requests |
Recommendation: Use WebSocket for Discord bot, keep HTTP for testing/debugging.