Files
miku-discord/soprano_to_rvc/WEBSOCKET_API.md
koko210Serve 8ca716029e add: absorb soprano_to_rvc as regular subdirectory
Voice conversion pipeline (Soprano TTS → RVC) with Docker support.
Previously tracked as bare gitlink; removed .git/ directories and
absorbed into main repo for unified tracking.

Includes: Soprano TTS, RVC WebUI integration, Docker configs,
WebSocket API, and benchmark scripts.
Updated .gitignore to exclude large model weights (*.pth, *.pt, *.onnx, *.index).
287 files (3.1GB of ML weights properly excluded via gitignore).
2026-03-04 00:24:53 +02:00

11 KiB
Raw Permalink Blame History

WebSocket Streaming API Documentation

Overview

The WebSocket endpoint (/ws/stream) enables real-time, token-by-token TTS streaming, perfect for Discord voice chat with LLM streaming responses.

Endpoint

ws://localhost:8765/ws/stream

Protocol

Client → Server Messages

Send Token:

{
  "token": "Hello",
  "pitch_shift": 0
}

Flush Buffer:

{
  "flush": true
}

Server → Client Messages

Audio Data (Binary):

  • Format: PCM float32
  • Sample Rate: 48kHz
  • Channels: Mono
  • Byte order: Native (little-endian on x86)

Error (JSON):

{
  "error": "Error message"
}

Behavior

Automatic Synthesis Triggers

The server will automatically synthesize and send audio when:

  1. Sentence boundary: . ! ?
  2. Pause boundary: , ;
  3. Buffer limit: More than 200 characters accumulated
  4. Explicit flush: Client sends {"flush": true}

Example Flow

Client: {"token": "Hello"}
Client: {"token": " "}
Client: {"token": "world"}
Client: {"token": "!"}
  ↓ Server detects '!' (sentence boundary)
Server: [binary audio data for "Hello world!"]

Client: {"token": " "}
Client: {"token": "How"}
Client: {"token": " "}
Client: {"token": "are"}
Client: {"token": " "}
Client: {"token": "you"}
Client: {"token": "?"}
  ↓ Server detects '?' (sentence boundary)
Server: [binary audio data for "How are you?"]

Integration Examples

Python with websockets

import websockets
import json
import numpy as np

async def speak(text_stream):
    async with websockets.connect('ws://localhost:8765/ws/stream') as ws:
        async for token in text_stream:
            # Send token
            await ws.send(json.dumps({"token": token, "pitch_shift": 0}))
            
            # Receive audio (non-blocking)
            try:
                audio_bytes = await asyncio.wait_for(ws.recv(), timeout=0.5)
                audio = np.frombuffer(audio_bytes, dtype=np.float32)
                # Play audio...
            except asyncio.TimeoutError:
                continue  # No audio yet
        
        # Flush remaining
        await ws.send(json.dumps({"flush": True}))
        while True:
            try:
                audio_bytes = await asyncio.wait_for(ws.recv(), timeout=1.0)
                # Process final chunks...
            except asyncio.TimeoutError:
                break

Discord.py Integration

import discord
from discord.ext import commands
import websockets
import json
import io
import asyncio

class MikuVoice(commands.Cog):
    def __init__(self, bot):
        self.bot = bot
        self.ws_url = 'ws://localhost:8765/ws/stream'
    
    async def stream_to_discord(self, voice_client, text_stream):
        """
        Stream TTS audio to Discord voice channel as LLM tokens arrive.
        
        Args:
            voice_client: discord.VoiceClient
            text_stream: Async generator yielding text tokens
        """
        async with websockets.connect(self.ws_url) as ws:
            # Queue for audio chunks
            audio_queue = asyncio.Queue()
            
            # Task to play audio from queue
            async def player():
                while True:
                    audio_bytes = await audio_queue.get()
                    if audio_bytes is None:  # Sentinel
                        break
                    
                    # Create Discord audio source
                    audio_source = discord.FFmpegPCMAudio(
                        io.BytesIO(audio_bytes),
                        pipe=True,
                        options='-f f32le -ar 48000 -ac 1'
                    )
                    
                    # Play (wait for previous to finish)
                    while voice_client.is_playing():
                        await asyncio.sleep(0.1)
                    
                    voice_client.play(audio_source)
            
            # Start player task
            player_task = asyncio.create_task(player())
            
            # Stream tokens
            try:
                async for token in text_stream:
                    # Send token to TTS
                    await ws.send(json.dumps({
                        "token": token,
                        "pitch_shift": 0
                    }))
                    
                    # Receive audio (non-blocking)
                    try:
                        audio_bytes = await asyncio.wait_for(
                            ws.recv(),
                            timeout=0.5
                        )
                        await audio_queue.put(audio_bytes)
                    except asyncio.TimeoutError:
                        continue
                
                # Flush remaining buffer
                await ws.send(json.dumps({"flush": True}))
                
                # Get remaining audio
                while True:
                    try:
                        audio_bytes = await asyncio.wait_for(
                            ws.recv(),
                            timeout=1.0
                        )
                        await audio_queue.put(audio_bytes)
                    except asyncio.TimeoutError:
                        break
            
            finally:
                # Signal player to stop
                await audio_queue.put(None)
                await player_task
    
    @commands.command()
    async def speak(self, ctx, *, prompt: str):
        """Make Miku speak in voice channel with streaming TTS"""
        
        # Connect to voice if needed
        if not ctx.voice_client:
            if not ctx.author.voice:
                await ctx.send("You need to be in a voice channel!")
                return
            await ctx.author.voice.channel.connect()
        
        # Get LLM response stream (example with llamacpp)
        async def llm_stream():
            # Replace with your actual LLM streaming code
            response = await your_llm_client.stream(prompt)
            async for token in response:
                yield token
        
        # Stream to Discord
        await self.stream_to_discord(ctx.voice_client, llm_stream())
        await ctx.send("✓ Done speaking!")

async def setup(bot):
    await bot.add_cog(MikuVoice(bot))

JavaScript/Node.js

const WebSocket = require('ws');

async function streamTTS(tokens) {
    const ws = new WebSocket('ws://localhost:8765/ws/stream');
    
    ws.on('open', () => {
        // Send tokens
        for (const token of tokens) {
            ws.send(JSON.stringify({
                token: token,
                pitch_shift: 0
            }));
        }
        
        // Flush
        ws.send(JSON.stringify({ flush: true }));
    });
    
    ws.on('message', (data) => {
        // data is Buffer containing PCM float32 audio
        const samples = new Float32Array(
            data.buffer,
            data.byteOffset,
            data.length / 4
        );
        
        // Play audio...
        playAudio(samples);
    });
}

Performance Characteristics

Latency Breakdown

Token-by-token (recommended):

LLM token → Bot (5ms) → WebSocket (5ms) → Soprano (80ms) → RVC (100ms) → Discord (20ms)
Total: ~210ms from token to sound

Sentence-by-sentence:

Full sentence (1000ms) → WebSocket (5ms) → Soprano (200ms) → RVC (300ms) → Discord (20ms)
Total: ~1525ms from start to sound

Throughput

  • Audio generation: ~0.95x realtime (GPU accelerated)
  • Network overhead: <1% (binary protocol)
  • Concurrent connections: 10+ supported

Audio Format Details

Raw PCM Format

The WebSocket sends raw PCM audio data:

# Receiving and converting to numpy
audio_bytes = await websocket.recv()
audio = np.frombuffer(audio_bytes, dtype=np.float32)

# Audio properties
sample_rate = 48000  # Hz
channels = 1         # Mono
dtype = np.float32   # 32-bit float
value_range = [-1.0, 1.0]  # Normalized

Converting to Other Formats

To WAV:

import wave
import struct

with wave.open('output.wav', 'wb') as wav:
    wav.setnchannels(1)
    wav.setsampwidth(4)  # 4 bytes = float32
    wav.setframerate(48000)
    wav.writeframes(audio_bytes)

To Discord Opus:

import discord

# Discord expects PCM s16le (16-bit signed integer)
audio_float = np.frombuffer(audio_bytes, dtype=np.float32)
audio_int16 = (audio_float * 32767).astype(np.int16)
audio_source = discord.PCMAudio(io.BytesIO(audio_int16.tobytes()))

Best Practices

1. Token Buffering

Don't send every character individually - send word-by-word or phrase-by-phrase:

# ✗ Bad: Too granular
for char in text:
    await ws.send(json.dumps({"token": char}))

# ✓ Good: Word-by-word
for word in text.split():
    await ws.send(json.dumps({"token": " " + word}))

2. Error Handling

Always handle disconnections gracefully:

try:
    async with websockets.connect(url) as ws:
        # ... streaming code ...
except websockets.exceptions.ConnectionClosed:
    logger.error("Connection lost, reconnecting...")
    # Retry logic...

3. Backpressure

If Discord's audio buffer fills, slow down token sending:

if voice_client.is_playing():
    await asyncio.sleep(0.1)  # Wait for buffer space

4. Flush at End

Always flush to ensure all audio is sent:

# After sending all tokens
await ws.send(json.dumps({"flush": True}))

# Wait for remaining audio
try:
    while True:
        await asyncio.wait_for(ws.recv(), timeout=1.0)
except asyncio.TimeoutError:
    pass  # All audio received

Troubleshooting

No Audio Received

Problem: Sending tokens but no audio comes back

Solutions:

  1. Check if you're hitting sentence boundaries (. ! ?)
  2. Try sending {"flush": true} manually
  3. Verify token format: {"token": "text", "pitch_shift": 0}

Audio Choppy/Gaps

Problem: Audio plays but with interruptions

Solutions:

  1. Increase buffer size on Discord side
  2. Send tokens in larger chunks (word-by-word, not char-by-char)
  3. Check network latency: ping localhost

Connection Drops

Problem: WebSocket disconnects unexpectedly

Solutions:

  1. Implement reconnection logic with exponential backoff
  2. Send periodic ping/pong frames
  3. Check Docker container logs: docker logs miku-rvc-api

High Latency

Problem: Long delay between token and audio

Solutions:

  1. Verify GPU is being used: Check logs for "Found GPU AMD Radeon RX 6800"
  2. Reduce sentence buffer triggers (adjust in code)
  3. Use smaller chunks: chunk_size=5 in Soprano config

Comparison with HTTP API

Feature WebSocket (/ws/stream) HTTP (/api/speak)
Latency ~200ms ~1500ms
Streaming ✓ Token-by-token ✗ Request-response
Overhead 5ms per message 100ms per request
Connection Persistent Per-request
Backpressure ✓ Bidirectional ✗ One-way
Complexity Medium Low
Use Case Real-time voice chat Simple TTS requests

Recommendation: Use WebSocket for Discord bot, keep HTTP for testing/debugging.