Files

koko210Serve 8ca716029e add: absorb soprano_to_rvc as regular subdirectory

Voice conversion pipeline (Soprano TTS → RVC) with Docker support.
Previously tracked as bare gitlink; removed .git/ directories and
absorbed into main repo for unified tracking.

Includes: Soprano TTS, RVC WebUI integration, Docker configs,
WebSocket API, and benchmark scripts.
Updated .gitignore to exclude large model weights (*.pth, *.pt, *.onnx, *.index).
287 files (3.1GB of ML weights properly excluded via gitignore).

2026-03-04 00:24:53 +02:00

11 KiB

Raw Permalink Blame History

WebSocket Streaming API Documentation

Overview

The WebSocket endpoint (/ws/stream) enables real-time, token-by-token TTS streaming, perfect for Discord voice chat with LLM streaming responses.

Endpoint

ws://localhost:8765/ws/stream

Protocol

Client → Server Messages

Send Token:

{
  "token": "Hello",
  "pitch_shift": 0
}

Flush Buffer:

{
  "flush": true
}

Server → Client Messages

Audio Data (Binary):

Format: PCM float32
Sample Rate: 48kHz
Channels: Mono
Byte order: Native (little-endian on x86)

Error (JSON):

{
  "error": "Error message"
}

Behavior

Automatic Synthesis Triggers

The server will automatically synthesize and send audio when:

Sentence boundary: . ! ? 。 ！ ？
Pause boundary: , ; ， 、
Buffer limit: More than 200 characters accumulated
Explicit flush: Client sends {"flush": true}

Example Flow

Client: {"token": "Hello"}
Client: {"token": " "}
Client: {"token": "world"}
Client: {"token": "!"}
  ↓ Server detects '!' (sentence boundary)
Server: [binary audio data for "Hello world!"]

Client: {"token": " "}
Client: {"token": "How"}
Client: {"token": " "}
Client: {"token": "are"}
Client: {"token": " "}
Client: {"token": "you"}
Client: {"token": "?"}
  ↓ Server detects '?' (sentence boundary)
Server: [binary audio data for "How are you?"]

Integration Examples

Python with websockets

import websockets
import json
import numpy as np

async def speak(text_stream):
    async with websockets.connect('ws://localhost:8765/ws/stream') as ws:
        async for token in text_stream:
            # Send token
            await ws.send(json.dumps({"token": token, "pitch_shift": 0}))
            
            # Receive audio (non-blocking)
            try:
                audio_bytes = await asyncio.wait_for(ws.recv(), timeout=0.5)
                audio = np.frombuffer(audio_bytes, dtype=np.float32)
                # Play audio...
            except asyncio.TimeoutError:
                continue  # No audio yet
        
        # Flush remaining
        await ws.send(json.dumps({"flush": True}))
        while True:
            try:
                audio_bytes = await asyncio.wait_for(ws.recv(), timeout=1.0)
                # Process final chunks...
            except asyncio.TimeoutError:
                break

Discord.py Integration

import discord
from discord.ext import commands
import websockets
import json
import io
import asyncio

class MikuVoice(commands.Cog):
    def __init__(self, bot):
        self.bot = bot
        self.ws_url = 'ws://localhost:8765/ws/stream'
    
    async def stream_to_discord(self, voice_client, text_stream):
        """
        Stream TTS audio to Discord voice channel as LLM tokens arrive.
        
        Args:
            voice_client: discord.VoiceClient
            text_stream: Async generator yielding text tokens
        """
        async with websockets.connect(self.ws_url) as ws:
            # Queue for audio chunks
            audio_queue = asyncio.Queue()
            
            # Task to play audio from queue
            async def player():
                while True:
                    audio_bytes = await audio_queue.get()
                    if audio_bytes is None:  # Sentinel
                        break
                    
                    # Create Discord audio source
                    audio_source = discord.FFmpegPCMAudio(
                        io.BytesIO(audio_bytes),
                        pipe=True,
                        options='-f f32le -ar 48000 -ac 1'
                    )
                    
                    # Play (wait for previous to finish)
                    while voice_client.is_playing():
                        await asyncio.sleep(0.1)
                    
                    voice_client.play(audio_source)
            
            # Start player task
            player_task = asyncio.create_task(player())
            
            # Stream tokens
            try:
                async for token in text_stream:
                    # Send token to TTS
                    await ws.send(json.dumps({
                        "token": token,
                        "pitch_shift": 0
                    }))
                    
                    # Receive audio (non-blocking)
                    try:
                        audio_bytes = await asyncio.wait_for(
                            ws.recv(),
                            timeout=0.5
                        )
                        await audio_queue.put(audio_bytes)
                    except asyncio.TimeoutError:
                        continue
                
                # Flush remaining buffer
                await ws.send(json.dumps({"flush": True}))
                
                # Get remaining audio
                while True:
                    try:
                        audio_bytes = await asyncio.wait_for(
                            ws.recv(),
                            timeout=1.0
                        )
                        await audio_queue.put(audio_bytes)
                    except asyncio.TimeoutError:
                        break
            
            finally:
                # Signal player to stop
                await audio_queue.put(None)
                await player_task
    
    @commands.command()
    async def speak(self, ctx, *, prompt: str):
        """Make Miku speak in voice channel with streaming TTS"""
        
        # Connect to voice if needed
        if not ctx.voice_client:
            if not ctx.author.voice:
                await ctx.send("You need to be in a voice channel!")
                return
            await ctx.author.voice.channel.connect()
        
        # Get LLM response stream (example with llamacpp)
        async def llm_stream():
            # Replace with your actual LLM streaming code
            response = await your_llm_client.stream(prompt)
            async for token in response:
                yield token
        
        # Stream to Discord
        await self.stream_to_discord(ctx.voice_client, llm_stream())
        await ctx.send("✓ Done speaking!")

async def setup(bot):
    await bot.add_cog(MikuVoice(bot))

JavaScript/Node.js

const WebSocket = require('ws');

async function streamTTS(tokens) {
    const ws = new WebSocket('ws://localhost:8765/ws/stream');
    
    ws.on('open', () => {
        // Send tokens
        for (const token of tokens) {
            ws.send(JSON.stringify({
                token: token,
                pitch_shift: 0
            }));
        }
        
        // Flush
        ws.send(JSON.stringify({ flush: true }));
    });
    
    ws.on('message', (data) => {
        // data is Buffer containing PCM float32 audio
        const samples = new Float32Array(
            data.buffer,
            data.byteOffset,
            data.length / 4
        );
        
        // Play audio...
        playAudio(samples);
    });
}

Performance Characteristics

Latency Breakdown

Token-by-token (recommended):

LLM token → Bot (5ms) → WebSocket (5ms) → Soprano (80ms) → RVC (100ms) → Discord (20ms)
Total: ~210ms from token to sound

Sentence-by-sentence:

Full sentence (1000ms) → WebSocket (5ms) → Soprano (200ms) → RVC (300ms) → Discord (20ms)
Total: ~1525ms from start to sound

Throughput

Audio generation: ~0.95x realtime (GPU accelerated)
Network overhead: <1% (binary protocol)
Concurrent connections: 10+ supported

Audio Format Details

Raw PCM Format

The WebSocket sends raw PCM audio data:

# Receiving and converting to numpy
audio_bytes = await websocket.recv()
audio = np.frombuffer(audio_bytes, dtype=np.float32)

# Audio properties
sample_rate = 48000  # Hz
channels = 1         # Mono
dtype = np.float32   # 32-bit float
value_range = [-1.0, 1.0]  # Normalized

Converting to Other Formats

To WAV:

import wave
import struct

with wave.open('output.wav', 'wb') as wav:
    wav.setnchannels(1)
    wav.setsampwidth(4)  # 4 bytes = float32
    wav.setframerate(48000)
    wav.writeframes(audio_bytes)

To Discord Opus:

import discord

# Discord expects PCM s16le (16-bit signed integer)
audio_float = np.frombuffer(audio_bytes, dtype=np.float32)
audio_int16 = (audio_float * 32767).astype(np.int16)
audio_source = discord.PCMAudio(io.BytesIO(audio_int16.tobytes()))

Best Practices

1. Token Buffering

Don't send every character individually - send word-by-word or phrase-by-phrase:

# ✗ Bad: Too granular
for char in text:
    await ws.send(json.dumps({"token": char}))

# ✓ Good: Word-by-word
for word in text.split():
    await ws.send(json.dumps({"token": " " + word}))

2. Error Handling

Always handle disconnections gracefully:

try:
    async with websockets.connect(url) as ws:
        # ... streaming code ...
except websockets.exceptions.ConnectionClosed:
    logger.error("Connection lost, reconnecting...")
    # Retry logic...

3. Backpressure

If Discord's audio buffer fills, slow down token sending:

if voice_client.is_playing():
    await asyncio.sleep(0.1)  # Wait for buffer space

4. Flush at End

Always flush to ensure all audio is sent:

# After sending all tokens
await ws.send(json.dumps({"flush": True}))

# Wait for remaining audio
try:
    while True:
        await asyncio.wait_for(ws.recv(), timeout=1.0)
except asyncio.TimeoutError:
    pass  # All audio received

Troubleshooting

No Audio Received

Problem: Sending tokens but no audio comes back

Solutions:

Check if you're hitting sentence boundaries (. ! ?)
Try sending {"flush": true} manually
Verify token format: {"token": "text", "pitch_shift": 0}

Audio Choppy/Gaps

Problem: Audio plays but with interruptions

Solutions:

Increase buffer size on Discord side
Send tokens in larger chunks (word-by-word, not char-by-char)
Check network latency: ping localhost

Connection Drops

Problem: WebSocket disconnects unexpectedly

Solutions:

Implement reconnection logic with exponential backoff
Send periodic ping/pong frames
Check Docker container logs: docker logs miku-rvc-api

High Latency

Problem: Long delay between token and audio

Solutions:

Verify GPU is being used: Check logs for "Found GPU AMD Radeon RX 6800"
Reduce sentence buffer triggers (adjust in code)
Use smaller chunks: chunk_size=5 in Soprano config

Comparison with HTTP API

Feature	WebSocket (`/ws/stream`)	HTTP (`/api/speak`)
Latency	~200ms	~1500ms
Streaming	✓ Token-by-token	✗ Request-response
Overhead	5ms per message	100ms per request
Connection	Persistent	Per-request
Backpressure	✓ Bidirectional	✗ One-way
Complexity	Medium	Low
Use Case	Real-time voice chat	Simple TTS requests

Recommendation: Use WebSocket for Discord bot, keep HTTP for testing/debugging.

11 KiB Raw Permalink Blame History Unescape Escape

WebSocket Streaming API Documentation

Overview

Endpoint

Protocol

Client → Server Messages

Server → Client Messages

Behavior

Automatic Synthesis Triggers

Example Flow

Integration Examples

Python with websockets

Discord.py Integration

JavaScript/Node.js

Performance Characteristics

Latency Breakdown

Throughput

Audio Format Details

Raw PCM Format

Converting to Other Formats

Best Practices

1. Token Buffering

2. Error Handling

3. Backpressure

4. Flush at End

Troubleshooting

No Audio Received

Audio Choppy/Gaps

Connection Drops

High Latency

Comparison with HTTP API

11 KiB

Raw Permalink Blame History