# WebSocket Streaming API Documentation ## Overview The WebSocket endpoint (`/ws/stream`) enables real-time, token-by-token TTS streaming, perfect for Discord voice chat with LLM streaming responses. ## Endpoint ``` ws://localhost:8765/ws/stream ``` ## Protocol ### Client → Server Messages **Send Token:** ```json { "token": "Hello", "pitch_shift": 0 } ``` **Flush Buffer:** ```json { "flush": true } ``` ### Server → Client Messages **Audio Data (Binary):** - Format: PCM float32 - Sample Rate: 48kHz - Channels: Mono - Byte order: Native (little-endian on x86) **Error (JSON):** ```json { "error": "Error message" } ``` ## Behavior ### Automatic Synthesis Triggers The server will automatically synthesize and send audio when: 1. **Sentence boundary**: `.` `!` `?` `。` `!` `?` 2. **Pause boundary**: `,` `;` `,` `、` 3. **Buffer limit**: More than 200 characters accumulated 4. **Explicit flush**: Client sends `{"flush": true}` ### Example Flow ``` Client: {"token": "Hello"} Client: {"token": " "} Client: {"token": "world"} Client: {"token": "!"} ↓ Server detects '!' (sentence boundary) Server: [binary audio data for "Hello world!"] Client: {"token": " "} Client: {"token": "How"} Client: {"token": " "} Client: {"token": "are"} Client: {"token": " "} Client: {"token": "you"} Client: {"token": "?"} ↓ Server detects '?' (sentence boundary) Server: [binary audio data for "How are you?"] ``` ## Integration Examples ### Python with websockets ```python import websockets import json import numpy as np async def speak(text_stream): async with websockets.connect('ws://localhost:8765/ws/stream') as ws: async for token in text_stream: # Send token await ws.send(json.dumps({"token": token, "pitch_shift": 0})) # Receive audio (non-blocking) try: audio_bytes = await asyncio.wait_for(ws.recv(), timeout=0.5) audio = np.frombuffer(audio_bytes, dtype=np.float32) # Play audio... except asyncio.TimeoutError: continue # No audio yet # Flush remaining await ws.send(json.dumps({"flush": True})) while True: try: audio_bytes = await asyncio.wait_for(ws.recv(), timeout=1.0) # Process final chunks... except asyncio.TimeoutError: break ``` ### Discord.py Integration ```python import discord from discord.ext import commands import websockets import json import io import asyncio class MikuVoice(commands.Cog): def __init__(self, bot): self.bot = bot self.ws_url = 'ws://localhost:8765/ws/stream' async def stream_to_discord(self, voice_client, text_stream): """ Stream TTS audio to Discord voice channel as LLM tokens arrive. Args: voice_client: discord.VoiceClient text_stream: Async generator yielding text tokens """ async with websockets.connect(self.ws_url) as ws: # Queue for audio chunks audio_queue = asyncio.Queue() # Task to play audio from queue async def player(): while True: audio_bytes = await audio_queue.get() if audio_bytes is None: # Sentinel break # Create Discord audio source audio_source = discord.FFmpegPCMAudio( io.BytesIO(audio_bytes), pipe=True, options='-f f32le -ar 48000 -ac 1' ) # Play (wait for previous to finish) while voice_client.is_playing(): await asyncio.sleep(0.1) voice_client.play(audio_source) # Start player task player_task = asyncio.create_task(player()) # Stream tokens try: async for token in text_stream: # Send token to TTS await ws.send(json.dumps({ "token": token, "pitch_shift": 0 })) # Receive audio (non-blocking) try: audio_bytes = await asyncio.wait_for( ws.recv(), timeout=0.5 ) await audio_queue.put(audio_bytes) except asyncio.TimeoutError: continue # Flush remaining buffer await ws.send(json.dumps({"flush": True})) # Get remaining audio while True: try: audio_bytes = await asyncio.wait_for( ws.recv(), timeout=1.0 ) await audio_queue.put(audio_bytes) except asyncio.TimeoutError: break finally: # Signal player to stop await audio_queue.put(None) await player_task @commands.command() async def speak(self, ctx, *, prompt: str): """Make Miku speak in voice channel with streaming TTS""" # Connect to voice if needed if not ctx.voice_client: if not ctx.author.voice: await ctx.send("You need to be in a voice channel!") return await ctx.author.voice.channel.connect() # Get LLM response stream (example with llamacpp) async def llm_stream(): # Replace with your actual LLM streaming code response = await your_llm_client.stream(prompt) async for token in response: yield token # Stream to Discord await self.stream_to_discord(ctx.voice_client, llm_stream()) await ctx.send("✓ Done speaking!") async def setup(bot): await bot.add_cog(MikuVoice(bot)) ``` ### JavaScript/Node.js ```javascript const WebSocket = require('ws'); async function streamTTS(tokens) { const ws = new WebSocket('ws://localhost:8765/ws/stream'); ws.on('open', () => { // Send tokens for (const token of tokens) { ws.send(JSON.stringify({ token: token, pitch_shift: 0 })); } // Flush ws.send(JSON.stringify({ flush: true })); }); ws.on('message', (data) => { // data is Buffer containing PCM float32 audio const samples = new Float32Array( data.buffer, data.byteOffset, data.length / 4 ); // Play audio... playAudio(samples); }); } ``` ## Performance Characteristics ### Latency Breakdown **Token-by-token (recommended):** ``` LLM token → Bot (5ms) → WebSocket (5ms) → Soprano (80ms) → RVC (100ms) → Discord (20ms) Total: ~210ms from token to sound ``` **Sentence-by-sentence:** ``` Full sentence (1000ms) → WebSocket (5ms) → Soprano (200ms) → RVC (300ms) → Discord (20ms) Total: ~1525ms from start to sound ``` ### Throughput - **Audio generation**: ~0.95x realtime (GPU accelerated) - **Network overhead**: <1% (binary protocol) - **Concurrent connections**: 10+ supported ## Audio Format Details ### Raw PCM Format The WebSocket sends raw PCM audio data: ```python # Receiving and converting to numpy audio_bytes = await websocket.recv() audio = np.frombuffer(audio_bytes, dtype=np.float32) # Audio properties sample_rate = 48000 # Hz channels = 1 # Mono dtype = np.float32 # 32-bit float value_range = [-1.0, 1.0] # Normalized ``` ### Converting to Other Formats **To WAV:** ```python import wave import struct with wave.open('output.wav', 'wb') as wav: wav.setnchannels(1) wav.setsampwidth(4) # 4 bytes = float32 wav.setframerate(48000) wav.writeframes(audio_bytes) ``` **To Discord Opus:** ```python import discord # Discord expects PCM s16le (16-bit signed integer) audio_float = np.frombuffer(audio_bytes, dtype=np.float32) audio_int16 = (audio_float * 32767).astype(np.int16) audio_source = discord.PCMAudio(io.BytesIO(audio_int16.tobytes())) ``` ## Best Practices ### 1. Token Buffering Don't send every character individually - send word-by-word or phrase-by-phrase: ```python # ✗ Bad: Too granular for char in text: await ws.send(json.dumps({"token": char})) # ✓ Good: Word-by-word for word in text.split(): await ws.send(json.dumps({"token": " " + word})) ``` ### 2. Error Handling Always handle disconnections gracefully: ```python try: async with websockets.connect(url) as ws: # ... streaming code ... except websockets.exceptions.ConnectionClosed: logger.error("Connection lost, reconnecting...") # Retry logic... ``` ### 3. Backpressure If Discord's audio buffer fills, slow down token sending: ```python if voice_client.is_playing(): await asyncio.sleep(0.1) # Wait for buffer space ``` ### 4. Flush at End Always flush to ensure all audio is sent: ```python # After sending all tokens await ws.send(json.dumps({"flush": True})) # Wait for remaining audio try: while True: await asyncio.wait_for(ws.recv(), timeout=1.0) except asyncio.TimeoutError: pass # All audio received ``` ## Troubleshooting ### No Audio Received **Problem**: Sending tokens but no audio comes back **Solutions**: 1. Check if you're hitting sentence boundaries (`.` `!` `?`) 2. Try sending `{"flush": true}` manually 3. Verify token format: `{"token": "text", "pitch_shift": 0}` ### Audio Choppy/Gaps **Problem**: Audio plays but with interruptions **Solutions**: 1. Increase buffer size on Discord side 2. Send tokens in larger chunks (word-by-word, not char-by-char) 3. Check network latency: `ping localhost` ### Connection Drops **Problem**: WebSocket disconnects unexpectedly **Solutions**: 1. Implement reconnection logic with exponential backoff 2. Send periodic ping/pong frames 3. Check Docker container logs: `docker logs miku-rvc-api` ### High Latency **Problem**: Long delay between token and audio **Solutions**: 1. Verify GPU is being used: Check logs for "Found GPU AMD Radeon RX 6800" 2. Reduce sentence buffer triggers (adjust in code) 3. Use smaller chunks: `chunk_size=5` in Soprano config ## Comparison with HTTP API | Feature | WebSocket (`/ws/stream`) | HTTP (`/api/speak`) | |---------|-------------------------|-------------------| | Latency | ~200ms | ~1500ms | | Streaming | ✓ Token-by-token | ✗ Request-response | | Overhead | 5ms per message | 100ms per request | | Connection | Persistent | Per-request | | Backpressure | ✓ Bidirectional | ✗ One-way | | Complexity | Medium | Low | | Use Case | Real-time voice chat | Simple TTS requests | **Recommendation**: Use WebSocket for Discord bot, keep HTTP for testing/debugging.