stt-parakeet/CLIENT_GUIDE.md

# Server & Client Usage Guide

## ✅ Server is Working!

The WebSocket server is running on port **8766** with GPU acceleration.

## Quick Start

### 1. Start the Server

```bash
./run.sh server/ws_server.py
```

Server will start on: `ws://localhost:8766`

### 2. Test with Simple Client

```bash
./run.sh test_client.py test.wav
```

### 3. Use Microphone Client

```bash
# List audio devices first
./run.sh client/mic_stream.py --list-devices

# Start streaming from microphone
./run.sh client/mic_stream.py

# Or specify device
./run.sh client/mic_stream.py --device 0
```

## Available Clients

### 1. **test_client.py** - Simple File Testing
```bash
./run.sh test_client.py your_audio.wav
```
- Sends audio file to server
- Shows real-time transcription
- Good for testing

### 2. **client/mic_stream.py** - Live Microphone
```bash
./run.sh client/mic_stream.py
```
- Captures from microphone
- Streams to server
- Real-time transcription display

### 3. **Custom Client** - Your Own Script

```python
import asyncio
import websockets
import json

async def connect():
    async with websockets.connect("ws://localhost:8766") as ws:
        # Send audio as int16 PCM bytes
        audio_bytes = your_audio_data.astype('int16').tobytes()
        await ws.send(audio_bytes)
        
        # Receive transcription
        response = await ws.recv()
        result = json.loads(response)
        print(result['text'])

asyncio.run(connect())
```

## Server Options

```bash
# Custom host/port
./run.sh server/ws_server.py --host 0.0.0.0 --port 9000

# Enable VAD (for long audio)
./run.sh server/ws_server.py --use-vad

# Different model
./run.sh server/ws_server.py --model nemo-parakeet-tdt-0.6b-v3

# Change sample rate
./run.sh server/ws_server.py --sample-rate 16000
```

## Client Options

### Microphone Client
```bash
# List devices
./run.sh client/mic_stream.py --list-devices

# Use specific device
./run.sh client/mic_stream.py --device 2

# Custom server URL
./run.sh client/mic_stream.py --url ws://192.168.1.100:8766

# Adjust chunk duration (lower = lower latency)
./run.sh client/mic_stream.py --chunk-duration 0.05
```

## Protocol

The server uses a simple JSON-based protocol:

### Server → Client Messages

```json
{
  "type": "info",
  "message": "Connected to ASR server",
  "sample_rate": 16000
}
```

```json
{
  "type": "transcript",
  "text": "transcribed text here",
  "is_final": false
}
```

```json
{
  "type": "error",
  "message": "error description"
}
```

### Client → Server Messages

**Send audio:**
- Binary data (int16 PCM, little-endian)
- Sample rate: 16000 Hz
- Mono channel

**Send commands:**
```json
{"type": "final"}   // Process remaining buffer
{"type": "reset"}   // Reset audio buffer
```

## Audio Format Requirements

- **Format**: int16 PCM (bytes)
- **Sample Rate**: 16000 Hz
- **Channels**: Mono (1)
- **Byte Order**: Little-endian

### Convert Audio in Python

```python
import numpy as np
import soundfile as sf

# Load audio
audio, sr = sf.read("file.wav", dtype='float32')

# Convert to mono
if audio.ndim > 1:
    audio = audio[:, 0]

# Resample if needed (install resampy)
if sr != 16000:
    import resampy
    audio = resampy.resample(audio, sr, 16000)

# Convert to int16 for sending
audio_int16 = (audio * 32767).astype(np.int16)
audio_bytes = audio_int16.tobytes()
```

## Examples

### Browser Client (JavaScript)

```javascript
const ws = new WebSocket('ws://localhost:8766');

ws.onopen = () => {
    console.log('Connected!');
    
    // Capture from microphone
    navigator.mediaDevices.getUserMedia({ audio: true })
        .then(stream => {
            const audioContext = new AudioContext({ sampleRate: 16000 });
            const source = audioContext.createMediaStreamSource(stream);
            const processor = audioContext.createScriptProcessor(4096, 1, 1);
            
            processor.onaudioprocess = (e) => {
                const audioData = e.inputBuffer.getChannelData(0);
                // Convert float32 to int16
                const int16Data = new Int16Array(audioData.length);
                for (let i = 0; i < audioData.length; i++) {
                    int16Data[i] = Math.max(-32768, Math.min(32767, audioData[i] * 32768));
                }
                ws.send(int16Data.buffer);
            };
            
            source.connect(processor);
            processor.connect(audioContext.destination);
        });
};

ws.onmessage = (event) => {
    const data = JSON.parse(event.data);
    if (data.type === 'transcript') {
        console.log('Transcription:', data.text);
    }
};
```

### Python Script Client

```python
#!/usr/bin/env python3
import asyncio
import websockets
import sounddevice as sd
import numpy as np
import json

async def stream_microphone():
    uri = "ws://localhost:8766"
    
    async with websockets.connect(uri) as ws:
        print("Connected!")
        
        def audio_callback(indata, frames, time, status):
            # Convert to int16 and send
            audio = (indata[:, 0] * 32767).astype(np.int16)
            asyncio.create_task(ws.send(audio.tobytes()))
        
        # Start recording
        with sd.InputStream(callback=audio_callback,
                           channels=1,
                           samplerate=16000,
                           blocksize=1600):  # 0.1 second chunks
            
            while True:
                response = await ws.recv()
                data = json.loads(response)
                if data.get('type') == 'transcript':
                    print(f"→ {data['text']}")

asyncio.run(stream_microphone())
```

## Performance

With GPU (GTX 1660):
- **Latency**: <100ms per chunk
- **Throughput**: ~50-100x realtime
- **GPU Memory**: ~1.3GB
- **Languages**: 25+ (auto-detected)

## Troubleshooting

### Server won't start
```bash
# Check if port is in use
lsof -i:8766

# Kill existing server
pkill -f ws_server.py

# Restart
./run.sh server/ws_server.py
```

### Client can't connect
```bash
# Check server is running
ps aux | grep ws_server

# Check firewall
sudo ufw allow 8766
```

### No transcription output
- Check audio format (must be int16 PCM, 16kHz, mono)
- Check chunk size (not too small)
- Check server logs for errors

### GPU not working
- Server will fall back to CPU automatically
- Check `nvidia-smi` for GPU status
- Verify CUDA libraries are loaded (should be automatic with `./run.sh`)

## Next Steps

1. **Test the server**: `./run.sh test_client.py test.wav`
2. **Try microphone**: `./run.sh client/mic_stream.py`
3. **Build your own client** using the examples above

Happy transcribing! 🎤
Decided on Parakeet ONNX Runtime. Works pretty great. Realtime voice chat possible now. UX lacking. 2026-01-19 00:29:44 +02:00			`# Server & Client Usage Guide`

			`## ✅ Server is Working!`

			`The WebSocket server is running on port 8766 with GPU acceleration.`

			`## Quick Start`

			`### 1. Start the Server`

			```bash
			`./run.sh server/ws_server.py`
			```

			Server will start on: `ws://localhost:8766`

			`### 2. Test with Simple Client`

			```bash
			`./run.sh test_client.py test.wav`
			```

			`### 3. Use Microphone Client`

			```bash
			`# List audio devices first`
			`./run.sh client/mic_stream.py --list-devices`

			`# Start streaming from microphone`
			`./run.sh client/mic_stream.py`

			`# Or specify device`
			`./run.sh client/mic_stream.py --device 0`
			```

			`## Available Clients`

			`### 1. test_client.py - Simple File Testing`
			```bash
			`./run.sh test_client.py your_audio.wav`
			```
			`- Sends audio file to server`
			`- Shows real-time transcription`
			`- Good for testing`

			`### 2. client/mic_stream.py - Live Microphone`
			```bash
			`./run.sh client/mic_stream.py`
			```
			`- Captures from microphone`
			`- Streams to server`
			`- Real-time transcription display`

			`### 3. Custom Client - Your Own Script`

			```python
			`import asyncio`
			`import websockets`
			`import json`

			`async def connect():`
			`async with websockets.connect("ws://localhost:8766") as ws:`
			`# Send audio as int16 PCM bytes`
			`audio_bytes = your_audio_data.astype('int16').tobytes()`
			`await ws.send(audio_bytes)`

			`# Receive transcription`
			`response = await ws.recv()`
			`result = json.loads(response)`
			`print(result['text'])`

			`asyncio.run(connect())`
			```

			`## Server Options`

			```bash
			`# Custom host/port`
			`./run.sh server/ws_server.py --host 0.0.0.0 --port 9000`

			`# Enable VAD (for long audio)`
			`./run.sh server/ws_server.py --use-vad`

			`# Different model`
			`./run.sh server/ws_server.py --model nemo-parakeet-tdt-0.6b-v3`

			`# Change sample rate`
			`./run.sh server/ws_server.py --sample-rate 16000`
			```

			`## Client Options`

			`### Microphone Client`
			```bash
			`# List devices`
			`./run.sh client/mic_stream.py --list-devices`

			`# Use specific device`
			`./run.sh client/mic_stream.py --device 2`

			`# Custom server URL`
			`./run.sh client/mic_stream.py --url ws://192.168.1.100:8766`

			`# Adjust chunk duration (lower = lower latency)`
			`./run.sh client/mic_stream.py --chunk-duration 0.05`
			```

			`## Protocol`

			`The server uses a simple JSON-based protocol:`

			`### Server → Client Messages`

			```json
			`{`
			`"type": "info",`
			`"message": "Connected to ASR server",`
			`"sample_rate": 16000`
			`}`
			```

			```json
			`{`
			`"type": "transcript",`
			`"text": "transcribed text here",`
			`"is_final": false`
			`}`
			```

			```json
			`{`
			`"type": "error",`
			`"message": "error description"`
			`}`
			```

			`### Client → Server Messages`

			`Send audio:`
			`- Binary data (int16 PCM, little-endian)`
			`- Sample rate: 16000 Hz`
			`- Mono channel`

			`Send commands:`
			```json
			`{"type": "final"} // Process remaining buffer`
			`{"type": "reset"} // Reset audio buffer`
			```

			`## Audio Format Requirements`

			`- Format: int16 PCM (bytes)`
			`- Sample Rate: 16000 Hz`
			`- Channels: Mono (1)`
			`- Byte Order: Little-endian`

			`### Convert Audio in Python`

			```python
			`import numpy as np`
			`import soundfile as sf`

			`# Load audio`
			`audio, sr = sf.read("file.wav", dtype='float32')`

			`# Convert to mono`
			`if audio.ndim > 1:`
			`audio = audio[:, 0]`

			`# Resample if needed (install resampy)`
			`if sr != 16000:`
			`import resampy`
			`audio = resampy.resample(audio, sr, 16000)`

			`# Convert to int16 for sending`
			`audio_int16 = (audio * 32767).astype(np.int16)`
			`audio_bytes = audio_int16.tobytes()`
			```

			`## Examples`

			`### Browser Client (JavaScript)`

			```javascript
			`const ws = new WebSocket('ws://localhost:8766');`

			`ws.onopen = () => {`
			`console.log('Connected!');`

			`// Capture from microphone`
			`navigator.mediaDevices.getUserMedia({ audio: true })`
			`.then(stream => {`
			`const audioContext = new AudioContext({ sampleRate: 16000 });`
			`const source = audioContext.createMediaStreamSource(stream);`
			`const processor = audioContext.createScriptProcessor(4096, 1, 1);`

			`processor.onaudioprocess = (e) => {`
			`const audioData = e.inputBuffer.getChannelData(0);`
			`// Convert float32 to int16`
			`const int16Data = new Int16Array(audioData.length);`
			`for (let i = 0; i < audioData.length; i++) {`
			`int16Data[i] = Math.max(-32768, Math.min(32767, audioData[i] * 32768));`
			`}`
			`ws.send(int16Data.buffer);`
			`};`

			`source.connect(processor);`
			`processor.connect(audioContext.destination);`
			`});`
			`};`

			`ws.onmessage = (event) => {`
			`const data = JSON.parse(event.data);`
			`if (data.type === 'transcript') {`
			`console.log('Transcription:', data.text);`
			`}`
			`};`
			```

			`### Python Script Client`

			```python
			`#!/usr/bin/env python3`
			`import asyncio`
			`import websockets`
			`import sounddevice as sd`
			`import numpy as np`
			`import json`

			`async def stream_microphone():`
			`uri = "ws://localhost:8766"`

			`async with websockets.connect(uri) as ws:`
			`print("Connected!")`

			`def audio_callback(indata, frames, time, status):`
			`# Convert to int16 and send`
			`audio = (indata[:, 0] * 32767).astype(np.int16)`
			`asyncio.create_task(ws.send(audio.tobytes()))`

			`# Start recording`
			`with sd.InputStream(callback=audio_callback,`
			`channels=1,`
			`samplerate=16000,`
			`blocksize=1600): # 0.1 second chunks`

			`while True:`
			`response = await ws.recv()`
			`data = json.loads(response)`
			`if data.get('type') == 'transcript':`
			`print(f"→ {data['text']}")`

			`asyncio.run(stream_microphone())`
			```

			`## Performance`

			`With GPU (GTX 1660):`
			`- Latency: <100ms per chunk`
			`- Throughput: ~50-100x realtime`
			`- GPU Memory: ~1.3GB`
			`- Languages: 25+ (auto-detected)`

			`## Troubleshooting`

			`### Server won't start`
			```bash
			`# Check if port is in use`
			`lsof -i:8766`

			`# Kill existing server`
			`pkill -f ws_server.py`

			`# Restart`
			`./run.sh server/ws_server.py`
			```

			`### Client can't connect`
			```bash
			`# Check server is running`
			`ps aux \| grep ws_server`

			`# Check firewall`
			`sudo ufw allow 8766`
			```

			`### No transcription output`
			`- Check audio format (must be int16 PCM, 16kHz, mono)`
			`- Check chunk size (not too small)`
			`- Check server logs for errors`

			`### GPU not working`
			`- Server will fall back to CPU automatically`
			- Check `nvidia-smi` for GPU status
			- Verify CUDA libraries are loaded (should be automatic with `./run.sh`)

			`## Next Steps`

			1. Test the server: `./run.sh test_client.py test.wav`
			2. Try microphone: `./run.sh client/mic_stream.py`
			`3. Build your own client using the examples above`

			`Happy transcribing! 🎤`