Files
miku-discord/stt-parakeet/CLIENT_GUIDE.md

304 lines
6.5 KiB
Markdown
Raw Normal View History

# Server & Client Usage Guide
## ✅ Server is Working!
The WebSocket server is running on port **8766** with GPU acceleration.
## Quick Start
### 1. Start the Server
```bash
./run.sh server/ws_server.py
```
Server will start on: `ws://localhost:8766`
### 2. Test with Simple Client
```bash
./run.sh test_client.py test.wav
```
### 3. Use Microphone Client
```bash
# List audio devices first
./run.sh client/mic_stream.py --list-devices
# Start streaming from microphone
./run.sh client/mic_stream.py
# Or specify device
./run.sh client/mic_stream.py --device 0
```
## Available Clients
### 1. **test_client.py** - Simple File Testing
```bash
./run.sh test_client.py your_audio.wav
```
- Sends audio file to server
- Shows real-time transcription
- Good for testing
### 2. **client/mic_stream.py** - Live Microphone
```bash
./run.sh client/mic_stream.py
```
- Captures from microphone
- Streams to server
- Real-time transcription display
### 3. **Custom Client** - Your Own Script
```python
import asyncio
import websockets
import json
async def connect():
async with websockets.connect("ws://localhost:8766") as ws:
# Send audio as int16 PCM bytes
audio_bytes = your_audio_data.astype('int16').tobytes()
await ws.send(audio_bytes)
# Receive transcription
response = await ws.recv()
result = json.loads(response)
print(result['text'])
asyncio.run(connect())
```
## Server Options
```bash
# Custom host/port
./run.sh server/ws_server.py --host 0.0.0.0 --port 9000
# Enable VAD (for long audio)
./run.sh server/ws_server.py --use-vad
# Different model
./run.sh server/ws_server.py --model nemo-parakeet-tdt-0.6b-v3
# Change sample rate
./run.sh server/ws_server.py --sample-rate 16000
```
## Client Options
### Microphone Client
```bash
# List devices
./run.sh client/mic_stream.py --list-devices
# Use specific device
./run.sh client/mic_stream.py --device 2
# Custom server URL
./run.sh client/mic_stream.py --url ws://192.168.1.100:8766
# Adjust chunk duration (lower = lower latency)
./run.sh client/mic_stream.py --chunk-duration 0.05
```
## Protocol
The server uses a simple JSON-based protocol:
### Server → Client Messages
```json
{
"type": "info",
"message": "Connected to ASR server",
"sample_rate": 16000
}
```
```json
{
"type": "transcript",
"text": "transcribed text here",
"is_final": false
}
```
```json
{
"type": "error",
"message": "error description"
}
```
### Client → Server Messages
**Send audio:**
- Binary data (int16 PCM, little-endian)
- Sample rate: 16000 Hz
- Mono channel
**Send commands:**
```json
{"type": "final"} // Process remaining buffer
{"type": "reset"} // Reset audio buffer
```
## Audio Format Requirements
- **Format**: int16 PCM (bytes)
- **Sample Rate**: 16000 Hz
- **Channels**: Mono (1)
- **Byte Order**: Little-endian
### Convert Audio in Python
```python
import numpy as np
import soundfile as sf
# Load audio
audio, sr = sf.read("file.wav", dtype='float32')
# Convert to mono
if audio.ndim > 1:
audio = audio[:, 0]
# Resample if needed (install resampy)
if sr != 16000:
import resampy
audio = resampy.resample(audio, sr, 16000)
# Convert to int16 for sending
audio_int16 = (audio * 32767).astype(np.int16)
audio_bytes = audio_int16.tobytes()
```
## Examples
### Browser Client (JavaScript)
```javascript
const ws = new WebSocket('ws://localhost:8766');
ws.onopen = () => {
console.log('Connected!');
// Capture from microphone
navigator.mediaDevices.getUserMedia({ audio: true })
.then(stream => {
const audioContext = new AudioContext({ sampleRate: 16000 });
const source = audioContext.createMediaStreamSource(stream);
const processor = audioContext.createScriptProcessor(4096, 1, 1);
processor.onaudioprocess = (e) => {
const audioData = e.inputBuffer.getChannelData(0);
// Convert float32 to int16
const int16Data = new Int16Array(audioData.length);
for (let i = 0; i < audioData.length; i++) {
int16Data[i] = Math.max(-32768, Math.min(32767, audioData[i] * 32768));
}
ws.send(int16Data.buffer);
};
source.connect(processor);
processor.connect(audioContext.destination);
});
};
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
if (data.type === 'transcript') {
console.log('Transcription:', data.text);
}
};
```
### Python Script Client
```python
#!/usr/bin/env python3
import asyncio
import websockets
import sounddevice as sd
import numpy as np
import json
async def stream_microphone():
uri = "ws://localhost:8766"
async with websockets.connect(uri) as ws:
print("Connected!")
def audio_callback(indata, frames, time, status):
# Convert to int16 and send
audio = (indata[:, 0] * 32767).astype(np.int16)
asyncio.create_task(ws.send(audio.tobytes()))
# Start recording
with sd.InputStream(callback=audio_callback,
channels=1,
samplerate=16000,
blocksize=1600): # 0.1 second chunks
while True:
response = await ws.recv()
data = json.loads(response)
if data.get('type') == 'transcript':
print(f"→ {data['text']}")
asyncio.run(stream_microphone())
```
## Performance
With GPU (GTX 1660):
- **Latency**: <100ms per chunk
- **Throughput**: ~50-100x realtime
- **GPU Memory**: ~1.3GB
- **Languages**: 25+ (auto-detected)
## Troubleshooting
### Server won't start
```bash
# Check if port is in use
lsof -i:8766
# Kill existing server
pkill -f ws_server.py
# Restart
./run.sh server/ws_server.py
```
### Client can't connect
```bash
# Check server is running
ps aux | grep ws_server
# Check firewall
sudo ufw allow 8766
```
### No transcription output
- Check audio format (must be int16 PCM, 16kHz, mono)
- Check chunk size (not too small)
- Check server logs for errors
### GPU not working
- Server will fall back to CPU automatically
- Check `nvidia-smi` for GPU status
- Verify CUDA libraries are loaded (should be automatic with `./run.sh`)
## Next Steps
1. **Test the server**: `./run.sh test_client.py test.wav`
2. **Try microphone**: `./run.sh client/mic_stream.py`
3. **Build your own client** using the examples above
Happy transcribing! 🎤