Decided on Parakeet ONNX Runtime. Works pretty great. Realtime voice chat possible now. UX lacking.

2026-01-19 00:29:44 +02:00
parent 0a8910fff8
commit 362108f4b0
34 changed files with 4593 additions and 73 deletions
--- a/stt-parakeet/README.md
+++ b/stt-parakeet/README.md
@@ -0,0 +1,280 @@
+# Parakeet ASR with ONNX Runtime
+
+Real-time Automatic Speech Recognition (ASR) system using NVIDIA's Parakeet TDT 0.6B V3 model via the `onnx-asr` library, optimized for NVIDIA GPUs (GTX 1660 and better).
+
+## Features
+
+- ✅ **ONNX Runtime with GPU acceleration** (CUDA/TensorRT support)
+- ✅ **Parakeet TDT 0.6B V3** multilingual model from Hugging Face
+- ✅ **Real-time streaming** via WebSocket server
+- ✅ **Voice Activity Detection** (Silero VAD)
+- ✅ **Microphone client** for live transcription
+- ✅ **Offline transcription** from audio files
+- ✅ **Quantization support** (int8, fp16) for faster inference
+
+## Model Information
+
+This implementation uses:
+- **Model**: `nemo-parakeet-tdt-0.6b-v3` (Multilingual)
+- **Source**: https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx
+- **Library**: https://github.com/istupakov/onnx-asr
+- **Original Model**: https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3
+
+## System Requirements
+
+- **GPU**: NVIDIA GPU with CUDA support (tested on GTX 1660)
+- **CUDA**: Version 11.8 or 12.x
+- **Python**: 3.10 or higher
+- **Memory**: At least 4GB GPU memory recommended
+
+## Installation
+
+### 1. Clone the repository
+
+```bash
+cd /home/koko210Serve/parakeet-test
+```
+
+### 2. Create virtual environment
+
+```bash
+python3 -m venv venv
+source venv/bin/activate
+```
+
+### 3. Install CUDA dependencies
+
+Make sure you have CUDA installed. For Ubuntu:
+
+```bash
+# Check CUDA version
+nvcc --version
+
+# If you need to install CUDA, follow NVIDIA's instructions:
+# https://developer.nvidia.com/cuda-downloads
+```
+
+### 4. Install Python dependencies
+
+```bash
+pip install --upgrade pip
+pip install -r requirements.txt
+```
+
+Or manually:
+
+```bash
+# With GPU support (recommended)
+pip install onnx-asr[gpu,hub]
+
+# Additional dependencies
+pip install numpy<2.0 websockets sounddevice soundfile
+```
+
+### 5. Verify CUDA availability
+
+```bash
+python3 -c "import onnxruntime as ort; print('Available providers:', ort.get_available_providers())"
+```
+
+You should see `CUDAExecutionProvider` in the list.
+
+## Usage
+
+### Test Offline Transcription
+
+Transcribe an audio file:
+
+```bash
+python3 tools/test_offline.py test.wav
+```
+
+With VAD (for long audio files):
+
+```bash
+python3 tools/test_offline.py test.wav --use-vad
+```
+
+With quantization (faster, less memory):
+
+```bash
+python3 tools/test_offline.py test.wav --quantization int8
+```
+
+### Start WebSocket Server
+
+Start the ASR server:
+
+```bash
+python3 server/ws_server.py
+```
+
+With options:
+
+```bash
+python3 server/ws_server.py --host 0.0.0.0 --port 8765 --use-vad
+```
+
+### Start Microphone Client
+
+In a separate terminal, start the microphone client:
+
+```bash
+python3 client/mic_stream.py
+```
+
+List available audio devices:
+
+```bash
+python3 client/mic_stream.py --list-devices
+```
+
+Connect to a specific device:
+
+```bash
+python3 client/mic_stream.py --device 0
+```
+
+## Project Structure
+
+```
+parakeet-test/
+├── asr/
+│   ├── __init__.py
+│   └── asr_pipeline.py       # Main ASR pipeline using onnx-asr
+├── client/
+│   ├── __init__.py
+│   └── mic_stream.py          # Microphone streaming client
+├── server/
+│   ├── __init__.py
+│   └── ws_server.py           # WebSocket server for streaming ASR
+├── vad/
+│   ├── __init__.py
+│   └── silero_vad.py          # VAD wrapper using onnx-asr
+├── tools/
+│   ├── test_offline.py        # Test offline transcription
+│   └── diagnose.py            # System diagnostics
+├── models/
+│   └── parakeet/              # Model files (auto-downloaded)
+├── requirements.txt           # Python dependencies
+└── README.md                  # This file
+```
+
+## Model Files
+
+The model files will be automatically downloaded from Hugging Face on first run to:
+```
+models/parakeet/
+├── config.json
+├── encoder-parakeet-tdt-0.6b-v3.onnx
+├── decoder_joint-parakeet-tdt-0.6b-v3.onnx
+└── vocab.txt
+```
+
+## Configuration
+
+### GPU Settings
+
+The ASR pipeline is configured to use CUDA by default. You can customize the execution providers in `asr/asr_pipeline.py`:
+
+```python
+providers = [
+    (
+        "CUDAExecutionProvider",
+        {
+            "device_id": 0,
+            "arena_extend_strategy": "kNextPowerOfTwo",
+            "gpu_mem_limit": 6 * 1024 * 1024 * 1024,  # 6GB
+            "cudnn_conv_algo_search": "EXHAUSTIVE",
+            "do_copy_in_default_stream": True,
+        }
+    ),
+    "CPUExecutionProvider",
+]
+```
+
+### TensorRT (Optional - Faster Inference)
+
+For even better performance, you can use TensorRT:
+
+```bash
+pip install tensorrt tensorrt-cu12-libs
+```
+
+Then modify the providers:
+
+```python
+providers = [
+    (
+        "TensorrtExecutionProvider",
+        {
+            "trt_max_workspace_size": 6 * 1024**3,
+            "trt_fp16_enable": True,
+        },
+    )
+]
+```
+
+## Troubleshooting
+
+### CUDA Not Available
+
+If CUDA is not detected:
+
+1. Check CUDA installation: `nvcc --version`
+2. Verify GPU: `nvidia-smi`
+3. Reinstall onnxruntime-gpu:
+   ```bash
+   pip uninstall onnxruntime onnxruntime-gpu
+   pip install onnxruntime-gpu
+   ```
+
+### Memory Issues
+
+If you run out of GPU memory:
+
+1. Use quantization: `--quantization int8`
+2. Reduce `gpu_mem_limit` in the configuration
+3. Close other GPU-using applications
+
+### Audio Issues
+
+If microphone is not working:
+
+1. List devices: `python3 client/mic_stream.py --list-devices`
+2. Select the correct device: `--device <id>`
+3. Check permissions: `sudo usermod -a -G audio $USER` (then logout/login)
+
+### Slow Performance
+
+1. Ensure GPU is being used (check logs for "CUDAExecutionProvider")
+2. Try quantization for faster inference
+3. Consider using TensorRT provider
+4. Check GPU utilization: `nvidia-smi`
+
+## Performance
+
+Expected performance on GTX 1660 (6GB):
+
+- **Offline transcription**: ~50-100x realtime (depending on audio length)
+- **Streaming**: <100ms latency
+- **Memory usage**: ~2-3GB GPU memory
+- **Quantized (int8)**: ~30% faster, ~50% less memory
+
+## License
+
+This project uses:
+- `onnx-asr`: MIT License
+- Parakeet model: CC-BY-4.0 License
+
+## References
+
+- [onnx-asr GitHub](https://github.com/istupakov/onnx-asr)
+- [Parakeet TDT 0.6B V3 ONNX](https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx)
+- [NVIDIA Parakeet](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3)
+- [ONNX Runtime](https://onnxruntime.ai/)
+
+## Credits
+
+- Model conversion by [istupakov](https://github.com/istupakov)
+- Original Parakeet model by NVIDIA