Decided on Parakeet ONNX Runtime. Works pretty great. Realtime voice chat possible now. UX lacking.

2026-01-19 00:29:44 +02:00
parent 0a8910fff8
commit 362108f4b0
34 changed files with 4593 additions and 73 deletions
--- a/stt-parakeet/REFACTORING.md
+++ b/stt-parakeet/REFACTORING.md
@@ -0,0 +1,244 @@
+# Refactoring Summary
+
+## Overview
+
+Successfully refactored the Parakeet ASR codebase to use the `onnx-asr` library with ONNX Runtime GPU support for NVIDIA GTX 1660.
+
+## Changes Made
+
+### 1. Dependencies (`requirements.txt`)
+- **Removed**: `onnxruntime-gpu`, `silero-vad`
+- **Added**: `onnx-asr[gpu,hub]`, `soundfile`
+- **Kept**: `numpy<2.0`, `websockets`, `sounddevice`
+
+### 2. ASR Pipeline (`asr/asr_pipeline.py`)
+- Completely refactored to use `onnx_asr.load_model()`
+- Added support for:
+  - GPU acceleration via CUDA/TensorRT
+  - Model quantization (int8, fp16)
+  - Voice Activity Detection (VAD)
+  - Batch processing
+  - Streaming audio chunks
+- Configurable execution providers for GPU optimization
+- Automatic model download from Hugging Face
+
+### 3. VAD Module (`vad/silero_vad.py`)
+- Refactored to use `onnx_asr.load_vad()`
+- Integrated Silero VAD via onnx-asr
+- Simplified API for VAD operations
+- Note: VAD is best used via `model.with_vad()` method
+
+### 4. WebSocket Server (`server/ws_server.py`)
+- Created from scratch for streaming ASR
+- Features:
+  - Real-time audio streaming
+  - JSON-based protocol
+  - Support for multiple concurrent connections
+  - Buffer management for audio chunks
+  - Error handling and logging
+
+### 5. Microphone Client (`client/mic_stream.py`)
+- Created streaming client using `sounddevice`
+- Features:
+  - Real-time microphone capture
+  - WebSocket streaming to server
+  - Audio device selection
+  - Automatic format conversion (float32 to int16)
+  - Async communication
+
+### 6. Test Script (`tools/test_offline.py`)
+- Completely rewritten for onnx-asr
+- Features:
+  - Command-line interface
+  - Support for WAV files
+  - Optional VAD and quantization
+  - Audio statistics and diagnostics
+
+### 7. Diagnostics Tool (`tools/diagnose.py`)
+- New comprehensive system check tool
+- Checks:
+  - Python version
+  - Installed packages
+  - CUDA availability
+  - ONNX Runtime providers
+  - Audio devices
+  - Model files
+
+### 8. Setup Script (`setup_env.sh`)
+- Automated setup script
+- Features:
+  - Virtual environment creation
+  - Dependency installation
+  - CUDA/GPU detection
+  - System diagnostics
+  - Optional model download
+
+### 9. Documentation
+- **README.md**: Comprehensive documentation with:
+  - Installation instructions
+  - Usage examples
+  - Configuration options
+  - Troubleshooting guide
+  - Performance tips
+  
+- **QUICKSTART.md**: Quick start guide with:
+  - 5-minute setup
+  - Common commands
+  - Troubleshooting
+  - Performance optimization
+  
+- **example.py**: Simple usage example
+
+## Key Benefits
+
+### 1. GPU Optimization
+- Native CUDA support via ONNX Runtime
+- Configurable GPU memory limits
+- Optional TensorRT for even faster inference
+- Automatic fallback to CPU if GPU unavailable
+
+### 2. Simplified Model Management
+- Automatic model download from Hugging Face
+- No manual ONNX export needed
+- Pre-converted models ready to use
+- Support for quantized versions
+
+### 3. Better Performance
+- Optimized ONNX inference
+- GPU acceleration on GTX 1660
+- ~50-100x realtime on GPU
+- Reduced memory usage with quantization
+
+### 4. Improved Usability
+- Simpler API
+- Better error handling
+- Comprehensive logging
+- Easy configuration
+
+### 5. Modern Features
+- WebSocket streaming
+- Real-time transcription
+- VAD integration
+- Batch processing
+
+## Model Information
+
+- **Model**: Parakeet TDT 0.6B V3 (Multilingual)
+- **Source**: https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx
+- **Size**: ~600MB
+- **Languages**: 25+ languages
+- **Location**: `models/parakeet/` (auto-downloaded)
+
+## File Structure
+
+```
+parakeet-test/
+├── asr/
+│   ├── __init__.py              ✓ Updated
+│   └── asr_pipeline.py          ✓ Refactored
+├── client/
+│   ├── __init__.py              ✓ Updated
+│   └── mic_stream.py            ✓ New
+├── server/
+│   ├── __init__.py              ✓ Updated
+│   └── ws_server.py             ✓ New
+├── vad/
+│   ├── __init__.py              ✓ Updated
+│   └── silero_vad.py            ✓ Refactored
+├── tools/
+│   ├── diagnose.py              ✓ New
+│   └── test_offline.py          ✓ Refactored
+├── models/
+│   └── parakeet/                ✓ Auto-created
+├── requirements.txt             ✓ Updated
+├── setup_env.sh                 ✓ New
+├── README.md                    ✓ New
+├── QUICKSTART.md                ✓ New
+├── example.py                   ✓ New
+├── .gitignore                   ✓ New
+└── REFACTORING.md               ✓ This file
+```
+
+## Migration from Old Code
+
+### Old Code Pattern:
+```python
+# Manual ONNX session creation
+import onnxruntime as ort
+session = ort.InferenceSession("encoder.onnx", providers=["CUDAExecutionProvider"])
+# Manual preprocessing and decoding
+```
+
+### New Code Pattern:
+```python
+# Simple onnx-asr interface
+import onnx_asr
+model = onnx_asr.load_model("nemo-parakeet-tdt-0.6b-v3")
+text = model.recognize("audio.wav")
+```
+
+## Testing Instructions
+
+### 1. Setup
+```bash
+./setup_env.sh
+source venv/bin/activate
+```
+
+### 2. Run Diagnostics
+```bash
+python3 tools/diagnose.py
+```
+
+### 3. Test Offline
+```bash
+python3 tools/test_offline.py test.wav
+```
+
+### 4. Test Streaming
+```bash
+# Terminal 1
+python3 server/ws_server.py
+
+# Terminal 2
+python3 client/mic_stream.py
+```
+
+## Known Limitations
+
+1. **Audio Format**: Only WAV files with PCM encoding supported directly
+2. **Segment Length**: Models work best with <30 second segments
+3. **GPU Memory**: Requires at least 2-3GB GPU memory
+4. **Sample Rate**: 16kHz recommended for best results
+
+## Future Enhancements
+
+Possible improvements:
+- [ ] Add support for other audio formats (MP3, FLAC, etc.)
+- [ ] Implement beam search decoding
+- [ ] Add language selection option
+- [ ] Support for speaker diarization
+- [ ] REST API in addition to WebSocket
+- [ ] Docker containerization
+- [ ] Batch file processing script
+- [ ] Real-time visualization of transcription
+
+## References
+
+- [onnx-asr GitHub](https://github.com/istupakov/onnx-asr)
+- [onnx-asr Documentation](https://istupakov.github.io/onnx-asr/)
+- [Parakeet ONNX Model](https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx)
+- [Original Parakeet Model](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3)
+- [ONNX Runtime](https://onnxruntime.ai/)
+
+## Support
+
+For issues related to:
+- **onnx-asr library**: https://github.com/istupakov/onnx-asr/issues
+- **This implementation**: Check logs and run diagnose.py
+- **GPU/CUDA issues**: Verify nvidia-smi and CUDA installation
+
+---
+
+**Refactoring completed on**: January 18, 2026
+**Primary changes**: Migration to onnx-asr library for simplified ONNX inference with GPU support