Files

koko210Serve 362108f4b0 Decided on Parakeet ONNX Runtime. Works pretty great. Realtime voice chat possible now. UX lacking.

2026-01-19 00:29:44 +02:00

6.6 KiB

Raw Blame History

Refactoring Summary

Overview

Successfully refactored the Parakeet ASR codebase to use the onnx-asr library with ONNX Runtime GPU support for NVIDIA GTX 1660.

Changes Made

1. Dependencies (`requirements.txt`)

Removed: onnxruntime-gpu, silero-vad
Added: onnx-asr[gpu,hub], soundfile
Kept: numpy<2.0, websockets, sounddevice

2. ASR Pipeline (`asr/asr_pipeline.py`)

Completely refactored to use onnx_asr.load_model()
Added support for:
- GPU acceleration via CUDA/TensorRT
- Model quantization (int8, fp16)
- Voice Activity Detection (VAD)
- Batch processing
- Streaming audio chunks
Configurable execution providers for GPU optimization
Automatic model download from Hugging Face

3. VAD Module (`vad/silero_vad.py`)

Refactored to use onnx_asr.load_vad()
Integrated Silero VAD via onnx-asr
Simplified API for VAD operations
Note: VAD is best used via model.with_vad() method

4. WebSocket Server (`server/ws_server.py`)

Created from scratch for streaming ASR
Features:
- Real-time audio streaming
- JSON-based protocol
- Support for multiple concurrent connections
- Buffer management for audio chunks
- Error handling and logging

5. Microphone Client (`client/mic_stream.py`)

Created streaming client using sounddevice
Features:
- Real-time microphone capture
- WebSocket streaming to server
- Audio device selection
- Automatic format conversion (float32 to int16)
- Async communication

6. Test Script (`tools/test_offline.py`)

Completely rewritten for onnx-asr
Features:
- Command-line interface
- Support for WAV files
- Optional VAD and quantization
- Audio statistics and diagnostics

7. Diagnostics Tool (`tools/diagnose.py`)

New comprehensive system check tool
Checks:
- Python version
- Installed packages
- CUDA availability
- ONNX Runtime providers
- Audio devices
- Model files

8. Setup Script (`setup_env.sh`)

Automated setup script
Features:
- Virtual environment creation
- Dependency installation
- CUDA/GPU detection
- System diagnostics
- Optional model download

9. Documentation

README.md: Comprehensive documentation with:
- Installation instructions
- Usage examples
- Configuration options
- Troubleshooting guide
- Performance tips
QUICKSTART.md: Quick start guide with:
- 5-minute setup
- Common commands
- Troubleshooting
- Performance optimization
example.py: Simple usage example

Key Benefits

1. GPU Optimization

Native CUDA support via ONNX Runtime
Configurable GPU memory limits
Optional TensorRT for even faster inference
Automatic fallback to CPU if GPU unavailable

2. Simplified Model Management

Automatic model download from Hugging Face
No manual ONNX export needed
Pre-converted models ready to use
Support for quantized versions

3. Better Performance

Optimized ONNX inference
GPU acceleration on GTX 1660
~50-100x realtime on GPU
Reduced memory usage with quantization

4. Improved Usability

Simpler API
Better error handling
Comprehensive logging
Easy configuration

5. Modern Features

WebSocket streaming
Real-time transcription
VAD integration
Batch processing

Model Information

Model: Parakeet TDT 0.6B V3 (Multilingual)
Source: https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx
Size: ~600MB
Languages: 25+ languages
Location: models/parakeet/ (auto-downloaded)

File Structure

parakeet-test/
├── asr/
│   ├── __init__.py              ✓ Updated
│   └── asr_pipeline.py          ✓ Refactored
├── client/
│   ├── __init__.py              ✓ Updated
│   └── mic_stream.py            ✓ New
├── server/
│   ├── __init__.py              ✓ Updated
│   └── ws_server.py             ✓ New
├── vad/
│   ├── __init__.py              ✓ Updated
│   └── silero_vad.py            ✓ Refactored
├── tools/
│   ├── diagnose.py              ✓ New
│   └── test_offline.py          ✓ Refactored
├── models/
│   └── parakeet/                ✓ Auto-created
├── requirements.txt             ✓ Updated
├── setup_env.sh                 ✓ New
├── README.md                    ✓ New
├── QUICKSTART.md                ✓ New
├── example.py                   ✓ New
├── .gitignore                   ✓ New
└── REFACTORING.md               ✓ This file

Migration from Old Code

Old Code Pattern:

# Manual ONNX session creation
import onnxruntime as ort
session = ort.InferenceSession("encoder.onnx", providers=["CUDAExecutionProvider"])
# Manual preprocessing and decoding

New Code Pattern:

# Simple onnx-asr interface
import onnx_asr
model = onnx_asr.load_model("nemo-parakeet-tdt-0.6b-v3")
text = model.recognize("audio.wav")

Testing Instructions

1. Setup

./setup_env.sh
source venv/bin/activate

2. Run Diagnostics

python3 tools/diagnose.py

3. Test Offline

python3 tools/test_offline.py test.wav

4. Test Streaming

# Terminal 1
python3 server/ws_server.py

# Terminal 2
python3 client/mic_stream.py

Known Limitations

Audio Format: Only WAV files with PCM encoding supported directly
Segment Length: Models work best with <30 second segments
GPU Memory: Requires at least 2-3GB GPU memory
Sample Rate: 16kHz recommended for best results

Future Enhancements

Possible improvements:

Add support for other audio formats (MP3, FLAC, etc.)
Implement beam search decoding
Add language selection option
Support for speaker diarization
REST API in addition to WebSocket
Docker containerization
Batch file processing script
Real-time visualization of transcription

References

Support

For issues related to:

onnx-asr library: https://github.com/istupakov/onnx-asr/issues
This implementation: Check logs and run diagnose.py
GPU/CUDA issues: Verify nvidia-smi and CUDA installation

Refactoring completed on: January 18, 2026 Primary changes: Migration to onnx-asr library for simplified ONNX inference with GPU support

6.6 KiB Raw Blame History