Parakeet ASR with ONNX Runtime
Real-time Automatic Speech Recognition (ASR) system using NVIDIA's Parakeet TDT 0.6B V3 model via the onnx-asr library, optimized for NVIDIA GPUs (GTX 1660 and better).
Features
- ✅ ONNX Runtime with GPU acceleration (CUDA/TensorRT support)
- ✅ Parakeet TDT 0.6B V3 multilingual model from Hugging Face
- ✅ Real-time streaming via WebSocket server
- ✅ Voice Activity Detection (Silero VAD)
- ✅ Microphone client for live transcription
- ✅ Offline transcription from audio files
- ✅ Quantization support (int8, fp16) for faster inference
Model Information
This implementation uses:
- Model:
nemo-parakeet-tdt-0.6b-v3(Multilingual) - Source: https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx
- Library: https://github.com/istupakov/onnx-asr
- Original Model: https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3
System Requirements
- GPU: NVIDIA GPU with CUDA support (tested on GTX 1660)
- CUDA: Version 11.8 or 12.x
- Python: 3.10 or higher
- Memory: At least 4GB GPU memory recommended
Installation
1. Clone the repository
cd /home/koko210Serve/parakeet-test
2. Create virtual environment
python3 -m venv venv
source venv/bin/activate
3. Install CUDA dependencies
Make sure you have CUDA installed. For Ubuntu:
# Check CUDA version
nvcc --version
# If you need to install CUDA, follow NVIDIA's instructions:
# https://developer.nvidia.com/cuda-downloads
4. Install Python dependencies
pip install --upgrade pip
pip install -r requirements.txt
Or manually:
# With GPU support (recommended)
pip install onnx-asr[gpu,hub]
# Additional dependencies
pip install numpy<2.0 websockets sounddevice soundfile
5. Verify CUDA availability
python3 -c "import onnxruntime as ort; print('Available providers:', ort.get_available_providers())"
You should see CUDAExecutionProvider in the list.
Usage
Test Offline Transcription
Transcribe an audio file:
python3 tools/test_offline.py test.wav
With VAD (for long audio files):
python3 tools/test_offline.py test.wav --use-vad
With quantization (faster, less memory):
python3 tools/test_offline.py test.wav --quantization int8
Start WebSocket Server
Start the ASR server:
python3 server/ws_server.py
With options:
python3 server/ws_server.py --host 0.0.0.0 --port 8765 --use-vad
Start Microphone Client
In a separate terminal, start the microphone client:
python3 client/mic_stream.py
List available audio devices:
python3 client/mic_stream.py --list-devices
Connect to a specific device:
python3 client/mic_stream.py --device 0
Project Structure
parakeet-test/
├── asr/
│ ├── __init__.py
│ └── asr_pipeline.py # Main ASR pipeline using onnx-asr
├── client/
│ ├── __init__.py
│ └── mic_stream.py # Microphone streaming client
├── server/
│ ├── __init__.py
│ └── ws_server.py # WebSocket server for streaming ASR
├── vad/
│ ├── __init__.py
│ └── silero_vad.py # VAD wrapper using onnx-asr
├── tools/
│ ├── test_offline.py # Test offline transcription
│ └── diagnose.py # System diagnostics
├── models/
│ └── parakeet/ # Model files (auto-downloaded)
├── requirements.txt # Python dependencies
└── README.md # This file
Model Files
The model files will be automatically downloaded from Hugging Face on first run to:
models/parakeet/
├── config.json
├── encoder-parakeet-tdt-0.6b-v3.onnx
├── decoder_joint-parakeet-tdt-0.6b-v3.onnx
└── vocab.txt
Configuration
GPU Settings
The ASR pipeline is configured to use CUDA by default. You can customize the execution providers in asr/asr_pipeline.py:
providers = [
(
"CUDAExecutionProvider",
{
"device_id": 0,
"arena_extend_strategy": "kNextPowerOfTwo",
"gpu_mem_limit": 6 * 1024 * 1024 * 1024, # 6GB
"cudnn_conv_algo_search": "EXHAUSTIVE",
"do_copy_in_default_stream": True,
}
),
"CPUExecutionProvider",
]
TensorRT (Optional - Faster Inference)
For even better performance, you can use TensorRT:
pip install tensorrt tensorrt-cu12-libs
Then modify the providers:
providers = [
(
"TensorrtExecutionProvider",
{
"trt_max_workspace_size": 6 * 1024**3,
"trt_fp16_enable": True,
},
)
]
Troubleshooting
CUDA Not Available
If CUDA is not detected:
- Check CUDA installation:
nvcc --version - Verify GPU:
nvidia-smi - Reinstall onnxruntime-gpu:
pip uninstall onnxruntime onnxruntime-gpu pip install onnxruntime-gpu
Memory Issues
If you run out of GPU memory:
- Use quantization:
--quantization int8 - Reduce
gpu_mem_limitin the configuration - Close other GPU-using applications
Audio Issues
If microphone is not working:
- List devices:
python3 client/mic_stream.py --list-devices - Select the correct device:
--device <id> - Check permissions:
sudo usermod -a -G audio $USER(then logout/login)
Slow Performance
- Ensure GPU is being used (check logs for "CUDAExecutionProvider")
- Try quantization for faster inference
- Consider using TensorRT provider
- Check GPU utilization:
nvidia-smi
Performance
Expected performance on GTX 1660 (6GB):
- Offline transcription: ~50-100x realtime (depending on audio length)
- Streaming: <100ms latency
- Memory usage: ~2-3GB GPU memory
- Quantized (int8): ~30% faster, ~50% less memory
License
This project uses:
onnx-asr: MIT License- Parakeet model: CC-BY-4.0 License
References
Credits
- Model conversion by istupakov
- Original Parakeet model by NVIDIA