# Parakeet ASR with ONNX Runtime Real-time Automatic Speech Recognition (ASR) system using NVIDIA's Parakeet TDT 0.6B V3 model via the `onnx-asr` library, optimized for NVIDIA GPUs (GTX 1660 and better). ## Features - ✅ **ONNX Runtime with GPU acceleration** (CUDA/TensorRT support) - ✅ **Parakeet TDT 0.6B V3** multilingual model from Hugging Face - ✅ **Real-time streaming** via WebSocket server - ✅ **Voice Activity Detection** (Silero VAD) - ✅ **Microphone client** for live transcription - ✅ **Offline transcription** from audio files - ✅ **Quantization support** (int8, fp16) for faster inference ## Model Information This implementation uses: - **Model**: `nemo-parakeet-tdt-0.6b-v3` (Multilingual) - **Source**: https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx - **Library**: https://github.com/istupakov/onnx-asr - **Original Model**: https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3 ## System Requirements - **GPU**: NVIDIA GPU with CUDA support (tested on GTX 1660) - **CUDA**: Version 11.8 or 12.x - **Python**: 3.10 or higher - **Memory**: At least 4GB GPU memory recommended ## Installation ### 1. Clone the repository ```bash cd /home/koko210Serve/parakeet-test ``` ### 2. Create virtual environment ```bash python3 -m venv venv source venv/bin/activate ``` ### 3. Install CUDA dependencies Make sure you have CUDA installed. For Ubuntu: ```bash # Check CUDA version nvcc --version # If you need to install CUDA, follow NVIDIA's instructions: # https://developer.nvidia.com/cuda-downloads ``` ### 4. Install Python dependencies ```bash pip install --upgrade pip pip install -r requirements.txt ``` Or manually: ```bash # With GPU support (recommended) pip install onnx-asr[gpu,hub] # Additional dependencies pip install numpy<2.0 websockets sounddevice soundfile ``` ### 5. Verify CUDA availability ```bash python3 -c "import onnxruntime as ort; print('Available providers:', ort.get_available_providers())" ``` You should see `CUDAExecutionProvider` in the list. ## Usage ### Test Offline Transcription Transcribe an audio file: ```bash python3 tools/test_offline.py test.wav ``` With VAD (for long audio files): ```bash python3 tools/test_offline.py test.wav --use-vad ``` With quantization (faster, less memory): ```bash python3 tools/test_offline.py test.wav --quantization int8 ``` ### Start WebSocket Server Start the ASR server: ```bash python3 server/ws_server.py ``` With options: ```bash python3 server/ws_server.py --host 0.0.0.0 --port 8765 --use-vad ``` ### Start Microphone Client In a separate terminal, start the microphone client: ```bash python3 client/mic_stream.py ``` List available audio devices: ```bash python3 client/mic_stream.py --list-devices ``` Connect to a specific device: ```bash python3 client/mic_stream.py --device 0 ``` ## Project Structure ``` parakeet-test/ ├── asr/ │ ├── __init__.py │ └── asr_pipeline.py # Main ASR pipeline using onnx-asr ├── client/ │ ├── __init__.py │ └── mic_stream.py # Microphone streaming client ├── server/ │ ├── __init__.py │ └── ws_server.py # WebSocket server for streaming ASR ├── vad/ │ ├── __init__.py │ └── silero_vad.py # VAD wrapper using onnx-asr ├── tools/ │ ├── test_offline.py # Test offline transcription │ └── diagnose.py # System diagnostics ├── models/ │ └── parakeet/ # Model files (auto-downloaded) ├── requirements.txt # Python dependencies └── README.md # This file ``` ## Model Files The model files will be automatically downloaded from Hugging Face on first run to: ``` models/parakeet/ ├── config.json ├── encoder-parakeet-tdt-0.6b-v3.onnx ├── decoder_joint-parakeet-tdt-0.6b-v3.onnx └── vocab.txt ``` ## Configuration ### GPU Settings The ASR pipeline is configured to use CUDA by default. You can customize the execution providers in `asr/asr_pipeline.py`: ```python providers = [ ( "CUDAExecutionProvider", { "device_id": 0, "arena_extend_strategy": "kNextPowerOfTwo", "gpu_mem_limit": 6 * 1024 * 1024 * 1024, # 6GB "cudnn_conv_algo_search": "EXHAUSTIVE", "do_copy_in_default_stream": True, } ), "CPUExecutionProvider", ] ``` ### TensorRT (Optional - Faster Inference) For even better performance, you can use TensorRT: ```bash pip install tensorrt tensorrt-cu12-libs ``` Then modify the providers: ```python providers = [ ( "TensorrtExecutionProvider", { "trt_max_workspace_size": 6 * 1024**3, "trt_fp16_enable": True, }, ) ] ``` ## Troubleshooting ### CUDA Not Available If CUDA is not detected: 1. Check CUDA installation: `nvcc --version` 2. Verify GPU: `nvidia-smi` 3. Reinstall onnxruntime-gpu: ```bash pip uninstall onnxruntime onnxruntime-gpu pip install onnxruntime-gpu ``` ### Memory Issues If you run out of GPU memory: 1. Use quantization: `--quantization int8` 2. Reduce `gpu_mem_limit` in the configuration 3. Close other GPU-using applications ### Audio Issues If microphone is not working: 1. List devices: `python3 client/mic_stream.py --list-devices` 2. Select the correct device: `--device ` 3. Check permissions: `sudo usermod -a -G audio $USER` (then logout/login) ### Slow Performance 1. Ensure GPU is being used (check logs for "CUDAExecutionProvider") 2. Try quantization for faster inference 3. Consider using TensorRT provider 4. Check GPU utilization: `nvidia-smi` ## Performance Expected performance on GTX 1660 (6GB): - **Offline transcription**: ~50-100x realtime (depending on audio length) - **Streaming**: <100ms latency - **Memory usage**: ~2-3GB GPU memory - **Quantized (int8)**: ~30% faster, ~50% less memory ## License This project uses: - `onnx-asr`: MIT License - Parakeet model: CC-BY-4.0 License ## References - [onnx-asr GitHub](https://github.com/istupakov/onnx-asr) - [Parakeet TDT 0.6B V3 ONNX](https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx) - [NVIDIA Parakeet](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) - [ONNX Runtime](https://onnxruntime.ai/) ## Credits - Model conversion by [istupakov](https://github.com/istupakov) - Original Parakeet model by NVIDIA