Real-time Automatic Speech Recognition (ASR) system using NVIDIA's Parakeet TDT 0.6B V3 model via the onnx-asr library, optimized for NVIDIA GPUs (GTX 1660 and better).

Features

✅ ONNX Runtime with GPU acceleration (CUDA/TensorRT support)
✅ Parakeet TDT 0.6B V3 multilingual model from Hugging Face
✅ Real-time streaming via WebSocket server
✅ Voice Activity Detection (Silero VAD)
✅ Microphone client for live transcription
✅ Offline transcription from audio files
✅ Quantization support (int8, fp16) for faster inference

Model Information

This implementation uses:

Model: nemo-parakeet-tdt-0.6b-v3 (Multilingual)
Source: https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx
Library: https://github.com/istupakov/onnx-asr
Original Model: https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3

System Requirements

GPU: NVIDIA GPU with CUDA support (tested on GTX 1660)
CUDA: Version 11.8 or 12.x
Python: 3.10 or higher
Memory: At least 4GB GPU memory recommended

Installation

1. Clone the repository

cd /home/koko210Serve/parakeet-test

2. Create virtual environment

python3 -m venv venv
source venv/bin/activate

3. Install CUDA dependencies

Make sure you have CUDA installed. For Ubuntu:

# Check CUDA version
nvcc --version

# If you need to install CUDA, follow NVIDIA's instructions:
# https://developer.nvidia.com/cuda-downloads

4. Install Python dependencies

pip install --upgrade pip
pip install -r requirements.txt

Or manually:

# With GPU support (recommended)
pip install onnx-asr[gpu,hub]

# Additional dependencies
pip install numpy<2.0 websockets sounddevice soundfile

5. Verify CUDA availability

python3 -c "import onnxruntime as ort; print('Available providers:', ort.get_available_providers())"

You should see CUDAExecutionProvider in the list.

Usage

Test Offline Transcription

Transcribe an audio file:

python3 tools/test_offline.py test.wav

With VAD (for long audio files):

python3 tools/test_offline.py test.wav --use-vad

With quantization (faster, less memory):

python3 tools/test_offline.py test.wav --quantization int8

Start WebSocket Server

Start the ASR server:

python3 server/ws_server.py

With options:

python3 server/ws_server.py --host 0.0.0.0 --port 8765 --use-vad

Start Microphone Client

In a separate terminal, start the microphone client:

python3 client/mic_stream.py

List available audio devices:

python3 client/mic_stream.py --list-devices

Connect to a specific device:

python3 client/mic_stream.py --device 0

Project Structure

parakeet-test/
├── asr/
│   ├── __init__.py
│   └── asr_pipeline.py       # Main ASR pipeline using onnx-asr
├── client/
│   ├── __init__.py
│   └── mic_stream.py          # Microphone streaming client
├── server/
│   ├── __init__.py
│   └── ws_server.py           # WebSocket server for streaming ASR
├── vad/
│   ├── __init__.py
│   └── silero_vad.py          # VAD wrapper using onnx-asr
├── tools/
│   ├── test_offline.py        # Test offline transcription
│   └── diagnose.py            # System diagnostics
├── models/
│   └── parakeet/              # Model files (auto-downloaded)
├── requirements.txt           # Python dependencies
└── README.md                  # This file

Model Files

The model files will be automatically downloaded from Hugging Face on first run to:

models/parakeet/
├── config.json
├── encoder-parakeet-tdt-0.6b-v3.onnx
├── decoder_joint-parakeet-tdt-0.6b-v3.onnx
└── vocab.txt

Configuration

GPU Settings

The ASR pipeline is configured to use CUDA by default. You can customize the execution providers in asr/asr_pipeline.py:

providers = [
    (
        "CUDAExecutionProvider",
        {
            "device_id": 0,
            "arena_extend_strategy": "kNextPowerOfTwo",
            "gpu_mem_limit": 6 * 1024 * 1024 * 1024,  # 6GB
            "cudnn_conv_algo_search": "EXHAUSTIVE",
            "do_copy_in_default_stream": True,
        }
    ),
    "CPUExecutionProvider",
]

TensorRT (Optional - Faster Inference)

For even better performance, you can use TensorRT:

pip install tensorrt tensorrt-cu12-libs

Then modify the providers:

providers = [
    (
        "TensorrtExecutionProvider",
        {
            "trt_max_workspace_size": 6 * 1024**3,
            "trt_fp16_enable": True,
        },
    )
]

Troubleshooting

CUDA Not Available

If CUDA is not detected:

Check CUDA installation: nvcc --version
Verify GPU: nvidia-smi

Reinstall onnxruntime-gpu:

pip uninstall onnxruntime onnxruntime-gpu
pip install onnxruntime-gpu

Memory Issues

If you run out of GPU memory:

Use quantization: --quantization int8
Reduce gpu_mem_limit in the configuration
Close other GPU-using applications

Audio Issues

If microphone is not working:

List devices: python3 client/mic_stream.py --list-devices
Select the correct device: --device <id>
Check permissions: sudo usermod -a -G audio $USER (then logout/login)

Slow Performance

Ensure GPU is being used (check logs for "CUDAExecutionProvider")
Try quantization for faster inference
Consider using TensorRT provider
Check GPU utilization: nvidia-smi

Performance

Expected performance on GTX 1660 (6GB):

Offline transcription: ~50-100x realtime (depending on audio length)
Streaming: <100ms latency
Memory usage: ~2-3GB GPU memory
Quantized (int8): ~30% faster, ~50% less memory

License

This project uses:

onnx-asr: MIT License
Parakeet model: CC-BY-4.0 License

References

Credits

Model conversion by istupakov
Original Parakeet model by NVIDIA