281 lines
6.3 KiB
Markdown
281 lines
6.3 KiB
Markdown
# Parakeet ASR with ONNX Runtime
|
|
|
|
Real-time Automatic Speech Recognition (ASR) system using NVIDIA's Parakeet TDT 0.6B V3 model via the `onnx-asr` library, optimized for NVIDIA GPUs (GTX 1660 and better).
|
|
|
|
## Features
|
|
|
|
- ✅ **ONNX Runtime with GPU acceleration** (CUDA/TensorRT support)
|
|
- ✅ **Parakeet TDT 0.6B V3** multilingual model from Hugging Face
|
|
- ✅ **Real-time streaming** via WebSocket server
|
|
- ✅ **Voice Activity Detection** (Silero VAD)
|
|
- ✅ **Microphone client** for live transcription
|
|
- ✅ **Offline transcription** from audio files
|
|
- ✅ **Quantization support** (int8, fp16) for faster inference
|
|
|
|
## Model Information
|
|
|
|
This implementation uses:
|
|
- **Model**: `nemo-parakeet-tdt-0.6b-v3` (Multilingual)
|
|
- **Source**: https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx
|
|
- **Library**: https://github.com/istupakov/onnx-asr
|
|
- **Original Model**: https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3
|
|
|
|
## System Requirements
|
|
|
|
- **GPU**: NVIDIA GPU with CUDA support (tested on GTX 1660)
|
|
- **CUDA**: Version 11.8 or 12.x
|
|
- **Python**: 3.10 or higher
|
|
- **Memory**: At least 4GB GPU memory recommended
|
|
|
|
## Installation
|
|
|
|
### 1. Clone the repository
|
|
|
|
```bash
|
|
cd /home/koko210Serve/parakeet-test
|
|
```
|
|
|
|
### 2. Create virtual environment
|
|
|
|
```bash
|
|
python3 -m venv venv
|
|
source venv/bin/activate
|
|
```
|
|
|
|
### 3. Install CUDA dependencies
|
|
|
|
Make sure you have CUDA installed. For Ubuntu:
|
|
|
|
```bash
|
|
# Check CUDA version
|
|
nvcc --version
|
|
|
|
# If you need to install CUDA, follow NVIDIA's instructions:
|
|
# https://developer.nvidia.com/cuda-downloads
|
|
```
|
|
|
|
### 4. Install Python dependencies
|
|
|
|
```bash
|
|
pip install --upgrade pip
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
Or manually:
|
|
|
|
```bash
|
|
# With GPU support (recommended)
|
|
pip install onnx-asr[gpu,hub]
|
|
|
|
# Additional dependencies
|
|
pip install numpy<2.0 websockets sounddevice soundfile
|
|
```
|
|
|
|
### 5. Verify CUDA availability
|
|
|
|
```bash
|
|
python3 -c "import onnxruntime as ort; print('Available providers:', ort.get_available_providers())"
|
|
```
|
|
|
|
You should see `CUDAExecutionProvider` in the list.
|
|
|
|
## Usage
|
|
|
|
### Test Offline Transcription
|
|
|
|
Transcribe an audio file:
|
|
|
|
```bash
|
|
python3 tools/test_offline.py test.wav
|
|
```
|
|
|
|
With VAD (for long audio files):
|
|
|
|
```bash
|
|
python3 tools/test_offline.py test.wav --use-vad
|
|
```
|
|
|
|
With quantization (faster, less memory):
|
|
|
|
```bash
|
|
python3 tools/test_offline.py test.wav --quantization int8
|
|
```
|
|
|
|
### Start WebSocket Server
|
|
|
|
Start the ASR server:
|
|
|
|
```bash
|
|
python3 server/ws_server.py
|
|
```
|
|
|
|
With options:
|
|
|
|
```bash
|
|
python3 server/ws_server.py --host 0.0.0.0 --port 8765 --use-vad
|
|
```
|
|
|
|
### Start Microphone Client
|
|
|
|
In a separate terminal, start the microphone client:
|
|
|
|
```bash
|
|
python3 client/mic_stream.py
|
|
```
|
|
|
|
List available audio devices:
|
|
|
|
```bash
|
|
python3 client/mic_stream.py --list-devices
|
|
```
|
|
|
|
Connect to a specific device:
|
|
|
|
```bash
|
|
python3 client/mic_stream.py --device 0
|
|
```
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
parakeet-test/
|
|
├── asr/
|
|
│ ├── __init__.py
|
|
│ └── asr_pipeline.py # Main ASR pipeline using onnx-asr
|
|
├── client/
|
|
│ ├── __init__.py
|
|
│ └── mic_stream.py # Microphone streaming client
|
|
├── server/
|
|
│ ├── __init__.py
|
|
│ └── ws_server.py # WebSocket server for streaming ASR
|
|
├── vad/
|
|
│ ├── __init__.py
|
|
│ └── silero_vad.py # VAD wrapper using onnx-asr
|
|
├── tools/
|
|
│ ├── test_offline.py # Test offline transcription
|
|
│ └── diagnose.py # System diagnostics
|
|
├── models/
|
|
│ └── parakeet/ # Model files (auto-downloaded)
|
|
├── requirements.txt # Python dependencies
|
|
└── README.md # This file
|
|
```
|
|
|
|
## Model Files
|
|
|
|
The model files will be automatically downloaded from Hugging Face on first run to:
|
|
```
|
|
models/parakeet/
|
|
├── config.json
|
|
├── encoder-parakeet-tdt-0.6b-v3.onnx
|
|
├── decoder_joint-parakeet-tdt-0.6b-v3.onnx
|
|
└── vocab.txt
|
|
```
|
|
|
|
## Configuration
|
|
|
|
### GPU Settings
|
|
|
|
The ASR pipeline is configured to use CUDA by default. You can customize the execution providers in `asr/asr_pipeline.py`:
|
|
|
|
```python
|
|
providers = [
|
|
(
|
|
"CUDAExecutionProvider",
|
|
{
|
|
"device_id": 0,
|
|
"arena_extend_strategy": "kNextPowerOfTwo",
|
|
"gpu_mem_limit": 6 * 1024 * 1024 * 1024, # 6GB
|
|
"cudnn_conv_algo_search": "EXHAUSTIVE",
|
|
"do_copy_in_default_stream": True,
|
|
}
|
|
),
|
|
"CPUExecutionProvider",
|
|
]
|
|
```
|
|
|
|
### TensorRT (Optional - Faster Inference)
|
|
|
|
For even better performance, you can use TensorRT:
|
|
|
|
```bash
|
|
pip install tensorrt tensorrt-cu12-libs
|
|
```
|
|
|
|
Then modify the providers:
|
|
|
|
```python
|
|
providers = [
|
|
(
|
|
"TensorrtExecutionProvider",
|
|
{
|
|
"trt_max_workspace_size": 6 * 1024**3,
|
|
"trt_fp16_enable": True,
|
|
},
|
|
)
|
|
]
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### CUDA Not Available
|
|
|
|
If CUDA is not detected:
|
|
|
|
1. Check CUDA installation: `nvcc --version`
|
|
2. Verify GPU: `nvidia-smi`
|
|
3. Reinstall onnxruntime-gpu:
|
|
```bash
|
|
pip uninstall onnxruntime onnxruntime-gpu
|
|
pip install onnxruntime-gpu
|
|
```
|
|
|
|
### Memory Issues
|
|
|
|
If you run out of GPU memory:
|
|
|
|
1. Use quantization: `--quantization int8`
|
|
2. Reduce `gpu_mem_limit` in the configuration
|
|
3. Close other GPU-using applications
|
|
|
|
### Audio Issues
|
|
|
|
If microphone is not working:
|
|
|
|
1. List devices: `python3 client/mic_stream.py --list-devices`
|
|
2. Select the correct device: `--device <id>`
|
|
3. Check permissions: `sudo usermod -a -G audio $USER` (then logout/login)
|
|
|
|
### Slow Performance
|
|
|
|
1. Ensure GPU is being used (check logs for "CUDAExecutionProvider")
|
|
2. Try quantization for faster inference
|
|
3. Consider using TensorRT provider
|
|
4. Check GPU utilization: `nvidia-smi`
|
|
|
|
## Performance
|
|
|
|
Expected performance on GTX 1660 (6GB):
|
|
|
|
- **Offline transcription**: ~50-100x realtime (depending on audio length)
|
|
- **Streaming**: <100ms latency
|
|
- **Memory usage**: ~2-3GB GPU memory
|
|
- **Quantized (int8)**: ~30% faster, ~50% less memory
|
|
|
|
## License
|
|
|
|
This project uses:
|
|
- `onnx-asr`: MIT License
|
|
- Parakeet model: CC-BY-4.0 License
|
|
|
|
## References
|
|
|
|
- [onnx-asr GitHub](https://github.com/istupakov/onnx-asr)
|
|
- [Parakeet TDT 0.6B V3 ONNX](https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx)
|
|
- [NVIDIA Parakeet](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3)
|
|
- [ONNX Runtime](https://onnxruntime.ai/)
|
|
|
|
## Credits
|
|
|
|
- Model conversion by [istupakov](https://github.com/istupakov)
|
|
- Original Parakeet model by NVIDIA
|