miku-discord/stt-parakeet/README.md

# Parakeet ASR with ONNX Runtime

Real-time Automatic Speech Recognition (ASR) system using NVIDIA's Parakeet TDT 0.6B V3 model via the `onnx-asr` library, optimized for NVIDIA GPUs (GTX 1660 and better).

## Features

- ✅ **ONNX Runtime with GPU acceleration** (CUDA/TensorRT support)
- ✅ **Parakeet TDT 0.6B V3** multilingual model from Hugging Face
- ✅ **Real-time streaming** via WebSocket server
- ✅ **Voice Activity Detection** (Silero VAD)
- ✅ **Microphone client** for live transcription
- ✅ **Offline transcription** from audio files
- ✅ **Quantization support** (int8, fp16) for faster inference

## Model Information

This implementation uses:
- **Model**: `nemo-parakeet-tdt-0.6b-v3` (Multilingual)
- **Source**: https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx
- **Library**: https://github.com/istupakov/onnx-asr
- **Original Model**: https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3

## System Requirements

- **GPU**: NVIDIA GPU with CUDA support (tested on GTX 1660)
- **CUDA**: Version 11.8 or 12.x
- **Python**: 3.10 or higher
- **Memory**: At least 4GB GPU memory recommended

## Installation

### 1. Clone the repository

```bash
cd /home/koko210Serve/parakeet-test
```

### 2. Create virtual environment

```bash
python3 -m venv venv
source venv/bin/activate
```

### 3. Install CUDA dependencies

Make sure you have CUDA installed. For Ubuntu:

```bash
# Check CUDA version
nvcc --version

# If you need to install CUDA, follow NVIDIA's instructions:
# https://developer.nvidia.com/cuda-downloads
```

### 4. Install Python dependencies

```bash
pip install --upgrade pip
pip install -r requirements.txt
```

Or manually:

```bash
# With GPU support (recommended)
pip install onnx-asr[gpu,hub]

# Additional dependencies
pip install numpy<2.0 websockets sounddevice soundfile
```

### 5. Verify CUDA availability

```bash
python3 -c "import onnxruntime as ort; print('Available providers:', ort.get_available_providers())"
```

You should see `CUDAExecutionProvider` in the list.

## Usage

### Test Offline Transcription

Transcribe an audio file:

```bash
python3 tools/test_offline.py test.wav
```

With VAD (for long audio files):

```bash
python3 tools/test_offline.py test.wav --use-vad
```

With quantization (faster, less memory):

```bash
python3 tools/test_offline.py test.wav --quantization int8
```

### Start WebSocket Server

Start the ASR server:

```bash
python3 server/ws_server.py
```

With options:

```bash
python3 server/ws_server.py --host 0.0.0.0 --port 8765 --use-vad
```

### Start Microphone Client

In a separate terminal, start the microphone client:

```bash
python3 client/mic_stream.py
```

List available audio devices:

```bash
python3 client/mic_stream.py --list-devices
```

Connect to a specific device:

```bash
python3 client/mic_stream.py --device 0
```

## Project Structure

```
parakeet-test/
├── asr/
│   ├── __init__.py
│   └── asr_pipeline.py       # Main ASR pipeline using onnx-asr
├── client/
│   ├── __init__.py
│   └── mic_stream.py          # Microphone streaming client
├── server/
│   ├── __init__.py
│   └── ws_server.py           # WebSocket server for streaming ASR
├── vad/
│   ├── __init__.py
│   └── silero_vad.py          # VAD wrapper using onnx-asr
├── tools/
│   ├── test_offline.py        # Test offline transcription
│   └── diagnose.py            # System diagnostics
├── models/
│   └── parakeet/              # Model files (auto-downloaded)
├── requirements.txt           # Python dependencies
└── README.md                  # This file
```

## Model Files

The model files will be automatically downloaded from Hugging Face on first run to:
```
models/parakeet/
├── config.json
├── encoder-parakeet-tdt-0.6b-v3.onnx
├── decoder_joint-parakeet-tdt-0.6b-v3.onnx
└── vocab.txt
```

## Configuration

### GPU Settings

The ASR pipeline is configured to use CUDA by default. You can customize the execution providers in `asr/asr_pipeline.py`:

```python
providers = [
    (
        "CUDAExecutionProvider",
        {
            "device_id": 0,
            "arena_extend_strategy": "kNextPowerOfTwo",
            "gpu_mem_limit": 6 * 1024 * 1024 * 1024,  # 6GB
            "cudnn_conv_algo_search": "EXHAUSTIVE",
            "do_copy_in_default_stream": True,
        }
    ),
    "CPUExecutionProvider",
]
```

### TensorRT (Optional - Faster Inference)

For even better performance, you can use TensorRT:

```bash
pip install tensorrt tensorrt-cu12-libs
```

Then modify the providers:

```python
providers = [
    (
        "TensorrtExecutionProvider",
        {
            "trt_max_workspace_size": 6 * 1024**3,
            "trt_fp16_enable": True,
        },
    )
]
```

## Troubleshooting

### CUDA Not Available

If CUDA is not detected:

1. Check CUDA installation: `nvcc --version`
2. Verify GPU: `nvidia-smi`
3. Reinstall onnxruntime-gpu:
   ```bash
   pip uninstall onnxruntime onnxruntime-gpu
   pip install onnxruntime-gpu
   ```

### Memory Issues

If you run out of GPU memory:

1. Use quantization: `--quantization int8`
2. Reduce `gpu_mem_limit` in the configuration
3. Close other GPU-using applications

### Audio Issues

If microphone is not working:

1. List devices: `python3 client/mic_stream.py --list-devices`
2. Select the correct device: `--device <id>`
3. Check permissions: `sudo usermod -a -G audio $USER` (then logout/login)

### Slow Performance

1. Ensure GPU is being used (check logs for "CUDAExecutionProvider")
2. Try quantization for faster inference
3. Consider using TensorRT provider
4. Check GPU utilization: `nvidia-smi`

## Performance

Expected performance on GTX 1660 (6GB):

- **Offline transcription**: ~50-100x realtime (depending on audio length)
- **Streaming**: <100ms latency
- **Memory usage**: ~2-3GB GPU memory
- **Quantized (int8)**: ~30% faster, ~50% less memory

## License

This project uses:
- `onnx-asr`: MIT License
- Parakeet model: CC-BY-4.0 License

## References

- [onnx-asr GitHub](https://github.com/istupakov/onnx-asr)
- [Parakeet TDT 0.6B V3 ONNX](https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx)
- [NVIDIA Parakeet](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3)
- [ONNX Runtime](https://onnxruntime.ai/)

## Credits

- Model conversion by [istupakov](https://github.com/istupakov)
- Original Parakeet model by NVIDIA