Decided on Parakeet ONNX Runtime. Works pretty great. Realtime voice chat possible now. UX lacking.
This commit is contained in:
244
stt-parakeet/REFACTORING.md
Normal file
244
stt-parakeet/REFACTORING.md
Normal file
@@ -0,0 +1,244 @@
|
||||
# Refactoring Summary
|
||||
|
||||
## Overview
|
||||
|
||||
Successfully refactored the Parakeet ASR codebase to use the `onnx-asr` library with ONNX Runtime GPU support for NVIDIA GTX 1660.
|
||||
|
||||
## Changes Made
|
||||
|
||||
### 1. Dependencies (`requirements.txt`)
|
||||
- **Removed**: `onnxruntime-gpu`, `silero-vad`
|
||||
- **Added**: `onnx-asr[gpu,hub]`, `soundfile`
|
||||
- **Kept**: `numpy<2.0`, `websockets`, `sounddevice`
|
||||
|
||||
### 2. ASR Pipeline (`asr/asr_pipeline.py`)
|
||||
- Completely refactored to use `onnx_asr.load_model()`
|
||||
- Added support for:
|
||||
- GPU acceleration via CUDA/TensorRT
|
||||
- Model quantization (int8, fp16)
|
||||
- Voice Activity Detection (VAD)
|
||||
- Batch processing
|
||||
- Streaming audio chunks
|
||||
- Configurable execution providers for GPU optimization
|
||||
- Automatic model download from Hugging Face
|
||||
|
||||
### 3. VAD Module (`vad/silero_vad.py`)
|
||||
- Refactored to use `onnx_asr.load_vad()`
|
||||
- Integrated Silero VAD via onnx-asr
|
||||
- Simplified API for VAD operations
|
||||
- Note: VAD is best used via `model.with_vad()` method
|
||||
|
||||
### 4. WebSocket Server (`server/ws_server.py`)
|
||||
- Created from scratch for streaming ASR
|
||||
- Features:
|
||||
- Real-time audio streaming
|
||||
- JSON-based protocol
|
||||
- Support for multiple concurrent connections
|
||||
- Buffer management for audio chunks
|
||||
- Error handling and logging
|
||||
|
||||
### 5. Microphone Client (`client/mic_stream.py`)
|
||||
- Created streaming client using `sounddevice`
|
||||
- Features:
|
||||
- Real-time microphone capture
|
||||
- WebSocket streaming to server
|
||||
- Audio device selection
|
||||
- Automatic format conversion (float32 to int16)
|
||||
- Async communication
|
||||
|
||||
### 6. Test Script (`tools/test_offline.py`)
|
||||
- Completely rewritten for onnx-asr
|
||||
- Features:
|
||||
- Command-line interface
|
||||
- Support for WAV files
|
||||
- Optional VAD and quantization
|
||||
- Audio statistics and diagnostics
|
||||
|
||||
### 7. Diagnostics Tool (`tools/diagnose.py`)
|
||||
- New comprehensive system check tool
|
||||
- Checks:
|
||||
- Python version
|
||||
- Installed packages
|
||||
- CUDA availability
|
||||
- ONNX Runtime providers
|
||||
- Audio devices
|
||||
- Model files
|
||||
|
||||
### 8. Setup Script (`setup_env.sh`)
|
||||
- Automated setup script
|
||||
- Features:
|
||||
- Virtual environment creation
|
||||
- Dependency installation
|
||||
- CUDA/GPU detection
|
||||
- System diagnostics
|
||||
- Optional model download
|
||||
|
||||
### 9. Documentation
|
||||
- **README.md**: Comprehensive documentation with:
|
||||
- Installation instructions
|
||||
- Usage examples
|
||||
- Configuration options
|
||||
- Troubleshooting guide
|
||||
- Performance tips
|
||||
|
||||
- **QUICKSTART.md**: Quick start guide with:
|
||||
- 5-minute setup
|
||||
- Common commands
|
||||
- Troubleshooting
|
||||
- Performance optimization
|
||||
|
||||
- **example.py**: Simple usage example
|
||||
|
||||
## Key Benefits
|
||||
|
||||
### 1. GPU Optimization
|
||||
- Native CUDA support via ONNX Runtime
|
||||
- Configurable GPU memory limits
|
||||
- Optional TensorRT for even faster inference
|
||||
- Automatic fallback to CPU if GPU unavailable
|
||||
|
||||
### 2. Simplified Model Management
|
||||
- Automatic model download from Hugging Face
|
||||
- No manual ONNX export needed
|
||||
- Pre-converted models ready to use
|
||||
- Support for quantized versions
|
||||
|
||||
### 3. Better Performance
|
||||
- Optimized ONNX inference
|
||||
- GPU acceleration on GTX 1660
|
||||
- ~50-100x realtime on GPU
|
||||
- Reduced memory usage with quantization
|
||||
|
||||
### 4. Improved Usability
|
||||
- Simpler API
|
||||
- Better error handling
|
||||
- Comprehensive logging
|
||||
- Easy configuration
|
||||
|
||||
### 5. Modern Features
|
||||
- WebSocket streaming
|
||||
- Real-time transcription
|
||||
- VAD integration
|
||||
- Batch processing
|
||||
|
||||
## Model Information
|
||||
|
||||
- **Model**: Parakeet TDT 0.6B V3 (Multilingual)
|
||||
- **Source**: https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx
|
||||
- **Size**: ~600MB
|
||||
- **Languages**: 25+ languages
|
||||
- **Location**: `models/parakeet/` (auto-downloaded)
|
||||
|
||||
## File Structure
|
||||
|
||||
```
|
||||
parakeet-test/
|
||||
├── asr/
|
||||
│ ├── __init__.py ✓ Updated
|
||||
│ └── asr_pipeline.py ✓ Refactored
|
||||
├── client/
|
||||
│ ├── __init__.py ✓ Updated
|
||||
│ └── mic_stream.py ✓ New
|
||||
├── server/
|
||||
│ ├── __init__.py ✓ Updated
|
||||
│ └── ws_server.py ✓ New
|
||||
├── vad/
|
||||
│ ├── __init__.py ✓ Updated
|
||||
│ └── silero_vad.py ✓ Refactored
|
||||
├── tools/
|
||||
│ ├── diagnose.py ✓ New
|
||||
│ └── test_offline.py ✓ Refactored
|
||||
├── models/
|
||||
│ └── parakeet/ ✓ Auto-created
|
||||
├── requirements.txt ✓ Updated
|
||||
├── setup_env.sh ✓ New
|
||||
├── README.md ✓ New
|
||||
├── QUICKSTART.md ✓ New
|
||||
├── example.py ✓ New
|
||||
├── .gitignore ✓ New
|
||||
└── REFACTORING.md ✓ This file
|
||||
```
|
||||
|
||||
## Migration from Old Code
|
||||
|
||||
### Old Code Pattern:
|
||||
```python
|
||||
# Manual ONNX session creation
|
||||
import onnxruntime as ort
|
||||
session = ort.InferenceSession("encoder.onnx", providers=["CUDAExecutionProvider"])
|
||||
# Manual preprocessing and decoding
|
||||
```
|
||||
|
||||
### New Code Pattern:
|
||||
```python
|
||||
# Simple onnx-asr interface
|
||||
import onnx_asr
|
||||
model = onnx_asr.load_model("nemo-parakeet-tdt-0.6b-v3")
|
||||
text = model.recognize("audio.wav")
|
||||
```
|
||||
|
||||
## Testing Instructions
|
||||
|
||||
### 1. Setup
|
||||
```bash
|
||||
./setup_env.sh
|
||||
source venv/bin/activate
|
||||
```
|
||||
|
||||
### 2. Run Diagnostics
|
||||
```bash
|
||||
python3 tools/diagnose.py
|
||||
```
|
||||
|
||||
### 3. Test Offline
|
||||
```bash
|
||||
python3 tools/test_offline.py test.wav
|
||||
```
|
||||
|
||||
### 4. Test Streaming
|
||||
```bash
|
||||
# Terminal 1
|
||||
python3 server/ws_server.py
|
||||
|
||||
# Terminal 2
|
||||
python3 client/mic_stream.py
|
||||
```
|
||||
|
||||
## Known Limitations
|
||||
|
||||
1. **Audio Format**: Only WAV files with PCM encoding supported directly
|
||||
2. **Segment Length**: Models work best with <30 second segments
|
||||
3. **GPU Memory**: Requires at least 2-3GB GPU memory
|
||||
4. **Sample Rate**: 16kHz recommended for best results
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
Possible improvements:
|
||||
- [ ] Add support for other audio formats (MP3, FLAC, etc.)
|
||||
- [ ] Implement beam search decoding
|
||||
- [ ] Add language selection option
|
||||
- [ ] Support for speaker diarization
|
||||
- [ ] REST API in addition to WebSocket
|
||||
- [ ] Docker containerization
|
||||
- [ ] Batch file processing script
|
||||
- [ ] Real-time visualization of transcription
|
||||
|
||||
## References
|
||||
|
||||
- [onnx-asr GitHub](https://github.com/istupakov/onnx-asr)
|
||||
- [onnx-asr Documentation](https://istupakov.github.io/onnx-asr/)
|
||||
- [Parakeet ONNX Model](https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx)
|
||||
- [Original Parakeet Model](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3)
|
||||
- [ONNX Runtime](https://onnxruntime.ai/)
|
||||
|
||||
## Support
|
||||
|
||||
For issues related to:
|
||||
- **onnx-asr library**: https://github.com/istupakov/onnx-asr/issues
|
||||
- **This implementation**: Check logs and run diagnose.py
|
||||
- **GPU/CUDA issues**: Verify nvidia-smi and CUDA installation
|
||||
|
||||
---
|
||||
|
||||
**Refactoring completed on**: January 18, 2026
|
||||
**Primary changes**: Migration to onnx-asr library for simplified ONNX inference with GPU support
|
||||
Reference in New Issue
Block a user