5.7 KiB
Quick Start Guide
🚀 Getting Started in 5 Minutes
1. Setup Environment
# Make setup script executable and run it
chmod +x setup_env.sh
./setup_env.sh
The setup script will:
- Create a virtual environment
- Install all dependencies including
onnx-asr - Check CUDA/GPU availability
- Run system diagnostics
- Optionally download the Parakeet model
2. Activate Virtual Environment
source venv/bin/activate
3. Test Your Setup
Run diagnostics to verify everything is working:
python3 tools/diagnose.py
Expected output should show:
- ✓ Python 3.10+
- ✓ onnx-asr installed
- ✓ CUDAExecutionProvider available
- ✓ GPU detected
4. Test Offline Transcription
Create a test audio file or use an existing WAV file:
python3 tools/test_offline.py test.wav
5. Start Real-Time Streaming
Terminal 1 - Start Server:
python3 server/ws_server.py
Terminal 2 - Start Client:
# List audio devices first
python3 client/mic_stream.py --list-devices
# Start streaming with your microphone
python3 client/mic_stream.py --device 0
🎯 Common Commands
Offline Transcription
# Basic transcription
python3 tools/test_offline.py audio.wav
# With Voice Activity Detection (for long files)
python3 tools/test_offline.py audio.wav --use-vad
# With quantization (faster, uses less memory)
python3 tools/test_offline.py audio.wav --quantization int8
WebSocket Server
# Start server on default port (8765)
python3 server/ws_server.py
# Custom host and port
python3 server/ws_server.py --host 0.0.0.0 --port 9000
# With VAD enabled
python3 server/ws_server.py --use-vad
Microphone Client
# List available audio devices
python3 client/mic_stream.py --list-devices
# Connect to server
python3 client/mic_stream.py --url ws://localhost:8765
# Use specific device
python3 client/mic_stream.py --device 2
# Custom sample rate
python3 client/mic_stream.py --sample-rate 16000
🔧 Troubleshooting
GPU Not Detected
-
Check NVIDIA driver:
nvidia-smi -
Check CUDA version:
nvcc --version -
Verify ONNX Runtime can see GPU:
python3 -c "import onnxruntime as ort; print(ort.get_available_providers())"Should include
CUDAExecutionProvider
Out of Memory
If you get CUDA out of memory errors:
-
Use quantization:
python3 tools/test_offline.py audio.wav --quantization int8 -
Close other GPU applications
-
Reduce GPU memory limit in
asr/asr_pipeline.py:"gpu_mem_limit": 4 * 1024 * 1024 * 1024, # 4GB instead of 6GB
Microphone Not Working
-
Check permissions:
sudo usermod -a -G audio $USER # Then logout and login again -
Test with system audio recorder first
-
List and test devices:
python3 client/mic_stream.py --list-devices
Model Download Fails
If Hugging Face is slow or blocked:
-
Set HF token (optional, for faster downloads):
export HF_TOKEN="your_huggingface_token" -
Manual download:
# Download from: https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx # Extract to: models/parakeet/
📊 Performance Tips
For Best GPU Performance
-
Use TensorRT provider (faster than CUDA):
pip install tensorrt tensorrt-cu12-libsThen edit
asr/asr_pipeline.pyto use TensorRT provider -
Use FP16 quantization (on TensorRT):
providers = [ ("TensorrtExecutionProvider", { "trt_fp16_enable": True, }) ] -
Enable quantization:
--quantization int8 # Good balance --quantization fp16 # Better quality
For Lower Latency Streaming
-
Reduce chunk duration in client:
python3 client/mic_stream.py --chunk-duration 0.05 -
Disable VAD for shorter responses
-
Use quantized model for faster processing
🎤 Audio File Requirements
Supported Formats
- Format: WAV (PCM_16, PCM_24, PCM_32, PCM_U8)
- Sample Rate: 16000 Hz (recommended)
- Channels: Mono (stereo will be converted to mono)
Convert Audio Files
# Using ffmpeg
ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav
# Using sox
sox input.mp3 -r 16000 -c 1 output.wav
📝 Example Workflow
Complete example for transcribing a meeting recording:
# 1. Activate environment
source venv/bin/activate
# 2. Convert audio to correct format
ffmpeg -i meeting.mp3 -ar 16000 -ac 1 meeting.wav
# 3. Transcribe with VAD (for long recordings)
python3 tools/test_offline.py meeting.wav --use-vad
# Output will show transcription with automatic segmentation
🌐 Supported Languages
The Parakeet TDT 0.6B V3 model supports 25+ languages including:
- English
- Spanish
- French
- German
- Italian
- Portuguese
- Russian
- Chinese
- Japanese
- Korean
- And more...
The model automatically detects the language.
💡 Tips
- For short audio clips (<30 seconds): Don't use VAD
- For long audio files: Use
--use-vadflag - For real-time streaming: Keep chunks small (0.1-0.5 seconds)
- For best accuracy: Use 16kHz mono WAV files
- For faster inference: Use
--quantization int8
📚 More Information
- See
README.mdfor detailed documentation - Run
python3 tools/diagnose.pyfor system check - Check logs for debugging information
🆘 Getting Help
If you encounter issues:
-
Run diagnostics:
python3 tools/diagnose.py -
Check the logs in the terminal output
-
Verify your audio format and sample rate
-
Review the troubleshooting section above