Files
miku-discord/stt-parakeet/QUICKSTART.md

291 lines
5.7 KiB
Markdown

# Quick Start Guide
## 🚀 Getting Started in 5 Minutes
### 1. Setup Environment
```bash
# Make setup script executable and run it
chmod +x setup_env.sh
./setup_env.sh
```
The setup script will:
- Create a virtual environment
- Install all dependencies including `onnx-asr`
- Check CUDA/GPU availability
- Run system diagnostics
- Optionally download the Parakeet model
### 2. Activate Virtual Environment
```bash
source venv/bin/activate
```
### 3. Test Your Setup
Run diagnostics to verify everything is working:
```bash
python3 tools/diagnose.py
```
Expected output should show:
- ✓ Python 3.10+
- ✓ onnx-asr installed
- ✓ CUDAExecutionProvider available
- ✓ GPU detected
### 4. Test Offline Transcription
Create a test audio file or use an existing WAV file:
```bash
python3 tools/test_offline.py test.wav
```
### 5. Start Real-Time Streaming
**Terminal 1 - Start Server:**
```bash
python3 server/ws_server.py
```
**Terminal 2 - Start Client:**
```bash
# List audio devices first
python3 client/mic_stream.py --list-devices
# Start streaming with your microphone
python3 client/mic_stream.py --device 0
```
## 🎯 Common Commands
### Offline Transcription
```bash
# Basic transcription
python3 tools/test_offline.py audio.wav
# With Voice Activity Detection (for long files)
python3 tools/test_offline.py audio.wav --use-vad
# With quantization (faster, uses less memory)
python3 tools/test_offline.py audio.wav --quantization int8
```
### WebSocket Server
```bash
# Start server on default port (8765)
python3 server/ws_server.py
# Custom host and port
python3 server/ws_server.py --host 0.0.0.0 --port 9000
# With VAD enabled
python3 server/ws_server.py --use-vad
```
### Microphone Client
```bash
# List available audio devices
python3 client/mic_stream.py --list-devices
# Connect to server
python3 client/mic_stream.py --url ws://localhost:8765
# Use specific device
python3 client/mic_stream.py --device 2
# Custom sample rate
python3 client/mic_stream.py --sample-rate 16000
```
## 🔧 Troubleshooting
### GPU Not Detected
1. Check NVIDIA driver:
```bash
nvidia-smi
```
2. Check CUDA version:
```bash
nvcc --version
```
3. Verify ONNX Runtime can see GPU:
```bash
python3 -c "import onnxruntime as ort; print(ort.get_available_providers())"
```
Should include `CUDAExecutionProvider`
### Out of Memory
If you get CUDA out of memory errors:
1. **Use quantization:**
```bash
python3 tools/test_offline.py audio.wav --quantization int8
```
2. **Close other GPU applications**
3. **Reduce GPU memory limit** in `asr/asr_pipeline.py`:
```python
"gpu_mem_limit": 4 * 1024 * 1024 * 1024, # 4GB instead of 6GB
```
### Microphone Not Working
1. Check permissions:
```bash
sudo usermod -a -G audio $USER
# Then logout and login again
```
2. Test with system audio recorder first
3. List and test devices:
```bash
python3 client/mic_stream.py --list-devices
```
### Model Download Fails
If Hugging Face is slow or blocked:
1. **Set HF token** (optional, for faster downloads):
```bash
export HF_TOKEN="your_huggingface_token"
```
2. **Manual download:**
```bash
# Download from: https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx
# Extract to: models/parakeet/
```
## 📊 Performance Tips
### For Best GPU Performance
1. **Use TensorRT provider** (faster than CUDA):
```bash
pip install tensorrt tensorrt-cu12-libs
```
Then edit `asr/asr_pipeline.py` to use TensorRT provider
2. **Use FP16 quantization** (on TensorRT):
```python
providers = [
("TensorrtExecutionProvider", {
"trt_fp16_enable": True,
})
]
```
3. **Enable quantization:**
```bash
--quantization int8 # Good balance
--quantization fp16 # Better quality
```
### For Lower Latency Streaming
1. **Reduce chunk duration** in client:
```bash
python3 client/mic_stream.py --chunk-duration 0.05
```
2. **Disable VAD** for shorter responses
3. **Use quantized model** for faster processing
## 🎤 Audio File Requirements
### Supported Formats
- **Format**: WAV (PCM_16, PCM_24, PCM_32, PCM_U8)
- **Sample Rate**: 16000 Hz (recommended)
- **Channels**: Mono (stereo will be converted to mono)
### Convert Audio Files
```bash
# Using ffmpeg
ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav
# Using sox
sox input.mp3 -r 16000 -c 1 output.wav
```
## 📝 Example Workflow
Complete example for transcribing a meeting recording:
```bash
# 1. Activate environment
source venv/bin/activate
# 2. Convert audio to correct format
ffmpeg -i meeting.mp3 -ar 16000 -ac 1 meeting.wav
# 3. Transcribe with VAD (for long recordings)
python3 tools/test_offline.py meeting.wav --use-vad
# Output will show transcription with automatic segmentation
```
## 🌐 Supported Languages
The Parakeet TDT 0.6B V3 model supports **25+ languages** including:
- English
- Spanish
- French
- German
- Italian
- Portuguese
- Russian
- Chinese
- Japanese
- Korean
- And more...
The model automatically detects the language.
## 💡 Tips
1. **For short audio clips** (<30 seconds): Don't use VAD
2. **For long audio files**: Use `--use-vad` flag
3. **For real-time streaming**: Keep chunks small (0.1-0.5 seconds)
4. **For best accuracy**: Use 16kHz mono WAV files
5. **For faster inference**: Use `--quantization int8`
## 📚 More Information
- See `README.md` for detailed documentation
- Run `python3 tools/diagnose.py` for system check
- Check logs for debugging information
## 🆘 Getting Help
If you encounter issues:
1. Run diagnostics:
```bash
python3 tools/diagnose.py
```
2. Check the logs in the terminal output
3. Verify your audio format and sample rate
4. Review the troubleshooting section above