miku-discord/stt-parakeet/QUICKSTART.md

# Quick Start Guide

## 🚀 Getting Started in 5 Minutes

### 1. Setup Environment

```bash
# Make setup script executable and run it
chmod +x setup_env.sh
./setup_env.sh
```

The setup script will:
- Create a virtual environment
- Install all dependencies including `onnx-asr`
- Check CUDA/GPU availability
- Run system diagnostics
- Optionally download the Parakeet model

### 2. Activate Virtual Environment

```bash
source venv/bin/activate
```

### 3. Test Your Setup

Run diagnostics to verify everything is working:

```bash
python3 tools/diagnose.py
```

Expected output should show:
- ✓ Python 3.10+
- ✓ onnx-asr installed
- ✓ CUDAExecutionProvider available
- ✓ GPU detected

### 4. Test Offline Transcription

Create a test audio file or use an existing WAV file:

```bash
python3 tools/test_offline.py test.wav
```

### 5. Start Real-Time Streaming

**Terminal 1 - Start Server:**
```bash
python3 server/ws_server.py
```

**Terminal 2 - Start Client:**
```bash
# List audio devices first
python3 client/mic_stream.py --list-devices

# Start streaming with your microphone
python3 client/mic_stream.py --device 0
```

## 🎯 Common Commands

### Offline Transcription

```bash
# Basic transcription
python3 tools/test_offline.py audio.wav

# With Voice Activity Detection (for long files)
python3 tools/test_offline.py audio.wav --use-vad

# With quantization (faster, uses less memory)
python3 tools/test_offline.py audio.wav --quantization int8
```

### WebSocket Server

```bash
# Start server on default port (8765)
python3 server/ws_server.py

# Custom host and port
python3 server/ws_server.py --host 0.0.0.0 --port 9000

# With VAD enabled
python3 server/ws_server.py --use-vad
```

### Microphone Client

```bash
# List available audio devices
python3 client/mic_stream.py --list-devices

# Connect to server
python3 client/mic_stream.py --url ws://localhost:8765

# Use specific device
python3 client/mic_stream.py --device 2

# Custom sample rate
python3 client/mic_stream.py --sample-rate 16000
```

## 🔧 Troubleshooting

### GPU Not Detected

1. Check NVIDIA driver:
   ```bash
   nvidia-smi
   ```

2. Check CUDA version:
   ```bash
   nvcc --version
   ```

3. Verify ONNX Runtime can see GPU:
   ```bash
   python3 -c "import onnxruntime as ort; print(ort.get_available_providers())"
   ```

   Should include `CUDAExecutionProvider`

### Out of Memory

If you get CUDA out of memory errors:

1. **Use quantization:**
   ```bash
   python3 tools/test_offline.py audio.wav --quantization int8
   ```

2. **Close other GPU applications**

3. **Reduce GPU memory limit** in `asr/asr_pipeline.py`:
   ```python
   "gpu_mem_limit": 4 * 1024 * 1024 * 1024,  # 4GB instead of 6GB
   ```

### Microphone Not Working

1. Check permissions:
   ```bash
   sudo usermod -a -G audio $USER
   # Then logout and login again
   ```

2. Test with system audio recorder first

3. List and test devices:
   ```bash
   python3 client/mic_stream.py --list-devices
   ```

### Model Download Fails

If Hugging Face is slow or blocked:

1. **Set HF token** (optional, for faster downloads):
   ```bash
   export HF_TOKEN="your_huggingface_token"
   ```

2. **Manual download:**
   ```bash
   # Download from: https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx
   # Extract to: models/parakeet/
   ```

## 📊 Performance Tips

### For Best GPU Performance

1. **Use TensorRT provider** (faster than CUDA):
   ```bash
   pip install tensorrt tensorrt-cu12-libs
   ```

   Then edit `asr/asr_pipeline.py` to use TensorRT provider

2. **Use FP16 quantization** (on TensorRT):
   ```python
   providers = [
       ("TensorrtExecutionProvider", {
           "trt_fp16_enable": True,
       })
   ]
   ```

3. **Enable quantization:**
   ```bash
   --quantization int8  # Good balance
   --quantization fp16  # Better quality
   ```

### For Lower Latency Streaming

1. **Reduce chunk duration** in client:
   ```bash
   python3 client/mic_stream.py --chunk-duration 0.05
   ```

2. **Disable VAD** for shorter responses

3. **Use quantized model** for faster processing

## 🎤 Audio File Requirements

### Supported Formats
- **Format**: WAV (PCM_16, PCM_24, PCM_32, PCM_U8)
- **Sample Rate**: 16000 Hz (recommended)
- **Channels**: Mono (stereo will be converted to mono)

### Convert Audio Files

```bash
# Using ffmpeg
ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav

# Using sox
sox input.mp3 -r 16000 -c 1 output.wav
```

## 📝 Example Workflow

Complete example for transcribing a meeting recording:

```bash
# 1. Activate environment
source venv/bin/activate

# 2. Convert audio to correct format
ffmpeg -i meeting.mp3 -ar 16000 -ac 1 meeting.wav

# 3. Transcribe with VAD (for long recordings)
python3 tools/test_offline.py meeting.wav --use-vad

# Output will show transcription with automatic segmentation
```

## 🌐 Supported Languages

The Parakeet TDT 0.6B V3 model supports **25+ languages** including:
- English
- Spanish
- French
- German
- Italian
- Portuguese
- Russian
- Chinese
- Japanese
- Korean
- And more...

The model automatically detects the language.

## 💡 Tips

1. **For short audio clips** (<30 seconds): Don't use VAD
2. **For long audio files**: Use `--use-vad` flag
3. **For real-time streaming**: Keep chunks small (0.1-0.5 seconds)
4. **For best accuracy**: Use 16kHz mono WAV files
5. **For faster inference**: Use `--quantization int8`

## 📚 More Information

- See `README.md` for detailed documentation
- Run `python3 tools/diagnose.py` for system check
- Check logs for debugging information

## 🆘 Getting Help

If you encounter issues:

1. Run diagnostics:
   ```bash
   python3 tools/diagnose.py
   ```

2. Check the logs in the terminal output

3. Verify your audio format and sample rate

4. Review the troubleshooting section above