# Dual GPU Quick Reference ## Quick Start ```bash # 1. Run setup check ./setup-dual-gpu.sh # 2. Build AMD container docker compose build llama-swap-amd # 3. Start both GPUs docker compose up -d llama-swap llama-swap-amd # 4. Verify curl http://localhost:8090/health # NVIDIA curl http://localhost:8091/health # AMD RX 6800 ``` ## Endpoints | GPU | Container | Port | Internal URL | |-----|-----------|------|--------------| | NVIDIA | llama-swap | 8090 | http://llama-swap:8080 | | AMD RX 6800 | llama-swap-amd | 8091 | http://llama-swap-amd:8080 | ## Models ### NVIDIA GPU (Primary) - `llama3.1` - Llama 3.1 8B Instruct - `darkidol` - DarkIdol Uncensored 8B - `vision` - MiniCPM-V-4.5 (4K context) ### AMD RX 6800 (Secondary) - `llama3.1-amd` - Llama 3.1 8B Instruct - `darkidol-amd` - DarkIdol Uncensored 8B - `moondream-amd` - Moondream2 Vision (2K context) ## Commands ### Start/Stop ```bash # Start both docker compose up -d llama-swap llama-swap-amd # Start only AMD docker compose up -d llama-swap-amd # Stop AMD docker compose stop llama-swap-amd # Restart AMD with logs docker compose restart llama-swap-amd && docker compose logs -f llama-swap-amd ``` ### Monitoring ```bash # Container status docker compose ps # Logs docker compose logs -f llama-swap-amd # GPU usage watch -n 1 nvidia-smi # NVIDIA watch -n 1 rocm-smi # AMD # Resource usage docker stats llama-swap llama-swap-amd ``` ### Testing ```bash # List available models curl http://localhost:8091/v1/models | jq # Test text generation (AMD) curl -X POST http://localhost:8091/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llama3.1-amd", "messages": [{"role": "user", "content": "Say hello!"}], "max_tokens": 20 }' | jq # Test vision model (AMD) curl -X POST http://localhost:8091/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "moondream-amd", "messages": [{ "role": "user", "content": [ {"type": "text", "text": "Describe this image"}, {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}} ] }], "max_tokens": 100 }' | jq ``` ## Bot Integration ### Using GPU Router ```python from bot.utils.gpu_router import get_llama_url_with_load_balancing, get_endpoint_for_model # Load balanced text generation url, model = get_llama_url_with_load_balancing(task_type="text") # Specific model url = get_endpoint_for_model("darkidol-amd") # Vision on AMD url, model = get_llama_url_with_load_balancing(task_type="vision", prefer_amd=True) ``` ### Direct Access ```python import globals # AMD GPU amd_url = globals.LLAMA_AMD_URL # http://llama-swap-amd:8080 # NVIDIA GPU nvidia_url = globals.LLAMA_URL # http://llama-swap:8080 ``` ## Troubleshooting ### AMD Container Won't Start ```bash # Check ROCm rocm-smi # Check permissions ls -l /dev/kfd /dev/dri # Check logs docker compose logs llama-swap-amd # Rebuild docker compose build --no-cache llama-swap-amd ``` ### Model Won't Load ```bash # Check VRAM rocm-smi --showmeminfo vram # Lower GPU layers in llama-swap-rocm-config.yaml # Change: -ngl 99 # To: -ngl 50 ``` ### GFX Version Error ```bash # RX 6800 is gfx1030 # Ensure in docker-compose.yml: HSA_OVERRIDE_GFX_VERSION=10.3.0 ``` ## Environment Variables Add to `docker-compose.yml` under `miku-bot` service: ```yaml environment: - PREFER_AMD_GPU=true # Prefer AMD for load balancing - AMD_MODELS_ENABLED=true # Enable AMD models - LLAMA_AMD_URL=http://llama-swap-amd:8080 ``` ## Files - `Dockerfile.llamaswap-rocm` - ROCm container - `llama-swap-rocm-config.yaml` - AMD model config - `bot/utils/gpu_router.py` - Load balancing utility - `DUAL_GPU_SETUP.md` - Full documentation - `setup-dual-gpu.sh` - Setup verification script ## Performance Tips 1. **Model Selection**: Use Q4_K quantization for best size/quality balance 2. **VRAM**: RX 6800 has 16GB - can run 2-3 Q4 models 3. **TTL**: Adjust in config files (1800s = 30min default) 4. **Context**: Lower context size (`-c 8192`) to save VRAM 5. **GPU Layers**: `-ngl 99` uses full GPU, lower if needed ## Support - ROCm Docs: https://rocmdocs.amd.com/ - llama.cpp: https://github.com/ggml-org/llama.cpp - llama-swap: https://github.com/mostlygeek/llama-swap