Add dual GPU support with web UI selector

Features: - Built custom ROCm container for AMD RX 6800 GPU - Added GPU selection toggle in web UI (NVIDIA/AMD) - Unified model names across both GPUs for seamless switching - Vision model always uses NVIDIA GPU (optimal performance) - Text models (llama3.1, darkidol) can use either GPU - Added /gpu-status and /gpu-select API endpoints - Implemented GPU state persistence in memory/gpu_state.json Technical details: - Multi-stage Dockerfile.llamaswap-rocm with ROCm 6.2.4 - llama.cpp compiled with GGML_HIP=ON for gfx1030 (RX 6800) - Proper GPU permissions without root (groups 187/989) - AMD container on port 8091, NVIDIA on port 8090 - Updated bot/utils/llm.py with get_current_gpu_url() and get_vision_gpu_url() - Modified bot/utils/image_handling.py to always use NVIDIA for vision - Enhanced web UI with GPU selector button (blue=NVIDIA, red=AMD) Files modified: - docker-compose.yml (added llama-swap-amd service) - bot/globals.py (added LLAMA_AMD_URL) - bot/api.py (added GPU selection endpoints and helper function) - bot/utils/llm.py (GPU routing for text models) - bot/utils/image_handling.py (GPU routing for vision models) - bot/static/index.html (GPU selector UI) - llama-swap-rocm-config.yaml (unified model names) New files: - Dockerfile.llamaswap-rocm - bot/memory/gpu_state.json - bot/utils/gpu_router.py (load balancing utility) - setup-dual-gpu.sh (setup verification script) - DUAL_GPU_*.md (documentation files)
2026-01-09 00:03:59 +02:00
parent ed5994ec78
commit 1fc3d74a5b
21 changed files with 2836 additions and 13 deletions
--- a/DUAL_GPU_QUICK_REF.md
+++ b/DUAL_GPU_QUICK_REF.md
@@ -0,0 +1,194 @@
+# Dual GPU Quick Reference
+
+## Quick Start
+
+```bash
+# 1. Run setup check
+./setup-dual-gpu.sh
+
+# 2. Build AMD container
+docker compose build llama-swap-amd
+
+# 3. Start both GPUs
+docker compose up -d llama-swap llama-swap-amd
+
+# 4. Verify
+curl http://localhost:8090/health  # NVIDIA
+curl http://localhost:8091/health  # AMD RX 6800
+```
+
+## Endpoints
+
+| GPU | Container | Port | Internal URL |
+|-----|-----------|------|--------------|
+| NVIDIA | llama-swap | 8090 | http://llama-swap:8080 |
+| AMD RX 6800 | llama-swap-amd | 8091 | http://llama-swap-amd:8080 |
+
+## Models
+
+### NVIDIA GPU (Primary)
+- `llama3.1` - Llama 3.1 8B Instruct
+- `darkidol` - DarkIdol Uncensored 8B
+- `vision` - MiniCPM-V-4.5 (4K context)
+
+### AMD RX 6800 (Secondary)
+- `llama3.1-amd` - Llama 3.1 8B Instruct
+- `darkidol-amd` - DarkIdol Uncensored 8B
+- `moondream-amd` - Moondream2 Vision (2K context)
+
+## Commands
+
+### Start/Stop
+```bash
+# Start both
+docker compose up -d llama-swap llama-swap-amd
+
+# Start only AMD
+docker compose up -d llama-swap-amd
+
+# Stop AMD
+docker compose stop llama-swap-amd
+
+# Restart AMD with logs
+docker compose restart llama-swap-amd && docker compose logs -f llama-swap-amd
+```
+
+### Monitoring
+```bash
+# Container status
+docker compose ps
+
+# Logs
+docker compose logs -f llama-swap-amd
+
+# GPU usage
+watch -n 1 nvidia-smi  # NVIDIA
+watch -n 1 rocm-smi    # AMD
+
+# Resource usage
+docker stats llama-swap llama-swap-amd
+```
+
+### Testing
+```bash
+# List available models
+curl http://localhost:8091/v1/models | jq
+
+# Test text generation (AMD)
+curl -X POST http://localhost:8091/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "llama3.1-amd",
+    "messages": [{"role": "user", "content": "Say hello!"}],
+    "max_tokens": 20
+  }' | jq
+
+# Test vision model (AMD)
+curl -X POST http://localhost:8091/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "moondream-amd",
+    "messages": [{
+      "role": "user",
+      "content": [
+        {"type": "text", "text": "Describe this image"},
+        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
+      ]
+    }],
+    "max_tokens": 100
+  }' | jq
+```
+
+## Bot Integration
+
+### Using GPU Router
+```python
+from bot.utils.gpu_router import get_llama_url_with_load_balancing, get_endpoint_for_model
+
+# Load balanced text generation
+url, model = get_llama_url_with_load_balancing(task_type="text")
+
+# Specific model
+url = get_endpoint_for_model("darkidol-amd")
+
+# Vision on AMD
+url, model = get_llama_url_with_load_balancing(task_type="vision", prefer_amd=True)
+```
+
+### Direct Access
+```python
+import globals
+
+# AMD GPU
+amd_url = globals.LLAMA_AMD_URL  # http://llama-swap-amd:8080
+
+# NVIDIA GPU  
+nvidia_url = globals.LLAMA_URL   # http://llama-swap:8080
+```
+
+## Troubleshooting
+
+### AMD Container Won't Start
+```bash
+# Check ROCm
+rocm-smi
+
+# Check permissions
+ls -l /dev/kfd /dev/dri
+
+# Check logs
+docker compose logs llama-swap-amd
+
+# Rebuild
+docker compose build --no-cache llama-swap-amd
+```
+
+### Model Won't Load
+```bash
+# Check VRAM
+rocm-smi --showmeminfo vram
+
+# Lower GPU layers in llama-swap-rocm-config.yaml
+# Change: -ngl 99
+# To:     -ngl 50
+```
+
+### GFX Version Error
+```bash
+# RX 6800 is gfx1030
+# Ensure in docker-compose.yml:
+HSA_OVERRIDE_GFX_VERSION=10.3.0
+```
+
+## Environment Variables
+
+Add to `docker-compose.yml` under `miku-bot` service:
+
+```yaml
+environment:
+  - PREFER_AMD_GPU=true          # Prefer AMD for load balancing
+  - AMD_MODELS_ENABLED=true      # Enable AMD models
+  - LLAMA_AMD_URL=http://llama-swap-amd:8080
+```
+
+## Files
+
+- `Dockerfile.llamaswap-rocm` - ROCm container
+- `llama-swap-rocm-config.yaml` - AMD model config
+- `bot/utils/gpu_router.py` - Load balancing utility
+- `DUAL_GPU_SETUP.md` - Full documentation
+- `setup-dual-gpu.sh` - Setup verification script
+
+## Performance Tips
+
+1. **Model Selection**: Use Q4_K quantization for best size/quality balance
+2. **VRAM**: RX 6800 has 16GB - can run 2-3 Q4 models
+3. **TTL**: Adjust in config files (1800s = 30min default)
+4. **Context**: Lower context size (`-c 8192`) to save VRAM
+5. **GPU Layers**: `-ngl 99` uses full GPU, lower if needed
+
+## Support
+
+- ROCm Docs: https://rocmdocs.amd.com/
+- llama.cpp: https://github.com/ggml-org/llama.cpp
+- llama-swap: https://github.com/mostlygeek/llama-swap