Add dual GPU support with web UI selector

Features: - Built custom ROCm container for AMD RX 6800 GPU - Added GPU selection toggle in web UI (NVIDIA/AMD) - Unified model names across both GPUs for seamless switching - Vision model always uses NVIDIA GPU (optimal performance) - Text models (llama3.1, darkidol) can use either GPU - Added /gpu-status and /gpu-select API endpoints - Implemented GPU state persistence in memory/gpu_state.json Technical details: - Multi-stage Dockerfile.llamaswap-rocm with ROCm 6.2.4 - llama.cpp compiled with GGML_HIP=ON for gfx1030 (RX 6800) - Proper GPU permissions without root (groups 187/989) - AMD container on port 8091, NVIDIA on port 8090 - Updated bot/utils/llm.py with get_current_gpu_url() and get_vision_gpu_url() - Modified bot/utils/image_handling.py to always use NVIDIA for vision - Enhanced web UI with GPU selector button (blue=NVIDIA, red=AMD) Files modified: - docker-compose.yml (added llama-swap-amd service) - bot/globals.py (added LLAMA_AMD_URL) - bot/api.py (added GPU selection endpoints and helper function) - bot/utils/llm.py (GPU routing for text models) - bot/utils/image_handling.py (GPU routing for vision models) - bot/static/index.html (GPU selector UI) - llama-swap-rocm-config.yaml (unified model names) New files: - Dockerfile.llamaswap-rocm - bot/memory/gpu_state.json - bot/utils/gpu_router.py (load balancing utility) - setup-dual-gpu.sh (setup verification script) - DUAL_GPU_*.md (documentation files)
2026-01-09 00:03:59 +02:00
parent ed5994ec78
commit 1fc3d74a5b
21 changed files with 2836 additions and 13 deletions
--- a/DUAL_GPU_SETUP.md
+++ b/DUAL_GPU_SETUP.md
@@ -0,0 +1,321 @@
+# Dual GPU Setup - NVIDIA + AMD RX 6800
+
+This document describes the dual-GPU configuration for running two llama-swap instances simultaneously:
+- **Primary GPU (NVIDIA)**: Runs main models via CUDA
+- **Secondary GPU (AMD RX 6800)**: Runs additional models via ROCm
+
+## Architecture Overview
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                         Miku Bot                            │
+│                                                             │
+│  LLAMA_URL=http://llama-swap:8080 (NVIDIA)                │
+│  LLAMA_AMD_URL=http://llama-swap-amd:8080 (AMD RX 6800)   │
+└─────────────────────────────────────────────────────────────┘
+                    │                      │
+                    │                      │
+                    ▼                      ▼
+        ┌──────────────────┐    ┌──────────────────┐
+        │  llama-swap      │    │  llama-swap-amd  │
+        │  (CUDA)          │    │  (ROCm)          │
+        │  Port: 8090      │    │  Port: 8091      │
+        └──────────────────┘    └──────────────────┘
+                    │                      │
+                    ▼                      ▼
+        ┌──────────────────┐    ┌──────────────────┐
+        │  NVIDIA GPU      │    │  AMD RX 6800     │
+        │  - llama3.1      │    │  - llama3.1-amd  │
+        │  - darkidol      │    │  - darkidol-amd  │
+        │  - vision        │    │  - moondream-amd │
+        └──────────────────┘    └──────────────────┘
+```
+
+## Files Created
+
+1. **Dockerfile.llamaswap-rocm** - ROCm-enabled Docker image for AMD GPU
+2. **llama-swap-rocm-config.yaml** - Model configuration for AMD models
+3. **docker-compose.yml** - Updated with `llama-swap-amd` service
+
+## Configuration Details
+
+### llama-swap-amd Service
+
+```yaml
+llama-swap-amd:
+  build:
+    context: .
+    dockerfile: Dockerfile.llamaswap-rocm
+  container_name: llama-swap-amd
+  ports:
+    - "8091:8080"  # External access on port 8091
+  volumes:
+    - ./models:/models
+    - ./llama-swap-rocm-config.yaml:/app/config.yaml
+  devices:
+    - /dev/kfd:/dev/kfd    # AMD GPU kernel driver
+    - /dev/dri:/dev/dri    # Direct Rendering Infrastructure
+  group_add:
+    - video
+    - render
+  environment:
+    - HSA_OVERRIDE_GFX_VERSION=10.3.0  # RX 6800 (Navi 21) compatibility
+```
+
+### Available Models on AMD GPU
+
+From `llama-swap-rocm-config.yaml`:
+
+- **llama3.1-amd** - Llama 3.1 8B text model
+- **darkidol-amd** - DarkIdol uncensored model  
+- **moondream-amd** - Moondream2 vision model (smaller, AMD-optimized)
+
+### Model Aliases
+
+You can access AMD models using these aliases:
+- `llama3.1-amd`, `text-model-amd`, `amd-text`
+- `darkidol-amd`, `evil-model-amd`, `uncensored-amd`
+- `moondream-amd`, `vision-amd`, `moondream`
+
+## Usage
+
+### Building and Starting Services
+
+```bash
+# Build the AMD ROCm container
+docker compose build llama-swap-amd
+
+# Start both GPU services
+docker compose up -d llama-swap llama-swap-amd
+
+# Check logs
+docker compose logs -f llama-swap-amd
+```
+
+### Accessing AMD Models from Bot Code
+
+In your bot code, you can now use either endpoint:
+
+```python
+import globals
+
+# Use NVIDIA GPU (primary)
+nvidia_response = requests.post(
+    f"{globals.LLAMA_URL}/v1/chat/completions",
+    json={"model": "llama3.1", ...}
+)
+
+# Use AMD GPU (secondary)
+amd_response = requests.post(
+    f"{globals.LLAMA_AMD_URL}/v1/chat/completions", 
+    json={"model": "llama3.1-amd", ...}
+)
+```
+
+### Load Balancing Strategy
+
+You can implement load balancing by:
+
+1. **Round-robin**: Alternate between GPUs for text generation
+2. **Task-specific**: 
+   - NVIDIA: Primary text + MiniCPM vision (heavy)
+   - AMD: Secondary text + Moondream vision (lighter)
+3. **Failover**: Use AMD as backup if NVIDIA is busy
+
+Example load balancing function:
+
+```python
+import random
+import globals
+
+def get_llama_url(prefer_amd=False):
+    """Get llama URL with optional load balancing"""
+    if prefer_amd:
+        return globals.LLAMA_AMD_URL
+    
+    # Random load balancing for text models
+    return random.choice([globals.LLAMA_URL, globals.LLAMA_AMD_URL])
+```
+
+## Testing
+
+### Test NVIDIA GPU (Port 8090)
+```bash
+curl http://localhost:8090/health
+curl http://localhost:8090/v1/models
+```
+
+### Test AMD GPU (Port 8091)
+```bash
+curl http://localhost:8091/health
+curl http://localhost:8091/v1/models
+```
+
+### Test Model Loading (AMD)
+```bash
+curl -X POST http://localhost:8091/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "llama3.1-amd",
+    "messages": [{"role": "user", "content": "Hello from AMD GPU!"}],
+    "max_tokens": 50
+  }'
+```
+
+## Monitoring
+
+### Check GPU Usage
+
+**AMD GPU:**
+```bash
+# ROCm monitoring
+rocm-smi
+
+# Or from host
+watch -n 1 rocm-smi
+```
+
+**NVIDIA GPU:**
+```bash
+nvidia-smi
+watch -n 1 nvidia-smi
+```
+
+### Check Container Resource Usage
+```bash
+docker stats llama-swap llama-swap-amd
+```
+
+## Troubleshooting
+
+### AMD GPU Not Detected
+
+1. Verify ROCm is installed on host:
+   ```bash
+   rocm-smi --version
+   ```
+
+2. Check device permissions:
+   ```bash
+   ls -l /dev/kfd /dev/dri
+   ```
+
+3. Verify RX 6800 compatibility:
+   ```bash
+   rocminfo | grep "Name:"
+   ```
+
+### Model Loading Issues
+
+If models fail to load on AMD:
+
+1. Check VRAM availability:
+   ```bash
+   rocm-smi --showmeminfo vram
+   ```
+
+2. Adjust `-ngl` (GPU layers) in config if needed:
+   ```yaml
+   # Reduce GPU layers for smaller VRAM
+   cmd: /app/llama-server ... -ngl 50 ...  # Instead of 99
+   ```
+
+3. Check container logs:
+   ```bash
+   docker compose logs llama-swap-amd
+   ```
+
+### GFX Version Mismatch
+
+RX 6800 is Navi 21 (gfx1030). If you see GFX errors:
+
+```bash
+# Set in docker-compose.yml environment:
+HSA_OVERRIDE_GFX_VERSION=10.3.0
+```
+
+### llama-swap Build Issues
+
+If the ROCm container fails to build:
+
+1. The Dockerfile attempts to build llama-swap from source
+2. Alternative: Use pre-built binary or simpler proxy setup
+3. Check build logs: `docker compose build --no-cache llama-swap-amd`
+
+## Performance Considerations
+
+### Memory Usage
+
+- **RX 6800**: 16GB VRAM
+  - Q4_K_M/Q4_K_XL models: ~5-6GB each
+  - Can run 2 models simultaneously or 1 with long context
+
+### Model Selection
+
+**Best for AMD RX 6800:**
+- ✅ Q4_K_M/Q4_K_S quantized models (5-6GB)
+- ✅ Moondream2 vision (smaller, efficient)
+- ⚠️ MiniCPM-V-4.5 (possible but may be tight on VRAM)
+
+### TTL Configuration
+
+Adjust model TTL in `llama-swap-rocm-config.yaml`:
+- Lower TTL = more aggressive unloading = more VRAM available
+- Higher TTL = less model swapping = faster response times
+
+## Advanced: Model-Specific Routing
+
+Create a helper function to route models automatically:
+
+```python
+# bot/utils/gpu_router.py
+import globals
+
+MODEL_TO_GPU = {
+    # NVIDIA models
+    "llama3.1": globals.LLAMA_URL,
+    "darkidol": globals.LLAMA_URL,
+    "vision": globals.LLAMA_URL,
+    
+    # AMD models
+    "llama3.1-amd": globals.LLAMA_AMD_URL,
+    "darkidol-amd": globals.LLAMA_AMD_URL,
+    "moondream-amd": globals.LLAMA_AMD_URL,
+}
+
+def get_endpoint_for_model(model_name):
+    """Get the correct llama-swap endpoint for a model"""
+    return MODEL_TO_GPU.get(model_name, globals.LLAMA_URL)
+
+def is_amd_model(model_name):
+    """Check if model runs on AMD GPU"""
+    return model_name.endswith("-amd")
+```
+
+## Environment Variables
+
+Add these to control GPU selection:
+
+```yaml
+# In docker-compose.yml
+environment:
+  - LLAMA_URL=http://llama-swap:8080
+  - LLAMA_AMD_URL=http://llama-swap-amd:8080
+  - PREFER_AMD_GPU=false  # Set to true to prefer AMD for general tasks
+  - AMD_MODELS_ENABLED=true  # Enable/disable AMD models
+```
+
+## Future Enhancements
+
+1. **Automatic load balancing**: Monitor GPU utilization and route requests
+2. **Health checks**: Fallback to primary GPU if AMD fails
+3. **Model distribution**: Automatically assign models to GPUs based on VRAM
+4. **Performance metrics**: Track response times per GPU
+5. **Dynamic routing**: Use least-busy GPU for new requests
+
+## References
+
+- [ROCm Documentation](https://rocmdocs.amd.com/)
+- [llama.cpp ROCm Support](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md#rocm)
+- [llama-swap GitHub](https://github.com/mostlygeek/llama-swap)
+- [AMD GPU Compatibility Matrix](https://rocm.docs.amd.com/en/latest/release/gpu_os_support.html)