Add dual GPU support with web UI selector

Features:
- Built custom ROCm container for AMD RX 6800 GPU
- Added GPU selection toggle in web UI (NVIDIA/AMD)
- Unified model names across both GPUs for seamless switching
- Vision model always uses NVIDIA GPU (optimal performance)
- Text models (llama3.1, darkidol) can use either GPU
- Added /gpu-status and /gpu-select API endpoints
- Implemented GPU state persistence in memory/gpu_state.json

Technical details:
- Multi-stage Dockerfile.llamaswap-rocm with ROCm 6.2.4
- llama.cpp compiled with GGML_HIP=ON for gfx1030 (RX 6800)
- Proper GPU permissions without root (groups 187/989)
- AMD container on port 8091, NVIDIA on port 8090
- Updated bot/utils/llm.py with get_current_gpu_url() and get_vision_gpu_url()
- Modified bot/utils/image_handling.py to always use NVIDIA for vision
- Enhanced web UI with GPU selector button (blue=NVIDIA, red=AMD)

Files modified:
- docker-compose.yml (added llama-swap-amd service)
- bot/globals.py (added LLAMA_AMD_URL)
- bot/api.py (added GPU selection endpoints and helper function)
- bot/utils/llm.py (GPU routing for text models)
- bot/utils/image_handling.py (GPU routing for vision models)
- bot/static/index.html (GPU selector UI)
- llama-swap-rocm-config.yaml (unified model names)

New files:
- Dockerfile.llamaswap-rocm
- bot/memory/gpu_state.json
- bot/utils/gpu_router.py (load balancing utility)
- setup-dual-gpu.sh (setup verification script)
- DUAL_GPU_*.md (documentation files)
This commit is contained in:
2026-01-09 00:03:59 +02:00
parent ed5994ec78
commit 1fc3d74a5b
21 changed files with 2836 additions and 13 deletions

321
DUAL_GPU_SETUP.md Normal file
View File

@@ -0,0 +1,321 @@
# Dual GPU Setup - NVIDIA + AMD RX 6800
This document describes the dual-GPU configuration for running two llama-swap instances simultaneously:
- **Primary GPU (NVIDIA)**: Runs main models via CUDA
- **Secondary GPU (AMD RX 6800)**: Runs additional models via ROCm
## Architecture Overview
```
┌─────────────────────────────────────────────────────────────┐
│ Miku Bot │
│ │
│ LLAMA_URL=http://llama-swap:8080 (NVIDIA) │
│ LLAMA_AMD_URL=http://llama-swap-amd:8080 (AMD RX 6800) │
└─────────────────────────────────────────────────────────────┘
│ │
│ │
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ llama-swap │ │ llama-swap-amd │
│ (CUDA) │ │ (ROCm) │
│ Port: 8090 │ │ Port: 8091 │
└──────────────────┘ └──────────────────┘
│ │
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ NVIDIA GPU │ │ AMD RX 6800 │
│ - llama3.1 │ │ - llama3.1-amd │
│ - darkidol │ │ - darkidol-amd │
│ - vision │ │ - moondream-amd │
└──────────────────┘ └──────────────────┘
```
## Files Created
1. **Dockerfile.llamaswap-rocm** - ROCm-enabled Docker image for AMD GPU
2. **llama-swap-rocm-config.yaml** - Model configuration for AMD models
3. **docker-compose.yml** - Updated with `llama-swap-amd` service
## Configuration Details
### llama-swap-amd Service
```yaml
llama-swap-amd:
build:
context: .
dockerfile: Dockerfile.llamaswap-rocm
container_name: llama-swap-amd
ports:
- "8091:8080" # External access on port 8091
volumes:
- ./models:/models
- ./llama-swap-rocm-config.yaml:/app/config.yaml
devices:
- /dev/kfd:/dev/kfd # AMD GPU kernel driver
- /dev/dri:/dev/dri # Direct Rendering Infrastructure
group_add:
- video
- render
environment:
- HSA_OVERRIDE_GFX_VERSION=10.3.0 # RX 6800 (Navi 21) compatibility
```
### Available Models on AMD GPU
From `llama-swap-rocm-config.yaml`:
- **llama3.1-amd** - Llama 3.1 8B text model
- **darkidol-amd** - DarkIdol uncensored model
- **moondream-amd** - Moondream2 vision model (smaller, AMD-optimized)
### Model Aliases
You can access AMD models using these aliases:
- `llama3.1-amd`, `text-model-amd`, `amd-text`
- `darkidol-amd`, `evil-model-amd`, `uncensored-amd`
- `moondream-amd`, `vision-amd`, `moondream`
## Usage
### Building and Starting Services
```bash
# Build the AMD ROCm container
docker compose build llama-swap-amd
# Start both GPU services
docker compose up -d llama-swap llama-swap-amd
# Check logs
docker compose logs -f llama-swap-amd
```
### Accessing AMD Models from Bot Code
In your bot code, you can now use either endpoint:
```python
import globals
# Use NVIDIA GPU (primary)
nvidia_response = requests.post(
f"{globals.LLAMA_URL}/v1/chat/completions",
json={"model": "llama3.1", ...}
)
# Use AMD GPU (secondary)
amd_response = requests.post(
f"{globals.LLAMA_AMD_URL}/v1/chat/completions",
json={"model": "llama3.1-amd", ...}
)
```
### Load Balancing Strategy
You can implement load balancing by:
1. **Round-robin**: Alternate between GPUs for text generation
2. **Task-specific**:
- NVIDIA: Primary text + MiniCPM vision (heavy)
- AMD: Secondary text + Moondream vision (lighter)
3. **Failover**: Use AMD as backup if NVIDIA is busy
Example load balancing function:
```python
import random
import globals
def get_llama_url(prefer_amd=False):
"""Get llama URL with optional load balancing"""
if prefer_amd:
return globals.LLAMA_AMD_URL
# Random load balancing for text models
return random.choice([globals.LLAMA_URL, globals.LLAMA_AMD_URL])
```
## Testing
### Test NVIDIA GPU (Port 8090)
```bash
curl http://localhost:8090/health
curl http://localhost:8090/v1/models
```
### Test AMD GPU (Port 8091)
```bash
curl http://localhost:8091/health
curl http://localhost:8091/v1/models
```
### Test Model Loading (AMD)
```bash
curl -X POST http://localhost:8091/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1-amd",
"messages": [{"role": "user", "content": "Hello from AMD GPU!"}],
"max_tokens": 50
}'
```
## Monitoring
### Check GPU Usage
**AMD GPU:**
```bash
# ROCm monitoring
rocm-smi
# Or from host
watch -n 1 rocm-smi
```
**NVIDIA GPU:**
```bash
nvidia-smi
watch -n 1 nvidia-smi
```
### Check Container Resource Usage
```bash
docker stats llama-swap llama-swap-amd
```
## Troubleshooting
### AMD GPU Not Detected
1. Verify ROCm is installed on host:
```bash
rocm-smi --version
```
2. Check device permissions:
```bash
ls -l /dev/kfd /dev/dri
```
3. Verify RX 6800 compatibility:
```bash
rocminfo | grep "Name:"
```
### Model Loading Issues
If models fail to load on AMD:
1. Check VRAM availability:
```bash
rocm-smi --showmeminfo vram
```
2. Adjust `-ngl` (GPU layers) in config if needed:
```yaml
# Reduce GPU layers for smaller VRAM
cmd: /app/llama-server ... -ngl 50 ... # Instead of 99
```
3. Check container logs:
```bash
docker compose logs llama-swap-amd
```
### GFX Version Mismatch
RX 6800 is Navi 21 (gfx1030). If you see GFX errors:
```bash
# Set in docker-compose.yml environment:
HSA_OVERRIDE_GFX_VERSION=10.3.0
```
### llama-swap Build Issues
If the ROCm container fails to build:
1. The Dockerfile attempts to build llama-swap from source
2. Alternative: Use pre-built binary or simpler proxy setup
3. Check build logs: `docker compose build --no-cache llama-swap-amd`
## Performance Considerations
### Memory Usage
- **RX 6800**: 16GB VRAM
- Q4_K_M/Q4_K_XL models: ~5-6GB each
- Can run 2 models simultaneously or 1 with long context
### Model Selection
**Best for AMD RX 6800:**
- ✅ Q4_K_M/Q4_K_S quantized models (5-6GB)
- ✅ Moondream2 vision (smaller, efficient)
- ⚠️ MiniCPM-V-4.5 (possible but may be tight on VRAM)
### TTL Configuration
Adjust model TTL in `llama-swap-rocm-config.yaml`:
- Lower TTL = more aggressive unloading = more VRAM available
- Higher TTL = less model swapping = faster response times
## Advanced: Model-Specific Routing
Create a helper function to route models automatically:
```python
# bot/utils/gpu_router.py
import globals
MODEL_TO_GPU = {
# NVIDIA models
"llama3.1": globals.LLAMA_URL,
"darkidol": globals.LLAMA_URL,
"vision": globals.LLAMA_URL,
# AMD models
"llama3.1-amd": globals.LLAMA_AMD_URL,
"darkidol-amd": globals.LLAMA_AMD_URL,
"moondream-amd": globals.LLAMA_AMD_URL,
}
def get_endpoint_for_model(model_name):
"""Get the correct llama-swap endpoint for a model"""
return MODEL_TO_GPU.get(model_name, globals.LLAMA_URL)
def is_amd_model(model_name):
"""Check if model runs on AMD GPU"""
return model_name.endswith("-amd")
```
## Environment Variables
Add these to control GPU selection:
```yaml
# In docker-compose.yml
environment:
- LLAMA_URL=http://llama-swap:8080
- LLAMA_AMD_URL=http://llama-swap-amd:8080
- PREFER_AMD_GPU=false # Set to true to prefer AMD for general tasks
- AMD_MODELS_ENABLED=true # Enable/disable AMD models
```
## Future Enhancements
1. **Automatic load balancing**: Monitor GPU utilization and route requests
2. **Health checks**: Fallback to primary GPU if AMD fails
3. **Model distribution**: Automatically assign models to GPUs based on VRAM
4. **Performance metrics**: Track response times per GPU
5. **Dynamic routing**: Use least-busy GPU for new requests
## References
- [ROCm Documentation](https://rocmdocs.amd.com/)
- [llama.cpp ROCm Support](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md#rocm)
- [llama-swap GitHub](https://github.com/mostlygeek/llama-swap)
- [AMD GPU Compatibility Matrix](https://rocm.docs.amd.com/en/latest/release/gpu_os_support.html)