8.7 KiB
Dual GPU Setup - NVIDIA + AMD RX 6800
This document describes the dual-GPU configuration for running two llama-swap instances simultaneously:
- Primary GPU (NVIDIA): Runs main models via CUDA
- Secondary GPU (AMD RX 6800): Runs additional models via ROCm
Architecture Overview
┌─────────────────────────────────────────────────────────────┐
│ Miku Bot │
│ │
│ LLAMA_URL=http://llama-swap:8080 (NVIDIA) │
│ LLAMA_AMD_URL=http://llama-swap-amd:8080 (AMD RX 6800) │
└─────────────────────────────────────────────────────────────┘
│ │
│ │
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ llama-swap │ │ llama-swap-amd │
│ (CUDA) │ │ (ROCm) │
│ Port: 8090 │ │ Port: 8091 │
└──────────────────┘ └──────────────────┘
│ │
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ NVIDIA GPU │ │ AMD RX 6800 │
│ - llama3.1 │ │ - llama3.1-amd │
│ - darkidol │ │ - darkidol-amd │
│ - vision │ │ - moondream-amd │
└──────────────────┘ └──────────────────┘
Files Created
- Dockerfile.llamaswap-rocm - ROCm-enabled Docker image for AMD GPU
- llama-swap-rocm-config.yaml - Model configuration for AMD models
- docker-compose.yml - Updated with
llama-swap-amdservice
Configuration Details
llama-swap-amd Service
llama-swap-amd:
build:
context: .
dockerfile: Dockerfile.llamaswap-rocm
container_name: llama-swap-amd
ports:
- "8091:8080" # External access on port 8091
volumes:
- ./models:/models
- ./llama-swap-rocm-config.yaml:/app/config.yaml
devices:
- /dev/kfd:/dev/kfd # AMD GPU kernel driver
- /dev/dri:/dev/dri # Direct Rendering Infrastructure
group_add:
- video
- render
environment:
- HSA_OVERRIDE_GFX_VERSION=10.3.0 # RX 6800 (Navi 21) compatibility
Available Models on AMD GPU
From llama-swap-rocm-config.yaml:
- llama3.1-amd - Llama 3.1 8B text model
- darkidol-amd - DarkIdol uncensored model
- moondream-amd - Moondream2 vision model (smaller, AMD-optimized)
Model Aliases
You can access AMD models using these aliases:
llama3.1-amd,text-model-amd,amd-textdarkidol-amd,evil-model-amd,uncensored-amdmoondream-amd,vision-amd,moondream
Usage
Building and Starting Services
# Build the AMD ROCm container
docker compose build llama-swap-amd
# Start both GPU services
docker compose up -d llama-swap llama-swap-amd
# Check logs
docker compose logs -f llama-swap-amd
Accessing AMD Models from Bot Code
In your bot code, you can now use either endpoint:
import globals
# Use NVIDIA GPU (primary)
nvidia_response = requests.post(
f"{globals.LLAMA_URL}/v1/chat/completions",
json={"model": "llama3.1", ...}
)
# Use AMD GPU (secondary)
amd_response = requests.post(
f"{globals.LLAMA_AMD_URL}/v1/chat/completions",
json={"model": "llama3.1-amd", ...}
)
Load Balancing Strategy
You can implement load balancing by:
- Round-robin: Alternate between GPUs for text generation
- Task-specific:
- NVIDIA: Primary text + MiniCPM vision (heavy)
- AMD: Secondary text + Moondream vision (lighter)
- Failover: Use AMD as backup if NVIDIA is busy
Example load balancing function:
import random
import globals
def get_llama_url(prefer_amd=False):
"""Get llama URL with optional load balancing"""
if prefer_amd:
return globals.LLAMA_AMD_URL
# Random load balancing for text models
return random.choice([globals.LLAMA_URL, globals.LLAMA_AMD_URL])
Testing
Test NVIDIA GPU (Port 8090)
curl http://localhost:8090/health
curl http://localhost:8090/v1/models
Test AMD GPU (Port 8091)
curl http://localhost:8091/health
curl http://localhost:8091/v1/models
Test Model Loading (AMD)
curl -X POST http://localhost:8091/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1-amd",
"messages": [{"role": "user", "content": "Hello from AMD GPU!"}],
"max_tokens": 50
}'
Monitoring
Check GPU Usage
AMD GPU:
# ROCm monitoring
rocm-smi
# Or from host
watch -n 1 rocm-smi
NVIDIA GPU:
nvidia-smi
watch -n 1 nvidia-smi
Check Container Resource Usage
docker stats llama-swap llama-swap-amd
Troubleshooting
AMD GPU Not Detected
-
Verify ROCm is installed on host:
rocm-smi --version -
Check device permissions:
ls -l /dev/kfd /dev/dri -
Verify RX 6800 compatibility:
rocminfo | grep "Name:"
Model Loading Issues
If models fail to load on AMD:
-
Check VRAM availability:
rocm-smi --showmeminfo vram -
Adjust
-ngl(GPU layers) in config if needed:# Reduce GPU layers for smaller VRAM cmd: /app/llama-server ... -ngl 50 ... # Instead of 99 -
Check container logs:
docker compose logs llama-swap-amd
GFX Version Mismatch
RX 6800 is Navi 21 (gfx1030). If you see GFX errors:
# Set in docker-compose.yml environment:
HSA_OVERRIDE_GFX_VERSION=10.3.0
llama-swap Build Issues
If the ROCm container fails to build:
- The Dockerfile attempts to build llama-swap from source
- Alternative: Use pre-built binary or simpler proxy setup
- Check build logs:
docker compose build --no-cache llama-swap-amd
Performance Considerations
Memory Usage
- RX 6800: 16GB VRAM
- Q4_K_M/Q4_K_XL models: ~5-6GB each
- Can run 2 models simultaneously or 1 with long context
Model Selection
Best for AMD RX 6800:
- ✅ Q4_K_M/Q4_K_S quantized models (5-6GB)
- ✅ Moondream2 vision (smaller, efficient)
- ⚠️ MiniCPM-V-4.5 (possible but may be tight on VRAM)
TTL Configuration
Adjust model TTL in llama-swap-rocm-config.yaml:
- Lower TTL = more aggressive unloading = more VRAM available
- Higher TTL = less model swapping = faster response times
Advanced: Model-Specific Routing
Create a helper function to route models automatically:
# bot/utils/gpu_router.py
import globals
MODEL_TO_GPU = {
# NVIDIA models
"llama3.1": globals.LLAMA_URL,
"darkidol": globals.LLAMA_URL,
"vision": globals.LLAMA_URL,
# AMD models
"llama3.1-amd": globals.LLAMA_AMD_URL,
"darkidol-amd": globals.LLAMA_AMD_URL,
"moondream-amd": globals.LLAMA_AMD_URL,
}
def get_endpoint_for_model(model_name):
"""Get the correct llama-swap endpoint for a model"""
return MODEL_TO_GPU.get(model_name, globals.LLAMA_URL)
def is_amd_model(model_name):
"""Check if model runs on AMD GPU"""
return model_name.endswith("-amd")
Environment Variables
Add these to control GPU selection:
# In docker-compose.yml
environment:
- LLAMA_URL=http://llama-swap:8080
- LLAMA_AMD_URL=http://llama-swap-amd:8080
- PREFER_AMD_GPU=false # Set to true to prefer AMD for general tasks
- AMD_MODELS_ENABLED=true # Enable/disable AMD models
Future Enhancements
- Automatic load balancing: Monitor GPU utilization and route requests
- Health checks: Fallback to primary GPU if AMD fails
- Model distribution: Automatically assign models to GPUs based on VRAM
- Performance metrics: Track response times per GPU
- Dynamic routing: Use least-busy GPU for new requests