Add dual GPU support with web UI selector
Features: - Built custom ROCm container for AMD RX 6800 GPU - Added GPU selection toggle in web UI (NVIDIA/AMD) - Unified model names across both GPUs for seamless switching - Vision model always uses NVIDIA GPU (optimal performance) - Text models (llama3.1, darkidol) can use either GPU - Added /gpu-status and /gpu-select API endpoints - Implemented GPU state persistence in memory/gpu_state.json Technical details: - Multi-stage Dockerfile.llamaswap-rocm with ROCm 6.2.4 - llama.cpp compiled with GGML_HIP=ON for gfx1030 (RX 6800) - Proper GPU permissions without root (groups 187/989) - AMD container on port 8091, NVIDIA on port 8090 - Updated bot/utils/llm.py with get_current_gpu_url() and get_vision_gpu_url() - Modified bot/utils/image_handling.py to always use NVIDIA for vision - Enhanced web UI with GPU selector button (blue=NVIDIA, red=AMD) Files modified: - docker-compose.yml (added llama-swap-amd service) - bot/globals.py (added LLAMA_AMD_URL) - bot/api.py (added GPU selection endpoints and helper function) - bot/utils/llm.py (GPU routing for text models) - bot/utils/image_handling.py (GPU routing for vision models) - bot/static/index.html (GPU selector UI) - llama-swap-rocm-config.yaml (unified model names) New files: - Dockerfile.llamaswap-rocm - bot/memory/gpu_state.json - bot/utils/gpu_router.py (load balancing utility) - setup-dual-gpu.sh (setup verification script) - DUAL_GPU_*.md (documentation files)
This commit is contained in:
321
DUAL_GPU_SETUP.md
Normal file
321
DUAL_GPU_SETUP.md
Normal file
@@ -0,0 +1,321 @@
|
||||
# Dual GPU Setup - NVIDIA + AMD RX 6800
|
||||
|
||||
This document describes the dual-GPU configuration for running two llama-swap instances simultaneously:
|
||||
- **Primary GPU (NVIDIA)**: Runs main models via CUDA
|
||||
- **Secondary GPU (AMD RX 6800)**: Runs additional models via ROCm
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Miku Bot │
|
||||
│ │
|
||||
│ LLAMA_URL=http://llama-swap:8080 (NVIDIA) │
|
||||
│ LLAMA_AMD_URL=http://llama-swap-amd:8080 (AMD RX 6800) │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
│ │
|
||||
│ │
|
||||
▼ ▼
|
||||
┌──────────────────┐ ┌──────────────────┐
|
||||
│ llama-swap │ │ llama-swap-amd │
|
||||
│ (CUDA) │ │ (ROCm) │
|
||||
│ Port: 8090 │ │ Port: 8091 │
|
||||
└──────────────────┘ └──────────────────┘
|
||||
│ │
|
||||
▼ ▼
|
||||
┌──────────────────┐ ┌──────────────────┐
|
||||
│ NVIDIA GPU │ │ AMD RX 6800 │
|
||||
│ - llama3.1 │ │ - llama3.1-amd │
|
||||
│ - darkidol │ │ - darkidol-amd │
|
||||
│ - vision │ │ - moondream-amd │
|
||||
└──────────────────┘ └──────────────────┘
|
||||
```
|
||||
|
||||
## Files Created
|
||||
|
||||
1. **Dockerfile.llamaswap-rocm** - ROCm-enabled Docker image for AMD GPU
|
||||
2. **llama-swap-rocm-config.yaml** - Model configuration for AMD models
|
||||
3. **docker-compose.yml** - Updated with `llama-swap-amd` service
|
||||
|
||||
## Configuration Details
|
||||
|
||||
### llama-swap-amd Service
|
||||
|
||||
```yaml
|
||||
llama-swap-amd:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: Dockerfile.llamaswap-rocm
|
||||
container_name: llama-swap-amd
|
||||
ports:
|
||||
- "8091:8080" # External access on port 8091
|
||||
volumes:
|
||||
- ./models:/models
|
||||
- ./llama-swap-rocm-config.yaml:/app/config.yaml
|
||||
devices:
|
||||
- /dev/kfd:/dev/kfd # AMD GPU kernel driver
|
||||
- /dev/dri:/dev/dri # Direct Rendering Infrastructure
|
||||
group_add:
|
||||
- video
|
||||
- render
|
||||
environment:
|
||||
- HSA_OVERRIDE_GFX_VERSION=10.3.0 # RX 6800 (Navi 21) compatibility
|
||||
```
|
||||
|
||||
### Available Models on AMD GPU
|
||||
|
||||
From `llama-swap-rocm-config.yaml`:
|
||||
|
||||
- **llama3.1-amd** - Llama 3.1 8B text model
|
||||
- **darkidol-amd** - DarkIdol uncensored model
|
||||
- **moondream-amd** - Moondream2 vision model (smaller, AMD-optimized)
|
||||
|
||||
### Model Aliases
|
||||
|
||||
You can access AMD models using these aliases:
|
||||
- `llama3.1-amd`, `text-model-amd`, `amd-text`
|
||||
- `darkidol-amd`, `evil-model-amd`, `uncensored-amd`
|
||||
- `moondream-amd`, `vision-amd`, `moondream`
|
||||
|
||||
## Usage
|
||||
|
||||
### Building and Starting Services
|
||||
|
||||
```bash
|
||||
# Build the AMD ROCm container
|
||||
docker compose build llama-swap-amd
|
||||
|
||||
# Start both GPU services
|
||||
docker compose up -d llama-swap llama-swap-amd
|
||||
|
||||
# Check logs
|
||||
docker compose logs -f llama-swap-amd
|
||||
```
|
||||
|
||||
### Accessing AMD Models from Bot Code
|
||||
|
||||
In your bot code, you can now use either endpoint:
|
||||
|
||||
```python
|
||||
import globals
|
||||
|
||||
# Use NVIDIA GPU (primary)
|
||||
nvidia_response = requests.post(
|
||||
f"{globals.LLAMA_URL}/v1/chat/completions",
|
||||
json={"model": "llama3.1", ...}
|
||||
)
|
||||
|
||||
# Use AMD GPU (secondary)
|
||||
amd_response = requests.post(
|
||||
f"{globals.LLAMA_AMD_URL}/v1/chat/completions",
|
||||
json={"model": "llama3.1-amd", ...}
|
||||
)
|
||||
```
|
||||
|
||||
### Load Balancing Strategy
|
||||
|
||||
You can implement load balancing by:
|
||||
|
||||
1. **Round-robin**: Alternate between GPUs for text generation
|
||||
2. **Task-specific**:
|
||||
- NVIDIA: Primary text + MiniCPM vision (heavy)
|
||||
- AMD: Secondary text + Moondream vision (lighter)
|
||||
3. **Failover**: Use AMD as backup if NVIDIA is busy
|
||||
|
||||
Example load balancing function:
|
||||
|
||||
```python
|
||||
import random
|
||||
import globals
|
||||
|
||||
def get_llama_url(prefer_amd=False):
|
||||
"""Get llama URL with optional load balancing"""
|
||||
if prefer_amd:
|
||||
return globals.LLAMA_AMD_URL
|
||||
|
||||
# Random load balancing for text models
|
||||
return random.choice([globals.LLAMA_URL, globals.LLAMA_AMD_URL])
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
### Test NVIDIA GPU (Port 8090)
|
||||
```bash
|
||||
curl http://localhost:8090/health
|
||||
curl http://localhost:8090/v1/models
|
||||
```
|
||||
|
||||
### Test AMD GPU (Port 8091)
|
||||
```bash
|
||||
curl http://localhost:8091/health
|
||||
curl http://localhost:8091/v1/models
|
||||
```
|
||||
|
||||
### Test Model Loading (AMD)
|
||||
```bash
|
||||
curl -X POST http://localhost:8091/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "llama3.1-amd",
|
||||
"messages": [{"role": "user", "content": "Hello from AMD GPU!"}],
|
||||
"max_tokens": 50
|
||||
}'
|
||||
```
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Check GPU Usage
|
||||
|
||||
**AMD GPU:**
|
||||
```bash
|
||||
# ROCm monitoring
|
||||
rocm-smi
|
||||
|
||||
# Or from host
|
||||
watch -n 1 rocm-smi
|
||||
```
|
||||
|
||||
**NVIDIA GPU:**
|
||||
```bash
|
||||
nvidia-smi
|
||||
watch -n 1 nvidia-smi
|
||||
```
|
||||
|
||||
### Check Container Resource Usage
|
||||
```bash
|
||||
docker stats llama-swap llama-swap-amd
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### AMD GPU Not Detected
|
||||
|
||||
1. Verify ROCm is installed on host:
|
||||
```bash
|
||||
rocm-smi --version
|
||||
```
|
||||
|
||||
2. Check device permissions:
|
||||
```bash
|
||||
ls -l /dev/kfd /dev/dri
|
||||
```
|
||||
|
||||
3. Verify RX 6800 compatibility:
|
||||
```bash
|
||||
rocminfo | grep "Name:"
|
||||
```
|
||||
|
||||
### Model Loading Issues
|
||||
|
||||
If models fail to load on AMD:
|
||||
|
||||
1. Check VRAM availability:
|
||||
```bash
|
||||
rocm-smi --showmeminfo vram
|
||||
```
|
||||
|
||||
2. Adjust `-ngl` (GPU layers) in config if needed:
|
||||
```yaml
|
||||
# Reduce GPU layers for smaller VRAM
|
||||
cmd: /app/llama-server ... -ngl 50 ... # Instead of 99
|
||||
```
|
||||
|
||||
3. Check container logs:
|
||||
```bash
|
||||
docker compose logs llama-swap-amd
|
||||
```
|
||||
|
||||
### GFX Version Mismatch
|
||||
|
||||
RX 6800 is Navi 21 (gfx1030). If you see GFX errors:
|
||||
|
||||
```bash
|
||||
# Set in docker-compose.yml environment:
|
||||
HSA_OVERRIDE_GFX_VERSION=10.3.0
|
||||
```
|
||||
|
||||
### llama-swap Build Issues
|
||||
|
||||
If the ROCm container fails to build:
|
||||
|
||||
1. The Dockerfile attempts to build llama-swap from source
|
||||
2. Alternative: Use pre-built binary or simpler proxy setup
|
||||
3. Check build logs: `docker compose build --no-cache llama-swap-amd`
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
### Memory Usage
|
||||
|
||||
- **RX 6800**: 16GB VRAM
|
||||
- Q4_K_M/Q4_K_XL models: ~5-6GB each
|
||||
- Can run 2 models simultaneously or 1 with long context
|
||||
|
||||
### Model Selection
|
||||
|
||||
**Best for AMD RX 6800:**
|
||||
- ✅ Q4_K_M/Q4_K_S quantized models (5-6GB)
|
||||
- ✅ Moondream2 vision (smaller, efficient)
|
||||
- ⚠️ MiniCPM-V-4.5 (possible but may be tight on VRAM)
|
||||
|
||||
### TTL Configuration
|
||||
|
||||
Adjust model TTL in `llama-swap-rocm-config.yaml`:
|
||||
- Lower TTL = more aggressive unloading = more VRAM available
|
||||
- Higher TTL = less model swapping = faster response times
|
||||
|
||||
## Advanced: Model-Specific Routing
|
||||
|
||||
Create a helper function to route models automatically:
|
||||
|
||||
```python
|
||||
# bot/utils/gpu_router.py
|
||||
import globals
|
||||
|
||||
MODEL_TO_GPU = {
|
||||
# NVIDIA models
|
||||
"llama3.1": globals.LLAMA_URL,
|
||||
"darkidol": globals.LLAMA_URL,
|
||||
"vision": globals.LLAMA_URL,
|
||||
|
||||
# AMD models
|
||||
"llama3.1-amd": globals.LLAMA_AMD_URL,
|
||||
"darkidol-amd": globals.LLAMA_AMD_URL,
|
||||
"moondream-amd": globals.LLAMA_AMD_URL,
|
||||
}
|
||||
|
||||
def get_endpoint_for_model(model_name):
|
||||
"""Get the correct llama-swap endpoint for a model"""
|
||||
return MODEL_TO_GPU.get(model_name, globals.LLAMA_URL)
|
||||
|
||||
def is_amd_model(model_name):
|
||||
"""Check if model runs on AMD GPU"""
|
||||
return model_name.endswith("-amd")
|
||||
```
|
||||
|
||||
## Environment Variables
|
||||
|
||||
Add these to control GPU selection:
|
||||
|
||||
```yaml
|
||||
# In docker-compose.yml
|
||||
environment:
|
||||
- LLAMA_URL=http://llama-swap:8080
|
||||
- LLAMA_AMD_URL=http://llama-swap-amd:8080
|
||||
- PREFER_AMD_GPU=false # Set to true to prefer AMD for general tasks
|
||||
- AMD_MODELS_ENABLED=true # Enable/disable AMD models
|
||||
```
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
1. **Automatic load balancing**: Monitor GPU utilization and route requests
|
||||
2. **Health checks**: Fallback to primary GPU if AMD fails
|
||||
3. **Model distribution**: Automatically assign models to GPUs based on VRAM
|
||||
4. **Performance metrics**: Track response times per GPU
|
||||
5. **Dynamic routing**: Use least-busy GPU for new requests
|
||||
|
||||
## References
|
||||
|
||||
- [ROCm Documentation](https://rocmdocs.amd.com/)
|
||||
- [llama.cpp ROCm Support](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md#rocm)
|
||||
- [llama-swap GitHub](https://github.com/mostlygeek/llama-swap)
|
||||
- [AMD GPU Compatibility Matrix](https://rocm.docs.amd.com/en/latest/release/gpu_os_support.html)
|
||||
Reference in New Issue
Block a user