322 lines
8.7 KiB
Markdown
322 lines
8.7 KiB
Markdown
# Dual GPU Setup - NVIDIA + AMD RX 6800
|
|
|
|
This document describes the dual-GPU configuration for running two llama-swap instances simultaneously:
|
|
- **Primary GPU (NVIDIA)**: Runs main models via CUDA
|
|
- **Secondary GPU (AMD RX 6800)**: Runs additional models via ROCm
|
|
|
|
## Architecture Overview
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Miku Bot │
|
|
│ │
|
|
│ LLAMA_URL=http://llama-swap:8080 (NVIDIA) │
|
|
│ LLAMA_AMD_URL=http://llama-swap-amd:8080 (AMD RX 6800) │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
│ │
|
|
│ │
|
|
▼ ▼
|
|
┌──────────────────┐ ┌──────────────────┐
|
|
│ llama-swap │ │ llama-swap-amd │
|
|
│ (CUDA) │ │ (ROCm) │
|
|
│ Port: 8090 │ │ Port: 8091 │
|
|
└──────────────────┘ └──────────────────┘
|
|
│ │
|
|
▼ ▼
|
|
┌──────────────────┐ ┌──────────────────┐
|
|
│ NVIDIA GPU │ │ AMD RX 6800 │
|
|
│ - llama3.1 │ │ - llama3.1-amd │
|
|
│ - darkidol │ │ - darkidol-amd │
|
|
│ - vision │ │ - moondream-amd │
|
|
└──────────────────┘ └──────────────────┘
|
|
```
|
|
|
|
## Files Created
|
|
|
|
1. **Dockerfile.llamaswap-rocm** - ROCm-enabled Docker image for AMD GPU
|
|
2. **llama-swap-rocm-config.yaml** - Model configuration for AMD models
|
|
3. **docker-compose.yml** - Updated with `llama-swap-amd` service
|
|
|
|
## Configuration Details
|
|
|
|
### llama-swap-amd Service
|
|
|
|
```yaml
|
|
llama-swap-amd:
|
|
build:
|
|
context: .
|
|
dockerfile: Dockerfile.llamaswap-rocm
|
|
container_name: llama-swap-amd
|
|
ports:
|
|
- "8091:8080" # External access on port 8091
|
|
volumes:
|
|
- ./models:/models
|
|
- ./llama-swap-rocm-config.yaml:/app/config.yaml
|
|
devices:
|
|
- /dev/kfd:/dev/kfd # AMD GPU kernel driver
|
|
- /dev/dri:/dev/dri # Direct Rendering Infrastructure
|
|
group_add:
|
|
- video
|
|
- render
|
|
environment:
|
|
- HSA_OVERRIDE_GFX_VERSION=10.3.0 # RX 6800 (Navi 21) compatibility
|
|
```
|
|
|
|
### Available Models on AMD GPU
|
|
|
|
From `llama-swap-rocm-config.yaml`:
|
|
|
|
- **llama3.1-amd** - Llama 3.1 8B text model
|
|
- **darkidol-amd** - DarkIdol uncensored model
|
|
- **moondream-amd** - Moondream2 vision model (smaller, AMD-optimized)
|
|
|
|
### Model Aliases
|
|
|
|
You can access AMD models using these aliases:
|
|
- `llama3.1-amd`, `text-model-amd`, `amd-text`
|
|
- `darkidol-amd`, `evil-model-amd`, `uncensored-amd`
|
|
- `moondream-amd`, `vision-amd`, `moondream`
|
|
|
|
## Usage
|
|
|
|
### Building and Starting Services
|
|
|
|
```bash
|
|
# Build the AMD ROCm container
|
|
docker compose build llama-swap-amd
|
|
|
|
# Start both GPU services
|
|
docker compose up -d llama-swap llama-swap-amd
|
|
|
|
# Check logs
|
|
docker compose logs -f llama-swap-amd
|
|
```
|
|
|
|
### Accessing AMD Models from Bot Code
|
|
|
|
In your bot code, you can now use either endpoint:
|
|
|
|
```python
|
|
import globals
|
|
|
|
# Use NVIDIA GPU (primary)
|
|
nvidia_response = requests.post(
|
|
f"{globals.LLAMA_URL}/v1/chat/completions",
|
|
json={"model": "llama3.1", ...}
|
|
)
|
|
|
|
# Use AMD GPU (secondary)
|
|
amd_response = requests.post(
|
|
f"{globals.LLAMA_AMD_URL}/v1/chat/completions",
|
|
json={"model": "llama3.1-amd", ...}
|
|
)
|
|
```
|
|
|
|
### Load Balancing Strategy
|
|
|
|
You can implement load balancing by:
|
|
|
|
1. **Round-robin**: Alternate between GPUs for text generation
|
|
2. **Task-specific**:
|
|
- NVIDIA: Primary text + MiniCPM vision (heavy)
|
|
- AMD: Secondary text + Moondream vision (lighter)
|
|
3. **Failover**: Use AMD as backup if NVIDIA is busy
|
|
|
|
Example load balancing function:
|
|
|
|
```python
|
|
import random
|
|
import globals
|
|
|
|
def get_llama_url(prefer_amd=False):
|
|
"""Get llama URL with optional load balancing"""
|
|
if prefer_amd:
|
|
return globals.LLAMA_AMD_URL
|
|
|
|
# Random load balancing for text models
|
|
return random.choice([globals.LLAMA_URL, globals.LLAMA_AMD_URL])
|
|
```
|
|
|
|
## Testing
|
|
|
|
### Test NVIDIA GPU (Port 8090)
|
|
```bash
|
|
curl http://localhost:8090/health
|
|
curl http://localhost:8090/v1/models
|
|
```
|
|
|
|
### Test AMD GPU (Port 8091)
|
|
```bash
|
|
curl http://localhost:8091/health
|
|
curl http://localhost:8091/v1/models
|
|
```
|
|
|
|
### Test Model Loading (AMD)
|
|
```bash
|
|
curl -X POST http://localhost:8091/v1/chat/completions \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "llama3.1-amd",
|
|
"messages": [{"role": "user", "content": "Hello from AMD GPU!"}],
|
|
"max_tokens": 50
|
|
}'
|
|
```
|
|
|
|
## Monitoring
|
|
|
|
### Check GPU Usage
|
|
|
|
**AMD GPU:**
|
|
```bash
|
|
# ROCm monitoring
|
|
rocm-smi
|
|
|
|
# Or from host
|
|
watch -n 1 rocm-smi
|
|
```
|
|
|
|
**NVIDIA GPU:**
|
|
```bash
|
|
nvidia-smi
|
|
watch -n 1 nvidia-smi
|
|
```
|
|
|
|
### Check Container Resource Usage
|
|
```bash
|
|
docker stats llama-swap llama-swap-amd
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### AMD GPU Not Detected
|
|
|
|
1. Verify ROCm is installed on host:
|
|
```bash
|
|
rocm-smi --version
|
|
```
|
|
|
|
2. Check device permissions:
|
|
```bash
|
|
ls -l /dev/kfd /dev/dri
|
|
```
|
|
|
|
3. Verify RX 6800 compatibility:
|
|
```bash
|
|
rocminfo | grep "Name:"
|
|
```
|
|
|
|
### Model Loading Issues
|
|
|
|
If models fail to load on AMD:
|
|
|
|
1. Check VRAM availability:
|
|
```bash
|
|
rocm-smi --showmeminfo vram
|
|
```
|
|
|
|
2. Adjust `-ngl` (GPU layers) in config if needed:
|
|
```yaml
|
|
# Reduce GPU layers for smaller VRAM
|
|
cmd: /app/llama-server ... -ngl 50 ... # Instead of 99
|
|
```
|
|
|
|
3. Check container logs:
|
|
```bash
|
|
docker compose logs llama-swap-amd
|
|
```
|
|
|
|
### GFX Version Mismatch
|
|
|
|
RX 6800 is Navi 21 (gfx1030). If you see GFX errors:
|
|
|
|
```bash
|
|
# Set in docker-compose.yml environment:
|
|
HSA_OVERRIDE_GFX_VERSION=10.3.0
|
|
```
|
|
|
|
### llama-swap Build Issues
|
|
|
|
If the ROCm container fails to build:
|
|
|
|
1. The Dockerfile attempts to build llama-swap from source
|
|
2. Alternative: Use pre-built binary or simpler proxy setup
|
|
3. Check build logs: `docker compose build --no-cache llama-swap-amd`
|
|
|
|
## Performance Considerations
|
|
|
|
### Memory Usage
|
|
|
|
- **RX 6800**: 16GB VRAM
|
|
- Q4_K_M/Q4_K_XL models: ~5-6GB each
|
|
- Can run 2 models simultaneously or 1 with long context
|
|
|
|
### Model Selection
|
|
|
|
**Best for AMD RX 6800:**
|
|
- ✅ Q4_K_M/Q4_K_S quantized models (5-6GB)
|
|
- ✅ Moondream2 vision (smaller, efficient)
|
|
- ⚠️ MiniCPM-V-4.5 (possible but may be tight on VRAM)
|
|
|
|
### TTL Configuration
|
|
|
|
Adjust model TTL in `llama-swap-rocm-config.yaml`:
|
|
- Lower TTL = more aggressive unloading = more VRAM available
|
|
- Higher TTL = less model swapping = faster response times
|
|
|
|
## Advanced: Model-Specific Routing
|
|
|
|
Create a helper function to route models automatically:
|
|
|
|
```python
|
|
# bot/utils/gpu_router.py
|
|
import globals
|
|
|
|
MODEL_TO_GPU = {
|
|
# NVIDIA models
|
|
"llama3.1": globals.LLAMA_URL,
|
|
"darkidol": globals.LLAMA_URL,
|
|
"vision": globals.LLAMA_URL,
|
|
|
|
# AMD models
|
|
"llama3.1-amd": globals.LLAMA_AMD_URL,
|
|
"darkidol-amd": globals.LLAMA_AMD_URL,
|
|
"moondream-amd": globals.LLAMA_AMD_URL,
|
|
}
|
|
|
|
def get_endpoint_for_model(model_name):
|
|
"""Get the correct llama-swap endpoint for a model"""
|
|
return MODEL_TO_GPU.get(model_name, globals.LLAMA_URL)
|
|
|
|
def is_amd_model(model_name):
|
|
"""Check if model runs on AMD GPU"""
|
|
return model_name.endswith("-amd")
|
|
```
|
|
|
|
## Environment Variables
|
|
|
|
Add these to control GPU selection:
|
|
|
|
```yaml
|
|
# In docker-compose.yml
|
|
environment:
|
|
- LLAMA_URL=http://llama-swap:8080
|
|
- LLAMA_AMD_URL=http://llama-swap-amd:8080
|
|
- PREFER_AMD_GPU=false # Set to true to prefer AMD for general tasks
|
|
- AMD_MODELS_ENABLED=true # Enable/disable AMD models
|
|
```
|
|
|
|
## Future Enhancements
|
|
|
|
1. **Automatic load balancing**: Monitor GPU utilization and route requests
|
|
2. **Health checks**: Fallback to primary GPU if AMD fails
|
|
3. **Model distribution**: Automatically assign models to GPUs based on VRAM
|
|
4. **Performance metrics**: Track response times per GPU
|
|
5. **Dynamic routing**: Use least-busy GPU for new requests
|
|
|
|
## References
|
|
|
|
- [ROCm Documentation](https://rocmdocs.amd.com/)
|
|
- [llama.cpp ROCm Support](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md#rocm)
|
|
- [llama-swap GitHub](https://github.com/mostlygeek/llama-swap)
|
|
- [AMD GPU Compatibility Matrix](https://rocm.docs.amd.com/en/latest/release/gpu_os_support.html)
|