Koko210/miku-discord

Fork 0

Files

koko210Serve c58b941587 moved AI generated readmes to readme folder (may delete)

2026-01-27 19:57:48 +02:00

8.7 KiB

Raw Blame History

Dual GPU Setup - NVIDIA + AMD RX 6800

This document describes the dual-GPU configuration for running two llama-swap instances simultaneously:

Primary GPU (NVIDIA): Runs main models via CUDA
Secondary GPU (AMD RX 6800): Runs additional models via ROCm

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                         Miku Bot                            │
│                                                             │
│  LLAMA_URL=http://llama-swap:8080 (NVIDIA)                │
│  LLAMA_AMD_URL=http://llama-swap-amd:8080 (AMD RX 6800)   │
└─────────────────────────────────────────────────────────────┘
                    │                      │
                    │                      │
                    ▼                      ▼
        ┌──────────────────┐    ┌──────────────────┐
        │  llama-swap      │    │  llama-swap-amd  │
        │  (CUDA)          │    │  (ROCm)          │
        │  Port: 8090      │    │  Port: 8091      │
        └──────────────────┘    └──────────────────┘
                    │                      │
                    ▼                      ▼
        ┌──────────────────┐    ┌──────────────────┐
        │  NVIDIA GPU      │    │  AMD RX 6800     │
        │  - llama3.1      │    │  - llama3.1-amd  │
        │  - darkidol      │    │  - darkidol-amd  │
        │  - vision        │    │  - moondream-amd │
        └──────────────────┘    └──────────────────┘

Files Created

Dockerfile.llamaswap-rocm - ROCm-enabled Docker image for AMD GPU
llama-swap-rocm-config.yaml - Model configuration for AMD models
docker-compose.yml - Updated with llama-swap-amd service

Configuration Details

llama-swap-amd Service

llama-swap-amd:
  build:
    context: .
    dockerfile: Dockerfile.llamaswap-rocm
  container_name: llama-swap-amd
  ports:
    - "8091:8080"  # External access on port 8091
  volumes:
    - ./models:/models
    - ./llama-swap-rocm-config.yaml:/app/config.yaml
  devices:
    - /dev/kfd:/dev/kfd    # AMD GPU kernel driver
    - /dev/dri:/dev/dri    # Direct Rendering Infrastructure
  group_add:
    - video
    - render
  environment:
    - HSA_OVERRIDE_GFX_VERSION=10.3.0  # RX 6800 (Navi 21) compatibility

Available Models on AMD GPU

From llama-swap-rocm-config.yaml:

llama3.1-amd - Llama 3.1 8B text model
darkidol-amd - DarkIdol uncensored model
moondream-amd - Moondream2 vision model (smaller, AMD-optimized)

Model Aliases

You can access AMD models using these aliases:

llama3.1-amd, text-model-amd, amd-text
darkidol-amd, evil-model-amd, uncensored-amd
moondream-amd, vision-amd, moondream

Usage

Building and Starting Services

# Build the AMD ROCm container
docker compose build llama-swap-amd

# Start both GPU services
docker compose up -d llama-swap llama-swap-amd

# Check logs
docker compose logs -f llama-swap-amd

Accessing AMD Models from Bot Code

In your bot code, you can now use either endpoint:

import globals

# Use NVIDIA GPU (primary)
nvidia_response = requests.post(
    f"{globals.LLAMA_URL}/v1/chat/completions",
    json={"model": "llama3.1", ...}
)

# Use AMD GPU (secondary)
amd_response = requests.post(
    f"{globals.LLAMA_AMD_URL}/v1/chat/completions", 
    json={"model": "llama3.1-amd", ...}
)

Load Balancing Strategy

You can implement load balancing by:

Round-robin: Alternate between GPUs for text generation
Task-specific:
- NVIDIA: Primary text + MiniCPM vision (heavy)
- AMD: Secondary text + Moondream vision (lighter)
Failover: Use AMD as backup if NVIDIA is busy

Example load balancing function:

import random
import globals

def get_llama_url(prefer_amd=False):
    """Get llama URL with optional load balancing"""
    if prefer_amd:
        return globals.LLAMA_AMD_URL
    
    # Random load balancing for text models
    return random.choice([globals.LLAMA_URL, globals.LLAMA_AMD_URL])

Testing

Test NVIDIA GPU (Port 8090)

curl http://localhost:8090/health
curl http://localhost:8090/v1/models

Test AMD GPU (Port 8091)

curl http://localhost:8091/health
curl http://localhost:8091/v1/models

Test Model Loading (AMD)

curl -X POST http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1-amd",
    "messages": [{"role": "user", "content": "Hello from AMD GPU!"}],
    "max_tokens": 50
  }'

Monitoring

Check GPU Usage

AMD GPU:

# ROCm monitoring
rocm-smi

# Or from host
watch -n 1 rocm-smi

NVIDIA GPU:

nvidia-smi
watch -n 1 nvidia-smi

Check Container Resource Usage

docker stats llama-swap llama-swap-amd

Troubleshooting

AMD GPU Not Detected

Verify ROCm is installed on host:
```
rocm-smi --version
```
Check device permissions:
```
ls -l /dev/kfd /dev/dri
```
Verify RX 6800 compatibility:
```
rocminfo | grep "Name:"
```

Model Loading Issues

If models fail to load on AMD:

Check VRAM availability:
```
rocm-smi --showmeminfo vram
```

Adjust -ngl (GPU layers) in config if needed:

# Reduce GPU layers for smaller VRAM
cmd: /app/llama-server ... -ngl 50 ...  # Instead of 99

Check container logs:
```
docker compose logs llama-swap-amd
```

GFX Version Mismatch

RX 6800 is Navi 21 (gfx1030). If you see GFX errors:

# Set in docker-compose.yml environment:
HSA_OVERRIDE_GFX_VERSION=10.3.0

llama-swap Build Issues

If the ROCm container fails to build:

The Dockerfile attempts to build llama-swap from source
Alternative: Use pre-built binary or simpler proxy setup
Check build logs: docker compose build --no-cache llama-swap-amd

Performance Considerations

Memory Usage

RX 6800: 16GB VRAM
- Q4_K_M/Q4_K_XL models: ~5-6GB each
- Can run 2 models simultaneously or 1 with long context

Model Selection

Best for AMD RX 6800:

✅ Q4_K_M/Q4_K_S quantized models (5-6GB)
✅ Moondream2 vision (smaller, efficient)
⚠️ MiniCPM-V-4.5 (possible but may be tight on VRAM)

TTL Configuration

Adjust model TTL in llama-swap-rocm-config.yaml:

Lower TTL = more aggressive unloading = more VRAM available
Higher TTL = less model swapping = faster response times

Advanced: Model-Specific Routing

Create a helper function to route models automatically:

# bot/utils/gpu_router.py
import globals

MODEL_TO_GPU = {
    # NVIDIA models
    "llama3.1": globals.LLAMA_URL,
    "darkidol": globals.LLAMA_URL,
    "vision": globals.LLAMA_URL,
    
    # AMD models
    "llama3.1-amd": globals.LLAMA_AMD_URL,
    "darkidol-amd": globals.LLAMA_AMD_URL,
    "moondream-amd": globals.LLAMA_AMD_URL,
}

def get_endpoint_for_model(model_name):
    """Get the correct llama-swap endpoint for a model"""
    return MODEL_TO_GPU.get(model_name, globals.LLAMA_URL)

def is_amd_model(model_name):
    """Check if model runs on AMD GPU"""
    return model_name.endswith("-amd")

Environment Variables

Add these to control GPU selection:

# In docker-compose.yml
environment:
  - LLAMA_URL=http://llama-swap:8080
  - LLAMA_AMD_URL=http://llama-swap-amd:8080
  - PREFER_AMD_GPU=false  # Set to true to prefer AMD for general tasks
  - AMD_MODELS_ENABLED=true  # Enable/disable AMD models

Future Enhancements

Automatic load balancing: Monitor GPU utilization and route requests
Health checks: Fallback to primary GPU if AMD fails
Model distribution: Automatically assign models to GPUs based on VRAM
Performance metrics: Track response times per GPU
Dynamic routing: Use least-busy GPU for new requests

8.7 KiB Raw Blame History