Files
miku-discord/readmes/DUAL_GPU_BUILD_SUMMARY.md

4.5 KiB

Dual GPU Setup Summary

What We Built

A secondary llama-swap container optimized for your AMD RX 6800 GPU using ROCm.

Architecture

Primary GPU (NVIDIA GTX 1660)     Secondary GPU (AMD RX 6800)
         ↓                                    ↓
   llama-swap (CUDA)                  llama-swap-amd (ROCm)
   Port: 8090                         Port: 8091
         ↓                                    ↓
   NVIDIA models                       AMD models
   - llama3.1                         - llama3.1-amd
   - darkidol                         - darkidol-amd
   - vision (MiniCPM)                 - moondream-amd

Files Created

  1. Dockerfile.llamaswap-rocm - Custom multi-stage build:

    • Stage 1: Builds llama.cpp with ROCm from source
    • Stage 2: Builds llama-swap from source
    • Stage 3: Runtime image with both binaries
  2. llama-swap-rocm-config.yaml - Model configuration for AMD GPU

  3. docker-compose.yml - Updated with llama-swap-amd service

  4. bot/utils/gpu_router.py - Load balancing utility

  5. bot/globals.py - Updated with LLAMA_AMD_URL

  6. setup-dual-gpu.sh - Setup verification script

  7. DUAL_GPU_SETUP.md - Comprehensive documentation

  8. DUAL_GPU_QUICK_REF.md - Quick reference guide

Why Custom Build?

  • llama.cpp doesn't publish ROCm Docker images (yet)
  • llama-swap doesn't provide ROCm variants
  • Building from source ensures latest ROCm compatibility
  • Full control over compilation flags and optimization

Build Time

The initial build takes 15-30 minutes depending on your system:

  • llama.cpp compilation: ~10-20 minutes
  • llama-swap compilation: ~1-2 minutes
  • Image layering: ~2-5 minutes

Subsequent builds are much faster due to Docker layer caching.

Next Steps

Once the build completes:

# 1. Start both GPU services
docker compose up -d llama-swap llama-swap-amd

# 2. Verify both are running
docker compose ps

# 3. Test NVIDIA GPU
curl http://localhost:8090/health

# 4. Test AMD GPU
curl http://localhost:8091/health

# 5. Monitor logs
docker compose logs -f llama-swap-amd

# 6. Test model loading on AMD
curl -X POST http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1-amd",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 50
  }'

Device Access

The AMD container has access to:

  • /dev/kfd - AMD GPU kernel driver
  • /dev/dri - Direct Rendering Infrastructure
  • Groups: video, render

Environment Variables

RX 6800 specific settings:

HSA_OVERRIDE_GFX_VERSION=10.3.0  # Navi 21 (gfx1030) compatibility
ROCM_PATH=/opt/rocm
HIP_VISIBLE_DEVICES=0            # Use first AMD GPU

Bot Integration

Your bot now has two endpoints available:

import globals

# NVIDIA GPU (primary)
nvidia_url = globals.LLAMA_URL  # http://llama-swap:8080

# AMD GPU (secondary)
amd_url = globals.LLAMA_AMD_URL  # http://llama-swap-amd:8080

Use the gpu_router utility for automatic load balancing:

from bot.utils.gpu_router import get_llama_url_with_load_balancing

# Round-robin between GPUs
url, model = get_llama_url_with_load_balancing(task_type="text")

# Prefer AMD for vision
url, model = get_llama_url_with_load_balancing(
    task_type="vision",
    prefer_amd=True
)

Troubleshooting

If the AMD container fails to start:

  1. Check build logs:

    docker compose build --no-cache llama-swap-amd
    
  2. Verify GPU access:

    ls -l /dev/kfd /dev/dri
    
  3. Check container logs:

    docker compose logs llama-swap-amd
    
  4. Test GPU from host:

    lspci | grep -i amd
    # Should show: Radeon RX 6800
    

Performance Notes

RX 6800 Specs:

  • VRAM: 16GB
  • Architecture: RDNA 2 (Navi 21)
  • Compute: gfx1030

Recommended Models:

  • Q4_K_M quantization: 5-6GB per model
  • Can load 2-3 models simultaneously
  • Good for: Llama 3.1 8B, DarkIdol 8B, Moondream2

Future Improvements

  1. Automatic failover: Route to AMD if NVIDIA is busy
  2. Health monitoring: Track GPU utilization
  3. Dynamic routing: Use least-busy GPU
  4. VRAM monitoring: Alert before OOM
  5. Model preloading: Keep common models loaded

Resources