Files
miku-discord/bot/utils/gpu_preload.py
koko210Serve 1fc3d74a5b Add dual GPU support with web UI selector
Features:
- Built custom ROCm container for AMD RX 6800 GPU
- Added GPU selection toggle in web UI (NVIDIA/AMD)
- Unified model names across both GPUs for seamless switching
- Vision model always uses NVIDIA GPU (optimal performance)
- Text models (llama3.1, darkidol) can use either GPU
- Added /gpu-status and /gpu-select API endpoints
- Implemented GPU state persistence in memory/gpu_state.json

Technical details:
- Multi-stage Dockerfile.llamaswap-rocm with ROCm 6.2.4
- llama.cpp compiled with GGML_HIP=ON for gfx1030 (RX 6800)
- Proper GPU permissions without root (groups 187/989)
- AMD container on port 8091, NVIDIA on port 8090
- Updated bot/utils/llm.py with get_current_gpu_url() and get_vision_gpu_url()
- Modified bot/utils/image_handling.py to always use NVIDIA for vision
- Enhanced web UI with GPU selector button (blue=NVIDIA, red=AMD)

Files modified:
- docker-compose.yml (added llama-swap-amd service)
- bot/globals.py (added LLAMA_AMD_URL)
- bot/api.py (added GPU selection endpoints and helper function)
- bot/utils/llm.py (GPU routing for text models)
- bot/utils/image_handling.py (GPU routing for vision models)
- bot/static/index.html (GPU selector UI)
- llama-swap-rocm-config.yaml (unified model names)

New files:
- Dockerfile.llamaswap-rocm
- bot/memory/gpu_state.json
- bot/utils/gpu_router.py (load balancing utility)
- setup-dual-gpu.sh (setup verification script)
- DUAL_GPU_*.md (documentation files)
2026-01-09 00:03:59 +02:00

70 lines
2.5 KiB
Python

"""
GPU Model Preloading Utility
Preloads models on AMD GPU to take advantage of 16GB VRAM
"""
import aiohttp
import asyncio
import json
import globals
async def preload_amd_models():
"""
Preload both text and vision models on AMD GPU
Since AMD RX 6800 has 16GB VRAM, we can keep both loaded simultaneously
"""
print("🔧 Preloading models on AMD GPU...")
# Preload text model
try:
async with aiohttp.ClientSession() as session:
payload = {
"model": "llama3.1",
"messages": [{"role": "user", "content": "Hi"}],
"max_tokens": 1
}
async with session.post(
f"{globals.LLAMA_AMD_URL}/v1/chat/completions",
json=payload,
timeout=aiohttp.ClientTimeout(total=60)
) as response:
if response.status == 200:
print("✅ Text model (llama3.1) preloaded on AMD GPU")
else:
print(f"⚠️ Text model preload returned status {response.status}")
except Exception as e:
print(f"⚠️ Failed to preload text model on AMD: {e}")
# Preload vision model
try:
async with aiohttp.ClientSession() as session:
# Create a minimal test image (1x1 white pixel)
import base64
test_image = "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mP8z8DwHwAFBQIAX8jx0gAAAABJRU5ErkJggg=="
payload = {
"model": "vision",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "What do you see?"},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{test_image}"}}
]
}
],
"max_tokens": 1
}
async with session.post(
f"{globals.LLAMA_AMD_URL}/v1/chat/completions",
json=payload,
timeout=aiohttp.ClientTimeout(total=120)
) as response:
if response.status == 200:
print("✅ Vision model preloaded on AMD GPU")
else:
print(f"⚠️ Vision model preload returned status {response.status}")
except Exception as e:
print(f"⚠️ Failed to preload vision model on AMD: {e}")
print("✅ AMD GPU preload complete - both models ready")