miku-discord/readmes/DUAL_GPU_SETUP.md

# Dual GPU Setup - NVIDIA + AMD RX 6800

This document describes the dual-GPU configuration for running two llama-swap instances simultaneously:
- **Primary GPU (NVIDIA)**: Runs main models via CUDA
- **Secondary GPU (AMD RX 6800)**: Runs additional models via ROCm

## Architecture Overview

```
┌─────────────────────────────────────────────────────────────┐
│                         Miku Bot                            │
│                                                             │
│  LLAMA_URL=http://llama-swap:8080 (NVIDIA)                │
│  LLAMA_AMD_URL=http://llama-swap-amd:8080 (AMD RX 6800)   │
└─────────────────────────────────────────────────────────────┘
                    │                      │
                    │                      │
                    ▼                      ▼
        ┌──────────────────┐    ┌──────────────────┐
        │  llama-swap      │    │  llama-swap-amd  │
        │  (CUDA)          │    │  (ROCm)          │
        │  Port: 8090      │    │  Port: 8091      │
        └──────────────────┘    └──────────────────┘
                    │                      │
                    ▼                      ▼
        ┌──────────────────┐    ┌──────────────────┐
        │  NVIDIA GPU      │    │  AMD RX 6800     │
        │  - llama3.1      │    │  - llama3.1-amd  │
        │  - darkidol      │    │  - darkidol-amd  │
        │  - vision        │    │  - moondream-amd │
        └──────────────────┘    └──────────────────┘
```

## Files Created

1. **Dockerfile.llamaswap-rocm** - ROCm-enabled Docker image for AMD GPU
2. **llama-swap-rocm-config.yaml** - Model configuration for AMD models
3. **docker-compose.yml** - Updated with `llama-swap-amd` service

## Configuration Details

### llama-swap-amd Service

```yaml
llama-swap-amd:
  build:
    context: .
    dockerfile: Dockerfile.llamaswap-rocm
  container_name: llama-swap-amd
  ports:
    - "8091:8080"  # External access on port 8091
  volumes:
    - ./models:/models
    - ./llama-swap-rocm-config.yaml:/app/config.yaml
  devices:
    - /dev/kfd:/dev/kfd    # AMD GPU kernel driver
    - /dev/dri:/dev/dri    # Direct Rendering Infrastructure
  group_add:
    - video
    - render
  environment:
    - HSA_OVERRIDE_GFX_VERSION=10.3.0  # RX 6800 (Navi 21) compatibility
```

### Available Models on AMD GPU

From `llama-swap-rocm-config.yaml`:

- **llama3.1-amd** - Llama 3.1 8B text model
- **darkidol-amd** - DarkIdol uncensored model
- **moondream-amd** - Moondream2 vision model (smaller, AMD-optimized)

### Model Aliases

You can access AMD models using these aliases:
- `llama3.1-amd`, `text-model-amd`, `amd-text`
- `darkidol-amd`, `evil-model-amd`, `uncensored-amd`
- `moondream-amd`, `vision-amd`, `moondream`

## Usage

### Building and Starting Services

```bash
# Build the AMD ROCm container
docker compose build llama-swap-amd

# Start both GPU services
docker compose up -d llama-swap llama-swap-amd

# Check logs
docker compose logs -f llama-swap-amd
```

### Accessing AMD Models from Bot Code

In your bot code, you can now use either endpoint:

```python
import globals

# Use NVIDIA GPU (primary)
nvidia_response = requests.post(
    f"{globals.LLAMA_URL}/v1/chat/completions",
    json={"model": "llama3.1", ...}
)

# Use AMD GPU (secondary)
amd_response = requests.post(
    f"{globals.LLAMA_AMD_URL}/v1/chat/completions",
    json={"model": "llama3.1-amd", ...}
)
```

### Load Balancing Strategy

You can implement load balancing by:

1. **Round-robin**: Alternate between GPUs for text generation
2. **Task-specific**:
   - NVIDIA: Primary text + MiniCPM vision (heavy)
   - AMD: Secondary text + Moondream vision (lighter)
3. **Failover**: Use AMD as backup if NVIDIA is busy

Example load balancing function:

```python
import random
import globals

def get_llama_url(prefer_amd=False):
    """Get llama URL with optional load balancing"""
    if prefer_amd:
        return globals.LLAMA_AMD_URL

    # Random load balancing for text models
    return random.choice([globals.LLAMA_URL, globals.LLAMA_AMD_URL])
```

## Testing

### Test NVIDIA GPU (Port 8090)
```bash
curl http://localhost:8090/health
curl http://localhost:8090/v1/models
```

### Test AMD GPU (Port 8091)
```bash
curl http://localhost:8091/health
curl http://localhost:8091/v1/models
```

### Test Model Loading (AMD)
```bash
curl -X POST http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1-amd",
    "messages": [{"role": "user", "content": "Hello from AMD GPU!"}],
    "max_tokens": 50
  }'
```

## Monitoring

### Check GPU Usage

**AMD GPU:**
```bash
# ROCm monitoring
rocm-smi

# Or from host
watch -n 1 rocm-smi
```

**NVIDIA GPU:**
```bash
nvidia-smi
watch -n 1 nvidia-smi
```

### Check Container Resource Usage
```bash
docker stats llama-swap llama-swap-amd
```

## Troubleshooting

### AMD GPU Not Detected

1. Verify ROCm is installed on host:
   ```bash
   rocm-smi --version
   ```

2. Check device permissions:
   ```bash
   ls -l /dev/kfd /dev/dri
   ```

3. Verify RX 6800 compatibility:
   ```bash
   rocminfo | grep "Name:"
   ```

### Model Loading Issues

If models fail to load on AMD:

1. Check VRAM availability:
   ```bash
   rocm-smi --showmeminfo vram
   ```

2. Adjust `-ngl` (GPU layers) in config if needed:
   ```yaml
   # Reduce GPU layers for smaller VRAM
   cmd: /app/llama-server ... -ngl 50 ...  # Instead of 99
   ```

3. Check container logs:
   ```bash
   docker compose logs llama-swap-amd
   ```

### GFX Version Mismatch

RX 6800 is Navi 21 (gfx1030). If you see GFX errors:

```bash
# Set in docker-compose.yml environment:
HSA_OVERRIDE_GFX_VERSION=10.3.0
```

### llama-swap Build Issues

If the ROCm container fails to build:

1. The Dockerfile attempts to build llama-swap from source
2. Alternative: Use pre-built binary or simpler proxy setup
3. Check build logs: `docker compose build --no-cache llama-swap-amd`

## Performance Considerations

### Memory Usage

- **RX 6800**: 16GB VRAM
  - Q4_K_M/Q4_K_XL models: ~5-6GB each
  - Can run 2 models simultaneously or 1 with long context

### Model Selection

**Best for AMD RX 6800:**
- ✅ Q4_K_M/Q4_K_S quantized models (5-6GB)
- ✅ Moondream2 vision (smaller, efficient)
- ⚠️ MiniCPM-V-4.5 (possible but may be tight on VRAM)

### TTL Configuration

Adjust model TTL in `llama-swap-rocm-config.yaml`:
- Lower TTL = more aggressive unloading = more VRAM available
- Higher TTL = less model swapping = faster response times

## Advanced: Model-Specific Routing

Create a helper function to route models automatically:

```python
# bot/utils/gpu_router.py
import globals

MODEL_TO_GPU = {
    # NVIDIA models
    "llama3.1": globals.LLAMA_URL,
    "darkidol": globals.LLAMA_URL,
    "vision": globals.LLAMA_URL,

    # AMD models
    "llama3.1-amd": globals.LLAMA_AMD_URL,
    "darkidol-amd": globals.LLAMA_AMD_URL,
    "moondream-amd": globals.LLAMA_AMD_URL,
}

def get_endpoint_for_model(model_name):
    """Get the correct llama-swap endpoint for a model"""
    return MODEL_TO_GPU.get(model_name, globals.LLAMA_URL)

def is_amd_model(model_name):
    """Check if model runs on AMD GPU"""
    return model_name.endswith("-amd")
```

## Environment Variables

Add these to control GPU selection:

```yaml
# In docker-compose.yml
environment:
  - LLAMA_URL=http://llama-swap:8080
  - LLAMA_AMD_URL=http://llama-swap-amd:8080
  - PREFER_AMD_GPU=false  # Set to true to prefer AMD for general tasks
  - AMD_MODELS_ENABLED=true  # Enable/disable AMD models
```

## Future Enhancements

1. **Automatic load balancing**: Monitor GPU utilization and route requests
2. **Health checks**: Fallback to primary GPU if AMD fails
3. **Model distribution**: Automatically assign models to GPUs based on VRAM
4. **Performance metrics**: Track response times per GPU
5. **Dynamic routing**: Use least-busy GPU for new requests

## References

- [ROCm Documentation](https://rocmdocs.amd.com/)
- [llama.cpp ROCm Support](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md#rocm)
- [llama-swap GitHub](https://github.com/mostlygeek/llama-swap)
- [AMD GPU Compatibility Matrix](https://rocm.docs.amd.com/en/latest/release/gpu_os_support.html)