readmes/VISION_MODEL_DEBUG.md

# Vision Model Debugging Guide

## Issue Summary
Vision model not working when AMD is set as the primary GPU for text inference.

## Root Cause Analysis

The vision model (MiniCPM-V) should **always run on the NVIDIA GPU**, even when AMD is the primary GPU for text models. This is because:

1. **Separate GPU design**: Each GPU has its own llama-swap instance
   - `llama-swap` (NVIDIA) on port 8090 → handles `vision`, `llama3.1`, `darkidol`
   - `llama-swap-amd` (AMD) on port 8091 → handles `llama3.1`, `darkidol` (text models only)

2. **Vision model location**: The vision model is **ONLY configured on NVIDIA**
   - Check: `llama-swap-config.yaml` (has vision model)
   - Check: `llama-swap-rocm-config.yaml` (does NOT have vision model)

## Fixes Applied

### 1. Improved GPU Routing (`bot/utils/llm.py`)

**Function**: `get_vision_gpu_url()`
- Now explicitly returns NVIDIA URL regardless of primary text GPU
- Added debug logging when text GPU is AMD
- Added clear documentation about the routing strategy

**New Function**: `check_vision_endpoint_health()`
- Pings the NVIDIA vision endpoint before attempting requests
- Provides detailed error messages if endpoint is unreachable
- Logs health status for troubleshooting

### 2. Enhanced Vision Analysis (`bot/utils/image_handling.py`)

**Function**: `analyze_image_with_vision()`
- Added health check before processing
- Increased timeout to 60 seconds (from default)
- Logs endpoint URL, model name, and detailed error messages
- Added exception info logging for better debugging

**Function**: `analyze_video_with_vision()`
- Added health check before processing
- Increased timeout to 120 seconds (from default)
- Logs media type, frame count, and detailed error messages
- Added exception info logging for better debugging

## Testing the Fix

### 1. Verify Docker Containers

```bash
# Check both llama-swap services are running
docker compose ps

# Expected output:
# llama-swap      (port 8090)
# llama-swap-amd  (port 8091)
```

### 2. Test NVIDIA Endpoint Health

```bash
# Check if NVIDIA vision endpoint is responsive
curl -f http://llama-swap:8080/health

# Should return 200 OK
```

### 3. Test Vision Request to NVIDIA

```bash
# Send a simple vision request directly
curl -X POST http://llama-swap:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vision",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "Describe this image."},
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
      ]
    }],
    "max_tokens": 100
  }'
```

### 4. Check GPU State File

```bash
# Verify which GPU is primary
cat bot/memory/gpu_state.json

# Should show:
# {"current_gpu": "amd", "reason": "..."} when AMD is primary
# {"current_gpu": "nvidia", "reason": "..."} when NVIDIA is primary
```

### 5. Monitor Logs During Vision Request

```bash
# Watch bot logs during image analysis
docker compose logs -f miku-bot 2>&1 | grep -i vision

# Should see:
# "Sending vision request to http://llama-swap:8080"
# "Vision analysis completed successfully"
# OR detailed error messages if something is wrong
```

## Troubleshooting Steps

### Issue: Vision endpoint health check fails

**Symptoms**: "Vision service currently unavailable: Endpoint timeout"

**Solutions**:
1. Verify NVIDIA container is running: `docker compose ps llama-swap`
2. Check NVIDIA GPU memory: `nvidia-smi` (should have free VRAM)
3. Check if vision model is loaded: `docker compose logs llama-swap`
4. Increase timeout if model is loading slowly

### Issue: Vision requests timeout (status 408/504)

**Symptoms**: Requests hang or return timeout errors

**Solutions**:
1. Check NVIDIA GPU is not overloaded: `nvidia-smi`
2. Check if vision model is already running: Look for MiniCPM processes
3. Restart llama-swap if model is stuck: `docker compose restart llama-swap`
4. Check available VRAM: MiniCPM-V needs ~4-6GB

### Issue: Vision model returns "No description"

**Symptoms**: Image analysis returns empty or generic responses

**Solutions**:
1. Check if vision model loaded correctly: `docker compose logs llama-swap`
2. Verify model file exists: `/models/MiniCPM-V-4_5-Q3_K_S.gguf`
3. Check if mmproj loaded: `/models/MiniCPM-V-4_5-mmproj-f16.gguf`
4. Test with direct curl to ensure model works

### Issue: AMD GPU affects vision performance

**Symptoms**: Vision requests are slower when AMD is primary

**Solutions**:
1. This is expected behavior - NVIDIA is still processing vision
2. Could indicate NVIDIA GPU memory pressure
3. Monitor both GPUs: `rocm-smi` (AMD) and `nvidia-smi` (NVIDIA)

## Architecture Diagram

```
┌─────────────────────────────────────────────────────────────┐
│                         Miku Bot                            │
│                                                             │
│  Discord Messages with Images/Videos                       │
└─────────────────────────────────────────────────────────────┘
                    │
                    ▼
        ┌──────────────────────────────┐
        │  Vision Analysis Handler     │
        │  (image_handling.py)         │
        │                              │
        │ 1. Check NVIDIA health       │
        │ 2. Send to NVIDIA vision     │
        └──────────────────────────────┘
                    │
                    ▼
        ┌──────────────────────────────┐
        │    NVIDIA GPU (llama-swap)   │
        │    Port: 8090                │
        │                              │
        │  Available Models:           │
        │  • vision (MiniCPM-V)        │
        │  • llama3.1                  │
        │  • darkidol                  │
        └──────────────────────────────┘
                    │
        ┌───────────┴────────────┐
        │                        │
        ▼ (Vision only)         ▼ (Text only in dual-GPU mode)
    NVIDIA GPU          AMD GPU (llama-swap-amd)
                        Port: 8091
                        
                        Available Models:
                        • llama3.1
                        • darkidol
                        (NO vision model)
```

## Key Files Changed

1. **bot/utils/llm.py**
   - Enhanced `get_vision_gpu_url()` with documentation
   - Added `check_vision_endpoint_health()` function

2. **bot/utils/image_handling.py**
   - `analyze_image_with_vision()` - added health check and logging
   - `analyze_video_with_vision()` - added health check and logging

## Expected Behavior After Fix

### When NVIDIA is Primary (default)
```
Image received
→ Check NVIDIA health: OK
→ Send to NVIDIA vision model
→ Analysis complete
✓ Works as before
```

### When AMD is Primary (voice session active)
```
Image received
→ Check NVIDIA health: OK
→ Send to NVIDIA vision model (even though text uses AMD)
→ Analysis complete
✓ Vision now works correctly!
```

## Next Steps if Issues Persist

1. Enable debug logging: Set `AUTONOMOUS_DEBUG=true` in docker-compose
2. Check Docker networking: `docker network inspect miku-discord_default`
3. Verify environment variables: `docker compose exec miku-bot env | grep LLAMA`
4. Check model file integrity: `ls -lah models/MiniCPM*`
5. Review llama-swap logs: `docker compose logs llama-swap -n 100`
moved AI generated readmes to readme folder (may delete) 2026-01-27 19:57:48 +02:00			`# Vision Model Debugging Guide`

			`## Issue Summary`
			`Vision model not working when AMD is set as the primary GPU for text inference.`

			`## Root Cause Analysis`

			`The vision model (MiniCPM-V) should always run on the NVIDIA GPU, even when AMD is the primary GPU for text models. This is because:`

			`1. Separate GPU design: Each GPU has its own llama-swap instance`
			- `llama-swap` (NVIDIA) on port 8090 → handles `vision`, `llama3.1`, `darkidol`
			- `llama-swap-amd` (AMD) on port 8091 → handles `llama3.1`, `darkidol` (text models only)

			`2. Vision model location: The vision model is ONLY configured on NVIDIA`
			- Check: `llama-swap-config.yaml` (has vision model)
			- Check: `llama-swap-rocm-config.yaml` (does NOT have vision model)

			`## Fixes Applied`

			### 1. Improved GPU Routing (`bot/utils/llm.py`)

			Function: `get_vision_gpu_url()`
			`- Now explicitly returns NVIDIA URL regardless of primary text GPU`
			`- Added debug logging when text GPU is AMD`
			`- Added clear documentation about the routing strategy`

			New Function: `check_vision_endpoint_health()`
			`- Pings the NVIDIA vision endpoint before attempting requests`
			`- Provides detailed error messages if endpoint is unreachable`
			`- Logs health status for troubleshooting`

			### 2. Enhanced Vision Analysis (`bot/utils/image_handling.py`)

			Function: `analyze_image_with_vision()`
			`- Added health check before processing`
			`- Increased timeout to 60 seconds (from default)`
			`- Logs endpoint URL, model name, and detailed error messages`
			`- Added exception info logging for better debugging`

			Function: `analyze_video_with_vision()`
			`- Added health check before processing`
			`- Increased timeout to 120 seconds (from default)`
			`- Logs media type, frame count, and detailed error messages`
			`- Added exception info logging for better debugging`

			`## Testing the Fix`

			`### 1. Verify Docker Containers`

			```bash
			`# Check both llama-swap services are running`
			`docker compose ps`

			`# Expected output:`
			`# llama-swap (port 8090)`
			`# llama-swap-amd (port 8091)`
			```

			`### 2. Test NVIDIA Endpoint Health`

			```bash
			`# Check if NVIDIA vision endpoint is responsive`
			`curl -f http://llama-swap:8080/health`

			`# Should return 200 OK`
			```

			`### 3. Test Vision Request to NVIDIA`

			```bash
			`# Send a simple vision request directly`
			`curl -X POST http://llama-swap:8080/v1/chat/completions \`
			`-H "Content-Type: application/json" \`
			`-d '{`
			`"model": "vision",`
			`"messages": [{`
			`"role": "user",`
			`"content": [`
			`{"type": "text", "text": "Describe this image."},`
			`{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}`
			`]`
			`}],`
			`"max_tokens": 100`
			`}'`
			```

			`### 4. Check GPU State File`

			```bash
			`# Verify which GPU is primary`
			`cat bot/memory/gpu_state.json`

			`# Should show:`
			`# {"current_gpu": "amd", "reason": "..."} when AMD is primary`
			`# {"current_gpu": "nvidia", "reason": "..."} when NVIDIA is primary`
			```

			`### 5. Monitor Logs During Vision Request`

			```bash
			`# Watch bot logs during image analysis`
			`docker compose logs -f miku-bot 2>&1 \| grep -i vision`

			`# Should see:`
			`# "Sending vision request to http://llama-swap:8080"`
			`# "Vision analysis completed successfully"`
			`# OR detailed error messages if something is wrong`
			```

			`## Troubleshooting Steps`

			`### Issue: Vision endpoint health check fails`

			`Symptoms: "Vision service currently unavailable: Endpoint timeout"`

			`Solutions:`
			1. Verify NVIDIA container is running: `docker compose ps llama-swap`
			2. Check NVIDIA GPU memory: `nvidia-smi` (should have free VRAM)
			3. Check if vision model is loaded: `docker compose logs llama-swap`
			`4. Increase timeout if model is loading slowly`

			`### Issue: Vision requests timeout (status 408/504)`

			`Symptoms: Requests hang or return timeout errors`

			`Solutions:`
			1. Check NVIDIA GPU is not overloaded: `nvidia-smi`
			`2. Check if vision model is already running: Look for MiniCPM processes`
			3. Restart llama-swap if model is stuck: `docker compose restart llama-swap`
			`4. Check available VRAM: MiniCPM-V needs ~4-6GB`

			`### Issue: Vision model returns "No description"`

			`Symptoms: Image analysis returns empty or generic responses`

			`Solutions:`
			1. Check if vision model loaded correctly: `docker compose logs llama-swap`
			2. Verify model file exists: `/models/MiniCPM-V-4_5-Q3_K_S.gguf`
			3. Check if mmproj loaded: `/models/MiniCPM-V-4_5-mmproj-f16.gguf`
			`4. Test with direct curl to ensure model works`

			`### Issue: AMD GPU affects vision performance`

			`Symptoms: Vision requests are slower when AMD is primary`

			`Solutions:`
			`1. This is expected behavior - NVIDIA is still processing vision`
			`2. Could indicate NVIDIA GPU memory pressure`
			3. Monitor both GPUs: `rocm-smi` (AMD) and `nvidia-smi` (NVIDIA)

			`## Architecture Diagram`

			```
			`┌─────────────────────────────────────────────────────────────┐`
			`│ Miku Bot │`
			`│ │`
			`│ Discord Messages with Images/Videos │`
			`└─────────────────────────────────────────────────────────────┘`
			`│`
			`▼`
			`┌──────────────────────────────┐`
			`│ Vision Analysis Handler │`
			`│ (image_handling.py) │`
			`│ │`
			`│ 1. Check NVIDIA health │`
			`│ 2. Send to NVIDIA vision │`
			`└──────────────────────────────┘`
			`│`
			`▼`
			`┌──────────────────────────────┐`
			`│ NVIDIA GPU (llama-swap) │`
			`│ Port: 8090 │`
			`│ │`
			`│ Available Models: │`
			`│ • vision (MiniCPM-V) │`
			`│ • llama3.1 │`
			`│ • darkidol │`
			`└──────────────────────────────┘`
			`│`
			`┌───────────┴────────────┐`
			`│ │`
			`▼ (Vision only) ▼ (Text only in dual-GPU mode)`
			`NVIDIA GPU AMD GPU (llama-swap-amd)`
			`Port: 8091`

			`Available Models:`
			`• llama3.1`
			`• darkidol`
			`(NO vision model)`
			```

			`## Key Files Changed`

			`1. bot/utils/llm.py`
			- Enhanced `get_vision_gpu_url()` with documentation
			- Added `check_vision_endpoint_health()` function

			`2. bot/utils/image_handling.py`
			- `analyze_image_with_vision()` - added health check and logging
			- `analyze_video_with_vision()` - added health check and logging

			`## Expected Behavior After Fix`

			`### When NVIDIA is Primary (default)`
			```
			`Image received`
			`→ Check NVIDIA health: OK`
			`→ Send to NVIDIA vision model`
			`→ Analysis complete`
			`✓ Works as before`
			```

			`### When AMD is Primary (voice session active)`
			```
			`Image received`
			`→ Check NVIDIA health: OK`
			`→ Send to NVIDIA vision model (even though text uses AMD)`
			`→ Analysis complete`
			`✓ Vision now works correctly!`
			```

			`## Next Steps if Issues Persist`

			1. Enable debug logging: Set `AUTONOMOUS_DEBUG=true` in docker-compose
			2. Check Docker networking: `docker network inspect miku-discord_default`
			3. Verify environment variables: `docker compose exec miku-bot env \| grep LLAMA`
			4. Check model file integrity: `ls -lah models/MiniCPM*`
			5. Review llama-swap logs: `docker compose logs llama-swap -n 100`