# Vision Model Troubleshooting Checklist ## Quick Diagnostics ### 1. Verify Both GPU Services Running ```bash # Check container status docker compose ps # Should show both RUNNING: # llama-swap (NVIDIA CUDA) # llama-swap-amd (AMD ROCm) ``` **If llama-swap is not running:** ```bash docker compose up -d llama-swap docker compose logs llama-swap ``` **If llama-swap-amd is not running:** ```bash docker compose up -d llama-swap-amd docker compose logs llama-swap-amd ``` ### 2. Check NVIDIA Vision Endpoint Health ```bash # Test NVIDIA endpoint directly curl -v http://llama-swap:8080/health # Expected: 200 OK # If timeout (no response for 5+ seconds): # - NVIDIA GPU might not have enough VRAM # - Model might be stuck loading # - Docker network might be misconfigured ``` ### 3. Check Current GPU State ```bash # See which GPU is set as primary cat bot/memory/gpu_state.json # Expected output: # {"current_gpu": "amd", "reason": "voice_session"} # or # {"current_gpu": "nvidia", "reason": "auto_switch"} ``` ### 4. Verify Model Files Exist ```bash # Check vision model files on disk ls -lh models/MiniCPM* # Should show both: # -rw-r--r-- ... MiniCPM-V-4_5-Q3_K_S.gguf (main model, ~3.3GB) # -rw-r--r-- ... MiniCPM-V-4_5-mmproj-f16.gguf (projection, ~500MB) ``` ## Scenario-Based Troubleshooting ### Scenario 1: Vision Works When NVIDIA is Primary, Fails When AMD is Primary **Diagnosis:** NVIDIA GPU is getting unloaded when AMD is primary **Root Cause:** llama-swap is configured to unload unused models **Solution:** ```yaml # In llama-swap-config.yaml, reduce TTL for vision model: vision: ttl: 3600 # Increase from 900 to keep vision model loaded longer ``` **Or:** ```yaml # Disable TTL for vision to keep it always loaded: vision: ttl: 0 # 0 means never auto-unload ``` ### Scenario 2: "Vision service currently unavailable: Endpoint timeout" **Diagnosis:** NVIDIA endpoint not responding within 5 seconds **Causes:** 1. NVIDIA GPU out of memory 2. Vision model stuck loading 3. Network latency **Solutions:** ```bash # Check NVIDIA GPU memory nvidia-smi # If memory is full, restart NVIDIA container docker compose restart llama-swap # Wait for model to load (check logs) docker compose logs llama-swap -f # Should see: "model loaded" message ``` **If persistent:** Increase health check timeout in `bot/utils/llm.py`: ```python # Change from 5 to 10 seconds async with session.get(f"{vision_url}/health", timeout=aiohttp.ClientTimeout(total=10)) as response: ``` ### Scenario 3: Vision Model Returns Empty Description **Diagnosis:** Model loaded but not processing correctly **Causes:** 1. Model corruption 2. Insufficient input validation 3. Model inference error **Solutions:** ```bash # Test vision model directly curl -X POST http://llama-swap:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "vision", "messages": [{ "role": "user", "content": [ {"type": "text", "text": "What is this?"}, {"type": "image_url", "image_url": {"url": "..."}} ] }], "max_tokens": 100 }' # If returns empty, check llama-swap logs for errors docker compose logs llama-swap -n 50 ``` ### Scenario 4: "Error 503 Service Unavailable" **Diagnosis:** llama-swap process crashed or model failed to load **Solutions:** ```bash # Check llama-swap container status docker compose logs llama-swap -n 100 # Look for error messages, stack traces # Restart the service docker compose restart llama-swap # Monitor startup docker compose logs llama-swap -f ``` ### Scenario 5: Slow Vision Analysis When AMD is Primary **Diagnosis:** Both GPUs under load, NVIDIA performance degraded **Expected Behavior:** This is normal. Both GPUs are working simultaneously. **If Unacceptably Slow:** 1. Check if text requests are blocking vision requests 2. Verify GPU memory allocation 3. Consider processing images sequentially instead of parallel ## Log Analysis Tips ### Enable Detailed Vision Logging ```bash # Watch only vision-related logs docker compose logs miku-bot -f 2>&1 | grep -i vision # Watch with timestamps docker compose logs miku-bot -f 2>&1 | grep -i vision | grep -E "ERROR|WARNING|INFO" ``` ### Check GPU Health During Vision Request In one terminal: ```bash # Monitor NVIDIA GPU while processing watch -n 1 nvidia-smi ``` In another: ```bash # Send image to bot that triggers vision # Then watch GPU usage spike in first terminal ``` ### Monitor Both GPUs Simultaneously ```bash # Terminal 1: NVIDIA watch -n 1 nvidia-smi # Terminal 2: AMD watch -n 1 rocm-smi # Terminal 3: Logs docker compose logs miku-bot -f 2>&1 | grep -E "ERROR|vision" ``` ## Emergency Fixes ### If Vision Completely Broken ```bash # Full restart of all GPU services docker compose down docker compose up -d llama-swap llama-swap-amd docker compose restart miku-bot # Wait for services to start (30-60 seconds) sleep 30 # Test health curl http://llama-swap:8080/health curl http://llama-swap-amd:8080/health ``` ### Force NVIDIA GPU Vision If you want to guarantee vision always works, even if NVIDIA has issues: ```python # In bot/utils/llm.py, comment out health check in image_handling.py # (Not recommended, but allows requests to continue) ``` ### Disable Dual-GPU Mode Temporarily If AMD GPU is causing issues: ```yaml # In docker-compose.yml, stop llama-swap-amd # Restart bot # This reverts to single-GPU mode (everything on NVIDIA) ``` ## Prevention Measures ### 1. Monitor GPU Memory ```bash # Setup automated monitoring watch -n 5 "nvidia-smi --query-gpu=memory.used,memory.free --format=csv,noheader" watch -n 5 "rocm-smi --showmeminfo" ``` ### 2. Set Appropriate Model TTLs In `llama-swap-config.yaml`: ```yaml vision: ttl: 1800 # Keep loaded 30 minutes llama3.1: ttl: 1800 # Keep loaded 30 minutes ``` In `llama-swap-rocm-config.yaml`: ```yaml llama3.1: ttl: 1800 # AMD text model darkidol: ttl: 1800 # AMD evil mode ``` ### 3. Monitor Container Logs ```bash # Periodic log check docker compose logs llama-swap | tail -20 docker compose logs llama-swap-amd | tail -20 docker compose logs miku-bot | grep vision | tail -20 ``` ### 4. Regular Health Checks ```bash # Script to check both GPU endpoints #!/bin/bash echo "NVIDIA Health:" curl -s http://llama-swap:8080/health && echo "✓ OK" || echo "✗ FAILED" echo "AMD Health:" curl -s http://llama-swap-amd:8080/health && echo "✓ OK" || echo "✗ FAILED" ``` ## Performance Optimization If vision requests are too slow: 1. **Reduce image quality** before sending to model 2. **Use smaller frames** for video analysis 3. **Batch process** multiple images 4. **Allocate more VRAM** to NVIDIA if available 5. **Reduce concurrent requests** to NVIDIA during peak load ## Success Indicators After applying the fix, you should see: ✅ Images analyzed within 5-10 seconds (first load: 20-30 seconds) ✅ No "Vision service unavailable" errors ✅ Log shows `Vision analysis completed successfully` ✅ Works correctly whether AMD or NVIDIA is primary GPU ✅ No GPU memory errors in nvidia-smi/rocm-smi ## Contact Points for Further Issues 1. Check NVIDIA llama.cpp/llama-swap logs 2. Check AMD ROCm compatibility for your GPU 3. Verify Docker networking (if using custom networks) 4. Check system VRAM (needs ~10GB+ for both models)