151 lines
4.4 KiB
Markdown
151 lines
4.4 KiB
Markdown
|
|
# Vision Model Dual-GPU Fix - Summary
|
||
|
|
|
||
|
|
## Problem
|
||
|
|
Vision model (MiniCPM-V) wasn't working when AMD GPU was set as the primary GPU for text inference.
|
||
|
|
|
||
|
|
## Root Cause
|
||
|
|
While `get_vision_gpu_url()` was correctly hardcoded to always use NVIDIA, there was:
|
||
|
|
1. No health checking before attempting requests
|
||
|
|
2. No detailed error logging to understand failures
|
||
|
|
3. No timeout specification (could hang indefinitely)
|
||
|
|
4. No verification that NVIDIA GPU was actually responsive
|
||
|
|
|
||
|
|
When AMD became primary, if NVIDIA GPU had issues, vision requests would fail silently with poor error reporting.
|
||
|
|
|
||
|
|
## Solution Implemented
|
||
|
|
|
||
|
|
### 1. Enhanced GPU Routing (`bot/utils/llm.py`)
|
||
|
|
|
||
|
|
```python
|
||
|
|
def get_vision_gpu_url():
|
||
|
|
"""Always use NVIDIA for vision, even when AMD is primary for text"""
|
||
|
|
# Added clear documentation
|
||
|
|
# Added debug logging when switching occurs
|
||
|
|
# Returns NVIDIA URL unconditionally
|
||
|
|
```
|
||
|
|
|
||
|
|
### 2. Added Health Check (`bot/utils/llm.py`)
|
||
|
|
|
||
|
|
```python
|
||
|
|
async def check_vision_endpoint_health():
|
||
|
|
"""Verify NVIDIA vision endpoint is responsive before use"""
|
||
|
|
# Pings http://llama-swap:8080/health
|
||
|
|
# Returns (is_healthy: bool, error_message: Optional[str])
|
||
|
|
# Logs status for debugging
|
||
|
|
```
|
||
|
|
|
||
|
|
### 3. Improved Image Analysis (`bot/utils/image_handling.py`)
|
||
|
|
|
||
|
|
**Before request:**
|
||
|
|
- Health check
|
||
|
|
- Detailed logging of endpoint, model, image size
|
||
|
|
|
||
|
|
**During request:**
|
||
|
|
- 60-second timeout (was unlimited)
|
||
|
|
- Endpoint URL in error messages
|
||
|
|
|
||
|
|
**After error:**
|
||
|
|
- Full exception traceback in logs
|
||
|
|
- Endpoint information in error response
|
||
|
|
|
||
|
|
### 4. Improved Video Analysis (`bot/utils/image_handling.py`)
|
||
|
|
|
||
|
|
**Before request:**
|
||
|
|
- Health check
|
||
|
|
- Logging of media type, frame count
|
||
|
|
|
||
|
|
**During request:**
|
||
|
|
- 120-second timeout (longer for multiple frames)
|
||
|
|
- Endpoint URL in error messages
|
||
|
|
|
||
|
|
**After error:**
|
||
|
|
- Full exception traceback in logs
|
||
|
|
- Endpoint information in error response
|
||
|
|
|
||
|
|
## Key Changes
|
||
|
|
|
||
|
|
| File | Function | Changes |
|
||
|
|
|------|----------|---------|
|
||
|
|
| `bot/utils/llm.py` | `get_vision_gpu_url()` | Added documentation, debug logging |
|
||
|
|
| `bot/utils/llm.py` | `check_vision_endpoint_health()` | NEW: Health check function |
|
||
|
|
| `bot/utils/image_handling.py` | `analyze_image_with_vision()` | Added health check, timeouts, detailed logging |
|
||
|
|
| `bot/utils/image_handling.py` | `analyze_video_with_vision()` | Added health check, timeouts, detailed logging |
|
||
|
|
|
||
|
|
## Testing
|
||
|
|
|
||
|
|
Quick test to verify vision model works when AMD is primary:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# 1. Check GPU state is AMD
|
||
|
|
cat bot/memory/gpu_state.json
|
||
|
|
# Should show: {"current_gpu": "amd", ...}
|
||
|
|
|
||
|
|
# 2. Send image to Discord
|
||
|
|
# (bot should analyze with vision model)
|
||
|
|
|
||
|
|
# 3. Check logs for success
|
||
|
|
docker compose logs miku-bot 2>&1 | grep -i "vision"
|
||
|
|
# Should see: "Vision analysis completed successfully"
|
||
|
|
```
|
||
|
|
|
||
|
|
## Expected Log Output
|
||
|
|
|
||
|
|
### When Working Correctly
|
||
|
|
```
|
||
|
|
[INFO] Primary GPU is AMD for text, but using NVIDIA for vision model
|
||
|
|
[INFO] Vision endpoint (http://llama-swap:8080) health check: OK
|
||
|
|
[INFO] Sending vision request to http://llama-swap:8080 using model: vision
|
||
|
|
[INFO] Vision analysis completed successfully
|
||
|
|
```
|
||
|
|
|
||
|
|
### If NVIDIA Vision Endpoint Down
|
||
|
|
```
|
||
|
|
[WARNING] Vision endpoint (http://llama-swap:8080) health check failed: status 503
|
||
|
|
[WARNING] Vision endpoint unhealthy: Status 503
|
||
|
|
[ERROR] Vision service currently unavailable: Status 503
|
||
|
|
```
|
||
|
|
|
||
|
|
### If Network Timeout
|
||
|
|
```
|
||
|
|
[ERROR] Vision endpoint (http://llama-swap:8080) health check: timeout
|
||
|
|
[WARNING] Vision endpoint unhealthy: Endpoint timeout
|
||
|
|
[ERROR] Vision service currently unavailable: Endpoint timeout
|
||
|
|
```
|
||
|
|
|
||
|
|
## Architecture Reminder
|
||
|
|
|
||
|
|
- **NVIDIA GPU** (port 8090): Vision + text models
|
||
|
|
- **AMD GPU** (port 8091): Text models ONLY
|
||
|
|
- When AMD is primary: Text goes to AMD, vision goes to NVIDIA
|
||
|
|
- When NVIDIA is primary: Everything goes to NVIDIA
|
||
|
|
|
||
|
|
## Files Modified
|
||
|
|
|
||
|
|
1. `/home/koko210Serve/docker/miku-discord/bot/utils/llm.py`
|
||
|
|
2. `/home/koko210Serve/docker/miku-discord/bot/utils/image_handling.py`
|
||
|
|
|
||
|
|
## Files Created
|
||
|
|
|
||
|
|
1. `/home/koko210Serve/docker/miku-discord/VISION_MODEL_DEBUG.md` - Complete debugging guide
|
||
|
|
|
||
|
|
## Deployment Notes
|
||
|
|
|
||
|
|
No changes needed to:
|
||
|
|
- Docker containers
|
||
|
|
- Environment variables
|
||
|
|
- Configuration files
|
||
|
|
- Database or state files
|
||
|
|
|
||
|
|
Just update the code and restart the bot:
|
||
|
|
```bash
|
||
|
|
docker compose restart miku-bot
|
||
|
|
```
|
||
|
|
|
||
|
|
## Success Criteria
|
||
|
|
|
||
|
|
✅ Images are analyzed when AMD GPU is primary
|
||
|
|
✅ Detailed error messages if vision endpoint fails
|
||
|
|
✅ Health check prevents hanging requests
|
||
|
|
✅ Logs show NVIDIA is correctly used for vision
|
||
|
|
✅ No performance degradation compared to before
|