Fix vision pipeline: route images through Cat, pass user question to vision model

- Fix silent None return in analyze_image_with_vision exception handler - Add None/empty guards after vision analysis in bot.py (image, video, GIF, Tenor) - Route all image/video/GIF responses through Cheshire Cat pipeline (was calling query_llama directly), enabling episodic memory storage for media interactions and correct Last Prompt display in Web UI - Add media_type parameter to cat_adapter.query() and forward as discord_media_type in WebSocket payload - Update discord_bridge plugin to read media_type from payload and inject MEDIA NOTE into system prefix in before_agent_starts hook - Add _extract_vision_question() helper to strip Discord mentions and bot-name triggers from user message; pass cleaned question to vision model so specific questions (e.g. 'what is the person wearing?') go directly to the vision model instead of the generic 'Describe this image in detail.' fallback - Pass user_prompt to all analyze_image_with_qwen / analyze_video_with_vision call sites in bot.py (image, video, GIF, Tenor, embed paths) - Fix autonomous reaction loops skipping messages that @mention the bot or have media attachments in DMs, preventing duplicate vision model calls for images already being processed by the main message handler - Increase vision max_tokens: images 300->800, video/GIF 400->1000 (no VRAM impact; KV cache is pre-allocated at model load time)
2026-03-05 21:59:27 +02:00
parent ae1e0aa144
commit d5b9964ce7
5 changed files with 144 additions and 20 deletions
--- a/bot/bot.py
+++ b/bot/bot.py
@@ -277,7 +277,10 @@ async def on_message(message):
                            return
                        
                        # Analyze image (objective description)
-                        qwen_description = await analyze_image_with_qwen(base64_img)
+                        qwen_description = await analyze_image_with_qwen(base64_img, user_prompt=prompt)
+                        if not qwen_description or not qwen_description.strip():
+                            await message.channel.send("I couldn't see that image clearly, sorry! Try sending it again.")
+                            return
                        # For DMs, pass None as guild_id to use DM mood
                        guild_id = message.guild.id if message.guild else None
                        miku_reply = await rephrase_as_miku(
@@ -349,7 +352,10 @@ async def on_message(message):
                        logger.debug(f"📹 Extracted {len(frames)} frames from {attachment.filename}")
                        
                        # Analyze the video/GIF with appropriate media type
-                        video_description = await analyze_video_with_vision(frames, media_type=media_type)
+                        video_description = await analyze_video_with_vision(frames, media_type=media_type, user_prompt=prompt)
+                        if not video_description or not video_description.strip():
+                            await message.channel.send(f"I couldn't analyze that {media_type} clearly, sorry! Try sending it again.")
+                            return
                        # For DMs, pass None as guild_id to use DM mood
                        guild_id = message.guild.id if message.guild else None
                        miku_reply = await rephrase_as_miku(
@@ -432,7 +438,10 @@ async def on_message(message):
                        logger.info(f"📹 Extracted {len(frames)} frames from Tenor GIF")
                        
                        # Analyze the GIF with tenor_gif media type
-                        video_description = await analyze_video_with_vision(frames, media_type="tenor_gif")
+                        video_description = await analyze_video_with_vision(frames, media_type="tenor_gif", user_prompt=prompt)
+                        if not video_description or not video_description.strip():
+                            await message.channel.send("I couldn't analyze that GIF clearly, sorry! Try sending it again.")
+                            return
                        guild_id = message.guild.id if message.guild else None
                        miku_reply = await rephrase_as_miku(
                            video_description, 
@@ -490,7 +499,7 @@ async def on_message(message):
                                    if base64_img:
                                        logger.info(f"Image downloaded, analyzing with vision model...")
                                        # Analyze image
-                                        qwen_description = await analyze_image_with_qwen(base64_img)
+                                        qwen_description = await analyze_image_with_qwen(base64_img, user_prompt=prompt)
                                        truncated = (qwen_description[:50] + "...") if len(qwen_description) > 50 else qwen_description
                                        logger.error(f"Vision analysis result: {truncated}")
                                        if qwen_description and qwen_description.strip():
@@ -514,7 +523,7 @@ async def on_message(message):
                                        frames = await extract_video_frames(media_bytes, num_frames=6)
                                        if frames:
                                            logger.info(f"📹 Extracted {len(frames)} frames, analyzing with vision model...")
-                                            video_description = await analyze_video_with_vision(frames, media_type="video")
+                                            video_description = await analyze_video_with_vision(frames, media_type="video", user_prompt=prompt)
                                            logger.info(f"Video analysis result: {video_description[:100]}...")
                                            if video_description and video_description.strip():
                                                embed_context_parts.append(f"[Embedded video shows: {video_description}]")