HIGH: Add Retry Logic for External API Calls #27

New Issue

Koko210 · 2026-02-16T22:50:36+02:00

Koko210 commented

2026-02-16 22:50:36 +02:00

External API calls (Cheshire Cat, LLM, voice services) lack retry logic, causing failures when services are temporarily unavailable.

Where It Occurs

cat-plugins/cat_client.py - Cheshire Cat API calls
bot/utils/llm.py - LLM API calls
bot/bot.py - Voice service calls
Various HTTP/WebSocket calls

Why This Is a Problem

Fragility: Temporary failures cause immediate errors
Poor UX: Users see errors for transient issues
Data Loss: Failed requests are not retried
No Backoff: Rapid retries overwhelm services

What Can Go Wrong

Scenario 1: Temporary Network Glitch

User sends message requiring LLM call
Network has 2-second hiccup
LLM API call fails with timeout
Bot shows error to user
Network is fine 1 second later, but request not retried
User frustrated

Scenario 2: Service Overload

Cheshire Cat service overloaded (503)
Memory consolidation request fails
Bot shows error message
Service recovers in 30 seconds
User has to manually retry
Poor user experience

Proposed Fix

Implement retry logic with exponential backoff:

# bot/utils/retry.py - NEW FILE
import asyncio
import functools
from typing import Callable, Type, Tuple
import logging

logger = logging.getLogger(__name__)

class RetryConfig:
    def __init__(
        self,
        max_attempts: int = 3,
        initial_delay: float = 1.0,
        max_delay: float = 60.0,
        backoff_multiplier: float = 2.0,
        retryable_exceptions: Tuple[Type[Exception], ...] = (
            TimeoutError,
            ConnectionError,
            aiohttp.ClientError,
        )
    ):
        self.max_attempts = max_attempts
        self.initial_delay = initial_delay
        self.max_delay = max_delay
        self.backoff_multiplier = backoff_multiplier
        self.retryable_exceptions = retryable_exceptions

def with_retry(config: RetryConfig):
    """Decorator to add retry logic to async functions"""
    def decorator(func: Callable):
        @functools.wraps(func)
        async def wrapper(*args, **kwargs):
            delay = config.initial_delay
            for attempt in range(config.max_attempts):
                try:
                    return await func(*args, **kwargs)
                except config.retryable_exceptions as e:
                    if attempt == config.max_attempts - 1:
                        logger.error(f"Function {func.__name__} failed after {config.max_attempts} attempts: {e}")
                        raise
                    
                    logger.warning(f"Function {func.__name__} failed (attempt {attempt + 1}/{config.max_attempts}): {e}")
                    await asyncio.sleep(min(delay, config.max_delay))
                    delay *= config.backoff_multiplier
            
            raise Exception("Max retry attempts exceeded")
        return wrapper
    return decorator

# Example usage
RETRY_CONFIG = RetryConfig(
    max_attempts=3,
    initial_delay=1.0,
    max_delay=60.0,
    backoff_multiplier=2.0,
)

@with_retry(RETRY_CONFIG)
async def call_llm_api(prompt: str):
    async with aiohttp.ClientSession() as session:
        async with session.post(LLM_URL, json={"prompt": prompt}) as resp:
            return await resp.json()

Severity

HIGH - Lack of retry logic causes frequent failures and poor UX.

Files Affected

cat-plugins/cat_client.py, bot/utils/llm.py, bot/bot.py, new file: bot/utils/retry.py

External API calls (Cheshire Cat, LLM, voice services) lack retry logic, causing failures when services are temporarily unavailable. ## Where It Occurs - cat-plugins/cat_client.py - Cheshire Cat API calls - bot/utils/llm.py - LLM API calls - bot/bot.py - Voice service calls - Various HTTP/WebSocket calls ## Why This Is a Problem 1. Fragility: Temporary failures cause immediate errors 2. Poor UX: Users see errors for transient issues 3. Data Loss: Failed requests are not retried 4. No Backoff: Rapid retries overwhelm services ## What Can Go Wrong ### Scenario 1: Temporary Network Glitch 1. User sends message requiring LLM call 2. Network has 2-second hiccup 3. LLM API call fails with timeout 4. Bot shows error to user 5. Network is fine 1 second later, but request not retried 6. User frustrated ### Scenario 2: Service Overload 1. Cheshire Cat service overloaded (503) 2. Memory consolidation request fails 3. Bot shows error message 4. Service recovers in 30 seconds 5. User has to manually retry 6. Poor user experience ## Proposed Fix Implement retry logic with exponential backoff: ```python # bot/utils/retry.py - NEW FILE import asyncio import functools from typing import Callable, Type, Tuple import logging logger = logging.getLogger(__name__) class RetryConfig: def __init__( self, max_attempts: int = 3, initial_delay: float = 1.0, max_delay: float = 60.0, backoff_multiplier: float = 2.0, retryable_exceptions: Tuple[Type[Exception], ...] = ( TimeoutError, ConnectionError, aiohttp.ClientError, ) ): self.max_attempts = max_attempts self.initial_delay = initial_delay self.max_delay = max_delay self.backoff_multiplier = backoff_multiplier self.retryable_exceptions = retryable_exceptions def with_retry(config: RetryConfig): """Decorator to add retry logic to async functions""" def decorator(func: Callable): @functools.wraps(func) async def wrapper(*args, **kwargs): delay = config.initial_delay for attempt in range(config.max_attempts): try: return await func(*args, **kwargs) except config.retryable_exceptions as e: if attempt == config.max_attempts - 1: logger.error(f"Function {func.__name__} failed after {config.max_attempts} attempts: {e}") raise logger.warning(f"Function {func.__name__} failed (attempt {attempt + 1}/{config.max_attempts}): {e}") await asyncio.sleep(min(delay, config.max_delay)) delay *= config.backoff_multiplier raise Exception("Max retry attempts exceeded") return wrapper return decorator # Example usage RETRY_CONFIG = RetryConfig( max_attempts=3, initial_delay=1.0, max_delay=60.0, backoff_multiplier=2.0, ) @with_retry(RETRY_CONFIG) async def call_llm_api(prompt: str): async with aiohttp.ClientSession() as session: async with session.post(LLM_URL, json={"prompt": prompt}) as resp: return await resp.json() ``` ## Severity HIGH - Lack of retry logic causes frequent failures and poor UX. ## Files Affected cat-plugins/cat_client.py, bot/utils/llm.py, bot/bot.py, new file: bot/utils/retry.py

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Koko210/miku-discord#27